Those who have been following our blog closely know by now that the internet is not a large truck you can just dump stuff on, and that we are big proponents of Facebook’s Scribe log server. We were early adopters of Scribe, and by combining it with Facebook’s efficient Thrift protocol we were able to scale our infrastructure from a few thousand requests a few years ago to over a billion log events per day, before the machines were even connected to the internet. However, owing to our recent growth in traffic and what we view as the diminishing community support for Scribe (as determined by number of “Likes” on Scribe’s Facebook page) we decided to investigate alternative solutions like Apache’s Kafka.
Although quite popular across Big and Medium Data companies, some of us had reservations about Kafka’s early development stage and its lack of native message buffer mechanism which we would have to implement separately. We also wanted a solution that came with native support in Ruby and Java and preferably other leading languages, like COBOL and Prolog. Some late searching on GitHub yielded a promising alternative logging platform: Twitter.com
Twitter.com as a Logging Platform
For those not familiar, Twitter.com is a light-weight, platform-agnostic messaging system predominantly used by machines but also some humans. Twitter proved far better supported than either Kafka (209) or Scribe (481) with 43,598 repositories on GitHub at the time of research. Much like Kafka, Twitter allows for concurrent reads and writes from multiple producers and consumers via an advanced API that is updated very regularly and is well documented. Beyond that, Twitter had a number of distinct features we liked:
Compact: Twitter log events (“Tweets”) are compact by design while still containing a lot of information. Other messaging systems we tested required separate compression protocols.
Twitter allows constant O(internet) access time to any Tweet via an API which scales very well irrespective of how many EC2 instances we are running
Our marketing department told us our social media presence was lacking
Our existing logging system aggregates data from our comprehensive publisher network which sees over a billion unique users a day. To support our scalability requirements, we wanted a system that:
Can scale to support at least 3B requests a day each up to 1 kilobyte in size
The producers must have their own cache to ensure message buffering in case of network downtime
The system must not one morning transform into a large vermin (a documented issue with Kafka)
Favorable review and the approval of Nathan Marz
We set up our benchmarking configuration with one medium EC2 instance for each of Scribe, Kafka and Twitter all running clones of our production configuration. We used the official Twitter Ruby library to interface with Twitter. All three machines were replaying a real stream of roughly 1MM log messages recorded the previous day over the span of two minutes. Of the three, Twitter proved by far the easiest to configure and get running, although picking a name for each server was challenging. Kafka proved a bureaucratic nightmare. Next, we looked at publisher performance.
Time to Publish A single Event
Time (in ms) to publish log event
The results of the first test seemed interesting, and we were worried that we would soon fall behind with our Twitter configuration. Luckily, we were able to register another 100 accounts on Fivrr for $5 + tax, and by using technologies like parallelization we were soon able to achieve runtimes comparable to those of a single Kafka instance. The second issue we ran into can be seen below:
As you can clearly observe, we had trouble reading back the messages, maxing out at 720 requests per API instance before running into what appears to be a limit of rates. Not very promising and an odd design decision for a messaging system. However, we found that this issue too can be solved by using multiple accounts. Since account acquisition was becoming a bottleneck, we quickly threw together a quick script that posted TaskRabbit requests for new Twitter accounts when our algorithm detected that we were at over 80% of the request limit. A few hours and one “Top User” designation on TaskRabbit later we were confidently chugging along at 66 log requests read per second.
Next, we wanted to look at the compression afforded by each system. Although storage these days seems cheap, given our use case of maintaining 50TB of compressed logs in a highly highly redundant RAID configuration, costs can easily exceed $50k. To that end, we were very impressed with the compression and succinctness afforded by “Tweets”. To (t)wit, sample log records:
Sample Log Snippets
Kafka: 133394834; 2345,11,0293; 39108323;231893f7fja333ff4ee;4=34jeeac331948
Twitter: @akIntelligence #AppleFanboy @KayneWest
Aside from being much more compact, Twitter proved the more expressive protocol out of the box. It is fairly apparent from the Twitter message that we onboarded Kanye West’s data to our partner Aggregate Knowledge with the information that he is a buyer of Apple products, whereas only one of our engineers was able to decrypt the gzipped Scribe message while the others had to use a computer. As an added benefit, we were made to understand that since the test, Kanye West’s own logging server responded to ours and a rap battle is currently under way.
We were also positively surprised that three of our 20k Twitter log servers had 15 followers each by the end of the first day of testing, while our Kafka and Scribe machines clocked in at a lowly zero – in fact, they were not even connected to the internet (we had to connect the Twitter machines for them to work). An additional upside of Twitter was the significant projected reduction in internal log storage space: from 50 to a projected 0TB and change.
Time to read 1,000,000 events
Instances required to read 1,000,000 events
Time to 15 followers
Marketing department approval
We were collectively impressed with the Twitter.com logging system, especially in light of the limited coverage it has received to date, which we are sure is bound to change. We are currently in the process of moving a number of our systems to Twitter.com logging, so be sure to keep an eye out for exciting updates from our production machines!
We would like to update the Twitter API to publish events using the GET rather than POST HTTP request type, as the latter is prohibited on the Sabbath, which undermines the reliability of our logging on weekends and Jewish high holidays.