Better, Faster, Stronger.

R. Tyler Ballance

January has been quite an exciting month at Apture. Over the last couple weeks we’ve been working into the wee hours making infrastructure improvements to keep up with Apture’s growth and improve the service. Some of you may have noticed we’ve had periodic issues with our service the last couple weeks while we performed these upgrades. Before I get into what we’ve been doing to correct some of these issues, let me take the opportunity to introduce myself. I started at Apture in October, 2009 after leaving the server team at Slide, enchanted not only by what Tristan and the Apture Team have done thus far but by what they want to do in the near future. Since arriving in Apture’s shiny San Francisco office, a lot of my focus has been “underneath the covers” of Apture.com, improving infrastructure, building backend services, helping to make our current product fast and reliable but also laying the groundwork for new projects (coming soon).

Enough about me, let’s talk about Apture.com; as I said, some of you might have noticed intermittent problems over the past couple weeks, problems which I feel confident saying are now behind us. Over the past couple weeks Can (read: Jon) and I have worked with Contegix to perform some pretty serious hardware upgrades to the cluster of machines running Apture.com. In addition to hardware upgrades, Can and I have also rolled out a number of software changes not only to service incoming requests faster but to service them reliably even in times of high-stress on the production machines.

Warning: It’s about to get technical.

Apture’s entire code base is written in Python, for a number of reasons I won’t go into here, being a Python-shop our web application is developed with Django at its core. This has a couple of implications, both positive and negative, from a technical standpoint. Django allows us to worry less about the “menial tasks” of building web applications, as with most web frameworks, we can take advantage of all the man hours that have gone into making Django a solid application framework and focus on what is unique about Apture. There are a couple of downsides to using Django for such a large application (both traffic and lines-of-code) like Apture; there’s some assumed level of “magic” going on underneath the covers, while Django tries to be as close to “one size fits all” there are certainly some portions that need tweaking as your application grows, particularly in the ORM and database-access layer. We’ve tried to partition some of our database access across multiple database machines but the unfortunate state of affairs is that we are still beholden to a single primary database machine for a lot of our operations (similar to the Twitter of 2007). This singular DB machine was the root cause for a number of our issues over the past few weeks.As we dug into our analysis across a couple of days, we noticed that the DB machine was particularly write-bound, Apture.com was getting so much updated data that the DB machine could not stream writes to its disks fast enough. This would then cause issues “downstream” so to speak, with application servers backing up waiting for their INSERT and UPDATE statements to complete causing cascading waves of overloaded machines, always originating from the primary DB machine.

Up until a couple of weeks ago we were using a standard Apache + mod_python set up for running the web application. When the primary DB machine would start to back up,causing requests in Apache to wait, we would notice the load on the machine would spike in catastrophic fashions, 4-core machines where a 4.0 load average is generally assumed “100% load”would spike into triple digits. What this meant for end-users was a brief period of 500 errors and connection resets as the wave of overload swept through the web servers.

The wave would also affect our work queue (Kestrel) which would all of a sudden back up to the tune of hundreds of thousands of jobs while asynchronous workers would wait patiently for their database writes to complete. The queue backing up also had the effect of prolonging our overload issues, as the DB would start to clear up, all of a sudden the thousands of jobs that were patiently waiting in the queue would start streaming through, which could in turn put enough load on the DB that it would start the cycle all over again!

A wise man once said:

“I do mind, the Dude minds. This will not stand, ya know, this aggression will not stand, man.”

We got to work correcting a number of deficiencies in our infrastructure. First, we replaced more than half our Apache instances with “Spawning“, a high performance Python-specific web server capable of handling an incredible number of concurrent requests. Spawning performed so well, we updated our load balancer configuration to prefer Spawning to Apache at a 2:1 ratio. We also corrected some of our flow issues with the work queue, making sure that jobs are not constantly backing up but also that we’re not evacuating the queue too quickly, thereby overwhelming the primary DB machine. While we were rooting around in the production infrastructure we also upgraded our memcached capacity by five fold, giving ourselves a vast amount of memory reserved for more and more caching in front of the databases. Lastly, we added more hardware to better federate critical services, the highlight being brand new beefy database machine. Working with Contegix we built a machine with much higher disk throughput and more processing power than its predecessor, switching our primary database over to the new machine last Thursday night (with ~2 minutes of downtime).

As it stands now, we’re continuing to make a number of infrastructure improvements to not only make Apture more reliable but faster. There are some changes that I’m working on currently to help improve the speed at which our users can access content and interact with Apture-enabled sites.

I think I can speak for the whole Apture team when I say, I appreciate your patience with us over the past couple weeks, we’re all hard at work constantly making Apture better every single day.

Blog Widget by LinkWithin

Search

Lijit Search

Facebook

Welcome to Our Blog!

A place where we talk about making the web a richer, more compelling, multi-dimensional experience.

Subscribe via RSS

Always On 250 Winner