This is the third in a series of posts explaining how we got web infastructure to handle a years worth of traffic in a single week.
In part two we looked into server-side caching, dedicated servers and virtual machines. In this part, we take the idea further by introducing a critical component.
Enter stage left - the load balancer
Lets take our webserver vps and power it down. We'll grab the disk image (it's a virtual machine so that's just a file on the host), and make a copy of it. Now power the virtual machine up again. Then power up the duplicate.
Now we create a third virtual machine, and install software on it, but don't add any content. Then we configure this new machine as a "reverse proxy" for the other two virtual machines.
What's a reverse proxy? A forward proxy server takes requests from many users and aggregates them into a single network connection. This is a common configuration within a corporate network. A reverse proxy is the same thing, but facing the other way. Instead of aggregating many users into a single (inbound) internet connection, it splits a single (outbound) internet connection across multiple backend servers.
The reverse proxy distributes the traffic evenly across multiple end servers. Additional capacity can be added by cloning additional servers. You can even get software which will automatically start and shutdown virtual servers as the capacity changes.
In our case we used the managed load balancer provided by our chosen host provider, but the load balancer could be any reverse proxy. A common one would be nginx.
What else is needed?
For static websites without anything dynamic (such as a cms, forms, shopping carts, members areas, results, etc), what I've mentioned above would work fine without anything else. But we don't make static websites at Karmabunny...
There are a few additional things you'll need to make this work:
- Make the database available on all nodes
- Make content files available on all nodes
I'll go into more detail on each of these below
Make the database available on all nodes
This is basically the process of duplicating the data across multiple servers. You really want this to be in real-time. There are services to do this, most commonly called Replication. MySQL has a few different replication modes some which are native and others provided by plugins. Related database engines such as Percona or MariaDB have replication solutions as well.
Basically all database systems have a replication option. Some systems are even designed for replication and scaling. These solutions are often also NoSQL solutions such as MongoDB. SproutCMS doesn't support using MongoDB or any other NoSQL solution as a datastore so this wasn't an option for us (although we do kinda like MongoDB).
When we built this particular cluster, we used standard MySQL replication. We set up one of our nodes as the master, and all the other nodes as slaves. We could have set up a dedicated database cluster and had a 1:1 mapping of database servers to web servers, but we decided to keep the slaves on the web servers themselves to simplify the deployment. If we were building a larger infastructure than our current needs, the ability to scale database and web servers independently of each other would be a benefit and in that case the infrastructure would probably be set up differently.
There are lots of articles as well as official documentation for setting up MySQL replication so I won't go into much detail here, but you set up the master server to record all the data changing queries to a log file called the binary log. Your slave servers then periodically download the binary log from the master server and play back the queries on their own copies of the database.
You also need to update your CMS software to route read queries to the local database slave, and write queries to the database master.
You can actually use database replication for much more than traffic scaling; another common use is for backups. You can have up-to-the-second backups on a remote system, and can take archive snapshots of the database without placing additional load on your primary database server.
Make content files available on all nodes
In many content managament systems - including SproutCMS - you can upload images, documents and videos onto the website and they will then be available on the website. There might also be files uploaded by end users from front end forms, such as the jobs module which allows an applicant to upload a CV. In SproutCMS these files are stored in the 'Media Repository'.
We needed a system do get these files on all the webservers. We had a few options;
We could transfer the files upon upload onto a CDN such as Amazon S3 or Rackspace. The files would then be served by those services instead of from our web servers. This is a very clean and elegant solution, but it requires support from your CMS software and at the time we were developing this solution, SproutCMS didn't have well tested support for external file services.
We could have set up some additional servers and then remote-mount the filesystems on them. This would work in the same way as your typical network drive. You can read and write files to the network drive as if it was on your own computer, but the files are actually on another computer. In this case, we would set up a network drive on the web servers which would point to another computer which would actually host the files. This is a fairly simple solution, but there are sometimes performance issues with this option.
The solution we eventually went with was that after each upload into the media repository in the admin area, the file is copied onto each of the slave servers. This solution is simple to implement although it only works because we don't have very many slaves in this cluster. Is also requires that CMS admin access is always on the master server, which we implement using a special subdomain which points to master instead of pointing to the load balancer.
Conclusion
This has proved to be a really good solution for us. Since we implemented this we haven't had any outages due to traffic loads.
We did a lot of testing when setting this system up. There is an excellent command-line tool called ab which is very handy at determining capacity of a system. We spent a fair bit of time figuring out what sort of capacity we actually needed, looking over Google Analytics data and coming up with figures for requests/second and megabytes/second which we needed to be able to serve.
As part of this process, we switched from Apache + FastCGI + PHP to a Nginx + PHP-FPM stack. This made a noticable difference to the performance characteristics. We also tuned the Nginx configuration a fair amount to serve as much static content directly without using PHP.
One benefit of this system is that we can shut down a node for maintenance, and the load balancer detects the node isn't running and stops sending it traffic. This allows us to do upgrades without interfering with site availability. Once the node becomes available again, the load balancer starts sending it traffic again.
So this was our solution. Have you faced a similar problem? If so, we would love to hear about your solution.