In 2007 we were hearing a lot about the inevitable cloud computing revolution. Amazon Web Services was in its infancy, with just a storage service (S3) and a newly released - but very limited compared to today - virtualized server platform (EC2). These were interesting services, but on the surface it didn't seem like a significant shift from the status quo.
AWS had a combination of three things that nobody else did at the time: it was on-demand, self-service, and priced to scale. This combination of flexibility, convenience and price made us take a serious look at how we could take advantage of real on-demand computing. Being able to launch 20 servers to load test an application for an hour - and only paying the equivalent of one server for one day - that was revolutionary.
Less than a year later we took the plunge.
Coming from a bare-metal server infrastructure in our own datacenter, we had toyed around with virtualization but never fully bought into it. We had a bit of a learning curve ahead of us moving to virtualized cloud servers. At Barchart, adopting the cloud way of thinking centered around a few principles:
Servers are disposable
Applications may be stopped or replaced at any time, and usually that's out of your control. Amazon (and other cloud providers) retires instances on a regular basis and there's nothing you can do to stop this. This reality needs to be baked into both your provisioning process and application design. It has big implications on how stateful the local application is allowed be; if a server goes down unexpectedly in the middle of a user session, it shouldn't result in data loss or major service interruptions.
Many small is better than few big
Since servers are disposable, redundancy needs to happen on the software layer. Applications need to be able to share workloads over many small instances rather than a few (or only one) large instances, or unexpected failures and maintenance will impact uptime and availability. To make this work you need to know where your software bottlenecks lie - scaling horizontally is a different animal than scaling vertically.
Service discovery is a necessity
When you have new application nodes that appear and disappear on a regular basis, your other services need to be able to find them. Static hostnames or IP lists won't work anymore. Fortunately, service discovery is something that sounds complicated but can be very simple. One of the most widespread service discovery tools is, of course, DNS. Though by itself it doesn't provide adequate failure detection, when coupled with a load balancer, it is a perfectly adequate service discovery solution.
Scaling is easy
Capacity planning is a big deal when you're running your own hardware. If you underestimate demand, your application will be crippled until you bring new hardware online, a process that can take hours or even days. When you have applications that run in parallel on many servers, combined with an infrastructure that allows on-demand capacity expansion, underestimating demand becomes a minor inconvenience that can be fixed in about 10 minutes.
Adopting these principles will expose any areas where you are cutting corners. Fast scaling and effective parallelism requires - for one thing - servers with identical configuration at launch with no manual tweaking allowed. This in turn requires advance planning on the software and provisioning side, because shortcuts like poor-man's load balancing of databases (via hand-tuned per-server configuration files) no longer fly.
Unfortunately, we were using several of these techniques. We thought we had a pretty good approach to server provisioning, but our first deployment to AWS exposed just how much we were missing. We weren't using real load balancing for a lot of things, we had hard-coded hosts files pointing to static IPs, our provisioning processes weren't fully automated, and we made assumptions about server statefulness that caused applications to error out when we started scaling up and down.
So all of this cloud capacity wasn't coming for free - we were going to have to work for it. We hadn't fully understood the role that application design would play in the migration to public cloud infrastructure. The expectation was that deployments would be a lot like we were used to in our private datacenter, but the reality was that there are very a different set of constraints in the public cloud.
Now that we knew what to look for, the next few weeks were spent auditing applications to figure out just how big of a project this was going to be. Like any company, the patterns we uncovered in the first application were scattered throughout our codebase. The good news was that there were few difficult fixes, just some time-consuming ones. Although we had to rethink a lot of our provisioning and configuration processes, we did not discover any major bottlenecks that would impact our ability to scale horizontally. Our code architecture was sound.
Step one was putting everything - and I mean everything - behind a load balancer. A single-server application is a single point of failure. Even if there really is only one server - put it behind a balancer. Load balancing is cheap, and if you ever need to replace the server or scale up, you will be glad you can.
Next, we addressed our service discovery problems. With everything load balanced, we were most of the way there - delegating health checks to the balancers removed the need to route around individual server failures in software, so we could go back to relying on well-defined DNS hostnames for service discovery. A zero-impact change.
Going through this process also highlighted the importance of making sure everything about the deployment process was fully automated. Hand-tuning server images, even if they were reusable, allows the possibility of unforced human errors. To streamline our operations processes and make provisioning and deployment more deterministic, we started down the path to full deployment automation. As it turned out, this was a very gradual migration due to the number of projects involved, and it took us until just this year to finally reach all of our automation goals. We are now using a combination of Packer, Puppet, Docker and AWS CloudFormation to manage all new deployments.
The Cloud Mentality
These experiences have forced us to think differently about software development and deployment, and resulted in a new set of rules for software deployment at Barchart that I call the cloud mentality:
Everything is disposable
Everything is immutable
Most things are stateless
Auto-discovery is critical
Many small nodes are better than a few big nodes
Cluster health is more important than node health
Build once, deploy everywhere
In the next few weeks I'll explore these in more detail, with real problems and concrete solutions that we are currently using in production to solve them, across areas like provisioning and deployment, cluster auto-discovery, security strategies, and large scale data replication.
Barchart had what I consider a fairly typical experience migrating into a public cloud. I know many that have gone through this process have run into similar problems. In the end, while it was more work than we initially planned for, it was not difficult to adapt once we knew which pitfalls to avoid in our applications. The flexibility and scalability that has resulted from it was well worth the time spent.
If you've already got the cloud mentality, the transition to a public cloud should be a smooth process. But if you rely on a lot of manual maintenance or configuration, be aware that public cloud infrastructure is not a free ticket to infinite scalability.