joelkuiper.eu

The curious case of cloud evangelism

1 Rain

It is quite hard to have a day pass where someone is not touting their latest success story on Azure, Amazon, or some other cloud provider. If there is a start-up equivalent of the second coming, the cloud must be it. These tiny vaporous water molecules permeate everything from infrastructure, to software. From data storage, to machine learning. Everything, literally everything, must be shrouded in clouds and sold at a premium to unsuspecting tech bros, brogrammers, valley hipsters and whatever their JavaScript conference T-shirt says they are. If you are not doing “full-stack” with the latest Jeff Bezos and team came up with, you might as well not be trying. Or at least, that’s the impression I’m getting while doing my daily procrastination round on tech blogs and Twitter.

The insane success of the Cloud over physical hardware can be attributed to the belief that it is: cheap, easy, scalable, and fast. But compared to what exactly? To me this was always the great puzzle. The software teams I’m part of deploy their services on old fashioned bare-metal hardware. Custom build to handle the load, carefully lifted into a rack at a colocation center. “You must be insane, old, or both” or so the sentiment on the blog-o-sphere goes. Might be true. But, it’s entirely besides the point.

What is definitely true for a lot of cases is the following: owning hardware is cheaper, faster, easier, and more scalable than the cloud. Not something most people expect. So what follows is a rundown of the type of applications we deploy, and how the Cloud might also not be the-greatest-thing-ever for most startups.

2 Cheaper

This is by far the most common argument I hear from Cloud proponents: “the cloud is so much cheaper than having your own rig”. And it simply is not true.

We build natural language processing and machine learning software for the medical domain. It’s not quite big data, the largest data set is about 400 gigabytes of text, but it’s “annoying-sized”. “Annoying-sized” data is the kind of data you cannot simply load up on your Macbook Pro, but isn’t really NSA sized either. I postulate that almost all data worth looking into is either small or annoying. Actual big data that can deliver real world insights is quite rare. Sure there are large databases, but you rarely want to consider all of it in tandem. But I digress.

So lets say that the unicorn you’re working on is of an equivalent size. You have a couple 100GBs of data, require some complex calculations, power some user interface, handle a couple thousand users. Pretty standard thing, especially if you’re just starting out.

Here’s what we did to deploy it: we rented half a rack (22U) in a co-location center, and placed our own SuperMicro servers.

Concretely:

  • Cisco Catalyst 1Gbs switch
  • pfSense firewall appliance
  • App server with 128GB ECC RAM, 2xIntel Xeon 6-core 2.4Ghz CPUs
  • Database server with 128GB ECC RAM, Intel Xeon 6-core 2Ghz CPU, 6x Samsung EVO 1TB SSDs, 6x2TB spinning rust drives, 2x256GB Intel caching SSDs
  • Machine learning worker node with 128GB ECC RAM, Intel Xeon 6-core 3.6Ghz CPU, 2xNvidia GTX 1070 8GB GPUs, 2x512GB Samsung EVO Pro SSDs

Total cost for the hardware? Roughly $20 000. Total monthly cost for co-location at a nice secure data-center? $300. Monthly bandwidth (100Mbit 95th%) and electricity (~400kWh): $200. So our monthly costs are about $500. The hardware is ours.

Lets consider what this would cost at Azure. Not to pick on Azure, Amazon is similar but their price transparency is just horrible. Note that we need these machines running 24/7. We’re deploying a user-facing SaaS after all, really no point in turning it off. Now it is a bit tricky to find an equivalent, but an approximate: at least 128GB RAM, 6 cores, fast SSDs, and at least a node with GPUs. The A11 instance comes closest. For GPUs the L16 does. Granted, you get a bit more cores. But I’m sure they include HyperThreaded ones, so they don’t really count. That’s a whopping $4,477.39/month. And that doesn’t even include bandwidth, or disk operations for the database. Maybe there are cheaper options, but it’s all trade-offs, and the ballpark figure will remain the same. That is roughly $60 000/year. Compared to $6 000 per year. For hardware that costs $20 000. Lets remember, we pay $500/month. That’s almost 10x as expensive.

3 Faster

Okay hang-on, why do you need 128GB RAM per machine, isn’t that a bit much? No, it’s perfectly sane. Even a bit on the low side. Here’s the thing: CPUs are fast. Like, really, really, really fast. You want to make sure the CPUs do the actual application work. The things that make your business go. You don’t want it spending all that precious time serializing stuff into JSON, or something equally stupid. So what do you do: you load it up in memory. You would be amazed how fast things go if you put them in memory, a lot of things fit in memory these days. Before you go roll out the next Hadoop/Spark cluster with terrible distributed state, serialization overhead, and network latency, consider just writing a for loop and loading it up in memory. A RAM disk works fine. Or heap memory. I mean, it doesn’t come with a free “Big Data” sticker for that patchwork of a laptop, but it solves the problem.

In addition to memory being insanely expensive on the cloud, another problem is that the performance is totally unpredictable. You get noisy neighbors, saturated network uplinks, limited disk IOPS, throttled CPUs, you name it: I’ve seen it all. It is especially annoying if you need to link several machines together, you don’t know how far the bits have to travel. And maybe your provider decides to shuffle stuff around, God only knows what they do to your running images. So then you have to be clever and work around it, to make sure the performance is at least bearable. If you run on bare metal you don’t even have to think about it. Maximally it has to travel 20cm over CAT-6; that’s it.

Now this is completely anecdotal, but we did try to deploy our application to the cloud before deciding against it: our performance was cut in half, and had weird hickups (probably due to stuff being paged out off memory). We never did a head-to-head comparison, but still.

4 Scalable

Now the next big thing would be scalability. These three machines will only go so far. “Premature optimization is the root of all evil”, they say. But certainly, now would be a good time to optimize. Software goes a long way. Sure you can hook up that O(xn) algorithm to your wallet with something “elastic”, but you’ll run out of money faster than you can say “cloud”. Here are some hard won lessons:

  • Be clever about caching
  • Compression is magic
  • So is memory mapping files
  • Precompute whatever you can on the background
  • Use queues, you are using queues right?
  • Start with a monolith, and break it up only when you need to scale
  • Parallelism and concurrency are much easier with immutable data structures
  • Serialization is expensive
  • Pick a JVM or CLR based language, I prefer Clojure
  • Check the indexes on your database (stay away from NoSQL unless you really have an extremely compelling case)
  • Measure, measure, measure
  • Pick up a book about algorithms, you’d be surprised how often you’re doing the equivalent of bubble sort somewhere
  • You really don’t need those gigabytes of log files
  • You also don’t need to track every user fart
  • Minimize IO
  • Is it really the server, or that cobbled together front-end running 10MB of JavaScript?

So once all of that is out of the way, and rewriting hot-loops in C/C++ or hand optimizing a compile-to-assembly language isn’t really your thing: buy more machines. You can buy a SuperMicro MicroBlade with 14 blades, each with 3,6Ghz 4-cores (8-threads), 32GB ECC RAM, 2x512GB SSDs for ... $35 000. Backed by a 10Gbs switch. Go ahead, configure 14 machines with high clock-speed quad cores, 32 GB DDR4, and no-limits 1TB SSD storage on Azure or Amazon. I’ll wait. You’ll likely spend in a month what these machines cost as an initial investment. And they are only 3U, so you can stack plenty of them.

5 Easier

Right so this is the sensitive part. Putting stuff in racks is a nice work-out, but not for everyone. The argument often heard is “I don’t want to deal with it, I just want to write NodeJS packages”. And that’s fair, but ops is part of your business. Making things run, arguably, is one of the most important parts of your whole operation. Speed and money matter, especially the trade-off curve. Maybe spending a $200 000 a year on the Azure/AWS bill is pennies for you. But if you can get it down to $20 000, that’s just smart business. Buys a free lunch.

That being said, it is not always easier. Stuff will break. Less often than you might think though, but if they do break it is often annoying. So some advice:

  • Keep spares of disks, PSUs, and memory
  • Use redundant PSUs, not because they break often, but if they do you don’t have to scramble to get the machine up again
  • Do off-site backups (incidentally, the cloud isn’t terrible for this)
  • Do follow deployment best practices, so you can migrate quickly (e.g. containers or otherwise reproducible builds)
  • Learn shell scripting and Makefiles (not a bad skill to have anyway)
  • Know your systems inside-out
  • Really do consider the physical security of your systems. If someone has physical access, it’s game over security wise

Sure, level-2 networking things can be tricky, and if you really cannot bring up the patience to learn it: hire someone. It takes maybe 4 hours to set it up, at an hourly payment of $200 you still have plenty of money left for hiring someone to have on call too. Having a data center next to your door does really help. I can take my bicycle and be there in 20 minutes.

What about all the other nice “server-less” cloud things? Well remember, there is no cloud, it’s just someone else’s computer. Anything you can run on the cloud, you can run on bare metal. There are certainly some services that are nice to have, but those are services, not infrastructure. The difference between IaaS and SaaS is huge. SaaS also has downsides, just ask Stallman; but that’s a completely different trade-off. Cloud doesn’t magically solve your fallover, backups, deployment, and redundancy. Even if you go full cloud and lock your application into a particular vendor (so they, instead of you, end up running your business), you still have to set it up. And, judging by the amount of self-proclaimed AWS experts, or at least those that show it off on their resume, it still isn’t trivial.