Recently I’ve been working on a major migration from in house managed colocation hardware to Amazon Web Services. The main thrust of this piece of work was to meet the challenge of providing distributed petabyte scale object store and S3 is certainly a great solution for this. The prevailing mythology in house to this point had been that cloud storage was 4 times more expensive than owned hardware. However, the model was fundamentally flawed. Firstly, this calculation was based on very little actual analysis, minimized the cost of developing and/or maintaining what would be an in house developed solution or a 3rd party or open source solution such as GlusterFS. Now, I should say that GlusterFS is a great piece of software, but I do believe that for a real highly active object store, the price of owning rather than “renting” is comparable if not in favour of cloud storage.
The magic expression I’d like to emphasize about S3 and other cloud storage services like it is simple: $ per byte/hours. A customer only pays for the storage used and for the time it is used. Try doing this with an in house data centre. Unless you are “internet scale”, like the facebooks, zyngas and googles of the world, it’s unlikely you can compete with AWS for storage cost and sophistication. Secondly, it’s important to note what S3 is. It’s a highly available, low latency, highly durable, highly scalable object store. My client stores billions of 100k files, which need to be able to delivered to a web application in 10s of milliseconds. Certainly there are companies out there building cheaper “storage” solutions, but solutions such as backblaze ( I love backblaze btw and am a customer for backing up my macbook pro) have a very different use pattern to S3.
We recently migrated approximately 3 billion objects from an in house tokyo cabinet service to S3. I want to point out something about this. The same people at my client who espoused the “cloud is 4x more expensive” myth, also underestimated how much waste and excess capacity was inherent in their system. 5 copies were stored of every single object. That’s fine if it’s just redundancy, but it wasn’t. Really, just 3 copies could have been stored for that purpose. The reason for 5 copies was an operational one. That’s how the process of generating objects, and then pushing them out to the various DCs had evolved. It was also an artifact of a process that grew up in a startup which didn’t have time to build a highly flexible and generalized object store. Therein lies my point. Most startups that want to focus on their business objectives and key value proposition don’t want to build a really awesome distributed data store (eg. dropbox). Also, often the servers that housed the 100TB + of data obviously weren’t 99.9% full. There was, let’s say theoretically 20% unused capacity. (Realistically this number is more like 50%).
The point is, byte/hours is amazing. Let’s take for example our migration effort. We shipped approximately 120TB of data via 3TB disks to Amazon in Seattle. We then had to migrate that from raw tokyo cabinet .tch files into individual objects on S3. So, granted we needed the extra storage to move to S3 in the first place, but this sort of data migration is done all the time in house. So, all of a sudden, we would need an extra 120TB of storage, but I only need it for about 3 weeks. Where can I get such a solution in a tradition DC? I buy myself a JBOD or similar and fill it with 40 x 3TB disks or so, and then keep it around because selling it isn’t prudent. This point really outlines one of the key advantages of cloud. Those that understand it are sitting there saying, “oh, of course”. Those that don’t should be paying attention.
Incidentally, we did face some challenges importing. We ran up to 30 large instances in parallel, each pulling down .tch files with ~25 threads and then wrote objects with approximately 64 threads each. We would roughly see 45MB/sec write rates across a max 10 concurrent instances. This is roughly equivalent to 4Gbps writing. Try doing that with a single JBOD or even a lot of physical machines. This works out to about 5000 puts/sec. Incidentally, if you are going to be using S3 for heavy throughput, it’ws worth paying attention to this from S3 team, essentially you’ll need to partition your keynames so that S3 can partition the data appropriately. If you’re really pushing S3, don’t take this as advice, take it more as “do this or get throttled”. Getting throttled is one thing in our case of a data import, I’m sure it’s another thing in the case of a live production system. Also, if you’re really pushing the limits of S3, you will see this error a bit:
InternalError: We encountered an internal error. Please try again. (code 500).
The error codes for S3 can be found here. Through out discussions with the S3 team, we were advised to take an exponential backoff approach on this error, meaning on this error we should wait 1, 2, 4, 8, 32… seconds before trying again. This gives the S3 service time to scale out as it reaches partition limits and the like. It’s worth mentioning in relation to the post from the S3 team mentioned earlier that we ended up with a partitioning scheme of [A-Z][a-z][0-9][_-] {2} (essentially two alpha-numeric values with some two extra characters. It turns out to be a mod(64) operation on two independent pieces of meta data for each object). such that at the root of every keyname in our bucket is something like 0A, 0B, etc etc. This gives us a keyspace of 4096 which is relevant when you consider the blog post above regarding S3 tips and tricks. The downside is that now we have to make 4096 requests to delete a single “set” which would otherwise be stored in a single directory. The trade off is worth it. Also of note, when you start talking about billions of objects, this translates to approximately $10k per billion just in puts! So be aware of this, and make your objects bigger or chunked together if at all possible. Nonetheless, I was able to simply provision 30 large instances, attach a 1TB drive to each one, pull down the tokyo cabinet files to each server, reprocess the data, store it to S3, shutdown the instances when I was done or when we were waiting for data to arrive in the import/export service, and delete the original data once it was processed. This sort of flexibility in computing resources is really incredible and something that many companies don’t realise.
In this blog post I capture the metrics we used for doing a real cost comparison between S3 and “build it yourself” distributed datastore. It’s an interesting read.