I have the bad habit of staring at progress bars.
I was backing up my Mac laptop to a CloudArray volume.¹ With gigabit ethernet, a full backup to a CloudArray volume takes less time than my local USB drive. Apple’s Time Machine actually only does a full backup once, followed by hourly incrementals that are rolled together into dailies, weeklies, and synthetic fulls. That’s a fantastic model for the cloud, since it saves a lot on bandwidth, but I usually tear down most of my volumes and run the full backup again. It’s a good way for me to keep an eye on a number of different variables that can affect CloudArray performance.
Anyway, I set up a backup volume and sat back to watch the progress bars. Here’s a good one:

At this point, the backup on my laptop was mostly done. You can see that my CloudArray cache still had 35 gigs of dirty data, and it was just starting to work on flushing 8 gigs out to the cloud. Also, I’d been staring too long, and popped off to do important CTO-type stuff.
A few minutes later, important CTO-type stuff being done, I checked back in on my progress bar:

The same flush was still in progress, and it was mostly done. But wait! The cache still reports 35 gigs of dirty data! (Actually, 35.1… the operating system hadn’t finished flushing its own cache the last time I checked.) But if an 8 gigabyte flush was mostly done, shouldn’t the cache be almost 8 gigs cleaner? What ever can be going on?
The answer, of course, is a teachable moment.
I’ve been building storage arrays of one type or another for pretty much my whole career. The most important aspect of any array’s firmware is its consistency model, by which I mean: how does it ensure that the data that it stores accurately represents the data that the host applications wrote? If an application writes “AB” to the disk, how does the firmware ensure that the next time it reads from that disk, it gets back “AB”? That is absolutely the most fundamental requirement of a storage system: everything else is just icing.
That might not sound like that hard of a problem, but the nuances of storing data in a complex, shared, networked controller can be subtle. For example, if my application writes “A”, then “B”, then “C” to different locations, I always want to return A, B, C for those locations. But if you add in a cache to the controller, and assume that the cache will fail (you always assume that every hardware component will eventually fail), then it’s not enough to just store the data in the cache. If you are implementing a write-back cache, you have to store information about the order in which those writes occur, so that the underlying backing store (a physical disk, say) gets those writes in the same order. Otherwise, when that cache fails, your application might read back A and C, but not B.
Why is that a problem? What if your application is a database, (A, B) is a credit card transaction, and C is the database checkpoint? In that case, your database will correctly read A, read corrupted data in place of B, and C will tell it that the corrupted junk is just fine. That’s bad.
If your cache firmware is well implemented, though, and only gets the chance to write two blocks before the cache hardware fails, then it will write A and B. Now, when your database tries to reread the data, it’ll find (A, B), but without that crucial C, it’ll do a proper rollback of the transaction.
In CloudArray, we’ve got an added complication: our backing store is not a local physical drive. It’s a massively scalable set of redundant data centers probably located a thousand miles away from our cache. The performance difference between our local devices and the cloud is several orders of magnitude. So how can we maintain consistency?
The answer lies in our rather complex representation of block devices as objects. First, we notice that strict write ordering is not an absolute necessity. We simply need to ensure that our data in the cloud represents some state that existed in our virtual volume, so that if C is present in the cloud, then (A, B) is there, too, but we don’t need to represent each of the intermediate states (A), (A, B), (A, B, C). Then, we have to partition our incoming data into sequences that can represent transitions between these states: these sequences are what we call a flush, and we try to design those partitions to maximize bandwidth utilization while also minimizing the temporal distance between state transitions. Finally, after we’ve transmitted a flush to the cloud, we have to perform an atomic commit on our representation, so that the new state of the cloud is entirely consistent.
And we have to do that in a way that is mindful of the architecture of cloud storage systems, which are often designed around the (not at all scary and in fact quite cool in a nerdy way, in spite of what some people say) eventual consistency model.
What’s all that got to do with my progress bar?
Well, in order to make sure that our cloud data maintains consistency, especially in the presence of sometimes quite flaky networks, we can’t clean out our cache until we’ve successfully committed and verified the most recent state transition, i.e. the last flush. So my progress bar is not really indicating the amount of data that’s been emptied out of the cache: it only tells me how much of the most recent state has been transmitted to the cloud. The data can’t be marked clean in the cache until the actual, final commit has been completed.
So what happens when the flush completes? Let’s see:

Huh. There it is. The cache now has only 27.1 gigabytes of dirty pages left. Mission accomplished.
And if my CloudArray were to experience some kind of catastrophe right now, like some dastardly CTO yanking out a cache storage device, what would happen? Once I restored it to operation, then Time Machine would pull the nice, consistent image out of the cloud, notice the missing 27.1 gigs, and pick right up from there. Like I said, it’s a nice piece of software, but it does rely on consistent storage.
¹It’s pretty easy to set up a Time Machine backup using the the Studio Network Solutions globalSAN iSCSI initiator for OS X: just install it, point it at a CloudArray, and voila! Up pops whatever capacity I need. Launch Time Machine, set the CloudArray volume as the target disk, and I’ve got a whole bunch of progress bars to stare at.



Del.icio.us
digg
Twitter
MySpace
FaceBook
reddit
Stumble Upon