Hardware and software have co-evolved, so that disks provide an illusion of erro...

spartango · on Dec 19, 2012

On one hand you're right: when block devices fail, they're pretty much gone--if you've ever tried to read from a bad sector you'll know this exactly, or worse still on SSDs, where the disk will fail to show up to the bus.

That said, I'm not sure I agree with the idea that we got away with a lack of error handling because disks had consistent performance. Magnetic disks have always had incredibly inconsistent random IO performance, and even inconsistent performance between different parts of the platter(s). And in the spirit of co-evolution, we engineered around it: OS disk caches are critical to decent performance on HDDs.

I think it's not that we found disks to be consistent, as much as our solution to their suboptimal behaviors was caching, rather than error reporting/timeouts. I believe this is because caching was the most transparent approach; a good cache makes a variable-speed disk look just like an ideal disk, so an application can be written assuming the disk is perfect.

Interestingly, as distributed systems have evolved, we've ended up having to engineer the error-handling constructs that might have been used for block devices; we see them in network filesystems (as mentioned) as well as most other network services. Applications have been designed to deal with errors. We just haven't propagated those constructs down to the disk devices in the recent past.

zurn · on Dec 20, 2012

> Magnetic disks have always had incredibly inconsistent random IO performance

From a non-realtime app POV disk seek performance is pretty consistent: you get 5-25ms seek cies that center around 10ms. Especially In contrast to network backed where you get to contend with hiccups and contention with other users.

OS disk caching came about for a different reason.