Thanks. I have read that paper! The real world is way more complicated... You ca...

Thanks. I have read that paper!

The real world is way more complicated...

You can think about each host as its own markov chain: they may be serving, have hidden or diagnosed problems at various confidence levels, be on route to various remedial processes (reboot/reinstall/repairs I cited above, to simplify), require software/firmware changes, scheduled or opportunistic diagnostics (e.g. periodic deep-screening for otherwise silent problems).

Repair workflows are even more complicated and depend on the specific of the fault, and connection of components (e.g. a system diagnosed with a missing GPU may be because a cable or an interposer card is not seated correctly). Also, repair time is modulated by work shifts and some details of the logistics in datacenters.

Parts can become bottlenecks too. I remember one time in the early TPUv3 days: we had delays in recovering from a large incident because of fault-positive diagnoses of a $4 fan, that was widely used in all systems.

Add that nowadays systems are not one host and some attached cards, but have multiple nodes: e.g. the simplest combination is a main compute node and the smart nic / IPU, but some systems can be a lot more complicated.

So these alone are a hierarchical markov chains, with inhomogeneous arrival times and service times that are themselves time-dependent functions. There are also a lot of long-term memory effects, ensemble average is not necessarily the the time average. Chaotic behavior, in the mathematical sense, is common.

Systems are built with field replaceable units (FRU): e.g. in some generations, you can't swap one GPU in a server that has 8, you have the swap a whole block of them. You can choose when to repair, to maximize TCO/$ based on usage patterns (how many users want all GPUs vs. a smaller number).

Some systems (e.g. TPU pods) have links between accelerator trays, both within a physical rack and across them. So the usefulness of the neighbor system is reduced while a host/tray is being services, and you can have the equivalent of deadlock/livelock in repair dependencies.

Cluster scheduling and management is designed to maximize service levels, minimizing disruptions. This also implies that some disruptions (e.g. when you do some repairs) is driven by what workloads you run. Workloads are power-law distributed in size, so you have small-world network dynamics and the potential for supercritical behavior: both in disruptions (picture jobs preempting other jobs as a graph) but also in risks (add to the previous picture, one job that trigger an hardware problem).

Multiply these by several thousands to get the size of a datacenter cluster. Add supporting compute+storage, networking, power and thermal constraints. Multiply these by the number of clusters (Some hyperscalers have global scheduling systems, that make it possible to see the whole ML fleet as one). Add rare events, because at this scale you start to think about utility and electrical grid failures.

Figuring out what to do in control systems for the current ML fleet is one problem. Simulating what kind of service the future systems a few generations of hardware, software, datacenter design down the line should provide, so you can define what you can offer, influence the designs and make the right investments... it's more complicated. Both of these two problems are my current job.