Research

The Adventure of the Errant Hardware

September 19, 2023 — Erich Elsen, Curtis Hawthorne, Arushi Somani

A tale of mystery, intrigue and derring-do. We recount our investigation into curious errors occuring during our large training runs–clues found, causes deciphered and solutions implemented.

An abstract 3D render

If you train big models (thousands+ of accelerators) for long enough, you’ll see some things go bump in the night. Things that keep you up. Things that make you never trust your hardware quite the same ever again. We want to share some scary things we’ve seen with the community and hopefully turn an unknown unknown into a known one.

The Valley of Fear

Which of these learning curves scare you the most?

It should be the third one. The first one has slowly diverged–most likely something about our training dynamics isn’t working out quite right—colloquially we say the network has “exploded.” This is usually our fault. The second one has hit a NaN (“Not a Number”–the bogeyman of numerical calculations) out of nowhere–this probably indicates a hardware problem1–but at least we know something is wrong! The third one looks fine, (or did it flatten out too much?), but something might have silently gone wrong and we don’t know. One number amongst the billions involved in our equations might change a little, or even a lot, and we wouldn’t immediately notice. This Happens. ECC won’t save you. And it should scare you. This is a story of how we noticed this happening while training our models and some protective measures you can take to make sure it doesn’t happen to you.

When something goes silently wrong, the errors can slowly compound over time and eventually lead to a situation like the first learning curve where the training blows up but it isn’t clear why; or, perhaps worse, it can lead to the learning being slower or stopping altogether.

We Expect Perfection

Neural network training is governed by a few simple equations. We can reduce training a distributed data parallel neural network to just three, where the _i subscript means this is local to machine i and no subscript means the quantity is the same across all machines:

  1. Each machine computes local gradients given local inputs and a consistent global view of the parameters.
    LocalGrad_i = f(Inputs_i, Targets_i, Params)
  2. Sum up all the local gradients and distribute that sum to each machine, so there is a consistent global view of the gradients.
    GlobalGrad = all_reduce(LocalGrad_i)
  3. Each machine can now locally update the parameters and optimizer state under the assumption that the exact same calculation will happen on all machines.
    NewParams, NewOptimState = g(Params, OldOptimState, GlobalGrad)

When we translate those equations into computer code, we expect the computers to be perfect. If I multiply two numbers, I always get the same and correct answer. If I save a value to memory and read it back later, I get the value I saved. The equations have no notion of “sometimes I get the right answer and sometimes it’s off by a bit2.”

For example, modern accelerators can have nearly 100 billion transistors, each changing states a billion times a second. For years on end! Software is generally written assuming that nary a mistake is made. To put this in perspective, during one training run, this level of perfection is like expecting all eight billion people on the planet to copy the complete works of Shakespeare over and over again for the 14 billion years the universe has existed and not have a single person make a mistake! That this expectation is even possible is pretty mind boggling.

The Game is Afoot

We started our investigation into this issue upon observing a training curve that looked like case #2–a sudden NaN. The first step was to restart training from the last good checkpoint (using the same machines). This time the curve looked identical but the NaN occurred on a different iteration. Suspect, but we didn’t have smoking gun evidence of a hardware issue just yet.

The problem is that training isn’t (usually) deterministic3; our chips can both be perfect and yet also produce slightly different outputs given the same inputs. This is possible because floating point arithmetic doesn’t associate–that is (a+b)+c is not necessarily the same as a+(b+c). Processors like GPUs and TPUs are fast because they are parallel and adding up a long list of numbers in parallel means the order of the additions might not always be the same. No ordering is more correct than another and usually the differences are tiny. Worst case examples where the difference are big can be constructed, but they are not encountered in practice when training neural networks4.

Thankfully, it is possible to make training completely deterministic on both TPUs and GPUs. Depending on your framework and hardware this might vary between trivial and difficult and come with anywhere from a negligible to large performance penalty. We restarted our run in deterministic mode and hit a NaN on different iterations each time. The numbers matched exactly up until the first NaN. Values are illustrative and shown truncated for display purposes.

Then we started a run where after some number of iterations we disturbingly observed that instead of a NaN the loss value was just slightly different from all previous runs. This was the first indication that a training curve of type #3 was possible and likely occurring. Now we have a smoking gun. The hardware isn’t perfect after all.

But What About ECC?

Compute hardware designers are aware of the possibility of errors occurring and enterprise grade hardware (CPUs and GPUs) uses error correcting codes (ECC) to protect their memory against random corruptions. Sometimes the corruptions can be caused by external factors like gamma radiation from outer space interacting with a computer chip5. Sometimes the corruptions are caused by flaws in the chip itself or its local environment.

ECC protects against values stored in memory being changed during reads, writes, or simply while being stored. It does this using a Hamming code; usually the variant6 where one corrupted bit can be detected and corrected while two corrupted bits can only be detected. If a two bit error is detected the hardware throws a fault and your program crashes to a halt.

But our program didn’t crash to a halt and the ECC counters, which keep track of how often a single and double bit corruption occurs, show nothing unusual.

Whatever Remains, However Improbable…

We must conclude that somewhere in the hardware a calculation is going wrong, but not in a part that is protected by ECC. This is particularly scary because while the errors that lead to NaNs are hard to miss, the errors that cause slightly different loss values are completely silent. Had this happened in our earlier runs? What effect did it have?

Our most pressing problem was figuring out on which machine the miscalculation was happening so that we could remove it and resume training on good nodes. Luckily, we now have a fully deterministic training job from our previous investigation. Run on different nodes, the deterministic job should produce the same results as long as we’re willing to wait. We launched training jobs on every node (each using all the accelerators attached to that node), and waited for a node to produce a different result.

Within the first 1000 seconds a machine produced a different result! With more experience, we know now this was typical. If an error is going to occur, our experience is that it usually happens within 1000 seconds, rarely within 10000 seconds and almost never after 10000 seconds. Replacing the machine and restarting the job led to a job that ran for weeks without encountering a NaN.

Checking only the outputs of matrix multiplication7 will sometimes corroborate the finding, but sometimes it does not! A training run exercises the entire machine more thoroughly which surfaces additional problems (host to device transfer and vice versa, machine to machine communication, different memories and circuits inside the chip, etc.)

You Know My Methods. Apply Them, Watson.

Depending on where in the calculation the silent error occurs, it can have different symptoms and detection possibilities.

If the error occurs in equation (1) during the calculation of the local gradients, then all nodes will have an incorrect global_grad after (2) and the final update in (3) means all nodes will have consistent, but incorrect parameters and optimizer state. There’s no way to detect this while training a model without doing extra work (i.e. repeating the computation)8. The severity would depend on the magnitude of the error. A tiny error on the order of the smallest possible difference would likely be under the noise floor of stochastic gradient descent. A large error could destabilize the run or permanently mess up the optimizer state. Gradient clipping might help mitigate the likelihood of the latter possibility.

If the error occurs in equation (3) during the update of the parameters and optimizer state, then one node will have different values from all others. This is definitely not ideal. If the difference is tiny we might be lucky and the network might still train without noticeable degradation, but if the error was “big” then a training curve like #1 is in our future.

The good news is that we can detect an error like this with relatively minimal cost during training. We can calculate a checksum of our parameters and optimizer state every N iterations and communicate a small number of bytes to allow each machine to check if they have the same state as every other machine9.

After implementing this technique, we expected to see errors over time where one machine had a different checksum from all other machines (checksums are not really 3 digit numbers). And we did see some instances of that.

But then we saw the following pattern:

How could a third of the machines end up with the same wrong answer?! Or did two—thirds of them end up with the same wrong answer?! We don’t know which checksum is correct. The probability of the same error occurring independently on each of the machines is indistinguishable from 0. To explain this, we need to take another look at the all_reduce of equation (2) and break it into its pieces: a reduce_scatter followed by an all_gather.

An all_reduce takes in a local array on each machine and returns the sum of all the arrays on every machine. The most common algorithm for doing this is a variant of the “ring allreduce”, which we’ll describe below. In actual practice, modifications of this basic algorithm might be used for optimal performance.

AllReduce

The starting state is three different arrays on three different machines.

Each machine passes one chunk to its clockwise neighbor, which sums the incoming chunk with its local copy.

Now the partial sums are passed along again to the clockwise neighbor, which again sums the incoming chunk with its local copy. Notice that after doing this, we have the final summed array, but it’s in three pieces on each of the machines. This operation is called reduce-scatter.

No more math is necessary, but data needs to move around until it’s all in the right place. So we are back to passing things around the ring. Each node passes their fully summed chunk clockwise to the right where it overwrites the value that had been there.

Finally, the newly received chunk is passed along one more time. Now each machine has the fully summed array. This operation is called an all-gather.

If an error occurs during the reduce-scatter, it can’t be detected10; a summed chunk will just be wrong. However, if an error occurs during the all-gather and a corruption happens while passing data from one machine to another, that error will propagate to all machines after the corruption in the ring. That matches the pattern of checksums we see exactly!

A Reason For Hope

While we cannot immediately identify all errors when they occur during training, it is likely that hardware which produces an error that we cannot identify will also produce an error that we can. Once an error is identified, we can rollback to a checkpoint far enough in the past that we feel certain it is “uncontaminated” by errors.

The Last Bow

Hopefully, our recounting convinces you to check on your accelerators from time to time. Despite extensive testing, errant hardware can still slip through the cracks. These types of errors are referred to as Silent Data Corruption (or SDCs for short) in the industry, but previously they’ve only been discussed as a CPU related problem11. We identified some ways to check for these errors during training, which can alert you to problems at minimal cost, before they have a chance to propagate. Additionally, it might be wise to run deterministic training jobs on every machine in your fleet at some interval to isolate problematic machines.

Looking forward, one avenue to continue the exponential growth of compute applied to neural network training is to embrace the possibility of mistakes. The idea of designing circuits that can make mistakes in order to reduce power consumption, increase speed, or both, is not new12, but to our knowledge it has not been applied to chips designed for neural network training and/or inference. Co-designing such chips and the algorithms to take advantage of them for neural network training would be an exciting area of future research. But that is a mystery for us to solve another day.

Citation

If you find this post useful and wish to cite it in your work, please use the following BibTeX:

@misc{adept-sdc,
  author = {Elsen, Erich and Hawthorne, Curtis and Somani, Arushi},
  title = {The Adventure of the Errant Hardware},
  url = {https://www.adept.ai/blog/sherlock-sdc},
  year = {2023}
}

Footnotes

  1. In the days of training with fp16, it could also indicate numerical overflow. In even older days of training RNNs, it could also just indicate a training dynamics issue. Modern transformers, with proper initialization and normalization almost never go from completely fine to NaN in one step without a hardware problem.

  2. Pun intended.

  3. Training can be non-deterministic for other reasons as well–the inputs might not always be in the same order and auto-tuning that might pick different algorithms depending on the moon phase are two more possibilities.

  4. William Kahan is most likely concerned by this assertion.

  5. From NVIDIA’s docs “At the current density and SRAM sizes, parity protected structures are vulnerable to a low level of uncorrectable SRAM errors caused by background radiation. This low background level of uncorrectable errors is fully expected and does not indicate faulty hardware.”

  6. Known as SECDED (single error correction, double error detection.)

  7. For example, using a tool such as GPU burn.

  8. The function f of equation (1) can be broken into two separate functions–forward and backward–and sometimes as a memory saving technique the forward function is recomputed when computing the backward one. If doing this, then we could checksum the first forward compute results and double check that we get the same answer the second time. However, this mode is falling out of favor precisely because it requires extra computation which slows down training.

  9. Use of ZeRO-Stage 3, or equivalently FSDP Full Sharding, makes this technique impossible as the state is no longer duplicated, so an error in the update calculation can no longer be detected.

  10. Can’t be detected without repeating work, that is.

  11. For example, see this article from Google Cloud and this one from Meta’s production engineering team.

  12. See Algorithmic Methodologies for Ultra-Efficient Inexact Architectures for Sustaining Technology Scaling. Geoff Hinton also imagines Mortal Computation where each hardware is so unique that networks trained on one cannot be transferred to another.

We are hiring