Dagmawi Babi
6.43K subscribers
14.8K photos
1.96K videos
231 files
2.06K links
Believer of Christ | Creative Developer.

Files Channel: https://t.me/+OZ9Ul_rSBAQ0MjNk

Community: @DagmawiBabiChat
Download Telegram
Cores that don't count.pdf
131.8 KB
Here's a very interesting paper I read this week

What happens if your CPU gets something wrong? If it wakes up one day and decides 2+2=5?

Well, most of us will never have to worry about that. But if you work at a company the size of Google, you do, which is why this paper on "mercurial cores" is so fascinating.

What the authors report--and supposedly this is common knowledge at the hyperscalers--is that a couple cores per several thousand machines are "mercurial." Due to subtle manufacturing defects or old age, they give wrong answers for certain instructions. These can cause all sorts of impossible-to-diagnose issues. Some rare problems at Google that were traced back to bad CPUs include:
• Mutexes not working, causing application crashes
• Silent data corruption
• Garbage collectors targeting live memory, causing application crashes
• Kernel state corruption causing kernel panics

What makes CPUs go bad? It's very hard to tell. The authors posit that issues are becoming more frequent as CPUs get more complex, but there aren't solid numbers behind that. There are certainly strong relationships between frequency, temperature, voltage, and bad CPU behavior--most mercurial CPUs only cause problems under very specific conditions, but those conditions vary from CPU to CPU. Age is another source of problems, as older CPUs are more likely to exhibit problems.

Bad CPUs are an especially serious problem because they're very hard to detect. If cosmic rays flip bits in storage or on the network, that can be detected through error coding. But there's no analogy for a CPU that allows cheap online verification of its correctness. Instead, the best detection techniques involve monitoring for symptoms. If a core exhibits exceptionally high rates of process crashes or kernel panics relative to its fellows, that's a strong indication something is wrong with it. For the most critical applications, the authors propose triple modular redundancy--redoing each of its computations on three cores and majority-voting a reliable result.

#Papers #CoresThatDontCount
@Dagmawi_Babi