Silent Errors in Large Scale Computing Systems | Dimitris Gizopoulos | TEDxAthens

26 Nov 2024 (1 month ago)
Silent Errors in Large Scale Computing Systems | Dimitris Gizopoulos | TEDxAthens

Introduction to Silent Errors in Computing

  • Many people have experienced situations where a computer does not function or functions incorrectly, but it is apparent that something is wrong (41s).
  • Today's discussion focuses on a more severe issue: cases where a computer's operation is incorrect, but it is not apparent that an error has occurred (55s).
  • Imagine a scenario where a couple wants to take out a loan to buy a new house and uses a spreadsheet like Excel to calculate their monthly mortgage payment based on their income and the loan they want to take (1m33s).
  • They assume the calculated number is correct and proceed with their plans, but in some cases, the calculation is incorrect, and it is not obvious that an error has occurred because the formula for calculating the mortgage payment is not easy to verify manually (1m54s).
  • The correct result is completely different from the one provided by the computer, which can cause significant problems, such as ruining their plans or changing their schedule (2m8s).
  • This issue can occur in various types of computers, and it is essential to explore where and how these silent errors appear (2m17s).

The Growing Problem of Silent Errors in Data Centers

  • The problem of silent errors in computing systems has recently emerged on a large scale, despite decades of computing, and has been particularly prevalent in the last four years in large data centers (2m32s).
  • Although the problem exists in all computers, including laptops, mobile devices, and tablets, large data centers are particularly affected, with over 11,000 such centers worldwide, each covering an area of around 100,000 square meters and containing approximately 100,000 computers on average (3m8s).
  • These data centers are used by companies to execute various programs and applications, including social networking, messaging, email, and file sharing, which are used by almost everyone on a daily basis (4m10s).
  • New, even larger data centers are being planned by companies like Open AI, the company behind Chat GPT, to train artificial intelligence models, which will require massive amounts of electrical power, with some centers expected to have a hundredfold increase in power consumption (3m33s).
  • These large data centers are complex structures with significant requirements for electrical power and cooling, and are used to execute various programs and applications, including those used for social networking, messaging, and file sharing (3m50s).
  • The increasing size and complexity of these data centers make them prone to silent errors, which can have significant consequences for the users who rely on them (3m54s).

Real-World Examples of Silent Errors and Their Impact

  • Large-scale computing systems, such as data centers, are used by billions of active users every month, and these systems are trusted to perform computations correctly (4m28s).
  • A disturbing incident occurred 3.5 years ago when a calculation raised the number 1.1 to the power of 53, resulting in an incorrect answer of zero, whereas the correct answer is 156 (4m46s).
  • This error was discovered when a dissatisfied customer noticed that a file was missing from a disk, and the system reported that the file had a size of zero, which was the result of the incorrect calculation (5m6s).
  • This incident created a significant impression and tension in the field of computer science, as it revealed that the frequency of such silent errors is higher than expected (5m26s).
  • Meta (formerly Facebook), Google, and Alibaba Cloud confirmed that they had encountered the same issue, and their findings were published and widely reported, including an article in The New York Times in the summer of 2022 (6m7s).
  • The three companies agreed that approximately 1 in 1,000 chips produce these silent errors, resulting in incorrect calculations (6m32s).
  • Modern devices such as mobile phones, tablets, and laptops contain microprocessors that can produce silent errors, with 20 to 30 of these chips inside each device producing errors that are not immediately noticeable (7m4s).
  • These silent errors are a significant issue, as they do not cause the computer to crash or display an error message, but instead result in incorrect calculations that are not visible to the user (7m24s).

Microprocessors: The Source of Silent Errors

  • Microprocessors are the source of powerful computing that people rely on every day for communication, entertainment, work, and other aspects of life, including transportation systems and banking systems (7m35s).
  • These microprocessors are based on silicon and involve complex patterns created with various chemical elements, with around 60 different elements used in their construction (8m13s).
  • The manufacturing process of microprocessors involves creating electrical circuits on a silicon wafer, cutting it into small pieces, and then packaging them into units such as central processing units (CPUs) (8m23s).
  • The complexity and microscopic size of microprocessors are notable, with the entire device fitting into an area smaller than the head of a needle (9m0s).
  • The silent errors produced by these microprocessors are a significant concern, as they can occur frequently without being noticed, and can have serious consequences in critical systems (7m13s).
  • Today, we have more than 100 million transistors, which are microscopic but very complex structures, and their design and construction have challenges that can lead to errors (9m4s).
  • The materials used in their construction are not fully clean and must be used in a clean manner, and the metal wiring drawn on the silicon can be imperfect (9m27s).
  • The machines that manufacture these microscopic objects can also make mistakes, and the situation is even worse because a chip can age, suffer from wear and tear, and lose its initial properties when used continuously in data centers or personal computers (9m51s).
  • Silicon chips wear out because they are used 24 hours a day, 7 days a week, and there are other parameters, such as the fact that chips, even if designed to be the same, are not identical and do not have the same electrical properties (10m10s).
  • The miracle is not that chips sometimes make mistakes, but that generally, we have the feeling that they work correctly, especially in the case of silent errors, which do not cause concern because they are silent (10m40s).

The Nature and Causes of Silent Errors

  • The problem is not with the computer's memory, which is rarely at fault, or the disk, but rather with the fact that silent errors can occur without being noticed (11m8s).
  • In large-scale computing systems, errors can occur due to various factors, including the programmer, the processor, or the central processing unit (CPU), which can make mistakes in basic arithmetic operations such as addition, subtraction, multiplication, and division (11m33s).
  • When a simple arithmetic operation is performed on a personal computer, such as adding 5 and 6, the result can be easily verified as correct or incorrect, but with more complex operations, the result may appear correct at first glance but actually be incorrect (12m11s).
  • In cases where a computer is asked to perform a division operation, the result may be incorrect, but the difference may only be noticeable in the lower digits (12m37s).
  • Computers are often used to sort large numbers, but even in these tasks, errors can occur, and the sorted list may not be entirely accurate (12m47s).
  • A recent example of this was when a computer was asked to sort a list of integers, and the resulting sorted list appeared correct at first, but upon closer inspection, the largest number, 498, may not have been the correct largest number (13m5s).
  • This type of error can have significant consequences, such as in a recommendation system that suggests the most suitable car to purchase, where an incorrect result could lead to an inappropriate recommendation (13m17s).
  • Ranking is a crucial process in computing, often used in various applications such as recommending the most suitable theatrical play to watch or finding the most compatible partner on a dating site, and this process is frequently utilized in our machines (13m29s).
  • A silent error occurred in a personal computer, which meant there was a better proposal in the list of 400 or 499, possibly a better car to purchase, but it's impossible to check the correctness of the result in such large lists of numbers, often consisting of millions of numbers (13m45s).
  • The problem of silent errors in large-scale computing systems was first identified 3.5 years ago by major companies when a microprocessor produced an incorrect result without crashing or showing any noticeable signs (14m23s).
  • This incorrect result can easily spread and be used by other computers in the system, leading to multiple machines producing incorrect results and continuing to perform operations based on those incorrect results (14m41s).
  • The issue is particularly concerning in large data centers, which operate on a massive scale, often involving hundreds of thousands of computers (14m58s).

Addressing the Challenge of Silent Errors

  • David Patterson, a Nobel laureate in computer science and recipient of the Turing Award, is a prominent figure in computer architecture (15m5s).
  • Silent errors are a significant problem in large-scale computing systems, particularly in training artificial intelligence systems, as confirmed by Patterson (15m24s).
  • The issue arises when one of the thousands of computers in a data center produces incorrect results, creating a massive problem in determining whether to trust the generated artificial intelligence model (15m52s).
  • The incorrect result could be due to the model, the person who programmed it, or the silicon, making it challenging to identify the primary cause of the error (16m7s).
  • To address this issue, efforts are being made to produce better chips, test them more exhaustively, and perform continuous tests on personal computers and data centers to ensure they function correctly over time (16m32s).
  • These efforts come at a cost, as they consume a portion of the useful time of the computing system, and resources are spent on repeating program executions, consuming energy, and using space (17m1s).
  • Additionally, computations are performed multiple times on multiple computers, which also incurs costs, highlighting the need for effective solutions to mitigate silent errors (17m18s).
  • Continuous efforts are made to ensure a computer works correctly, but it's not easy to achieve this everywhere, as seen in the example of calculating a mortgage loan payment, where it's unlikely someone would perform the calculation on multiple computers to verify the result is correct (17m26s).
  • In critical applications where computers are dominant, such as transportation, banking, and cars, the cost of redundancy, or running a calculation multiple times on one or more computers, is high because failures are unacceptable (17m56s).
  • To address any technical problem, it's essential to measure its size and dimensions accurately before deciding how much to spend on it, as stated by Lord Kelvin, "If you cannot measure it, you cannot improve it" (18m25s).
  • The challenge lies in measuring silent errors, which are difficult to detect because they are silent, and to measure something, it needs to be observed, which is a significant challenge (18m53s).
  • The problem of measuring silent errors is further complicated because it's like searching for a needle in a haystack, but the needle is not stationary, it's moving, growing, and the haystack is also growing, making it harder to find (19m14s).
  • The amount of data is growing in size every year, and it is essential to have confidence in computing as it is the basis of human progress and daily life (19m28s).
  • Despite the importance of computing, it is known that one in 1000 chips can make an error, but this does not necessarily mean that the underlying issues are understood (19m49s).
  • The problem of silent errors in large-scale computing systems is a significant issue that top computer engineers and scientists are working to address in universities and major computing companies worldwide (20m11s).
  • These experts are actively trying to solve the problem of silent errors, which is a challenge that keeps them up at night (20m16s).

Overwhelmed by Endless Content?