Testing and Benchmarking of AI Compilers
This is an in-depth post on bugs and how to prevent them in AI software and AI compilers specifically. I was the software lead for TPUv3 at Google and I’ve worked on a variety of AI compilers and projects across Google, Nvidia, Amazon and Facebook.
Zero is a hard number
In my estimation, XLA has the most comprehensive AI test suite of any ML compiler, so I heartily recommend XLA for mission-critical AI. XLA is used for most Google AI and has been for a decade. XLA is highly reliable. Yet, even XLA has bugs that escape into the wild for customers to encounter. The number of bugs is not zero, not even for XLA.
Anthropic published this report, diagnosing a bug in an XLA op as one of the causes of the Anthropic service giving bad response to its users for a period of time. We should all commend Anthropic for being this open about the incident. The op in question, approximate top k, was a new op in XLA that evidently didn’t receive as much testing as it needed. This is just one bug, yet look at what resulted. I hope no one received bad medical advice or bad suicide prevention guidance from Anthropic as a result of this XLA bug, but those are among the possibilities. It’s just one bug and people might have died. Which is how you might start to understand why Anthropic took the issue so seriously as to publicly publish a report like that. If you read between the lines, you can tell how upset the person who wrote the report was, even though they are being very professional about it. AI software correctness is serious business.
Consider that your project, whatever it is, is quite unlikely to be as error free and as well tested as XLA is. If you are responsible for an AI development effort, how many situations like this would you like your customers to encounter? Zero. The correct answer is zero. But zero is a hard number. If your project will be widely deployed, you are not going to be able to keep the number of bugs that your users will encounter at zero. In fact, many software engineers might be laughing right now reading this, at the idea of zero bugs as a concept. The only projects with zero bugs reported are projects that don’t have any customers.
It’s a similar question as asking how many patients a surgeon should kill because they didn’t do their job correctly. It’s a number that one would certainly hope is zero, yet humans make mistakes and a surgeon’s number isn’t going to be zero over a long career. Except if he doesn’t do any surgeries. You can’t stay at zero. Zero is a place for people with no customers.
I’ve seen software developers discount testing because they know it will not remove all bugs. Zero is impossible, therefore any number will do. This isn’t an exaggeration or something funny. This is what some otherwise highly capable real software professionals really believe. They prefer to think this way because they believe that it will make their jobs easier if they don’t have to write tests - also incorrect, at least in the context of most AI software. It’s a bit like thinking that heavy smoking improves your life because it improves your day. I wouldn’t want surgery from a surgeon that did his work in this way and, for important AI applications, such as what Anthropic offers, I would avoid using AI software that was developed with such a mindset. XLA is excellent on testing and even XLA has issues such as this. “Sounds like XLA is doing a lot on this, if even they can’t do it, why should we try?” I’d suggest to stop thinking like that.
Zero is a hard number. If that by itself leads you to discount the attempt to reduce your project’s number, since we won’t reach zero anyway, I suggest (in fact insist) that there is something wrong with your philosophy of software development.
Planes sometimes crash. Airbags sometimes don’t deploy. Surgeons sometimes kill people by mistake. Rockets with astronauts on them sometimes explode. Trains sometimes derail. Bungee jump cords sometimes snap. Parachutes sometimes fail. Buildings sometimes collapse. Even though zero is a hard number, it matters how often “sometimes” is.
Testing and benchmarking should be high status work
Your AI project needs to view testing as lower status work in the same way it needs to view fire escapes as optional. Yet this is commonly how it is. People often do it out of a sense of duty, not because they have to. Some managers are getting a free pass because they have employees who do things correctly even if they aren’t supposed to.
One of the problems with testing is that it is a difficult task to estimate how well tested a feature or product actually is. It requires good engineering judgement. To have a firm sense of this, you need to be a good engineer, you need to know everything about the feature and you need to inspect the test suite carefully. There are metrics, such as the number of tests or various kinds of code coverage measures, which are OK to use, but they are not replacements for good engineering judgement. So if employee A does poor testing and employee B does good testing, it’s not necessarily going to be obvious that this is the case without looking closely. Both engineers delivered their project but employee B took longer. Maybe there is something wrong with employee B?
It’s OK, we’ll just count the number of bugs reported later and then we’ll know who did a good job. Employee B took longer. Now we also see more bugs reported in what he did. This employee B is real trouble. Well maybe employee B’s project was also more complex and more important to customers, so they used it more and found more of the few bugs that it did have. Maybe employee A’s project had many more bugs, but nobody used it, so it stayed at zero bugs reported. So counting bugs, as a metric by itself, is not great. It’s not a replacement for good engineering judgement.
If your CEO complains about your project having bugs, he probably just doesn’t know that zero is a hard number. Right? You can explain this to him. If it’s hard for your manager to tell whether proper testing has been done, what chance does your CEO have of figuring this out?
Well, OK, maybe figuring out if testing is good is just hard. But, surely, once a bug has been reported, we have to value our customers’ concerns and fix them quickly (true enough!). Turns out it’s quite easy to tell if your dev team is doing a good job fixing customer bugs - just ask the customers. We should probably reward engineers that are responsible for fixing bugs, because we need engineers to do this and they sometimes don’t want to. In fact, this employee A looks like a real star - he delivers all his projects quickly and he fixes more bugs than anyone else. There is no metric that is a replacement for good engineering judgement.
The CEO may notice that customers are sad about the bugs but loyal customers do in fact appreciate the close relationship that they have with the company’s dev team resulting from these quick bug fixes. Zero is a hard number, but our policy to focus on and reward bug fixing is working.
So, if you can’t tell, this is not a great situation this company is finding itself in, but everything looks reasonable. That’s the problem. Focusing on fixing bugs quickly is a fine idea, but the question is why there are so many bugs to fix in the first place. But how many is too many? I can’t tell you a specific number (well, OK, 37, that’s too many). Nobody can. There is no metric that is a replacement for good engineering judgement.
What happens if an employee notices that we aren’t doing a lot of testing and proposes to do more about testing? You are going to reduce the team’s apparent development velocity for a time if you do this. That doesn’t sound appealing. Worse, suppose this testing turns out to be effective. Then you’ve now revealed that your project in fact had many more bugs than it seemed. Your dev velocity will also now be even slower as you fix the sudden influx of bugs that your own testing revealed. So what does this look like externally to your team? Well, it might look like you first suggested doing less (to do more testing), then your project suddenly has way more bugs, then you did even less than you said you would (to fix more bugs) and, through all this, you’ve delivered nothing that any customer is happy about (for now). You are saying that you are now doing better on bugs, but actually the number of bugs reported against your project (by your own new tests) is above that of other projects in the company. So your testing effort is a success on its own, but what do things look like externally? This situation certainly calls for some careful management of perceptions.
I’m not an engineer specializing in testing, but testing is one of the underdone aspects of many projects I’ve been on, in my opinion, so I’ve had occasion to work quite a bit on testing because I thought it needed improvement. So, in fact, I have personally set off such a cascade of events in my career as described above. This is a direct quote from the manager at the time: “you found more bugs in the past two weeks than our entire team did in the past year”. This was a team with a subteam doing just testing (it wasn’t their fault - they were doing their jobs in the way that they were told to do it).
I ended the story at its low point. What happened immediately after this darker chapter is that customers were still reporting bugs, but now, more often than not, the answer was: “This is already fixed in the latest version, please update.” A while later, the number of bugs reported fell dramatically. The team had been spending half their time fixing customer bugs before any of this started, now it was much less than that. So development velocity and morale were significantly improved. Bugs are far faster and easier to diagnose and fix if you have small tests to find them up front, instead of having to collaborate with a customer to figure out what is wrong later. If there is a bug somewhere in a large customer AI model, this can be very challenging and time-consuming to diagnose (it can take weeks of work to diagnose one bug). That’s the primary source of the quite significant speed-up in team velocity that occurred. Testing improves team velocity. That can just take some time to materialize - both on the way up and on the way down.
There was also another source of improved team velocity. I didn’t just write a bunch of tests myself - though I also did that. The more important thing that I did was to improve the testing infrastructure of the project. If testing is lower status work, your top engineers may not look that much at what they can do to improve testing and its infrastructure. Especially not if you then have a subteam that does the testing instead of the people writing the features doing the testing. That subteam may be expected to take instruction on what and how to test, not to improve testing infrastructure. So then no one is expected to improve testing infrastructure.
Testing AI software isn’t easy or simple and neither is testing infrastructure. What I did was, first, to reduce the amount of boilerplate involved in writing a test. So, and this is not an exaggeration, you could write a test in 3 simple lines that would have taken 30+ more complex lines before, and this improvement applied across tests. This wasn’t easy to do, it required careful API work, and the effect was more significant than it may sound like since it makes people more keen to write many tests. So it doesn’t just save you some time when writing the test, it improves testing in other ways, too. Previously, often a file would contain only a single test. Now, files could contain many tests because they were not so big. Even this is a significant improvement - it’s just easier to keep track of less code.
I also wrote a fuzzer, which found a bunch of bugs. It was based on taking existing tests and automatically making them more complicated in various ways that didn’t change the result of the test. This was very successful, acting as a force multiplier on the existing number of tests, and I would recommend that approach for any AI compiler. So you write one test, but behind the scenes it turns into 20 tests. That’s a lot more productive.
This work was at first somewhat hard to sell as a positive. During the dark chapter period, which lasted a few weeks, I had caused everyone to now have to spend almost all their time fixing bugs, which is usually a software engineer’s least favorite activity. The view of this effort was much improved once we got past the dark chapter period. The well of bugs ran dry and things were looking up.
What did the testing subteam think? They were actually quite happy. If you write tests for a living, your job is going to be more fun if you can write many tests quickly. It’s also more fun if you can write one test and then a fuzzer automatically turns it into 20 tests and then you find many more bugs. You can probably see how the status of people involved with testing rises if they find more bugs. Which they will if given proper tools. I think it also helped morale that I was talking about their work as something very important, which of course it was.
This all led to the testing subteam having some extra time due to the now increased productivity of writing tests. They had previously been expected to use the project’s APIs to write tests, but not to inspect how the code inside the project worked - it was quite complex. I proposed that the testing subteam spend some of their now freed up time to do a series of improvements to the project on the inside of the code, primarily long overdue refactorings, so that they became familiar with the internals of the project, too. I also suggested that they write a reference backend for the AI compiler, which, apart from such a backend being yet another boost to testing productivity, required them to understand how to implement every op in the whole compiler (as opposed to testing each op from the outside). It’s easier to test a project if you know how it works on the inside. It turned out that they were perfectly capable of doing such work, they just hadn’t been expected to do such work previously. I would have just removed the entire notion of a test subteam and mixed this team in with the rest of the team, though we didn’t do that.
Was I expected or hired to do this kind of work? Absolutely not, though I didn’t have trouble justifying the time I spent on this after it got going. The whole thing took around 2 months of my time. It was successful enough that the testing approach that I used was disseminated more widely within the company through a company-specific avenue for such things. Don’t misunderstand this story - it was a great company and a strong team.
What about safety certifications? What if this team had been subjected to a safety certification process, maybe that would have led to the same changes that I made? No. I’ve been involved in such a process and nothing of what I did here would have been the result of a safety certification process. So you can perhaps see why I’m skeptical of safety certifications, even though they may indeed have some legitimate positive effects. I think that they are more a legal tool than an engineering tool. I suppose that legal tools can be important, too.
Maybe you think this is a story where I say that I’m a great engineer. Well, I do like to think so, yes, but you might have missed the bigger picture here. This was a story about the importance of testing and testing infrastructure and some of the challenges that get in the way. You underestimate these areas at your peril. I’ve never joined an AI software team that didn’t have some need for improvement in this area in my opinion (which partly explains how I was so useful on this - wasn’t my first rodeo). I think the whole AI industry is underestimating the importance of testing and benchmarking, not in one company or in one place but everywhere.
In the previous section, there is a perspective of how even a single bug can cause deaths and public embarrassment for you and your customers. In this section, we are talking about a high volume of bugs. So it seems that there is some kind of mismatch here? Yes, there is. That’s what I’m saying. You’ll find this mismatch everywhere in the AI industry.
If you still think that testing is or should be lower status work, then maybe read this story again. I have to say that I disagree with you. Testing AI software is not easy and it matters how you do it.
Kinds of AI software bugs and their impact
What can the impact of bugs in AI software be? There are different levels of AI software bugs:
No service bug A no service bug is when the user consistently gets an internal error or the system is obviously broken in some other way. These bugs are obvious and bothersome, but they are the least serious kind of error. A self driving system with a no service bug like this will not be released to the world until the bug is fixed, so it’s not that serious. It’s just bothersome.
Intermittent no service bug Like a no service bug, but it only happens some of the time. Maybe rarely. This is much more of a problem, since such bugs take time to be noticed, so they impact customers to a greater extent. For example, a self driving system with an intermittent no service bug might be released to the public, if the error does not occur during system testing, and then cause deaths in the wild if the bug does occur there.
Correctness bug With this kind of bug, the system isn’t obviously broken, and there are no errors, but what is happening is not correct. This is a very serious kind of bug in the context of AI software. These bugs can be extremely hard to diagnose and they can go unnoticed for extended periods of time. A self driving system with such a bug will probably not be released to the public, since the bug will likely be detected during testing, but that isn’t guaranteed.
Intermittent correctness bug This is the worst kind of bug and it is the kind of bug that Anthropic was dealing with in their public report. You can see how such a bug can escape testing efforts and go unnoticed for a long time even while it keeps causing serious problems. A self driving system with such a bug may well be released to the public.
As a customer, you might notice that no service bugs are not that serious for your business. Once something works, it’ll keep working. So you can deal with that. However, I would suggest considering that these different kinds of bugs are correlated. An AI system with many bugs of one kind is likely to have many bugs of the other kinds, too. So I would not take no service bugs lightly, even though their direct impact is limited. They are a red flag that should have you worried about encountering other bugs that may well impact your business more seriously.
Note that here we are not talking about AI that makes mistakes. That’s different. An AI mistake is when the AI functions the way it’s supposed to, but it just can’t figure out the right thing to do. This is a problem for AI researchers to deal with - they need to come up with a better transformer or use a better dataset to train the AI. That’s not what we are talking about here. We are instead talking about situations where the AI would do the right thing if the software that realizes its internal computations were functioning correctly, but that software is not functioning correctly. That’s an AI software bug (or potentially hardware bug), not an AI mistake. No matter how well an AI is trained or structured, it might still do the wrong thing if there is a bug in the underlying software that runs it. So buggy AI and wrong AI are different.
What can the impact of AI bugs be?
AI assistants AI assistants, such as Anthropic, ChatGPT or Gemini, are used for advice in all areas of human activity, including suicide prevention and medical advice. There is really no limit to what might result if an AI assistant uses buggy software, since there is no limit to the potential actions that the wrong person might take if given the wrong advice at the wrong time from a source that they trust.
Is it really feasible that an AI assistant could start saying evil things due to a bug? Consider that one of the possibilities for an intermittent correctness bug is intermittent sign error. Not a likely bug, but it is a perfectly possible one. Be aware that AI’s internally contain many vectors and an AI model may well have directions of such vectors that correspond to various kinds of evil behavior. An AI assistant will then of course have been trained to avoid such vectors. However, if there is a sign error, you might flip a vector or one of its components from a direction of “goodness” to a direction of “evilness” with a corresponding flip in behavior of the AI. So an intermittent sign error bug could in fact lead to an intermittently evil AI assistant that’s randomly good most of the time and optimizing towards evil when you aren’t looking. So buggy AI can potentially be quite a bit more serious than simply somewhat wrong AI.
Medical diagnosis AI is today used for medical diagnosis, such as finding cancer on a mammogram. In such cases, currently, AI is usually used to support human judgement, so a faulty AI verdict may be corrected by a human, but humans make mistakes, too, so that isn’t guaranteed. If a hospital that I will use does use AI in their diagnosis procedures, even if only in an advisory capacity, I would very much appreciate if they will avoid using buggy AI software.
Self Driving There are already self driving cars on the road and people fall asleep while “driving” them even thought they aren’t supposed to. In the future, there will be full self driving cars on the roads where you are allowed to fall asleep or there might be no human occupant at all. These are already on the road in a few areas. AI software bugs here can of course lead to traffic accidents and deaths.
These are three particularly serious applications, but there are many other applications of AI where bugs are still serious, even if not quite that serious. Ask your favorite systolic array, I mean LLM, and it’ll give you a long list of such applications.
Ah, but it’s OK, none of our customers are planning on using our product in a place where it might kill someone. Well, not right now and not as far as you know. If your AI software becomes popular, you will never know all the places it will be used. And, in any case, if a big order comes in from a medical diagnosis company, are you going to tell them that they shouldn’t use your product because it’s buggy? Probably not. You’ll take the order and hope for the best.
Maybe that medical company will require a safety certification process, but as I’ve said, these certification processes don’t assure what it sounds like they assure. You think “certified safe” software doesn’t have any serious bugs? Zero is a hard number. So the question is how effective the certification process is at finding bugs. Somewhat effective. Much of a safety certification involves making a list of all possible bugs you might have and then to do paperwork to document that you have tested for each of them. If you are honestly bad at coming up with possible bugs in your software, then your certification will be easier to complete. If I will receive surgery from an AI robot, what I want to know is that the people who created it were conscientious and competent. That is more powerful than any certification. Of course, I won’t object if they also have a certification.
I suggest you take software correctness and testing seriously. I also suggest that you prefer to prevent bugs escaping your software development process instead of focusing on fixing bugs quickly after customers report them - even though of course you do need to fix it if a customer does report a bug and doing it quickly in that case of course is preferable.
If you are buying a lot of AI hardware, you might want to ask your vendor about how many bugs they’ve ever had escape to their customers (they won’t know, or if they do don’t expect the number to be small, but watch how they respond) and what their testing story for their hardware and software is. You may have a hard time evaluating the answer, but if you get the sense that they aren’t taking that side of the business seriously, that’s something I’d be concerned about in your place. If they only emphasize their bug fixing turn-around time, and don’t have answers on their efforts on preventing bugs (e.g. testing), that’s maybe not great. Though maybe no one else ever asked before, so it’s OK if the sales team needs to go back to their dev team to ask. If they don’t take testing seriously, what that really means is that they aren’t taking your interests as a customer seriously. At least that’s how I would view it in your place.
AI hardware and software infrastructure
The first thing to know is that you need a significantly large server farm to run your tests if you will be developing large-scale AI software. During TPUv2 development, our XLA testing fleet of TPUs was so powerful that it would have been in the top 5 of world supercomputers at the time, if we ignored that such lists require higher precision computations than what TPUs do. To be fair, this happened because TPUs are incredibly fast, so we had many TPUs but not so many as that makes it sound like. Even though we had a lot of TPUs available for testing XLA, we still would have liked more. This is because you need many tests and ideally these should all run before every change to the software repository, so that bugs never even make it into the repository. This can require a lot of testing hardware.
It is quite important how long it takes to make a change to the code, compile (if in a compiled language) and run the tests. You will want a parallelized compilation flow where compilation happens in a distributed way rather than locally, since otherwise it will be slow. The same for testing - you will want a distributed system where tests can be run in parallel across many machines, not just locally. Critically, you will want this to be easy to do from the command line. At Google (and externally if you use Bazel), you can compile and run all relevant tests by simply typing “bazel run” in your code directory. It will quickly compile and test in a distributed fashion automatically from that one invocation. If your work flow is not so good as that, I don’t know why you wouldn’t improve it until it is. And consider using Bazel for building and testing, it works well.
A particularly bad situation here is if a developer needs to book a machine to run tests on, has to log into it and then maybe has to install some things before running tests. Repeatedly (one-off is maybe OK). Don’t do it that way. Just use Bazel - it allows you to declare that a test requires a specific kind of hardware and it will make it happen (well, as long as you provide that kind of hardware to it, of course). At Google, you can type “bazel run my_test” and if my_test is set up that way, this will run in parallel across all current kinds of TPUs and a few types of GPUs, involving reaching out to many different machines, each with their own kind of hardware. It happens in seconds. You can also tell it to run only on a specific kind of hardware. You can have that where you work, too, if you use Bazel.
[[[ Irrelevant aside Why call it “Bazel”? Seems like an odd name. Well, internally at Google, this system has always been called “Blaze” for “blazing fast”. When it was open sourced, I guess they wanted to distinguish the open version from the internal version, so they called the open version “Bazel”. It’s a bit of an odd name, but you can see the connection to “Blaze”. ]]]
The modify, compile, test cycle time is important because it has a strong effect on developer productivity. If your developers’ time isn’t expensive and valuable, you probably didn’t hire the right team to work on AI software. Proper infrastructure is a large force multiplier on your development team’s efforts.
If it takes too long to run your tests (more than a minute is already a long time, I think), buy more hardware. Keep doing that until you can’t afford it anymore. If that never happens, you have either a very large budget (unlikely), or you team isn’t adding enough tests (very likely). If you are somewhat into the development process and you as a manager aren’t getting a request for more funds to buy a larger fleet of test machines, something is wrong. You should figure out what’s going on. Perhaps your team doesn’t feel empowered to ask for what they need. Perhaps they didn’t write any tests. Something is wrong. They should eventually be complaining that they don’t have enough test hardware no matter how much test hardware they already have.
You should buy a lot of hardware, but, eventually, you won’t have more money for more hardware. Even at Google that’s how it was eventually (though we did get a world top 5 supercomputer out of it, so hey, that was pretty nice). So what then? The most obvious solution is to ask people to stop adding so many tests. I’ve seen this proposed and used as the primary solution. That’s bad. ABAT - Always Be Adding Tests. But... then how do we solve the problem that it will take longer and longer to run all these tests? That could tank your dev team productivity, so that’s no good. What to do?
The first thing to do is to optimize your tests. The easiest way to optimize code is to run a profiler and generate a flame graph (if profiling and generating a flame graph takes more than a single command line invocation in your team’s setup, why not make a script for it?). Tests are code. So profile your tests. This will be very effective if you’ve been underway for a while - you are surely wasting time doing many silly things in your testing code. You might get a 10x this way, it can happen. A common first discovery upon doing this on an AI compiler is that generating large random arrays, commonly used to test AI software, takes way longer than you’d think. So cache those and reuse them. That alone can be a large speed-up.
If you profile your tests, you might also discover that your actual product is slow in some cases where you didn’t expect it to be. Congratulations, you just found a performance bug in your software and you will speed up also your tests by fixing it. For example, if you make AI compiler tests with very large graphs, as you should, then you might well find that your AI compiler is very slow for such cases - I once discovered an O(N^6) algorithm in the compiler I was working on this way. That’s something to fix. It’ll speed up your tests and please your customers if they use large graphs. If you do this work, of course document numbers on the impact of your work for your performance evaluation / promotion case in the future.
While you are profiling your tests, pay attention to the utilization of the AI hardware that you are running your tests on. The utilization of your AI HW during the testing process will usually be very low, e.g. less than 1%. This happens because many tests use a lot of CPU cycles to compile a kernel, prepare inputs and inspecting outputs for correctness. The actual kernel that runs on the AI HW is usually completed very quickly. AI HW is very fast - that’s the whole point of AI HW. So your tests are likely mostly CPU bound, running on a $200 CPU, while your $10k accelerator (if you can find one that cheap) is 99% idle. So in this case you can buy twice as many $10k accelerators to double your testing capacity. That’s industry standard, I’m not joking, it’s not something funny, I’m giving you serious information without exaggeration here. This happens largely because teams don’t realize that the AI HW is poorly utilized during testing (Profile my tests? Why would I do that???). But, even when teams do realize the issue, there might not be the will to fix it. I’ve seen that as well.
The trouble is that solving low utilization of AI HW during testing is somewhat tricky. Each test needs to use the device, so a test will commonly acquire exclusive use of the device, run the test and then release the device. So if you have one device, running the tests in parallel on the CPU doesn’t help since they are serialized on acquiring the device, even though the device is mostly idle while it is acquired.
What you need is an improved way to write tests, and improved testing infrastructure, such that, naturally, and without special efforts in any of the tests, the testing infrastructure automatically sets it up so that a test will do as much work as it can before acquiring the device (prepare inputs and compile the test kernel), then it quickly acquires the device, transfers inputs to the device, runs the (already compiled) test kernel on the device, transfers the output off the device and immediately releases the device. Only after the device is released, the test can then inspect the output for correctness. That’s how to do it - you can even pipeline these steps for further optimization, but that’s less critical to do. Your tests will probably still be bound on the CPU, with a significantly idle device, but if you have 30 cores on the CPU, this might give you a ~30x improvement on your testing throughput. So that’s nice. Now you can write 30x as many tests before you have a problem with the speed of testing again. ABAT - Always Be Adding Tests.
Won’t multiplexing tests on the AI HW like this lead to race conditions and flaky tests? If you do it wrong, yes. If you do it right, no. But worrying about flaky tests (tests that fail but only sometimes) is good. Flaky tests are bad. Make sure you don’t have flaky tests.
Won’t this make every test more complicated to write? If you do it wrong, it will. But if you hide all these steps behind testing infrastructure and good testing APIs, it will be no more complicated to write a test with this setup than without it. I suggest doing it that way.
Anyway, this is starting to sound like somewhat complicated software development, isn’t it? Profiling, optimizing, distributing across machines and parallelizing within machines even perhaps software pipelining. Yes, that’s right. Testing AI software isn’t a lightweight easy activity.
What do good testing APIs look like for AI software? This is what a test for 1 + 1 = 2 should look like for an AI compiler:
ExpectEq(AddInput(1) + AddInput(1), 2)
This one line does a lot of stuff:
Create an AI model graph.
Add two scalar inputs to the AI model graph.
Add an addition node of the two inputs to the AI model graph.
Add an output node to the AI model graph connected to the addition node.
Compile the model graph to a kernel that can run on the device.
Create two arrays on the host, each containing a scalar 1.
Acquire the device, so that other tests cannot use the device.
Transfer the compiled binary to the device.
Transfer both inputs to the device.
Reserve memory to hold the output on the device.
Run the binary on the device, passing in the addresses of both inputs and the output.
Wait for the binary to complete.
Transfer the output scalar from the device back to the host.
Release the device, so that other tests can use the device.
Compare the transferred output array to the expected result, in this case a scalar 2.
Report to the testing infrastructure whether the test succeeded or failed.
Deallocate memory for the AI model graph, binaries, inputs, outputs etc.
For some AI compilers, this same test will require more than 17 lines of code, and from the list above, you can maybe tell how such a thing is possible. But, in fact, you can create AI testing APIs so that the above one line does all of that. There is enough information on that one line to do all of these steps. If your AI testing infrastructure isn’t so nice as this is, I suggest that you work on it until it is. You’ll also notice that the idea to acquire the device only when necessary is already baked into this API (you can this way also later add support for pipelining the transfers between tests without changing any of the tests).
I’d also suggest adding an automated fuzzer to expand each test into multiple other more complicated tests. So from one line you can generate many tests. You’ll also want a reference backend so that this also checks that 1 + 1 = 2. The correct output is inferred on the CPU using the reference backend, which is a simple and therefore (more often) correct.
ExpectEq(AddInput(1) + AddInput(1))
This is not useful for such a trivial case like this, since it’s not a problem to say that the answer should be 2, but it’s very useful for cases where the output is very large (e.g. a million numbers). Your fuzzer might also have use of such a reference backend.
Your reference backend will be used for large outputs. And the CPU reference backend should be implemented simply, so that you can have more confidence that it is correct, so you cannot (should not) use complex optimizations inside the reference backend. So if you look at your test profile (you are profiling your tests, right?), you are going to see after a while that the reference backend will be by far the biggest factor slowing down your testing. What to do? This one is a bit tricky. You need a simple reference backend, so maybe buy faster/more CPUs. What else? Well, here’s an idea that I think is good, even if it is a bit of trouble:
If the output from your device was previously correct upon comparison with the reference backend, record a stable 256 bit hash code of the previous correct device output. If the hash code of the current output from the device is equal to that which was recorded as correct previously, then you can mark the test as passing without running the reference backend again. This way the reference backend will rarely be run (most code changes do not change the output of most tests), yet you didn’t lose any test coverage.
You might also leverage this into a nightly determinism test: if you run the tests twice with no code change, the outputs should be bitwise identical. If they aren’t, that’s a determinism bug.
There are some complications with this idea, like how to store and retrieve the hash codes, but I think it’s the best practical solution. Better than just having unnecessarily slow tests - though definitely don’t do this until the test profile shows that the reference backend is starting to be a problem. You will want to rerun everything with the reference backend once nightly to catch, in a timely manner, some very rare situations where you can get false negatives from this approach (the reference backend changed enough to trigger test failures, or you look up by test name, the test changed and the device output didn’t change but it should have). Couldn’t you store the whole output instead of a stable hash code? Yes, but these outputs can be large and would consume bandwidth to transfer, so you might not want to do that, though you could. In that case you might as well cache the reference backend outputs instead. You can’t cache reference backend output hash codes since the device output comparison to the reference backend is usually approximately equal, not exactly equal, for reasons such as floating point reassociation.
Most of your tests are going to be smallish unit tests, that you can run before every code change, but you are also going to need larger-scale tests that take too long to run before every code change. E.g. if your hardware is for training, one of your tests should be to train e.g. a transformer from scratch and the test criterion is that the converged accuracy of the trained model at a given training step is within expectation. If your hardware is for inference, you should be looking at accuracy end-to-end of e.g. a pretrained transformer. You might want to run this for multiple kinds of models. If you aren’t running such tests daily (nightly) or at least weekly in an automated way, that’s a problem. You’ll also want automated tracking and graphing of how model accuracy is changing over time so you can spot regressions (or improvements).
If tests take more than a few minutes to run, people will sometimes skip it when they think it’s probably OK, occasionally leading to tests that fail in the shared code repository. This is a big problem for everybody every time it happens, so people get understandably angry about this. If running the tests is slow or somehow troublesome in any other way (“just follow this 10 step process for booking a machine that you run the tests on”), the real root cause for this situation is your testing infrastructure. If I hear of a situation where this keeps happening, I already know that their tests are slow or bothersome to run in some other way. That’s why it happens. I’ve seen managers and infra teams be very angry at engineers for skipping a 2 hour (!!!) test run, causing test regressions to be submitted in rare cases, when the real underlying trouble is that no one bothered to ever profile and optimize the tests. And no one even asked for more test hardware, either. You need good testing infrastructure and a bunch of test hardware.
In the past two paragraphs, I wrote that you need tests that take too long to run before every code change, and I also wrote that if you get regressions into your code repository, that’s because your tests are too slow to run. Seems like a contradiction - it has to be fast, but also you need slow tests. How to reconcile these two ideas? The consistent rule here is that if a test is very slow, it might get regressed because it won’t and shouldn’t be run all the time, only sometimes (e.g. nightly). You’ll just have to deal with that as a permanent situation, is what I’m saying. However, ideally, you will have many fast small tests that test your entire product, so that you only very rarely have cases where a slow long-running nightly test fails but none of your fast per-code-change tests fail. That’s how you deal with this, by reducing the rate of such occurrences to be low enough that it is tolerable. If slow-test regressions are happening often enough to be bothersome, then you don’t have good enough coverage from your fast tests.
Here is something that I would suggest to avoid: It is possible to save significant test load by having everyone submit their code changes without running tests, or running just a few tests, and then only running the tests every N’th (e.g. 10th) code change on the shared code repository. In case of a failure, a code change is then automatically identified (bisected) and reverted. This is bad for developer productivity and also morale (“who submitted a regression again?”). If you don’t have any budget left (did you ask?) and you already profiled and optimized the tests (did you, though?), maybe you have to do it that way, but it’s not good. I wouldn’t do that.
You might also want to run your test suite together with Valgrind and the various llvm sanitizer modes and with your favorite static analysis and coverage tools and, these days, AI analysis. This doesn’t have to happen often, some of the modes are very slow, but it’s imprudent never to do it. Once monthly or more often might make sense. Certainly always before releases. If you find e.g. 30 of these, you can run one of them each night, to avoid having to deal with the combined findings of 30 such tools at the same time.
XLA has an excellent open source test suite. If you are developing software for your own AI HW, you might consider adding an XLA backend (not that hard) so that you can run XLA’s test suite against your software and hardware. You might consider doing so to access XLA’s test suite even if you aren’t interested in using XLA for anything else. Though XLA is a fine choice. It’s how Google does AI.
Benchmarking infrastructure
Everyone on your team needs easy access to running benchmarks. Any change, even ones not intended to, can change performance for better or for worse, and it’s important for your team to be able to tell what’s going on here. It is also much easier and more motivating to do performance work if you can right away see that your change was a positive +X% for some X.
You need benchmarks both for how fast your compiler is at compiling and for how fast the generated binaries are at running an AI model. A code change can make one model faster and another model slower, so it’s important to measure across a variety of models - you can’t just look at a single model. Unless your project only supports a single model - which is starting to perhaps be a possible situation with everything being a transformer. You of course want your customers’ models to be included in your set of benchmarks, so that your team knows right away if you are making your customers’ models slower. Regressing customer models is easy to do by mistake, so it’s important to be able to tell if you did it. What if your customers want to keep their models secret from you? Well, that’s OK, but then they might get regressed (run slower). That’s the deal.
One thing that any team eventually learns over time is that some powerful optimizations that improve many models overall might still regress some models and it isn’t always practical to fix the regressions (sometimes it is, of course). So you are going to have to accept regressions sometimes. It’s a case by case judgement call. Requiring no regressions ever is not a practical policy, not even of customer models. What you’ll want to make an effort to avoid is large regressions in customer models that make their way to that customer and then there is nothing the customer can do about it. Keep in mind the possibility that if your team did many separate optimizations for a release, then perhaps all your customers’ models are improved, even if some of those individual optimizations caused regressions on their own. Don’t let fear of waves prevent a rising tide from lifting all boats (this analogy breaks down because tides are also cyclical, but don’t think about that).
It is common, maybe even industry standard, that running benchmarks is an involved process that requires a specific team member, who knows how to do it, to spend some time on it each time. So people will ask that team member to benchmark their code changes and, if the team member can be bothered that day, you might get the result the next day. The team member will probably only run one or two benchmarks, because otherwise it’s too much trouble. This is normal. It’s also not a good situation. Instead, you want a benchmark report like this to be reported automatically if you simply write “run_benchmarks” on the command line:
Performance report for change XYZ
Before After Speedup (1.0 = neutral)
model_benchmark_A 120 ms 60 ms 1.93
model_benchmark_B ...
...
Geomean 1.42
Geomean means geometric mean, this is the correct mean for ratios (like speedup). This report is comparing the time with and without a code change. That’s what you need to know the impact of the code change in terms of performance. You want a report for both compilation time and model benchmark time (always reported together). Ideally the time to generate such a benchmark should be very low, but in reality it’ll probably take a while. If it’s more than an hour that’s probably not good (you are profiling your benchmarks, right?). You want the report from the command line to be in the form of a permanent HTTP link pointing to the report. That link can then be put into a code review, so the reviewers can see what the impact of your change is without having to rerun the benchmarks for themselves, or it can be put into a report for your company, such as a written case for promotion.
If you don’t have this, I’d suggest to work at it until you do. I don’t know why you wouldn’t. It is a force multiplier for your entire team.
An important property of a benchmarking system is how noisy the numbers are. To measure this, run the benchmark a few times with a change that doesn’t do anything, so that before and after should be identical. If you don’t get a perfect 1.00 impact result every time, then your benchmark is noisy. An optimization that gives you 1% across your benchmark set is a good optimization, so you will want benchmarks that are stable enough so that you can trust a 1% difference to be real, not just noise. If you don’t make special efforts here, your benchmarks are probably going to be very noisy. You may want to have a daily (nightly) run that measures noise in your benchmarking process and shows it as a graph and with notifications in case of an increase in noise. This is important since if your benchmarks have suddenly become noisy, your team may not know about it yet, so important decisions can then end up being based on differences that in fact are just noise. Also, if you have one of these Employee A types on your team, they can run the benchmark 10 times until it gets the result that they want and that will work if the measurement is noisy. Promotions and pay are in part based on such numbers (“I improved X by Y%”), so there is a very real motivation for that kind of shenanigans. Noisy benchmarks are bad. Unless you want to confuse your organization - then maybe they are very useful. Don’t accept noisy benchmarks.
Where does all this noise in benchmark results come from? For speed, your benchmarks need to run in parallel across your fleet of machines. This will likely be the test fleet and may even be coming from a cloud provider. This introduces many noise problems:
Varying load If machines in your test fleet are also running tests at the same time, that will introduce noise since the load on the machine is varying. So you will want to acquire exclusive use of each machine for each benchmark, so that each machine is otherwise quiet. If your test fleet is coming from a cloud provider, this may not be possible. You may need to buy physical machines only for the purpose of benchmarking.
Temperature If the temperatures of the chips in your machines are not identical across benchmark runs, then throttling features of high performance chips will cause variation in the results (all high performance chips have throttling features). This is true for both the CPU and your AI HW. It is hard to manage getting an identical temperature, but one thing that may help is to leave the device idle for a second before running the benchmark or running the benchmark twice while discarding the first result. If nothing else works, you may also choose to pre-throttle the machine so that it is always running at a slower clock during benchmarks. This messes up the measurement a bit, since your customers will not be using that configuration, but at least it can be less noisy. Make sure that your test fleet machines are always returned to their default, faster state after a benchmark, even if the benchmark is interrupted by e.g. a segmentation fault.
Varying machines Are all your test machines identical? Are you sure? What if one machine is standing in a hot corner of your datacenter and another machine is in a cold corner. That’s already non-identical. One way around this is to run all benchmarks in parallel across machines, but to run the before and after for each same benchmark each on their own same machine. That reduces before/after noise. It will also reduce your parallelism by a factor of two, but that may well be worth it. You can also run all the benchmarks always on the exact same machine, but then your benchmarks will likely take too long to run overall after a while (benchmarks accumulate) because you will have no parallelism.
Measure the right thing However strange it may sound, I can report that it is quite easy to create a benchmarking infrastructure that ends up measuring things that shouldn’t be included in inference or training step time and these additional things tend to be more noisy than the real signal that you are after. One bad example here is if the measured time includes the time to acquire exclusive use of the device (a process that can be hidden away inside the infrastructure), which could take anywhere from a nanosecond to a minute. Inspect carefully when you start and stop the clock and what occurs between the two. You also probably want to exclude time taken to transfer inputs/outputs and binaries to/from the device. Also take care to know whether you are measuring wall-clock time or CPU cycles time or device cycles time. Measuring all 3 times is fine, but if you only report one, you want wall clock time. This is the real end-to-end measure.
Natural variation Whatever you do, there is still going to be variation. For this reason, you may prefer to run the benchmark a few times and report the median or minimum time instead of just running it once. You will likely want to run very fast benchmarks many times and long running benchmarks perhaps just once. For important runs (like the ones you promote people based on), maybe run the whole report separately a few times to ensure that you get the same result. If you use Bazel, which can also be used for benchmarks, be aware that you have to tell it to do the same thing again, otherwise it will just give you the same cached result back many times without remeasuring.
Wrong baseline You can’t cache a baseline number to compare against across your team. It may sound like a nice 2x on benchmark speed to reuse the same baseline, but it causes meaningless results. You have to compare a change to a baseline that is the revision just before that change. If your baseline is a fixed revision that changes e.g. nightly, then people are not getting just the effect of their own change. This is an optimization that I have in fact seen used and it’s not a good idea.
Getting a handle on all this is necessary if you will be building serious AI software. You need to be able to run benchmarks easily and (somewhat) quickly and you need to be able to trust the results. Benchmarking is hard.
You will also want to ensure that you store the logs of all the benchmarks. Benchmarks can crash or exhibit other problems and you need some easy way to investigate that. You probably also want a feature where you can specify a single benchmark to run, or a specific few, instead of the default larger set, on the command line.
Due to the virtues of ABAT, Always Be Adding Tests (benchmarks are performance tests), over time you will accumulate benchmarks (good) so that you will have to curate the set of benchmarks that run by default when people write “run_benchmarks” on the command line. Otherwise it will take too long. This is unfortunate but will eventually be necessary. Certainly profile and optimize your benchmarks and complain about not having enough hardware before you start doing this. What is the utilization of your AI HW during benchmarks? It can be and should be quite a bit higher than what you’ve achieved for your unit tests, since benchmarks run for longer on the device per benchmark than unit tests normally do. So make sure that that is the case.
You will also want a daily (nightly) run of your benchmarks that is permanently recorded and which feeds into graphs that your whole team can easily access, so that everyone can see how performance is trending over time for each benchmark model. This is also how you notice if a regression in performance makes it into the shared code repository (which can easily happen). For this nightly run, you will want to run many more benchmarks than are run by default when team members type “run_benchmarks”. This is feasible since it only runs once per day.
Avoid stigmatizing people for having submitted performance regressions. It happens and it’s not a big deal as long as all this infrastructure is in place to catch it. Don’t make your team afraid to change the code, including junior developers and interns. If someone keeps doing it all the time irresponsibly, sure, have a conversation about it, but it probably shouldn’t be a public conversation.
You may also prefer to be able to run benchmarks against competing AI software. I think that’s a great idea. How else will you be able to claim that your AI software or hardware is better than everyone else’s AI software, as all AI software and hardware vendors claim? You can even have a version of your benchmarks where the before/after is your product against their product. There are some troubles here. You probably know how to properly configure your own product (better check, though!), but you may not know how to properly configure your competitions’ products. It’s easy to end up comparing your parallel product against your competitors’ product that, unfortunately, only got one thread to run on, which is a meaningless comparison even if it enables you to claim a 30x speedup (I’ve seen this happen). There are many ways to get this wrong. Do not expect setting up benchmarking to be easy to get right. In fact, wrong benchmarks is probably part of how it is possible for all vendors to claim that their product is best (another factor is only reporting numbers for models where you are faster). If you are a customer, never believe the benchmarks coming from vendors. If you are a vendor yourself, you probably will want to know what’s really up, though, so better to have correct benchmarks. A sophisticated customer will set this up for themselves, so you can’t hide the truth. If a specific AI model is much faster on a competitor’s product, that’s also an easy way to tell that there is something to be improved in your own product on that AI model, to the benefit of your own customers.
Does it sound like benchmarking is not that easy? If that’s what you think now, then you are correct. Teams commonly don’t do this at all or do it quite poorly to save time having to deal with all this. This is a false economy of development effort. Easy (easy to run), stable and fast benchmarking is a force multiplier on your entire team, so you will want to get it right.
This is a summary of the the three most important concepts (recall that benchmarks are included in “tests”):
ABAT - Always Be Adding Tests (always improve your test suite)
ABP - Always Be Profiling (always optimize your tests and inspect HW utilization)
ATTAM - Always Try To Acquire Machines (keep asking for more test hardware)
If ABP and ATTAM don’t seem necessary, you have not been doing enough ABAT (of course postpone ATTAM until you need it - don’t worry, you will soon).
There really are AI compiler teams out there that never optimized their tests at all (probably missing out on 100x speed) and that never considered getting more hardware. So they just stopped writing tests because otherwise running the tests is too slow (and maybe some of the engineers don’t like writing tests, maybe in part because the testing APIs are bad - you need good testing APIs). However absurd that may seem after you’ve read all this, this is a real thing that happens. Don’t let it happen to you. ABAT, ABP, ATTAM.
If you are a manager, how much test hardware should you budget for? More than that. Whatever it was. See, that’s how you do ATTAM. More seriously, you will have to weigh this carefully. Ensure your engineers are doing ABAT and ABP and you’ll have to see how far the budget for hardware can reasonably go after that. Don’t fall into the trap of thinking that “there has to be a limit” implies “any limit is as good as any other limit”. This is similar to “zero is a hard number, so any number will do.” That’s wrong and that’s why you need testing in the first place. There is a test hardware budget that is right for your situation and you’ll have to figure out what it is.
If people follow this advice literally, have I created an economically inefficient situation where engineers are trying to pull too many funds into test hardware? You could imagine such a thing. First of all, don’t follow this too literally. Second, it is more likely that you are neglecting this area than overinvesting in it. Though if your engineers want more test hardware and they didn’t even profile their tests (a common situation), then I’ve given you some ammunition to push back here.
The power of assertions and/or internal errors
It is far better to get a consistent internal error than to get a wrong result from an AI model. For this reason, you will want to put assertions / internal error checks everywhere in your AI compiler. All over the place, in every function, absolutely everywhere. If these trigger often for customers, so that your assertion practice may start to have a poor reputation, that’s still much better than not having the assertions and getting correctness errors instead. If customers are seeing too many assertions triggering, then your problem is the overall quality of your product and of your testing practices. The problem is not the assertions themselves. Even if an assertion would be in error (which almost never happens in my experience), your testing efforts should have uncovered that. Assertions are good. Use them everywhere. Ship your product to your customers with assertions enabled - but see the paragraph below on legal liability.
Assertions can slow your product down. This happens primarily if an assertion is checking a property that takes too long to check. You might also, very rarely, have an issue with assertions, even fairly cheap ones, occurring inside inner loops and therefore taking too long. Both cases should be noticeable when you profile your product (you are profiling your product, right?). In this case you might have to remove an offending assertion that’s just too slow. That’s fine. Still keep adding other assertions, though. Most assertions are quite cheap. Assertions are good.
If using C++, never use the C assert() macro. Instead, use a macro like the Abseil CHECK() macro, which enables providing a useful dynamic message along with the error, e.g.
ABSL_CHECK(a == b) << “a (“ << a << “) should be equal to b (“ << b << “).”;
Now we know the values of a and b when the condition fails, not just that they weren’t equal. This can be very useful sometimes. This particular case is so common that there is a short-hand for it:
ABSL_CHECK_EQ(a, b);
Be aware that the code to generate the message is only ever executed if the CHECK fires, so that isn’t a performance overhead on the happy path. There might be indirect performance effects through code size and adding branches to your code - I’d consider those small overheads to be well worth it, excepting the hottest inner loops in your code.
If you don’t want to take a dependency on Abseil just for a CHECK macro, that’s fine. Just make your own CHECK macro. You can do it in a small header. Don’t put unnecessary messages on every CHECK assertion, but sometimes it can be quite useful. Also during debugging.
What should you do when an assertion triggers? The default is to shut the program down. This can be a good solution, since it forces reporting of the issue and operating in an incorrect state is a menace - also for cyber security. However, some cases like self driving software might call for a different response if the car is not currently in a safe situation to shut down or restart. My advice is: consider carefully what to do in case of a segmentation fault, then do that also for assertions. This advice doesn’t help you figure out what to do, but it halves the number of decisions to make, it might help you realize that you can in fact change what happens on segmentation fault (and sometimes, very rarely, should) and, most importantly, it makes it clear that the problem of what to do in case an internal error is encountered isn’t introduced by using assertions - you have to figure that out anyway. If your solution is to continue to operate the program, be sure to consider what happens if there is a high rate of segmentation errors or assertions - log disk space may overflow and, also, if the logging or other handling is slow, you might slow your program down so much that it can’t be used anyway (very bad in self driving). Before you get too depressed about this issue, consider that the real line of defence is to have a very low rate of errors in the first place and that’s something assertions help with.
This all leads to the awful issue, saved for last, of what to do about bugs and potential legal liability caused by assertions (or by not having them - could be construed as an irresponsible development practice). Assertions on the whole reduce bugs (and therefore deaths), that is why I advise to use them, but assertions are code. Any code that you write can result in bugs and therefore deaths. E.g. even a seemingly side-effect-free assertion can follow a wild pointer, leading to a segmentation fault. Just as any code can. You might remove assertions when you ship to the customer, but are you 100% sure that all the code inside all your assertions doesn’t include any necessary side effects that you now removed? This is a source of bugs, so I don’t suggest this. You might also have an incorrect assertion - it is reporting an error, but there isn’t an error (the assertion itself is the error). On the whole I think the most responsible development process is to assert liberally and to handle what happens when an assertion triggers carefully. I’d suggest to try to avoid policy decisions that will motivate your development team to avoid asserting liberally - e.g. if you require an error recovery document written for every assertion, there may well be no assertions at all, so you’ve in effect decided not to use them. There is no decision that you can make here, including not using assertions in the first place, that doesn’t have some serious problems with it. If you can, sure, use formal methods and prove that every assertion is correct and side effect free, but this is unlikely to be practically feasible for most software. You’ll want to ask your lawyers about error detection and recovery, including assertions (show them this paragraph).
Whole model debugging
Suppose your AI software has a bug that causes an AI model to fail. Worse, your suite of unit tests has failed to catch the bug. Uh oh. Hopefully, you were running some whole models as part of your nightly tests and this is how you caught the bug. In that case you can figure out (bisect) which of the changes from the prior day were responsible - not great to deal with, but also not horrible. But let’s say that’s not what happened. The bug escaped your testing entirely and has either been reported from a customer or this is a new model that you’ve just added to your test suite. Hopefully, one of your assertions caught the issue, which makes things much easier. But let’s say that’s not what happened either. There is no error, the model just doesn’t give the right result when run on your software and hardware. Let’s say the user model has 100k ops in it. Something is wrong somewhere. It can be extraordinarily challenging to figure out what is causing a situation such as this. Oh, and the customer is waiting, so if you can get this done real quick, that would be nice. Sometimes it can be resolved quickly, but sometimes it can take weeks of hard work to figure something like this out.
First of all, you are in this situation because your unit test suite isn’t good enough. You clearly need more unit tests if this is happening. But zero is a hard number, so, very infrequently, even if you have an excellent suite of unit tests, this might still happen.
What to do? I can tell you what happened the first time this occurred on the XLA team. We all stared at each other and said something like “oh fuck, what now?” The whole team got together and brainstormed how to address such situations (we had a good manager who realized that this was necessary). The team had many ideas and we went off and did all of them. Over time, it became clear that just two of these ideas turned out to resolve most cases like this. It turns out that most bugs are either something where some individual op (or fusion of ops) is producing an incorrect result locally, or, if that isn’t it, the problem is likely that there is some kind of memory error somewhere (accessing an address in memory that shouldn’t have been accessed - this can have non-local effects). The good news is that both of these kinds of errors can be diagnosed automatically if you have the right tools for it. I haven’t seen either of these two tools in other AI compilers, but I’ll strongly suggest for everyone to build them. The function of both tools is to tell you which exact op it is that is having a problem. From there you’ll have to debug that single op, but that is much easier than doing whole model debugging.
The Isolator If an op (or fusion of ops) is doing the wrong thing, the way to figure that out is to take the user model, separate it into individual ops (or fusions of ops), and test each op (or fusion of ops) individually against your reference backend. So, yeah, you must have a reference backend for this. The tool in XLA that does this is called The Isolator. In most situations of whole model debugging, the isolator tool will quickly tell you the exact op that is the root cause of the issue. So you are going to want to build a tool like that.
The primary complication in building an isolator tool like this is that you have to provide the isolated op (or fusion of ops) with inputs so that you can run the ops and compare the result to the result from the reference backend. One way to do this is to capture the inputs used inside the actual user model. That will work, though this is going to be some trouble to set up and will generate very large files of inputs. An alternative is to just use random numbers as the inputs. This has some issues, chief among that the floating point numeric properties of the op in question may be bad on a random input, even if the numerics are fine on the actual data used inside the model. This makes the output noisy and can result in a mismatch against the reference backend even if there is really nothing wrong with the op. Note that the reference backend will not likely produce an identical output due to floating point reassociation, among other reasons. In XLA, there is a very complicated set of heuristics inside The Isolator to figure out how to generate data that will likely be numerically stable for a given op (or fusion of ops). There are also some ops that e.g. take indices, like DynamicSlice, which won’t be tested very well with purely random data. Using heuristics to resolve this can work very well. This means that The Isolator can diagnose bugs on models without needing any associated data for that model. If you don’t want to deal with that, you’ll have to capture live internal data for the user model. Or you’ll have to deal with false positives. One of those three.
The Isolator is very effective. For example, the XLA bug that Anthropic found would likely have been immediately automatically diagnosed using The Isolator.
The Bounds Checker Sometimes The Isolator will report that all the ops in a model are working correctly, yet the final result is still wrong. This is rare, but it happens. How can that be? The most likely reason for such a situation is that one of the ops is accessing memory that it should not be accessing. This can cause some other op to later receive incorrect data, even while the offending op that is the root cause of the issue doesn’t output any incorrect data itself. The Isolator will not detect such a situation. For this reason, XLA has a feature called bounds checking. It ensures that all memory accesses are checked to see if they are as expected. Running a model with this feature enabled will pinpoint which exact op it is that is accessing memory incorrectly. This is so desirable that XLA:TPU enables partial bounds checking at all times, even if you don’t ask for it, which was made possible due to performance work that reduced the overhead of bounds checking. Bounds that are particularly expensive to check are still left unchecked unless you ask XLA to enable them.
It is a rare situation where neither The Isolator nor the bounds checker can resolve a whole model debugging problem down to an exact op. It can happen, but it is rare. In those rare cases, it is still very useful to know that both of those tools failed to find the bug, because that is a big hint for where to find the bug - the bug can’t be anywhere where those tools would find it, so it has to be elsewhere. So these tools are very useful even in the rare cases where they fail to find an offending op (or fusion of ops).
Another possible bug is where high-level optimizations rewrote the computational graph incorrectly somehow. Bounds checking and The Isolator work at a lower level of abstraction from this and so will not detect such an error. This is easier to diagnose, though - disable the optimizations passes one at a time (or all at once) and see what effect that has. We didn’t see many cases like this for whole model debugging, though, I expect because most such cases were already well tested using unit tests. These optimizations are usually not so hard to unit test. Though if you see more bugs of this kind than we did in XLA, perhaps you’ll want a tool for this kind of debugging, as well.
Productive and non-error-prone APIs
Not making errors in the first place is upstream of finding them with tests. This isn’t something you can do 100%, but some software designs and APIs are much more error prone than others and it is a good thing to prefer less error prone designs, all other things being equal.
Productive test APIs and productive APIs in general are also important. I gave a concrete example of a productive test API in an earlier section.
These are two things, productive APIs and non-error-prone APIs, that I know when I see it, but I don’t know how to write a long section about how to do them. Yet these are very important, so just because I didn’t say much about it shouldn’t be taken to mean that they aren’t important. It might just require long-time interest, some talent and a lot of experience to arrive at such APIs, but maybe someone cleverer than me can cover it better.
Updates
2025-01-12 Sean Silva mentioned an IR invariant checker on LinkedIn. Yep, that’s important! This is something that you should have - something to run between compiler passes that detects if IR invariants have been violated. This helps a lot to find bugs in compiler passes and can detect some intermittent issues that may otherwise escape your test suite. Running it between all compiler passes might be too slow, except during debugging, but you may well want to run it as the first and last passes even in production, and perhaps also a few places in-between.