I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:
At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.
Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.
Days-taken-to-fix is kind of a weird measure for how difficult a bug is. It's clearly a factor of a large number of things that's not the bug itself, including experience and whether you have to go it alone or if you can talk to the right people.
The bug ticks most of the boxes for a tricky bug:
* Non-deterministic
* Enormous haystack
* Unexpected "1+1=3"-type error with a cause outside of the code itself
Like sure it would have been slower to debug if it took 30 hours of to reproduce, and harder he had to be going down the Niagara falls in a barrel while debugging it, but I'm not quite sure those things quite count.
I had a similar category of bug I was struggling with the other year[1] that was related to a faulty optimization in the GraalVM JVM leading to bizarre behavior in very rare circumstances. If I'd been sitting next to the right JVM engineers over at Oracle I'm sure we'd figured it out in days and not the weeks it took me.
Imagine if you weren't working at Google and were trying to convince the Chromium team you found a bug in V8. That'd probably be nigh-impossible.
One thing I notice is that Google has no way whatsoever to actually just ask users "hey, are you having problems?", a definite downside of their approach to software development where there is absolutely no communication between users and developers.
I'd love to see the rest of your postmortem template! I never thought about adding a "Where did we get lucky?" question.
I recently realized that one question for me should be, "Did you panic? What was the result of that panic? What caused the panic?"
I had taken down a network, and the device led me down a pathway that required multiple apps and multiple log ins I didn't have to regain access. I panicked and because the network was small, roamed and moved all devices to my backup network.
The following day, under no stress, I realized that my mistake was that I was scanning a QR code 90 degrees off from it's proper orientation. I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation. Then it was simple to gain access to that device. I couldn't even replicate the other odd path.
The basic operation of this program is as follows:
1. Panic. You usually do so anyways, so you might as well get it over with. Just don't do anything stupid. Panic away from your machine. Then relax, and see if the steps below won't help you out.
2. ...
A good section to have is one on concept/process issues you encountered, which I think is a generalization of your question about panic.
For instance, you might be mistaken about the operation of a system in some way that prolongs an outage or complicates recovery. Or perhaps there are complicated commands that someone pasted in a comment in a Slack channel once upon a time and you have to engage in gymnastics with Sloogle™ to find them, while the PM and PO are requesting updates. Or you end up saving the day because of a random confluence of rabbit holes you'd traversed that week, but you couldn't expect anyone else on the team to have had the same flash of insight that you did.
That might be information that is valuable to document or add to training materials before it is forgotten. A lot of postmortems focus on the root cause, which is great and necessary, but don't look closely at the process of trying to stop the bleeding.
> I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation.
Same, I assumed they were designed to always work. I suspect it was whatever app or library you were using that wasn't designed to handle them correctly.
> In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.
I'm not sure this is really luck.
The fix is to just not use Math.abs. If they didn't work at Google they still would've done the same debugging and used the same fix. Working at Google probably harmed them as once they discovered Math.abs didn't work correctly they could've just immediately used `> 0` instead of asking the chrome team about it.
There's nothing lucky about slowly adding printf statements until you understand what the computer is actually doing; that's just good work.
I wish I could recall the details better but this was 20+ years ago now. In college I had an internship working at Bose, doing QA on firmware in a new multi CD changer addon to their flagship stereo. We were provided discs of music tracks with various characteristics. And had to listen to them over and over and over and over and over and over, running through test cases provided by QA management as we did. But also doing random ad-hoc testing once we finished the required tests on a given build.
At one point I found a bug where if you hit a sequence of buttons on the remote at a very specific time--I want to say it was "next track" twice right as a new track started--the whole device would crash and reboot. This was a show stopper; people would hit the roof if their $500 stereo crashed from hitting "next". Similar to the article, the engineering lead on the product cleared his schedule to reproduce, find, and fix the issue. He did explain what was going on at the time, but the specifics are lost to me.
Overall the work was incredibly boring. I heard the same few tracks so many times I literally started to hear them in my dreams. So it was cool to find a novel, highest severity bug by coloring outside the lines of the testcases. I felt great for finding the problem! I think the lead lost 20% of his hair in the course of fixing it, lol.
I haven't had QA as a job title in a long time but that job did teach me some important lessons about how to test outside the happy path, and how to write a reproducible and helpful bug report for the dev team. Shoutout to all the extremely underpaid and unappreciated QA folks out there. It sucks that the discipline doesn't get more respect.
That is great QAing. It also speaks to why QA should be a real role in more orgs, rather than a shrinking discipline. Engineers LOVE LOVE LOVE to test the happy path.
It's not even malice/laziness, it's their entire interpretation of the problem/requirements drives their implementation which then drives their testing. It's like asking restaurants to self-certify they are up to food safety codes.
If you do not follow the happy path something will break 100% of the time. That's why engineers always follow the happy path. Some engineers even think that anything outside the happy path is an exception and not even worth investigating. These engineers only thrives if the users are unable to switch to another product. Only competition will lead to better products.
My favorite happy path developer.. and he was by far 10x worse than any engineer I worked with at this, did the following:
Spec: allow the internal BI tool to send scheduled reports to the user
Implementation: the server required the desktop front end of said user to have been opened that day for the scheduled reports to work, even though the server side was sending the mails
Why this was hilariously bad - the only reason to have this feature is for when the user is out of office / away from desk for an extended period, precisely when they may not have opened their desktop UI for the day.
One of my favorite examples of how an engineer can get the entire premise of the problem wrong.
In the end he had taken so long and was so intransigent that desktop support team found it easier to schedule the desktop UIs to auto-open in windows scheduler every day such that the whole Rube Goldberg scheduled reports would work.
You just needed to find another one like him, and bam, +4×.
(It is actually conceivable that two bad engineers could mostly cancel each other out, if they can occupy each other enough, but it’s not the most likely outcome.)
> That is great QAing. It also speaks to why QA should be a real role in more orgs, rather than a shrinking discipline.
As a software engineer, I've always been very proud of my thoroughness and attention to detail in testing my code. However, good QA people always leave me wondering "how did they even think to do that?" when reviewing bug reports.
Pedantically pointing out the difference between doing some exploratory testing "testing outside the test cases" and QA which is setting up processes/procedures part of which should be "do exploratory testing as well as running the test cases" but the Testing is not QA distinction has been fought over for decades...
But, love the story and I collect tales like this all the time so thanks for sharing
A friend of mine has near PTSD from watching some movie over and over and over at a optician where she worked. Was on rotation so that their customers could gauge their eyesight.
Interesting writeup, but 2 days to debug “the hardest bug ever”, while accurate, seems a bit overdone.
Though abs() returning negative numbers is hilarious.. “You had one job…”
To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.
I’m not just talking about concurrency issues either…
The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.
The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.
The hardest one I've debugged took a few months to reproduce, and would only show up on hardware that only one person on the team had.
One of the interesting things about working on a very mature product is that bugs tend to be very rare, but those rare ones which do appear are also extremely difficult to debug. The 2-hour, 2-day, and 2-week bugs have long been debugged out already.
That reminded me of a former colleague at the desk next to me randomly exclaiming one day that he had just fixed a bug he had created 20 years ago.
The bug was actually quite funny in a way: it was in the code displaying the internal temperature of the electronics box of some industrial equipment. The string conversion was treating the temperature variable as an unsigned int when it was in fact signed. It took a brave field technician in Finland in winter, inspecting a unit in an unheated space to even discover this particular bug because the units' internal temperatures were usually about 20C above ambient.
This is a surprisingly common mistake with temperature readings. Especially when the system has a thermal safety power off that triggers if it's above some temperature, but then interprets -1 deg C as actually 255 deg C.
The rollout is still happening, but the new resident water meters for Victoria, Australia come with a temperature fix.
Prior to this year, they could only handle 0-127 degrees for the water temperature. Which used to be sensible, but there were some issues with pressurised water starting to be delivered to houses resulting in negative temperatures being reported, like -125C, which immediately has the water switch off to prevent icing problems.
The software side also switched from COBOL to Ada. So that's kewl.
My brother is a wifi expert at a hw manufacturer. He once had a case where the customer had issues setting the transmit power to like 100 times the legal limit. They happened to be an offshore drilling platform and had an exemption for the transmission power as their antenna was basically on a buoy on the ocean. He had to convince the developer to fix that very specific bug.
During the time I was working on a mature hardware product in maintenance, if I think about the number of customer bugs we had to close due to being not-reproducible or were only present for a brief amount of time in specific setup, it was really embarassing and we felt like a bunch of noobs.
Author here! I debugged a fair number of those when I was a systems engineer in soft real time robotics systems, but none of them felt as bad in retrospect because you're just reading up on the system and mulling over it and eventually you get the answer in a shower thought. Maybe I just find the puzzle of them fun, I don't know why they don't feel quite so bad. This was just an exhausting 2-day brute-force grind where it turned out the damn compiler was broken.
I also came to the comments to weigh in on my perception of how rough this was, but instead will ask:
Regarding "exhausting 2-day brute-force grind": is/was this just how you like to get things done, or was there external pressure of the "don't work on anything else" sort? I've never worked at a large company, and lots of descriptions of the way things get done are pretty foreign to me :). I am also used to being able to say "this isn't getting figured out today; probably going to be best if I work on something else for a bit, and sleep on it, too".
The fatal error volume was so overwhelming that we didn't have any option but understanding the problem in perfect detail so that we could fix it if the problem was on our side, or avoid it if it was caused by something like our compiler or the browser.
Our team also had a very grindy culture, so "I'm going to put in extra hours focusing exclusively on our top crash" was a pretty normalized behavior. After I left that team (and Google), most of my future teams have been more forgiving on pace for non-outages.
Same here, we had an IE8 bug that prevented the initial voice over of the screen reader (JAWS). No dev could reproduce it because we all had DevTools open.
I can't remember the actual bug now, but one of my early career memories was hunting down an IE7 issue by using bookmarklets to alert() values. (Did IE7 even have dev tools?)
There was a downloadable developer toolbar for IE6 and IE7, and scripts could be debugged in the external Windows Script Debugger. The developer toolbar even told you which elements had the famous hasLayout attribute applied, which completely changed how it was rendered and interacted with other objects, which was invaluable.
"To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added."
My favourite are bugs, that not only don't appear in the debugger - but also don't reproduce anymore on normal settings after I took a closer look in the debugger (Only to come back later at a random time).
Feels like chasing ghosts.
This repro was a few times per day, but try fixing a Linux kernel panic when you don't even have C/C++ on your resume, and everyone who originally set stuff up has left...
> To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.
A favourite of mine was a bug (specifically, a stack corruption) that I only managed to see under instrumentation. After a lot of debugging turns out that the bug was in the instrumentation software itself, which generated invalid assembly under certain conditions (calling one of its own functions with 5 parameters even though it takes only 4). Resolved by upgrading to their latest version.
I don't think taking how long something took to debug in number of days is at all interesting. Trivial bugs can take weeks to debug for a noob. Insanely hard bugs takes hours to debug for genius devs, maybe even without any reproducer, just by thinking about it.
In hardware, you regularly see behavior change when you probe the system. Your oscilloscope or LA probes affect the system just enough to make a marginal circuit work. It's absolutely maddening.
Yes ! I've dealt with complex issues that turned out to be vendor-swapped-hardware-woopsie which we spent over a month trying to solve in software before finally figuring it out.
Part of it was difficulty of pinpointing the actual issue - fullness of drive vs throughput of writes.
A lot of it was unfortunately organizational politics such that the system spanned two teams with different reporting lines that didn't cooperate well / had poor testing practices.
Sometimes it isn't outright lying. I have had the issues with hardware, API and SDK documentation being subtly different from the product as shipped. With hardware with a mixture of revisions, some conforming to doco and other differing and even their engineers not being clear about which is which.
For stuff like this we used in-memory ring buffer logger that printed the logs on request. And it didn't save the strings, just necessary data bits and a pointer to formatting function. Writing to this logger didn't affect any timings.
I always refer to them as “quantum bugs” because the act of observing the bug changes the bug. Absolutely infuriating. I like “heisenbug” better. Has a better ring to it.
FWIW: this type of bug in Chrome is exploitable to create out-of-bounds array accesses in JIT-compiled JavaScript code.
The JIT compiler contains passes that will eliminate unnecessary bounds checks. For example, if you write “var x = Math.abs(y); if(x >= 0) arr[x] = 0xdeadbeef;”, the JIT compiler will probably delete the if statement and the internal nonnegative array index check inside the [] operator, as it can assume that x is nonnegative.
However, if Math.abs is then “optimized” such that it can produce a negative number, then the lack of bounds checks means that the code will immediately access a negative array index - which can be abused to rewrite the array’s length and enable further shenanigans.
> which can be abused to rewrite the array’s length and enable further shenanigans.
I followed all of this up until here. JavaScript lets you modify the length of an array by assigning to indexes that are negative? I'm familiar with the paradigm of negative indexing being used to access things from the end of the array (like -1 being the last element), but I don't understand what operation someone could do that would somehow modify the length of the array rather than modifying a specific element in-place. Does JIT-compiled JavaScript not follow the usual JavaScript semantics that would normally happen when using a negative index, or are you describing something that would be used in combination with some other compiler bug (which honestly sounds a lot more severe even in the absence of an usual Math.abs implementation).
Normally, there would be a bounds check to ensure that the index was actually non-negative; negative indices get treated as property accesses instead of array accesses (unlike e.g. Python where they would wrap around).
However, if the JIT compiler has "proven" that the index is never non-negative (because it came from Math.abs), it may omit such checks. In that case, the resulting access to e.g. arr[-1] may directly access the memory that sits one position before the array elements - which could, for example, be part of the array metadata, such as the length of the array.
You can read the comments on the sample CVE's proof-of-concept to see what the JS engine "thinks" is happening, vs. what actually happens when the code is executed: https://github.com/shxdow/exploits/blob/master/CVE-2020-9802.... This exploit is a bit more complicated than my description, but uses a similar core idea.
I understand the idea of the lack of a bounds check allowing access to early memory with a negative index, but I'm mostly struggling with wrapping my head around why the underlying memory layout is accessible in JavaScript in the first place. I hadn't considered the fact that the same syntax could be used for accessing arbitrary properties rather than just array indexes; that might be the nuance I was missing.
>I followed all of this up until here. JavaScript lets you modify the length of an array by assigning to indexes that are negative?
This is my no doubt dumb understanding of what you can do, based on some funky stuff I did one time to mess with people's heads
do the following
const arr = [];
arr[-1] = "hi";
console.log(arr)
this gives you
"-1": "hi"
length: 0
which I figured is because really an array is just a special type of object. (my interpretation, probably wrong)
now we can see that the JavaScript Array length is 0, but since the value is findable in there I would expect there is some length representation in the lower level language that JavaScript is implemented in, in the browser, and I would then think that there could even be exploits available by somehow taking advantage of the difference between this lower level representation of length and the JS array length. (again all this is silly stuff I thought and have never investigated, and is probably laughably wrong in some ways)
I remember seeing some additions to array a few years back that made it so you could protect against the possibility of negative indexes storing data in arrays - but that memory may be faulty as I have not had any reason to worry about it.
You raise a good point that JavaScript arrays are "just" objects that let you assign to arbitrary properties through the same syntax as array indexing. I could totally imagine some sort of optimization where a compiler utilizes this to be able to map arrays directly to their underlying memory layout (presumably with a length prefix), and that would end up potentially providing access to it in the case of a mistaken assumption about omitting a bounds check.
yeah you know what you said made me think about these funny experiments that I haven't done in a long time and I remember now yeah, you can do
const arr = [];
arr[false] = "hi";
which console.log(arr); - in FF at least - gives
Array []
false: "hi"
length: 0
which means
console.log(arr[Boolean(arr.length)]); returns
hi
which is funny, I just feel there must be an exploit somewhere among this area of things, but maybe not because it would be well covered.
on edit: for example since the index could be achieved - for some reason - from numeric operation that output NaN, you would then have NaN: "hi", or since the arr[-1] gives you "-1": "hi" but arr[0 -1] returns that "hi" there are obviously type conversions going on in the indexing...which just always struck me as a place you don't expect the type conversions to be going on the way you do with a == b;
Maybe I am just easily freaked out by things as I get older.
Because after bound checks have been taken care of, loading an element of a JS array probably compiles to a simple assembly-level load like mov. If you bypass the bounds checks, that mov can read or write any mapped address.
Yeah, I understand all of that. I think my surprise was that you can access arbitrary parts of this struct from within JavaScript at all; I guess I really just haven't delved deeply enough into what JIT compiling actually is doing at runtime, because I wouldn't have expected that to be possible.
My own story: I spent >10 hours debugging an Emacs project that would occasionally cause a kernel crash on my machine. Proximate cause was a nonlocal interaction between two debug-print statements. (Wasn't my first guess). The Elisp debug-print function #'message has two effects: it appends to a log, and also does a small update notification in the corner of the editor window. If that corner-of-the-window GUI object is thrashed several hundred times in a millisecond, it would cause the GPU driver on my specific machine to lock up, for a reason I've never root-caused.
Emacs' #'message implementation has a debounce logic, that if you repeatedly debug-print the same string, it gets deduplicated. (If you call (message "foo") 50 times fast, the string printed is "foo [50 times]"). So: if you debug-print inspect a variable that infrequently changes (as was the case), no GUI thrashing occurs. The bug manifested when there were *two* debug-print statements active, which circumvented the debouncer, since the thing being printed was toggling between two different strings. Commenting out one debug-print statement, or the other, would hide the bug.
> If that corner-of-the-window GUI object is thrashed several hundred times in a millisecond, it would cause the GPU driver on my specific machine to lock up, for a reason I've never root-caused.
Until comparatively recently, it was absurdly easy to crash machines via their graphics drivers, even by accident. And I bet a lot of them were security concerns, not just DoS vectors. WebGL has been marvellous at encouraging the makers to finally fix their drivers properly, because browsers declared that kind of thing unacceptable (you shouldn’t be able to bring the computer down from an unprivileged web page¹), and developed long blacklists of cards and drivers, and brought the methodical approach browsers had finally settled on to the graphics space.
Things aren’t perfect, but they are much better than ten years ago.
—⁂—
¹ Ah, fond memories of easy IE6 crashes, some of which would even BSOD Windows 98. My favourite was, if my memory serves me correctly, <script>document.createElement("table").appendChild(document.createElement("div"))</script>. This stuff was not robust.
I experienced "crashes after 16 hours if you didn't copy the mostly empty demo Android project from the manufacturer and paste the entire existing project into it"
Turned out there was an undocumented MDM feature that would reboot the device if a package with a specific name wasn't running.
Upon decompilation it wasn't supposed to be active (they had screwed up and shipped a debug build of the MDM) and it was supposed to be 60 seconds according to the variable name, but they had mixed up milliseconds and seconds
My hardest bug story, almost circling back to the origin of the word.
An intern gets a devboard with a new mcu to play with. A new generation, but mostly backwards compatible or something like that. Intern gets the board up and running with embedded equivalent of "hello world". They port basic product code - ${thing} does not work. After enough hair are pulled, I give them some guidance - ${thing} does not work. Okay, I instruct intern to take mcu vendor libraries/examples and get ${thing} running in isolation. Intern fails.
Okay, we are missing something huge that should be obvious. We start pair programming and strip the code down layer by layer. Eventually we are at a stage where we are accessing hand-coded memory addresses directly. ${thing} does not work. Okay, set up a peripheral and read state register back. Assertion fails. Okay, set up peripheral, nop some time for values to settle, read state register back. Assertion fails. Check generated assembly - nopsled is there.
We look at manual, the bit switching peripheral into the state we care about is not set. However we poke the mcu, whatever we write to control register, the bit is just not set and the peripheral never switches into the mode we need. We get a new devboard (or resolder mcu on the old one, don't remember) and it works first try.
"New device - must be new behavior" thinking with lack of easy access to the new hardware led us down a rabbit hole. Yes, nothing too fancy. However, I shudder thinking what if reading the state register gave back the value written?
what if reading the state register gave back the value written?
I've had that experience. Turned out some boards in the wild didn't have the bodge wire that connected the shift register output to the gate that changed the behavior.
It’s amusing how so many of the comments here are like “You think two days is hard? Well, I debugged a problem which was passed down to me by my father, and his father before him”. It reminds me of the Four Yorkshiremen sketch.
Yes, of course, I greatly enjoy the stories and it’s why I opened this thread. But that’s not what my comment is about, I was specifically referencing the parts of the comments which dismiss the difficulty and length of time the author spent tracking down this particular bug. I found that funny and my comment was essentially one big joke.
At least the author worked for Google. It's another layer of fun to go through the work of tracking down a bug like that as a third party and then trying to somehow contact a person at the company who can fix it, especially when it is a big company and doubly so if the product is older and on a maintenance only schedule.
Me: "Your product is broken for all customers in this situation, probably has been so for years, here is the exact problem and how to fix it, can I talk with someone who can do the work?"
Customer Support: "Have you tried turning your machine off and turning it back on again?"
Complaining about "slow to reproduce" and talking _seconds_. Dear, oh dear those are rookie numbers!
Currently working a bug where we saw file system corruption after 3 weeks of automated testing, 10s of thousands of restarts. We might never see the problem again, even? Only happened once yet.
If it only happened once... it might be the final category of bugs where nothing you can do will fix it. Cosmic ray bit flipping bug. Which is something your software needs to be able to work around, or in this case, the file system itself... unless you're actually working on the file system itself, in which case, I wish you good luck.
Anything can fail, at any time. The best we can do is mitigate it and estimate bounds for how likely it is to mess up. Sometimes those bounds are acceptable.
My worst bug had me using statistics to try and correlate occurrence rates with traffic/time of day, API requests, app versions, Node.js versions, resource allocations, etc. And when that failed I was capturing Prod traffic for examination in Wireshark...
Turned out that Node.js didn't gracefully close TCP connections. It just silently dropped the connection and sent a RST packet if the other side tried to reuse it. Fun times.
Heh, not a nodejs problem but something related to TCP connections.
I won't name the product because it's not its fault, but we had an HA cluster of 3 instances of it set up. Users reported that the first login of the day would fail, but only for the first person to come into the office. You hit the login button, it takes 30 seconds to give you an invalid login, and then you try logging in again and it works fine for the rest of the day.
Turns out IT had a "passive" firewall (traffic inspection and blocking, but no NAT) in place between the nodes. The nodes established long-running TCP connections between them for synchronization. The firewall internally kept a table of known established connections and eventually drops them out if they're idle. The product had turned on TCP keepalive, but the Linux default keepalive interval is longer than the firewall's timeout. When the firewall dropped the connection from the table it didn't spit out RST packets to anyone, it just silently stopped letting traffic flow.
When the first user of the day tried to log in, all three HA nodes believed their TCP connections were still alive and happy (since they had no reason not to think that) and had to wait for the connection to timeout before tearing those down and re-establishing them. That was a fun one to figure out...
Networking in node.js is maddeningly stupid and extremely hard to debug, especially when you're running it in something like Azure where the port allocation can be restricted outside of your control. It's bad enough that I wouldn't consider using node.js on any new project.
Honestly, of all the stupid ideas, having your engine switch to a completely untested mode when under heavy load, a mode that no one ever checks and it might take years to discover bugs in, is absolutely one of most insane things I can think of. That's at best really lazy, and at worst displays a corporate culture that prizes superficial performance over reliability and quality. Thankfully no one's deploying V8 in, like, avionics. I hope.
At least this is one of those bugs you can walk away from and say, it really truly was a low-level issue. And it takes serious time and energy to prove that.
I agree with your assessment of how stupid this is, but I'm not surprised.
To be clear, there are good reasons for this different mode. The fuck-up is not testing it properly.
These kinds of modes can be tested properly in various ways, e.g. by having an override switch that forces the chosen mode to be used all the time instead of using the default heuristics for switching between modes. And then you run your test suite in that configuration in addition to the default configuration.
The challenge is that you have now at least doubled the time it takes to run all your tests. And with this kind of project (like a compiler), there are usually multiple switches of this kind, so you very quickly get into combinatorial explosion where even a company like Google falls far short of the resources it would require to run all the tests. (Consider how many -f flags GCC has... there aren't enough physical resources to run any test suite against all combinations.)
The solution I'd love to see is stochastic testing. Instead of (or, more realistically, in addition to) a single fixed test suite that runs on every check-in and/or daily, you have an ongoing testing process that continuously tests your main branch against randomly sampled (test, config) pairs from the space of { test suite } x { configuration space }. Ideally combine it with an automatic bisector which, whenever a failure is found, goes back to an older version to see if the failure is a recent regression and identifies the regression point if so.
Isn't stochastic testing becoming more and more of a standard practice? Even if you have the hardware and time to run a full testsuite, you still want to add some randomness just to catch accidental dependencies between tests.
Maybe? I'd love to hear if there are some good tools for it that can be integrated into typical setups with Git repositories, Jenkins or GitHub Actions, etc.
We had a fun bug where our VPN was crashing on macOS. The error was pretty clear, we were subtracting two timestamps and getting a negative, which should never happen as these were from a monotonic clock. We spent lots of time analyzing all of the code to make sure that the arguments were all in the right order and being subtracted from the right values and everything looked fine.
However we still saw these crash reports from one device (conveniently the partner of the CEO, so we got full debug reports). However the system logs were suspicious, lots of clock jumps especially when coming out of sleep. At the end of the day we concluded it was bad hardware (an M1 Max) and the OS was trusting it too much, returning out-of-order values for a supposedly monotonic clock. We updated the code to use saturating arithmetic to mitigate the problem.
My hardest debug was actually not software related, it was my first car - late 80s VW Passat. The problem was that the battery would simply not charge, and I had to jump-start it every time I used it, or park at the top of a hill/street and start it rolling down.
Bought a brand new battery, but the problem persisted. Started looking at all the various parts in the car, that were connected to the electrical system. Took them out, troubleshooting the parts to my best ability, even ended up buying a new alternator AND solenoid just out of sheer desperation.
3 months went by, countless hours in the garage, and I thought to myself...could it be...could it be the new battery I bought? Bought yet another battery, and everything worked. Just like that.
Turns out the battery I had in my car originally had degraded, and couldn't store enough charge. And the second (brand new) I bought turned out to also be defect, having the very same fault.
Those faulty batteries would charge up to measure the correct voltage, but didn't get the correct charge capacity - and thus the car couldn't draw enough current to start the engine.
And don't get me started on the weird wacky world of electronics...but the car debugging was by far the longest I've spent, at one point I had almost every component out of the car, going over the wiring.
That's the worst when you buy a new part and it still doesn't fix it, you rarely think that the new part could be bad, especially something like a battery that generally wouldn't have problems fresh from the store.
It seems to me that V8 had very bad unit tests if this wasn't caught before release. Making sure all operators act the same way when optimized and not is a no-brainer.
It sounds like their unit-tests cover abs(), but they weren't covering all of abs(), and were not reliably triggering the optimized codepath:
> When doing the refactoring, they needed to provide new implementations for every opcode. Someone accidentally turned Math.abs() into the identity function for the super-optimized level. But nobody noticed because it almost never ran — and was right half of the time when it did.
If it never was tested, plain and simple as that, then it couldn't matter that it 'almost never ran' or 'was right half the time'.
So the root problem here is that their test-suite neither exercised all optimized levels appropriately, nor flagged the omission as a fatal problem breaking 100% branch coverage (which for a simple primitive like abs you'd definitely want). This meant that they could break lots of other things too without noticing. OP doesn't discuss if the JS team dealt with it appropriately; one hopes they did.
Fair enough, it's busywork and easy to postpone. But code optimization is something that needs this kind of double-checking, so in the end you should have it for all opcodes, and then including the easy ones like abs isn't much extra work.
The worst bug I've ever encountered was a JS file that kept not running, with very cryptic and hard to understand trace that made no sense. TypeScript and others parsed it fine without any issues.
After 3 days of literally trying everything, I don't know why, I thought of rewriting the file character by character by hand and it worked. What was happening?
Eventually opened the two files side by side in a hex editor and here it is: several exotic unicode characters for "empty" space.
I've seen this happen in enterprise systems integration work, where some data interchange spec is authored as a Word document, and it has tables defining valid string values for certain fields, and Word helpfully replaces plain ascii dashes in the string constants with pretty long dashes, and team A builds their side hand-typing these constants as plain ascii, and team B builds their side by copy-pasting the exact unicode strings out of the Word doc.
Not a hard thing to debug once the issue is noticed, and completely preventable (write specs in plain text).
In the late 1990s my friend was writing a game for his TI-83 calculator in TI-Basic. He was running into this bizarre bug we boiled down to a single IF after almost an hour of back and forth over a single calculator. The IF was not behaving as you would expect and it made zero sense. In the early version of TI-Basic, operators are actually single symbols, rather than made from text characters. In frustration I delete the IF symbol, insert a new one, and fire the game up. Everything works, and my friend just about dies in disbelief. It's probably my most frustrating bug fix.
I was telling someone the story a couple years ago and they said the opcodes linked to the symbols could get corrupted or something like that.
Seems it is a story time thread. Here goes my strangest one.
Back in 2005, when I had only paid-by-cash internet cafe access to computer, one of the shopkeeper offered me free time on computer IF I typed and ran a 15 page of class 12 computer project printed on A4 sheets, onto the compiler. TurboC++. I gladly accepted the offer and typed things.
When I finished typing, taking out all the compile error, the program didn't work as expected. Few hours latter, I find out that 1 or 2 pages of printed source codes were not in original order. :-O . So had to swap code from one function to another to finally get it working. That was one hell of a lesson!
Shopkeeper must have sold that project to many students, and I got some Free internet access.
I work on a server software of online backups for customers. We do daily thousands of mount/umount of a particular filesystem.
Once every month or so, we get an issue where a file timestamp fails to save, the error happens at the filesystem level.
Hard to reproduce! It's a filesystem bug! So it's full theorical, reading code and seeing how it would happen.
Found out after a while, the conditions were fun. I don't remember exactly, but it was like, you need to follow these steps :
1/ Create a folder
2/ Create in it 99 files (no more no less)
3/ Create a new folder
4/ Copy the first of the 99 files in the new folder
The issue was linked to some data structure caching, and cache eviction.
> When doing the refactoring, they needed to provide new implementations for every opcode. Someone accidentally turned Math.abs() into the identity function for the super-optimized level. But nobody noticed because it almost never ran — and was right half of the time when it did.
That's the perfect optimization: extremely fast, and mostly right -- probably more often than 50% if there are more positive numbers than negative ones.
One of the interesting ones we encountered was in the JDBC driver of our chosen database at the time. Under load, the application core dumped. Mind you this is java, running a native jdbc driver, no JNI in sight. It took some gdb stepping to figure out that under load, the JIT compiler got a little aggressive and inlined a little more code than there was room in the JIT buffer - result? a completely random core dump. Once I did find it, it was a simple matter of increasing JIT buffer size and adding more heap and ram. Tracing assembler generated from byte code generated from java was just part of the issue, the fact that the code itself had nothing to do with the issue is what made it interesting as the buffer size is set in a completely different area by the jvm. Fun times.
I’m not even close to being on par with other faang engineers but this is far from being a very difficult bug in my experience. The hardest bugs are the ones where the repro takes days to repro. But nonetheless the op’s tenacity is all that matters and I would trust them to solve any of the hard problems Ive faced in the past.
Hi, author here! At my job before Google I had to debug these kinds of bugs for our mobile robotics / computer vision stack, but I found them fun so they didn't feel "hard" per se. The most time-consuming one took a month on basically a camera-mounted computer vision system, where after an hour of use the system would start stuttering unusably. But the journey took us through heat throttling on 2009-era gaming laptops, esoteric windows APIs, hardware design, and ultimately distributed queuing. But fixing it was a blast! I learned a ton. I hated that project but fixing that bug was the highlight of it.
Funkiest for me was a random crash in a C# app. No pattern whatsoever. No function or user role or part of the software or time of day. I had to learn crash dump analysis and bought my first Kindle books (on desktop, no kindle because I needed it asap), one of which had a trick to make a memory issue crash closer to the source, rather than leave it around to be stumbled over hours later. Which was the source of the randomness. Click button, crash. Move mouse, crash.
This had worked perfectly for many years but windows was upgraded underneath it, and some smartass had used clever tricks for a hover menu that didn’t work in a future (safer) version of the OS. A rarely triggered hover menu.
Thank you, authors of advanced windows debugging and advanced .net debugging.
Vendor provided an outlook plugin (ew) that linked storage directly in outlook (double ew) and contained a built in pdf viewer (disgusting) for law firms to manage their cases.
One user, regardless of PC, user account or any other isolation factor, would reliably crash the program and outlook with it.
She could work for 40 minutes on another users logged in account on another PC and reproduce the issue.
Turns out it was a memory allocation issue. When you open a file saved in the addons storage, via the built in pdf viewer, it would allocate memory for it. However, when you close the pdf file, it would not deallocate that memory. After debugging her usage for some time, I noted that there was a memory deallocation, but it was performed at intervals.
If there were 20 or so pdf allocations and then she switched customer case file before a deallocation, regardless of available memory, the memory allocation system in the addon would shit the bed and crash.
This one user, an absolute powerhouse of a woman I must say, could type 300 wpm and would rapidly read -> close -> assign -> allocate -> write notes faster than anyone I have ever seen before. We legitimately got her to rate limit herself to 2 files per 10 minutes as an initial workaround while waiting for a patch from the vendor.
I had to write one hell of a bug report to the vendor before they would even look at it. Naturally they could not reproduce the error through their normal tests and tried closing the bug on me several times. The first update they rolled out upped it to something like 40 pdfs viewed every 15 minutes. But she still managed to touch the new ceiling on occasion (I imagine billing each of those customers 7 minutes a pop or whatever law firms do) and ultimately they had to rewrite the entire memory system.
This is close enough to the "can't log in to computer when standing up" bug... someone had swapped the keycaps for D/F (for example) so when 5-star general tried to log in when standing up he was typing "doobar" instead of "foobar" into the password field.
With the lady, if she'd dialed it back a bit on her pace of work "because people are watching", that could have been a crazy one to debug... "only happens when no one is watching (and I'm not beastly-WPM closing cases)"
> Vendor provided an outlook plugin (ew) that linked storage directly in outlook (double ew) and contained a built in pdf viewer (disgusting) for law firms to manage their cases.
I still don't understand how we've arrived at this state of affairs
Look I supported a few different legal platforms in that role and while I hated it, it was also the best.
Heres what a lawyer does:
1. They bill for time writing emails and on phone calls
2. They bill for time reviewing emails.
3. They bill for printing (and faxing if they are diehards)
4. They also bill for the time they are face to face with a human.
They also need to gather all the data, much of which flows in and out via email (or fax if they hate you) related to the case in a single space.
The sad state is that 80% of this can be achieved in outlook without much effort. Setting up an external application to capture all this shit is quite difficult, and generally requires mail to be run through it in some capacity. The question is, why reinvent the email client. (Sadly they reinvented the pdf reader) I have seen some lawfirms literally saving out every email as html, and uploading it with billing stats to a third party app. Its easier for me to support but the user experience can be awful.
The user already exists in Outlook, they already understand outlook. A few buttons in the ribbon (Mostly File this under X open case, time me, and bill this customer) make more sense from a user perspective.
From a support perspective its an absolute nightmare. Microsoft absolutely wont take a support case about an addon with shit memory management. And the addon provider will usually blame Microsoft.
I didn’t fix this bug but I did reproduce it so it could be fixed, but it took years. At one company I worked for we have an email archive and we were seeing an uptick in customers having issues with deleting expired emails. Most companies have a retention policy of about 7 years, and the company was now 10 years old and early customers were beginning to deleted old emails. But developers couldn’t find the bug, but reducing the scope of the deletion usually worked, so it was usually marked as not reproducible. While devs tried to debug it, no one would let us poke around their prod email server every much, for obvious reasons.
I had been promoted to technical writer and I needed a better test system that didn’t have customer data for screenshots. Something I needed was unique data because the archive used single instance storage, so I put together a bash script to create and send emails generated from random lines of public domain books I got from Gutenberg.
This worked great for me and at one point I had it fire off 1 million emails just for fun. I let my test email server and archive server chew on them over the weekend. It worked great but I had nearly maxed out my storage. No problem, use the deletion function. And it didn’t work.
It’s Didn’t Work. I had reproduced the bug in-house on a system we had full control over. Engineering and QA both took copies of my environments and started working on the bug.
I also learned the lore of the deletion feature. The founding developer didn’t think anyone wanted a deletion feature because it made no sense to him. But after pressure from the CEO, Board of Directors and customers he banged out some code over a weekend and shipped it. It was no 10 years later and he was long gone, and it was finally beginning to bite us.
After devs banged no the code for a while they found there was a design flaw, it failed if the number of items to delete was more than 500. QA had tested the feature, repeatedly, but their test data set just happened to be just smaller than 500 items so the bug never triggered. I only exceeded that because Austin Powers is funny.
Now that we could reproduce it, and knew there was a design flaw. The code for deletion needed to be replaced. It needed taking over two years to replace the code, because project management never thought it was all that important compared to new features, even though customers were complaining about it.
I had one that took literally years to reproduce. It was in PLC code, on a touchscreen controller running a soft PLC with Busybox under the hood. These devices were used 24/7 and usually absolutely bullet proof. Every now and then I’d get a comment that sometimes they’d crash on startup but a power cycle usually fixed it. Finally managed to get it to happen in the workshop, and dropped everything to try and figure it out.
The ultimate cause was in the network initialisation using a network library that was a tissue-paper-thin wrapper around Linux sockets. When downloading a new software version to the device, it would halt the PLC but this didn’t cleanly shut down open sockets, which would stay open, preventing a network service from starting until the unit was restarted. So I did the obvious thing and wrote the socket handle to a file. On startup I’d check the file and if it existed, shut that socket handle. This worked great during development.
Of course this file was still there after a power cycle. 99% of the time nothing would happen, but very occasionally, closing this random socket handle on startup would segfault the soft PLC runtime. So dumb, but so hard to actually catch in the wild.
The early-to-mid-90s "High C/C++" compiler had a bug in its floating point library for basic math functions. It ended up being a bit of a Heisenbug to track down, and I didn't initially believe it wasn't my code, but it actually ended up being in their supplied library.
It took me maybe three days to track down, from first clues to final resolution, on a 486/50 luggable with the orange on black monochrome built-in screen.
In interviews I've never forced anyone to code, what I do is try to get them to tell me these sorts of war stories - I want to hear how you fixed it, why it was cooly bizarre, and I'm hoping for some enthusiasm when you talk about it.
I couldn't always get people to talk this way, but people who did usually worked out well
This, whenever I get these sorts of questions on interviews I don't know how to answer, because my weirdest or hardest bug isn't something I've internalized as a war story, it was just another day.
It's just like those "what did you do when you had conflict with another employee" questions. I either worked it out with them like an adult or got our management involved and they worked it out for them. It's not some hero narrative I considered much past the time it happened.
No, they're selecting for the kind of person who can tell a war story when asked. They're also selecting for the kind of people who had to debug something gnarly enough and different enough that it was memorable.
Some people are not natural story tellers. Telling a story is not a usual part of the job responsibility of a software engineer—we aren't novelists. Having a memorable debugging experience doesn't directly equate to having a good story to tell.
This is really the same issue with the promo culture we see at Big Tech companies: you end up promoting the people who are good at crafting promo packets i.e. telling stories about their work. There is certainly a good overlap between that and the people who do genuinely good work, but it's not a perfect overlap.
Personally I don't really mind it because I consider myself good at story telling. But as an interviewer I would never do that to a candidate because not everyone can tell good stories.
They have forums if you can find them, I think they call them 'communities', where you can complain.
Then a high-ranked non-employee 'product expert' will be along presently to tell you that's not really a problem and to stop bothering the almighty google with such trivialities, your views are not important and they have millions of users, really why should they listen to you?
This is a very fun post, not only on its own merits, but also how it spurs many other hard-to-debug stories.
I like the hard-earned lessons that are often taken away from such sessions.
While nowhere on the scale of this story, I helped a fellow student while I was at the University where his program was outputting highly bogus numbers from punched card deck input. I ultimately suggested that he print out the numbers that were being read by the program and presto the field alignments were off. This has now become my first step in debugging.
During a co-op stint during my EE degree program was at a pulp bleach plant in Longview Washington. They were implementing instrumentation of various metrics in the bleach tower. The engineers told of a story about one of their instruments to measure flow or temperature or acidity. The instrument was failing but the manufacturer couldn't find any flaw, shipped it back. The cycle repeated several times until one of the engineers accompanied the instrument to the repair lab. The technicians were standing the instrument on its side, not flat as it was in the instrument rack back at the plant. Lying it flat exposed the error.
Another bug sticks in my mind from reading Coders At Work by Peter Seibel. Guy Steele is telling about a bug Bill Gosper reported in the bignum library. One thing caught is eye was a conditional step he didn't quite understand. Since it was based on the division algorithms from Knuth: "And what caught my eye in Knuth was a comment that this step happens rarely—with a probability of roughly only one in two to the size of the word." The error was in a rarely-executed piece of code. The lesson here helped him find similar bugs.
While three of us were building a compiler at Sycor, we kept a large lab notebook in which we wrote brief release notes, and a one-line note about each bug we found and fixed.
My most recent bug was a new emacs snippet was causing errors in eval_buf. Made no sense, so ultimately decided to clear out the .emacs.d directory and start over. There were files that were over 20 years old--I just copied the directory when I built a new machine.
As far as I'm concerned if you can use a debugger it automatically shouldn't qualify as the most difficult ever.
As per the compute shader post from a few days ago, currently I'm "debugging" some pretty advanced code that's being ported to a shader, and the only way to do it is by creating an array of e.g. ints and inserting values into it in both the original and the shader code to see where they diverge. Its not the most difficult but its quite time consuming.
> I do it a few more times. It’s not always the 20th iteration, but it usually happens sometime between the 10th and 40th iteration. Sometimes it never happend. Okay, the bug is nondeterministic.
That’s an incorrect assumption. Just because your test case isn’t triggering the bug reliably, it does not mean the bug is nondeterministic.
That is like saying the “OpenOffice can’t print on Tuesdays” is non deterministic because you can’t reproduce it everyday. It is deterministic, you just need to find the right set of circumstances.
From the writing it appears the author found one way to reproduce the bug sometimes and then relied on it for every test. Another approach would have been to tweak their test case until they found a situation which reproduced the bug more or less often, trying to find the threshold that causes it and continuing to deduce from there.
"Deterministic" is .. something of a moveable feast. We'd generally agree that "software is deterministic in that if you provide the same inputs to the same executable machine code it will return the same value", which is nearly always true unless someone is irradiating your processor or trying to voltage-glitch it.
But there's a lot hidden in "same inputs", because that includes everything that's an input to your program from the operating system. Which includes things like "time" (bane of reproduction), memory layout, execution scheduling order of multithreaded code, value of uninitialized memory, and so on.
> Another approach would have been to tweak their test case until they found a situation which reproduced the bug more or less often, trying to find the threshold that causes it and continuing to deduce from there.
Yes - when dealing with unknowns in a huge problem space it can be very effective to play hotter-colder and climb up the hill.
If I understood correctly - the Math.Abs() value would be positive roughly half the time, regardless of the steps taken to get there. That seems definitively nondeterministic.
You don’t call Math.abs() on its own, you need to give it a number. Regardless if it is positive or negative, it should always return a positive (that’s what an absolute value is). The issue here is that it was returning a negative number when given a negative value, which is wrong:
> We rerun the repro. We look at the logged value. Math.abs() is returning negative values for negative inputs. We reload and run it again. Math.abs() is returning negative values for negative inputs. We reload and run it again. Math.abs() is returning negative values for negative inputs.
Regardless, that is beside the point. I was not arguing either way if this was a deterministic bug or not, I was pointing out that the author’s conclusion does not follow from the premise. Even if the bug had turned out to be nondeterministic, they had not done the necessary steps to confidently make that assertion. There is a chasm of difference between “this bug is nondeterministic” and “I haven’t yet determined the conditions that reproduce this bug”.
My interpretation was it was replaced with the identity function (e.g. just returning the original value). But it's only replaced if the code is determined to be a hot spot. So it would work correctly until the code was in a tight loop, then it would start failing once passed a negative number.
> Then we called in our Tech Lead / Manager, who had a reputation of being a human JavaScript compiler. We explained how we got here, that Math.abs() is returning negative values, and whether she could find anything that we were doing wrong. After persuading her that we weren’t somehow horribly mistaken, she sat down and looked at the code. Her CPU spun up to 100%, and she was muttering in Russian about parse trees or something while staring at the code and typing into the debug console. Finally she leaned back and declared that Math.abs() was definitely returning negative values for negative inputs.
And somewhere out there is a person reading this post and coming to the conclusion "How can Google be stupid enough to hire people stupid enough to have abs() return a negative value."
Love the story! There is so much complexity in the world around as that seemingly obviously wrong things happen through the most unlikely chains of dependency.
> And somewhere out there is a person reading this post and coming to the conclusion "How can Google be stupid enough to hire people stupid enough to have abs() return a negative value."
Weird things can happen anywhere but I was wondering why this issue wasn't caught by test cases before it escaped to production? I would think that a compiler team would have low-level tests for such common functions.
So far my record is 3 weeks. It was a hiesenbug triggered when two different ebpf based systems raced with each other. Ebpf is a great tool in the right place but is it ever a pain in the ass to debug.
The fix ended up being one character -> change the priority of an ebpf tc filter from 0 to 1.
When I was 12 I was just learning stuff and wrote something in C, which crashed at unpredictable intervals and I could not explain it. I took it to my 14 year old uncle who was better than me at coding for help. Now mind you this is ~ 40 years ago but I seem to remember that Borland Turbo C (I still love that IDE blue color) had debugging with breakpoints (mind blowing!) which eventually led to "duh you didn't dispose of your pointer and are reusing it and the memory there is now garbage" or something like that. I vaguely recall * or * being somewhere nearby. This was my first intro to RTFM and debugging and what a powerful intro.
Worst debugging issues are always things I can't access directly, on top of being rare.
Think network appliance in the middle that don't log or not at the level you need (and sometimes they can't log what you need).
Those usually mean that no reproduction is possible, except in production or very close to it, with tools you don't always control.
Annoying ones are those of "This http request is sometimes slow", and chasing each boxes in the middle shows a new box that is supposed to be transparent but isn't, or some rare timing issues due to boxes interacting in a funny way.
I've told my personal worst here a couple of times. So this time I'm going to talk about a co-worker named Ed.
On an embedded system, we had this bug that we couldn't find. It was around for a month or two. Random crashes that we couldn't reproduce, couldn't even debug. We started calling it "the phantom".
Finally Ed said, "I think the phantom showed up after we made that change to the ethernet driver." We reverted it, and the bug disappeared.
We never found the bug in the source code. But Ed debugged it using the calendar.
> What can I even do from here as the newsletter author? Normally I like finding a teachable lesson. But it was 2 days of grueling debugging and somehow there aren’t any teachable lessons there.
A lesson to learn seems obvious to me: the V8 team did not communicate upfront sufficiently on the "oops our Math.abs() may return negative numbers, we fixed that in version X, be warned".
Which the V8 should be able to do in a "advisory for Google developers that work on high-performance client-side view rendering stuff" sort of weekly newsletter.
I often read these stories about hard to debug problems because I enjoy debugging (call it a love for software true crime) and this is the first one I’ve read I had an “oh god no” reaction when the author described where they needed to look for the culprit. The description of the layout engine and all of the browser specific tweaks makes it sound like an absolutely tedious nightmare to debug.
It's amazing how often it happens in large companies that different people from different organizations are troubleshooting or fixing the same fault, independent from each other, without even knowing. Sometimes you don't even realize until you've implemented a fix which causes a merge conflict with the fix that someone else is working on.
I suppose the Google Doc team initially thought this would surely be a bug in their own code, not in Chrome or in V8, so it wouldn't help to bisect their own code. Nobody really begins to debug by blaming the compiler.
> It didn’t correspond to a Google Docs release. The stack trace added very little information. There wasn’t an associated spike in user complaints, so we weren’t even sure it was really happening — but if it was happening it would be really bad. It was Chrome-only starting at a specific release.
That sounds like a Chrome bug. Or, at least, a bug triggered by a change in Chrome. Bisecting your code when their change reveals a crash is folly, regardless of whose bug it is.
If your job is to solve the situation, your best hope is to figure out what change caused it; understand that change; and then do whatever needs to be done.
In a large complicated application where a change to the environment revealed a crash, finding out what changed in the environment and thinking about how that affects the application makes a lot more sense than going back through application changes to see if you can find it that way.
Once you figure out what the problem is, sure you can probably fix it in the application or the environment, and fixing the application is often easier if the environment is Chrome. But chrome changed and my app is broken means look at the changes in Chrome and work from there.
«Math.abs() is returning negative values for negative inputs.», man I would have reached for the bible if that happened to me. Fascinating in hindsight.
My hardest bug to debug was related to broken drivers and a useless vendor. In total I spent around 2 months on and off trying to chase that one, and by the end was starting to go crazy.
A new customer comes in and we deploy a new VMware vSphere private cloud platform for them (first using this type of hardware). Nothing special or too fancy, but fist ones 10G production networking.
After a few weeks, integration team complains that a random VM stopped being able to communicate with another VM, but only one other specific VM. Moving the "broken" VM to a different ESXi fixed things, so we suspected a bad cable/connection/port/switch. Various tests turned up nothing, so we just waited for something to happen again.
A few days later, same thing. Some more debugging, packet capture, nothing. Rebooting the ESXi fixed the issue, so it was not the cables/switch, probably. Support ticket was opened at VMware for them to throw all sorts of useless "advice" (update drivers, firwmare, OS, etc etc).
This kept happening more and more, at some point there were multiple daily occurrences of this - again, just specific VMs to other specific VMs, but could always SSH, and communicate with other things, for which we had to reboot the hypervisor to fix it. VMware are completely and utterly useless, even with all the logs, timelines, etc.
A few weeks in, customer is getting pissed. We say that we've tried all sorts of debugging of everything (packet capture on the ESX, switch stuff, in the guest OSes, etc etc), and there's no rhyme nor reason - all sorts of VMs, of different virtual hardware versions, on different guest OSes, different virtual NIC types, different ESXes, and we're trying stuff with the vendor, it probably being a software bug.
One morning I decided to just go and read all of the logs on one of the ESX, trying to see if I can spot something weird (early on we tried greping for errors, warns yielded just VMware vomit and nothing of use). There's too much of them, and I don't see anything. In desperation, I Googled various combinations of "vmware" "nic type" "network issues", and boom, I stumble upon Intel forums with months of people complaining that the Intel X710 NIC's drivers are broken, throw a "Malicious Driver Detected" message (not error) in the logs, and just shut down traffic on that specific port. And what do you know, that's the NICs we're using, and we have those messages. The piece of shit of a driver had been known to not work for months (there was either that, or it crashing the whole machine), but was proudly sitting on VMware's compatibility list. When I told VMware's support about it, they said they were aware internally, but refused to remove it from the compatibility list. But if we upgraded to the beta release of the next major vSphere, there's a newer driver that supposedly fixes everything. We did that and everything was then finally fixed, but there were machines with similar issues where the driver wasn't updated for years after that.
This is the event that taught me that enterprise vendors don't know that much even about their own software, VMware's support is useless, hardware compatibility lists are also useless. So you actually need to know what you're doing and can't rely on support saving you.
excellent post. i think the lesson is a good one: it's better to have less bugs than more bugs, and for some users, it would still have had an annoying bug.
> Next, the reproduction was slow. It took probably 20 seconds just to load the dev version of the editor, and another 40 seconds to trigger the issue.
60 seconds to reproduce? Slow!? Laughs in enterprise software
The worst bugs I've ever dealt with were a result of working at a company which was using the Clarion programming language.
The language compiler was most likely written by someone who had never read a book about compilation, it was basically just like if you had written a compiler using macros. I don't think it had anything like an optimisation pass. This combined with it being a higher level language meant that debugging with a debugger was just infeasible. Even if you had figured out the issue, you wouldn't know what exactly caused it from the code side as most lines of code would get turned into pages of assembly. Not only that, I believe the format for the debug symbols was custom so line number information was something you would only get if you used the terrible debugger which shipped with the language. Windows is also a terrible development environment due to the incredible lack of any good documentation for almost anything at the WinAPI level.
The applications I was working on were multi-threaded Windows applications. Concurrency issues were everywhere. Troubleshooting them sometimes took months. In many cases the fixes made absolutely no sense.
The IDE (which you were basically forced to use) was incessantly buggy. You could reliably crash it in many contexts by simply clicking too fast. After 5 years of working with that tooling, I had gained an intuition for where I needed to slow down my clicks to prevent a crash.
The IDE also operated on these binary blobs which encapsulated the entire project. I never put in the time to investigate the format of these blobs but, unsurprisingly, given the quality of the IDE, it was possible to put these opaque binary blobs in erroneous states. You could either just revert to a previous version of the blob and copy paste all your work (no way of easily accessing the raw text in the IDE because of this idiotically designed templating feature which was used throughout). If your project was in a wierd state, you would get mystery compiler errors with a 32bit integer printed as hex as an error identifier.
Searching the documentation or the internet for these numbers would either produce no results or would produce forum or comp.lang.clarion results for dozens of unrelated issues.
The language itself was an insane variation of pascal and/or COBOL. It had some nice database related features (as it was effectively CRUD domain specific) but that was about it. You look on GitHub these days to see people discussing the soundness and ergonomics issues of the never type in rust for many months before even considering partially stabilising it. Meanwhile in clarion, you get a half-arsedly written document page which serves as the language specification and out of it you get a half baked feature which doesn't work half the time. The documentation would often have duplicate pages for some features which would provide you with non-overlapping, sometimes conflicting or just outright wrong information.
When dealing with WINAPI you would need to deal with pointer types, and sometimes you would need to do pointer type conversions. The language wouldn't let you just do something like `void *p = &foo;` (this is C, actually very sane compared to Clarion). You had to do the language equivalent of `void *p = 1 ? &foo : NULL;` which magically lost enough type information for the language to let you do it. There was no documented alternative to this (there was casting, it just didn't work in this case), this wasn't even itself documented and was just a result of frustration and trial and error.
Not only this, the people I was working with had all entered this terrible proprietary language (oh wait, did I mention, you had to pay for a license for this shit) at a time where you were writing pure winapi code in C or C++. So for them, the fact that it had a forms editor was so amazing that they literally never considered for the next 25 years looking at alternative options. So when I complained about the complete insanity of using this completely ridiculous language I would get told that the alternatives were worse.
Do you want to experience living hell when debugging? Find a company writing Clarion, apparently it's still popular in the US government.
I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:
At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.
Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.
reply