I've made this statement a bunch in other mediums: The reason AI software is alw...

mondrian · 2024-08-20T00:43:18 1724114598

Recent work from Meta uses AI to automatically increase test coverage with zero human checking of AI outputs. They do this with a strong oracle for AI outputs: whether the AI-generated test compiles, runs, and hits yet-unhit lines of code in the tested codebase.

We probably need a lot more work along this dimension of finding use cases where strong automatic verification of AI outputs is possible.

jamescostian · 2024-08-20T02:17:32 1724120252

> with zero human checking of AI outputs

It can be hard enough for humans to just look at some (already consistently passing) tests and think, "is X actually the expected behavior or should it have been Y instead?"

I think you should have a look at the abstract, especially this quote:

> 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers

This tool sounds awesome in that it generated real tests that engineers liked! "zero human checking of AI outputs" is very different though, and "this test passes" is very different from "this is a good test"

mondrian · 2024-08-21T06:29:57 1724221797

Good points regarding test quality. One takeaway for me from this paper is that you can increase code coverage with LLMs without any human checking of LLM outputs, because it’s easy to make a fully automated checker. Pure coverage may not be super-interesting but it’s still fairly interesting and nontrivial. LLM-based applications that run fully autonomously without bubbling hallucinations up to users seem elusive but this is an example.

nforgerit · 2024-08-20T00:14:51 1724112891

You hit the nail. It's been almost tragically funny how people frantically tried to juggle 5 bars of wet soap in recent 2 years solving problems that (from what I've seen so far) have been already solved in a (boring) deterministic way consuming much less resources.

Going further, our predecessors put so much work into getting non-deterministic electronics together providing us with a stable and _correct_ platform, it looks ridiculous how people were trying to squeeze another layer of non-determinism in between to solve the same classes of problems.

diatone · 2024-08-20T02:25:37 1724120737

The irony here is that there are many domains using statistical methods, that bound the complexity and failure modes of statistical methods successfully. A lot of people struggle with statistics but in domains where the glove fits I think AI will slot in all across the stack really nicely.

loa_in_ · 2024-08-20T00:03:24 1724112204

But software works only 99% of the time. For some definition of work: 99% of days it's run, 99% of clicks, 99% of CPU time in given component, 99% of versions released and linked into some business' production binary, 99% of github tags, 99% of commits, 99% of software that that one guy says is battle-tested

jjmarr · 2024-08-20T01:09:17 1724116157

If twenty components work 99% of the time, then they only have an 0.99^20 = 82% chance of working as a collective.

If your 5.1 GHz (billion instructions per second) CPU had a 0.00000001% chance of failing at a given instruction, you'd have a 40% chance of a crash every second.

If a flight had a 1% chance of killing everyone aboard 10 million people/day * 1% = 100,000 people would die every day from a plane.

diatone · 2024-08-20T02:27:14 1724120834

Gamblers fallacy

Groxx · 2024-08-20T00:09:14 1724112554

Software works so much more than 99% of the time that it's a rather deliberate strawman to claim otherwise.

Newly-"AI"-branded things that I have touched work substantially less than 90% of the time. There are like 3 orders of magnitude difference, even people who aren't paying any attention at all are noticing it.

BobbyJo · 2024-08-20T00:22:12 1724113332

Do you have to write your code presuming that sometimes 'a + b' will be wrong? I don't.

Software pretty much always "works" when you consider the definition of work to be "does what the programmer told it too". AI? Not so much.

wruza · 2024-08-20T00:57:07 1724115427

It’s all about limits and edge cases. a+b may “fail” at INT_MAX and at 0.1+0.2. You don’t `==` your doubles, you don’t (a+b)/2 your mid, and you don’t ask ai to just book you vacation. You ask it to “collect average sentiment from `these_5k_reviews()` ignoring apparently fake ones, which are defined as <…>”. You don’t care about determinism because it’s a statistical instrument.

BobbyJo · 2024-08-20T01:05:16 1724115916

> and you don’t ask ai to just book you vacation. You ask it to “collect average sentiment from `these_5k_reviews()` ignoring apparently fake ones, which are defined as <…>”.

That's exactly my point. You have to interact directly with the A.I. and be aware of what its doing.

simonw · 2024-08-20T00:25:34 1724113534

That's not true. If software works correctly today then users can expect it to work correctly tomorrow. If it doesn't work any more that's a bug.