Hacker News new | past | comments | ask | show | jobs | submit login

If I understood the graphs correctly, it only achieves 20% pass rate on their internal tests. So I have to wait 30min and pay a lot of money just to sift through walls of most likely incorrect text? Unless the possibility of hallucinations is negligible, this is just way too much content to review at once. The process probably needs to be a lot more iterative.





Here's an example of the type of question it is acheiving 20% on;

The set of natural transformations between two functors F,G ⁣:C→DF,G:C→D can be expressed as the end Nat(F,G)≅∫AHomD(F(A),G(A)). Nat(F,G)≅∫A HomD (F(A),G(A)).

Define set of natural cotransformations from FF to GG to be the coend CoNat(F,G)≅∫AHomD(F(A),G(A)). CoNat(F,G)≅∫AHomD (F(A),G(A)).

Let: - F=B∙(Σ4)∗/F=B∙ (Σ4 )∗/ be the under ∞∞-category of the nerve of the delooping of the symmetric group Σ4Σ4 on 4 letters under the unique 00-simplex ∗∗ of B∙Σ4B∙ Σ4 . - G=B∙(Σ7)∗/G=B∙ (Σ7 )∗/ be the under ∞∞-category nerve of the delooping of the symmetric group Σ7Σ7 on 7 letters under the unique 00-simplex ∗∗ of B∙Σ7B∙ Σ7 .

How many natural cotransformations are there between FF and GG?


As someone who doesn't understand anything beyond the word 'set' in that question, can anyone give an indication of how hard of a problem that actually is (within that domain)?

Also I'm curious as to what percentage of the questions in this benchmark are of this type / difficulty, vs the seemingly much easier example of "In Greek mythology, who was Jason's maternal great-grandfather?".

I'd imagine the latter is much easier for an LLM, and almost trivial for any LLM with access to external sources (such as deep research).


btw isn't this question at least really badly worded (and maybe incorrect?) the definitions they give for F and G are categories not functors... (and both categories are in fact one object with contractible space of morphisms...)

That's easy Dave: 42.

Do we actually know whether it got this specific example right? It got 20% on HLE, but I think a few questions are quite a bit easier.

It's very interesting to think about what kind of "mental model" might it have, if it's capable of "understanding" all this (to me) gibberish, but is then unable to actually work the problem.

26.6% on humanity's last exam is actually impressive.

pass rate really only matters in context of the difficulty of the tasks


Only if you are asking questions at the level of a cutting edge benchmark

This is one of the actual questions:

> In Greek mythology, who was Jason's maternal great-grandfather?

https://www.google.com/search?q=In+Greek+mythology%2C+who+wa...


Did you intentionally flip through all the questions to find the one that seemed the easiest? If so, why? That's question #7, and all other 7 questions in the sample set seem ridiculously difficult to me.

In Greek mythology, Jason's maternal great-grandfather was Einstein.

This is a hard question for language models since it targets one of their known weaknesses.

Greek mythology? But seriously please elaborate for my less educated self.

it tests syllogistic reasoning: Jason's mother was Tyro, whose father was Poesidon, whose father was Kronos. it also tests whether it "eagerly" rather than comprehensively considers something: a maternal great-grandfather could be the father of either one's maternal grandmother or maternal grandfather. so the answer could also be king Aeolus of the Etruscans.

ideally a model would be able to answer this accurately and completely.


I think there are more possible answers? Jason's mother differs depending on the author...

For example, Jason's mother was Philonis, daughter of Mestra, daughter of Daedalion, son of Hesporos. So Jason's maternal great-grandfather was Hesporos.


LLMs often don't do well on tasks that require composition into smaller subtasks. In this case there is a chain of relations that depend on the previous result.

Users don’t care about how hard something is for LLMs if they receive incorrect output.

It's categorically more than a weakness.

No it is not an actual question on this exam. From the paper: “To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable, and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, though contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (e.g., precise historical details, trivia, local customs) and have specific, unambiguous answers…”. (Emphasis mine)

It's example #7 on https://lastexam.ai/

This is an example of the submitted questions. Because it is possible to search it on the web, it is not an example of the accepted questions.

I am selling a bridge, it is a great bargain.

Maybe. Not enough data to say. Say it does a days worth of work in a query. It is sensible to use if it takes less than a day to review ~5 days worth of work. I don't know if we're near that threshold yet but conceptually this would work well for actual research where the amount of preparation is large compared to the amount of output written.

And eyeballing the benchmarks, it'll probably reach a >50% rate per query by the end of the year. Seems to double every model or two.


On questions even specialists in that field can’t answer correctly.

Yeah it can be more iterative. Just use individual queries and build on it yourself. This is all this is doing. It's a trick, and OpenAI is a PR hype company at this stage.

I mean you want it to grill your steak and eat it for you too?

I mean I too can complain that my iPhone doesn’t automatically screen out spammers and send my mom flowers on Mother’s Day.


Why doesn't the iPhone screen spammers yet? Pixel has had this feature for a decade.

Pixel hasn’t even been around for a decade.

The Pixel branding is 12 years old, and IIRC this feature also existed in Nexus before that.

Haha, are you referring to the Chromebook Pixel? How is that relevant to stopping spam calls?

Pixel phone launched in 2016.


The difference is that it takes few minutes to an hour at most so it can be run multiple times a day, using the results of previous runs to further refine the search and reasoning process to get better outcomes. Pretty much how any human research works but much faster and with potentially vastly more world-knowledge and reasoning capability than average humans. And these capabilities will rapidly improve with further RL.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: