If I understood the graphs correctly, it only achieves 20% pass rate on their internal tests. So I have to wait 30min and pay a lot of money just to sift through walls of most likely incorrect text?
Unless the possibility of hallucinations is negligible, this is just way too much content to review at once. The process probably needs to be a lot more iterative.
Here's an example of the type of question it is acheiving 20% on;
The set of natural transformations between two functors F,G :C→DF,G:C→D can be expressed as the end
Nat(F,G)≅∫AHomD(F(A),G(A)).
Nat(F,G)≅∫A HomD (F(A),G(A)).
Define set of natural cotransformations from FF to GG to be the coend
CoNat(F,G)≅∫AHomD(F(A),G(A)).
CoNat(F,G)≅∫AHomD (F(A),G(A)).
Let:
- F=B∙(Σ4)∗/F=B∙ (Σ4 )∗/ be the under ∞∞-category of the nerve of the delooping of the symmetric group Σ4Σ4 on 4 letters under the unique 00-simplex ∗∗ of B∙Σ4B∙ Σ4 .
- G=B∙(Σ7)∗/G=B∙ (Σ7 )∗/ be the under ∞∞-category nerve of the delooping of the symmetric group Σ7Σ7 on 7 letters under the unique 00-simplex ∗∗ of B∙Σ7B∙ Σ7 .
How many natural cotransformations are there between FF and GG?
As someone who doesn't understand anything beyond the word 'set' in that question, can anyone give an indication of how hard of a problem that actually is (within that domain)?
Also I'm curious as to what percentage of the questions in this benchmark are of this type / difficulty, vs the seemingly much easier example of "In Greek mythology, who was Jason's maternal great-grandfather?".
I'd imagine the latter is much easier for an LLM, and almost trivial for any LLM with access to external sources (such as deep research).
btw isn't this question at least really badly worded (and maybe incorrect?) the definitions they give for F and G are categories not functors... (and both categories are in fact one object with contractible space of morphisms...)
It's very interesting to think about what kind of "mental model" might it have, if it's capable of "understanding" all this (to me) gibberish, but is then unable to actually work the problem.
Did you intentionally flip through all the questions to find the one that seemed the easiest? If so, why? That's question #7, and all other 7 questions in the sample set seem ridiculously difficult to me.
it tests syllogistic reasoning: Jason's mother was Tyro, whose father was Poesidon, whose father was Kronos. it also tests whether it "eagerly" rather than comprehensively considers something: a maternal great-grandfather could be the father of either one's maternal grandmother or maternal grandfather. so the answer could also be king Aeolus of the Etruscans.
ideally a model would be able to answer this accurately and completely.
I think there are more possible answers? Jason's mother differs depending on the author...
For example, Jason's mother was Philonis, daughter of Mestra, daughter of Daedalion, son of Hesporos. So Jason's maternal great-grandfather was Hesporos.
LLMs often don't do well on tasks that require composition into smaller subtasks. In this case there is a chain of relations that depend on the previous result.
No it is not an actual question on this exam. From the paper: “To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable, and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, though contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (e.g., precise historical details, trivia, local customs) and have specific, unambiguous answers…”. (Emphasis mine)
Maybe. Not enough data to say. Say it does a days worth of work in a query. It is sensible to use if it takes less than a day to review ~5 days worth of work. I don't know if we're near that threshold yet but conceptually this would work well for actual research where the amount of preparation is large compared to the amount of output written.
And eyeballing the benchmarks, it'll probably reach a >50% rate per query by the end of the year. Seems to double every model or two.
Yeah it can be more iterative. Just use individual queries and build on it yourself. This is all this is doing. It's a trick, and OpenAI is a PR hype company at this stage.
The difference is that it takes few minutes to an hour at most so it can be run multiple times a day, using the results of previous runs to further refine the search and reasoning process to get better outcomes. Pretty much how any human research works but much faster and with potentially vastly more world-knowledge and reasoning capability than average humans. And these capabilities will rapidly improve with further RL.