If I understood the graphs correctly, it only achieves 20% pass rate on their in...

itkovian_ · 2025-02-03T01:22:29 1738545749

Here's an example of the type of question it is acheiving 20% on;

The set of natural transformations between two functors F,G ⁣:C→DF,G:C→D can be expressed as the end Nat(F,G)≅∫AHomD(F(A),G(A)). Nat(F,G)≅∫A HomD (F(A),G(A)).

Define set of natural cotransformations from FF to GG to be the coend CoNat(F,G)≅∫AHomD(F(A),G(A)). CoNat(F,G)≅∫AHomD (F(A),G(A)).

Let: - F=B∙(Σ4)∗/F=B∙ (Σ4 )∗/ be the under ∞∞-category of the nerve of the delooping of the symmetric group Σ4Σ4 on 4 letters under the unique 00-simplex ∗∗ of B∙Σ4B∙ Σ4 . - G=B∙(Σ7)∗/G=B∙ (Σ7 )∗/ be the under ∞∞-category nerve of the delooping of the symmetric group Σ7Σ7 on 7 letters under the unique 00-simplex ∗∗ of B∙Σ7B∙ Σ7 .

How many natural cotransformations are there between FF and GG?

slaterbug · 2025-02-03T06:17:33 1738563453

As someone who doesn't understand anything beyond the word 'set' in that question, can anyone give an indication of how hard of a problem that actually is (within that domain)?

Also I'm curious as to what percentage of the questions in this benchmark are of this type / difficulty, vs the seemingly much easier example of "In Greek mythology, who was Jason's maternal great-grandfather?".

I'd imagine the latter is much easier for an LLM, and almost trivial for any LLM with access to external sources (such as deep research).

Davidzheng · 2025-02-03T03:17:37 1738552657

btw isn't this question at least really badly worded (and maybe incorrect?) the definitions they give for F and G are categories not functors... (and both categories are in fact one object with contractible space of morphisms...)

baal80spam · 2025-02-03T07:51:35 1738569095

That's easy Dave: 42.

YmiYugy · 2025-02-03T18:38:36 1738607916

Do we actually know whether it got this specific example right? It got 20% on HLE, but I think a few questions are quite a bit easier.

perching_aix · 2025-02-03T04:27:04 1738556824

It's very interesting to think about what kind of "mental model" might it have, if it's capable of "understanding" all this (to me) gibberish, but is then unable to actually work the problem.

brokensegue · 2025-02-03T00:46:49 1738543609

26.6% on humanity's last exam is actually impressive.

pass rate really only matters in context of the difficulty of the tasks

tmnvdb · 2025-02-03T00:36:43 1738543003

Only if you are asking questions at the level of a cutting edge benchmark

rvnx · 2025-02-03T00:50:58 1738543858

This is one of the actual questions:

> In Greek mythology, who was Jason's maternal great-grandfather?

https://www.google.com/search?q=In+Greek+mythology%2C+who+wa...

johnfn · 2025-02-03T04:35:22 1738557322

Did you intentionally flip through all the questions to find the one that seemed the easiest? If so, why? That's question #7, and all other 7 questions in the sample set seem ridiculously difficult to me.

elicksaur · 2025-02-03T01:08:03 1738544883

In Greek mythology, Jason's maternal great-grandfather was Einstein.

tmnvdb · 2025-02-03T01:07:43 1738544863

This is a hard question for language models since it targets one of their known weaknesses.

andyg_blog · 2025-02-03T01:21:24 1738545684

Greek mythology? But seriously please elaborate for my less educated self.

_bin_ · 2025-02-03T04:21:54 1738556514

it tests syllogistic reasoning: Jason's mother was Tyro, whose father was Poesidon, whose father was Kronos. it also tests whether it "eagerly" rather than comprehensively considers something: a maternal great-grandfather could be the father of either one's maternal grandmother or maternal grandfather. so the answer could also be king Aeolus of the Etruscans.

ideally a model would be able to answer this accurately and completely.

nimithryn · 2025-02-03T04:46:30 1738557990

I think there are more possible answers? Jason's mother differs depending on the author...

For example, Jason's mother was Philonis, daughter of Mestra, daughter of Daedalion, son of Hesporos. So Jason's maternal great-grandfather was Hesporos.

tmnvdb · 2025-02-03T01:51:27 1738547487

LLMs often don't do well on tasks that require composition into smaller subtasks. In this case there is a chain of relations that depend on the previous result.

layer8 · 2025-02-03T01:15:45 1738545345

Users don’t care about how hard something is for LLMs if they receive incorrect output.

11101010001100 · 2025-02-03T01:23:51 1738545831

It's categorically more than a weakness.

pama · 2025-02-03T02:10:18 1738548618

No it is not an actual question on this exam. From the paper: “To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable, and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, though contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (e.g., precise historical details, trivia, local customs) and have specific, unambiguous answers…”. (Emphasis mine)

neynt · 2025-02-03T03:03:30 1738551810

It's example #7 on https://lastexam.ai/

pama · 2025-02-03T15:13:35 1738595615

This is an example of the submitted questions. Because it is possible to search it on the web, it is not an example of the accepted questions.

freehorse · 2025-02-03T08:47:52 1738572472

I am selling a bridge, it is a great bargain.

roenxi · 2025-02-03T01:00:24 1738544424

Maybe. Not enough data to say. Say it does a days worth of work in a query. It is sensible to use if it takes less than a day to review ~5 days worth of work. I don't know if we're near that threshold yet but conceptually this would work well for actual research where the amount of preparation is large compared to the amount of output written.

And eyeballing the benchmarks, it'll probably reach a >50% rate per query by the end of the year. Seems to double every model or two.

dyauspitr · 2025-02-03T03:28:34 1738553314

On questions even specialists in that field can’t answer correctly.

throwaway123lol · 2025-02-03T15:12:20 1738595540

Yeah it can be more iterative. Just use individual queries and build on it yourself. This is all this is doing. It's a trick, and OpenAI is a PR hype company at this stage.

spyckie2 · 2025-02-03T00:59:07 1738544347

I mean you want it to grill your steak and eat it for you too?

I mean I too can complain that my iPhone doesn’t automatically screen out spammers and send my mom flowers on Mother’s Day.

scarab92 · 2025-02-03T01:13:25 1738545205

Why doesn't the iPhone screen spammers yet? Pixel has had this feature for a decade.

senordevnyc · 2025-02-03T08:07:54 1738570074

Pixel hasn’t even been around for a decade.

scarab92 · 2025-02-03T10:31:15 1738578675

The Pixel branding is 12 years old, and IIRC this feature also existed in Nexus before that.

senordevnyc · 2025-02-03T16:35:50 1738600550

Haha, are you referring to the Chromebook Pixel? How is that relevant to stopping spam calls?

Pixel phone launched in 2016.

random_cynic · 2025-02-03T02:30:47 1738549847

The difference is that it takes few minutes to an hour at most so it can be run multiple times a day, using the results of previous runs to further refine the search and reasoning process to get better outcomes. Pretty much how any human research works but much faster and with potentially vastly more world-knowledge and reasoning capability than average humans. And these capabilities will rapidly improve with further RL.