More

hackpert · 2024-12-21T04:36:42 1734755802

If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here's a quick visualization: https://arcagi-o3-viz.netlify.app

hackpert · 2024-09-26T17:35:28 1727372128

That's fair but what if you could estimate the direction of incoming light with other sensors? Using inverse diffraction etc. Just a thought

modeless · 2024-09-26T22:08:13 1727388493

I'm not sure what you mean. Light is coming in from all directions simultaneously.

hackpert · 2024-09-28T03:17:39 1727493459

Sorry never mind! I wasn't thinking at all when I wrote that thought, but that obviously doesn't make sense :)

hackpert · 2024-08-09T06:09:48 1723183788

Hi! I've been working on theorem proving systems for some time now. I would love to help out with an AlphaProof reproduction, but I can't reach you on discord for some reason!

sterlind · 2024-08-09T13:57:20 1723211840

ack! try again, I forgot to update my account name since Discord got rid of # tags. I also put my email as a fallback.

hackpert · 2024-07-05T23:47:52 1720223272

Thank you, those insights are invaluable! This is a specific and potentially dumb question and I completely understand if you can't answer it!

The practical motivation for MoEs is very clear but I do worry about loss of compositional abilities (that I think just emerge from superposed representations?) that some tasks may require, especially with the many experts phenomenon we're seeing. This is an observation from smaller MoE models (with like top-k gating etc.) that may or may not scale, that denser models trained to the same loss tend to perform complex tasks "better".

Intuitively, do you think MoEs are just another stopgap trick we're using while we figure out more compute, better optimizers or could there be enough theoretical motivation to justify their continued use? If there isn't, perhaps we need to at least figure out "expert scaling laws" :)

swyx · 2024-07-07T23:16:37 1720394197

thanks for the thoughtful qtn! yeah i dont have the background for this one, you'll have to ask an MoE researcher (which Yi is not really either as i found out on the pod). it does make sense that on a param-for-param basis MoEs would have less compositional abilities, but i have yet to read a paper (mostly my fault, bc i dont read MoE papers that closely, but also researcher fault, in that they're not incentivized to be rigorous about downsides of MoEs) that really identified what these compositional abilities are that MoEs are affected. if you could, for example, identify subcategories of BigBench or similar that require compositional abilities, then we might be able to get hard evidence on this question. i'm not yet motivated enough to do this myself but it'd make a decent small research question.

HOWEVER i do opine that MoEs are kiiind of a stopgap (both on the pod and on https://latent.space/p/jan-2024) - definitely a validated efficiency/sparsity technique (esp see deepseek's moe work if you havent already, with >100 experts https://buttondown.email/ainews/archive/ainews-deepseek-v2-b...) but mostly a oneoff boost you get on the single small dense expert equivalent model rather than comparable to the capabilities of a large dense model of the same param count (aka I expect a 8x22B MoE to never outperform a 176B dense model ceteris paribus - which is difficult to get a like-for-like comparison on bc these things are expensive, also partially because usually the MoE is just upcycled instead of trained from scratch, and partially because the routing layer is deepening every month). so perhaps to TLDR there is more than enough evidence and practical motivation to justify their continued use (i would go so far as to say that all inference endpoints incl gpt4 and above should be MoEs) but they themselves are not really an architectural decision that matters for the next quantum leap in capabilities

hackpert · 2024-06-18T04:42:36 1718685756

I'm not sure how to quantify how quickly or well humans learn in-context (if you know of any work on this I'd love to read it!)

In general, there is too much fluff and confusion floating around about what these models are and are not capable of (regardless of the training mechanism.) I think more people need to read Song Mei's lovely slides[1] and related work by others. These slides are the best exposition I've found of neat ideas around ICL that researchers have been aware of for a while.

[1] https://www.stat.berkeley.edu/~songmei/Presentation/Algorith...

hackpert · on March 6, 2024

There has been some interesting work when it comes to distributed training. For example DiLoCo (https://arxiv.org/abs/2311.08105). I also know that Bittensor and nousresearch collaborated on some kind of competitive distributed model frankensteining-training thingy that seems to be going well. https://bittensor.org/bittensor-and-nous-research/

Of course it gets harder as models get larger but distributed training doesn't seem totally infeasible. For example if we were to talk about MoE transformer models, perhaps separate slices of the model can be trained in an asynchronous manner and then combined with some retraining. You can have minimal regular communication about say, mean and variance for each layer and a new loss term dependent on these statistics to keep the "expertise" for each contributor distinct.

hackpert · on Feb 13, 2024

I tried to do this ages ago in 2018 by adapting OpenAI's flow architecture and it sort of seemed to work (was at least promising). With today's models with a significantly more disentangled latent space it should be much easier to do! I saw a transformer trained on the UK Biobank recently, excited for this space!

hackpert · on March 22, 2023

Wow I had hoped for a more productive discussion than these 1-1 comparisons of Bard vs ChatGPT that I'm seeing everywhere. The model deployed with this version of Bard is clearly a smaller model than the biggest LaMDA/PaLM models Google has been working on for ages. Which, according to their publications, show unprecedented results on _proof writing_ of all things (see Minerva). While their strategic decisions may be questionable (or they're just trying to quantize the model for mass deployment without burning billions per month in compute costs), its almost silly to question Google's ability to build useful LLMs.

seanhunter · on March 22, 2023

At the moment unless we get more information about what metric you're supposed to evaluate it on, you could probably simplify the headline to just "Bard is much worse than chatgpt" without any loss of accuracy.

It's not really realistic to expect people to give Google credit for these amazing models they have published results about but haven't let people play with - they have given people Bard and people are evaluating it based on the criteria most obvious to them - a comparison to a very similar product that has just been released.

Traubenfuchs · on March 22, 2023

They knew the war they were entering, they knew their enemies, they knew how they'd get evaluated and still decided to get this model out in its current state, leading to the conclusion: Yes, this is really the best they can do and it's much worse than the state of the art.

In any case, it's a massive marketing blunder, the public opinion formed within the last hours was overwhelmingly "Bard sucks compared to ChatGPT."

carlmr · on March 22, 2023

>Yes, this is really the best they can do and it's much worse than the state of the art.

This is the best they can do under pressure.

ChatGPT surprised the world with how good it was, then Google scrambled to get something out quick.

A project like this is a massive undertaking, the first mover has the advantage that they can calmly refine their model until they find it presentable.

The question is, is what Google is delivering good for the timeframe since OpenGPT exploded in popularity enough for Google leadership to take note. Since that moment, realistically, is when they put pressure on their devs to push something out the door.

I think we'll see a better iteration soon. Not only from Google, but from other competitors.

random_cynic · on March 22, 2023

> its almost silly to question Google's ability to build useful LLMs.

Unless they release a model one can "use" and verify their claims it's literally silly to make this statement.

amelius · on March 22, 2023

There are useful open source LLMs. Or are you questioning their ability to configure; make install?

DeathArrow · on March 22, 2023

> its almost silly to question Google's ability to build useful LLMs.

It's almost silly to presume anything without proofs. People are judging Google based on what Google has shown.

xiphias2 · on March 22, 2023

It seems like they don't want to be the best, just good and cheap enough that they don't lose users therefore ad revenue.

They behave like Yahoo when Google took over.

letitgo12345 · on March 22, 2023

Even the best Google models seem to be lagging for reasoning tasks vs OpenAI ones at the moment - see the graphs at https://github.com/suzgunmirac/BIG-Bench-Hard

peyton · on March 22, 2023

Everyone’s gonna have bigger models. Where are their useful models?

celim307 · on March 22, 2023

Thats like saying car brand X is more reliable than car brand Y, because brand X won formula one.

masakreTech · on March 22, 2023

you can always go back to reddit

hackpert · on Nov 11, 2022

I've been using Metaphor for a few weeks now and have almost entirely switched from Google and other search engines. Keyword based search simply doesn't come close when it comes to getting the _right_ results. While I have to sift through a few pages of results on Google and then maybe find what I'm looking for, on metaphor, there's almost no SEO spam or Wikipedia-style links dominating the top results. It directs you to sources that are relevant to your search query. I don't know how they did this (probably a lot of very specific and targeted tricks), but Alex and team have created a marvelous product and I'm excited to see where this goes! Congrats on the launch!

hackpert · on Oct 7, 2022

While your point about numerical stability is correct in general, there are no numerical stability issues here and I think this conception, which I've seen in more than one place now, stems from a fundamental misunderstanding of the paper's results. While they _did_ come up with a faster TPU/GPU algorithm too, the primary result is not a fast matmul approximation, it is an exact algorithm comprising of stepwise addition/multiplication operations, and hence is numerically stable and should work for any ring (https://ncatlab.org/nlab/show/ring). AlphaTensor itself does not do the matrix multiplication, it was used to perform an (efficiently pruned) tree search over the space of operations to find an efficient, stable algorithm.

zekrioca · on Oct 7, 2022

Directly from the paper’s "Discussion" section:

> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.

hackpert · on Oct 7, 2022

Right, but doesn't that mean that it could potentially be used for designing algorithms that have componentwise numerical stability over some kind of floating point standard, but this, by definition being a result over finite fields, should be numerically stable?

(apologies if I misunderstood, I wasn't calling you out specifically but a generalized misconception I've noticed in a lot of other discussions so far)