If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here's a quick visualization: https://arcagi-o3-viz.netlify.app
Hi! I've been working on theorem proving systems for some time now. I would love to help out with an AlphaProof reproduction, but I can't reach you on discord for some reason!
Thank you, those insights are invaluable! This is a specific and potentially dumb question and I completely understand if you can't answer it!
The practical motivation for MoEs is very clear but I do worry about loss of compositional abilities (that I think just emerge from superposed representations?) that some tasks may require, especially with the many experts phenomenon we're seeing. This is an observation from smaller MoE models (with like top-k gating etc.) that may or may not scale, that denser models trained to the same loss tend to perform complex tasks "better".
Intuitively, do you think MoEs are just another stopgap trick we're using while we figure out more compute, better optimizers or could there be enough theoretical motivation to justify their continued use? If there isn't, perhaps we need to at least figure out "expert scaling laws" :)
thanks for the thoughtful qtn! yeah i dont have the background for this one, you'll have to ask an MoE researcher (which Yi is not really either as i found out on the pod). it does make sense that on a param-for-param basis MoEs would have less compositional abilities, but i have yet to read a paper (mostly my fault, bc i dont read MoE papers that closely, but also researcher fault, in that they're not incentivized to be rigorous about downsides of MoEs) that really identified what these compositional abilities are that MoEs are affected. if you could, for example, identify subcategories of BigBench or similar that require compositional abilities, then we might be able to get hard evidence on this question. i'm not yet motivated enough to do this myself but it'd make a decent small research question.
HOWEVER i do opine that MoEs are kiiind of a stopgap (both on the pod and on https://latent.space/p/jan-2024) - definitely a validated efficiency/sparsity technique (esp see deepseek's moe work if you havent already, with >100 experts https://buttondown.email/ainews/archive/ainews-deepseek-v2-b...) but mostly a oneoff boost you get on the single small dense expert equivalent model rather than comparable to the capabilities of a large dense model of the same param count (aka I expect a 8x22B MoE to never outperform a 176B dense model ceteris paribus - which is difficult to get a like-for-like comparison on bc these things are expensive, also partially because usually the MoE is just upcycled instead of trained from scratch, and partially because the routing layer is deepening every month). so perhaps to TLDR there is more than enough evidence and practical motivation to justify their continued use (i would go so far as to say that all inference endpoints incl gpt4 and above should be MoEs) but they themselves are not really an architectural decision that matters for the next quantum leap in capabilities
I'm not sure how to quantify how quickly or well humans learn in-context (if you know of any work on this I'd love to read it!)
In general, there is too much fluff and confusion floating around about what these models are and are not capable of (regardless of the training mechanism.) I think more people need to read Song Mei's lovely slides[1] and related work by others. These slides are the best exposition I've found of neat ideas around ICL that researchers have been aware of for a while.
There has been some interesting work when it comes to distributed training. For example DiLoCo (https://arxiv.org/abs/2311.08105). I also know that Bittensor and nousresearch collaborated on some kind of competitive distributed model frankensteining-training thingy that seems to be going well. https://bittensor.org/bittensor-and-nous-research/
Of course it gets harder as models get larger but distributed training doesn't seem totally infeasible. For example if we were to talk about MoE transformer models, perhaps separate slices of the model can be trained in an asynchronous manner and then combined with some retraining. You can have minimal regular communication about say, mean and variance for each layer and a new loss term dependent on these statistics to keep the "expertise" for each contributor distinct.
I tried to do this ages ago in 2018 by adapting OpenAI's flow architecture and it sort of seemed to work (was at least promising). With today's models with a significantly more disentangled latent space it should be much easier to do! I saw a transformer trained on the UK Biobank recently, excited for this space!
Wow I had hoped for a more productive discussion than these 1-1 comparisons of Bard vs ChatGPT that I'm seeing everywhere. The model deployed with this version of Bard is clearly a smaller model than the biggest LaMDA/PaLM models Google has been working on for ages. Which, according to their publications, show unprecedented results on _proof writing_ of all things (see Minerva). While their strategic decisions may be questionable (or they're just trying to quantize the model for mass deployment without burning billions per month in compute costs), its almost silly to question Google's ability to build useful LLMs.
At the moment unless we get more information about what metric you're supposed to evaluate it on, you could probably simplify the headline to just "Bard is much worse than chatgpt" without any loss of accuracy.
It's not really realistic to expect people to give Google credit for these amazing models they have published results about but haven't let people play with - they have given people Bard and people are evaluating it based on the criteria most obvious to them - a comparison to a very similar product that has just been released.
They knew the war they were entering, they knew their enemies, they knew how they'd get evaluated and still decided to get this model out in its current state, leading to the conclusion: Yes, this is really the best they can do and it's much worse than the state of the art.
In any case, it's a massive marketing blunder, the public opinion formed within the last hours was overwhelmingly "Bard sucks compared to ChatGPT."
>Yes, this is really the best they can do and it's much worse than the state of the art.
This is the best they can do under pressure.
ChatGPT surprised the world with how good it was, then Google scrambled to get something out quick.
A project like this is a massive undertaking, the first mover has the advantage that they can calmly refine their model until they find it presentable.
The question is, is what Google is delivering good for the timeframe since OpenGPT exploded in popularity enough for Google leadership to take note. Since that moment, realistically, is when they put pressure on their devs to push something out the door.
I think we'll see a better iteration soon. Not only from Google, but from other competitors.
I've been using Metaphor for a few weeks now and have almost entirely switched from Google and other search engines. Keyword based search simply doesn't come close when it comes to getting the _right_ results. While I have to sift through a few pages of results on Google and then maybe find what I'm looking for, on metaphor, there's almost no SEO spam or Wikipedia-style links dominating the top results. It directs you to sources that are relevant to your search query. I don't know how they did this (probably a lot of very specific and targeted tricks), but Alex and team have created a marvelous product and I'm excited to see where this goes! Congrats on the launch!
While your point about numerical stability is correct in general, there are no numerical stability issues here and I think this conception, which I've seen in more than one place now, stems from a fundamental misunderstanding of the paper's results. While they _did_ come up with a faster TPU/GPU algorithm too, the primary result is not a fast matmul approximation, it is an exact algorithm comprising of stepwise addition/multiplication operations, and hence is numerically stable and should work for any ring (https://ncatlab.org/nlab/show/ring). AlphaTensor itself does not do the matrix multiplication, it was used to perform an (efficiently pruned) tree search over the space of operations to find an efficient, stable algorithm.
> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.
Right, but doesn't that mean that it could potentially be used for designing algorithms that have componentwise numerical stability over some kind of floating point standard, but this, by definition being a result over finite fields, should be numerically stable?
(apologies if I misunderstood, I wasn't calling you out specifically but a generalized misconception I've noticed in a lot of other discussions so far)