* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]
* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]
* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.
* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]
* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)
I hate when people don't include approximation for traning before final hyperparameters are found as its most costly part of whole process most of the time.
Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-
> we used 2048 A100-80GB for a period of approximately 5 months
Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?
Wolfram Alpha:
- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)
- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)
We still have the edge.
LOL, I'm being downvoted, I wonder way. Some don't like the question.
You have to include our evolutionary history too. A considerable amount of our sophisticated behavior doesn't require specific training, as it is encoded in our genetic and epigenetic systems. We aren't starting from zero.
Then you would need to include the our history in the GPU calculation. GPUs require evolutionary bootstrapping - they didn't materialize alongside the first few hydrogen atoms post BB.
Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.
A thing to keep in mind is that 1 MWh of raw calories takes much more than 1 MWh to produce (fuel for tractors, inefficiency of meat etc). The GPU energy is also easier to make renewable.
I did an extremely rough calculation recently that the training of GPT-3 is comparable to one transatlantic flight (all passengers combined) in terms of emissions, very depending on the energy mix of course.
That's the entire problem. There's so much more energy that goes into a modern human beyond just what they eat. Beyond physical items you've listed like clothing there's also education and healthcare. Those two institutions are critical in making a modern human and they both have their own dependency chains of physical resource, energy, and the input of even more humans.
I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.
At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs
90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.
A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.
>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.*
what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?
umm...so does OpenAI. In fact this is OpenAI discovery from [1]:
>Convergence is inefficient: When working within a fixed compute budget C but without any other restric-
tions on the model size N or available data D, we attain optimal performance by training very large models
and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would
therefore be far more sample efficient than one might expect based on training small models to convergence,
with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)
>We have also tested our models on a set of additional text data distributions. The test loss on these datasets
as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2
dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct
parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the
in-distribution validation loss, and does not depend on the duration of training or proximity to convergence.
We also observe no dependence on model depth (see Appendix D.8)
This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556
By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table?
Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.
Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).
That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.
Your question feels like it has a motive though. What are you really asking?
LLMs need a baseline to compare with. I suspect that when they get compared with a fuzzy hash table of a similar size (that returns a range of probabilities), their performance will become unimpressive.
You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.
LLMs can track mid-range dependencies though. Consider the following input
> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.
Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.
Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.
None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.
The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.
The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.
Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).
There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.
For some intuition, imagine the following tasks:
> Repeat the following phrase exactly twice: "sdflhasdflhasdf"
> Repeat the following phrase exactly twice: "sdflhasdflhasdg"
Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.
The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.
"Access to the model will be granted on a case-by-case basis to academic researchers"
They keep saying the word 'release' but I don't think they know what that word means. There are perfectly good words in the English language to describe this situation without abusing "release." They "will begin to grant access to a select few". Nothing about that releases the model, or their control, which they are not doing and shouldn't imply.
> since in AI land these words mean the opposite of what they say.
Huh. I think you just might be right: OpenAI that isn't open, AI safety/ethics "researchers" that have nothing to do with safety or ethics, almost every answer chatGPT gives about a topic considered "sensitive" by said "researchers", almost every time ChatGPT falsely asserts it "cannot" do something or simply lies (1).
I often wonder why this field became so twisted and perverted. I support the idea behind ClosedAI.
1: Answers given by "DAN" provide a glimpse of what chatGPT output could be like if it was allowed to provide answers that are factual, genuine, and truthful according to its dataset.
> almost every time ChatGPT falsely asserts it "cannot" do something or simply lies (1).
Fun story: ChatGPT, if directly faced with empirical evidence that it can do something that OpenAI made it say it can’t do, (for example, by causing the model to lie about itself by poisoning the input corpus with falsehoods about ChatGPT’s own capabilities), it cannot grasp that there’s a paradox.
At first I was "oh wow, I should download and run this immediately". Then I have read the press-release and of course this was a marketingspeak with exactly the opposite meaning.
This blog post is terrible at listing the improvements offered by the model, the abstract is better:
> We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. We release all our models to the research community.
> We release all our models to the research community.
This is yet more evidence for the "AI isn't a competitive advantage" thesis. State-of-the-Art is a public resource, so competing with AI offers no "moat".
In terms of medieval warfare, what Facebook is doing here looks like filling the moat with rocks and dirt. OpenAI is worth billions, Microsoft is spending billions to retrofit most of their big offerings with AI, Google is doubtless also spending billions to integrate AI in their products. Their moat in all cases is “a blob of X billion weights trained on Y trillion tokens”. Facebook here is spending mere _millions_ to make and release competitive models, effectively filling in the moat by giving it to everyone.
I don't see such grand stratagems as being a likely explanation. It seems more likely that a bunch of dorks are running around unsupervised trying their best to make lemonade with whatever budgets they are given as nobody can realistically manage AI researches. Or at least, it seems that much like in political governance, events are suddenly outpacing the ability of corporate to react.
In the case of OpenAI it is a "nudge humanity into a more wholesome direction" because a lot of them went on an acid fueled toxic affective altruism bender. And in this case it is "release all the things while Zuck is still obsessed with VR --for science!".
I like this other group better. But it is disturbing that it can stop at any moment. Probably why they are doing it, while they still can.
I think there is definitely a component of competition here.
Facebook has no real way to monetize this (eg they won’t release an api a la OpenAI and they don’t own a search engine). Since they can’t monetize it… why not provide a bit of kindling to lower the barrier for everyone else to compete with your competitors. This strategy is called “commodize your complement”.
If Facebook makes it easier to develop a google alternative, especially by doing something that doesn’t hurt them, then they just weakened a competitor. See Facebook releasing datasets for mapping. Think of the panic ChatGPT caused Google. It only cost a few Million to train but it’s probably costing Google more than that already.
> Facebook has no real way to monetize this (eg they won’t release an api a la OpenAI and they don’t own a search engine). Since they can’t monetize it… why not provide a bit of kindling to lower the barrier for everyone else to compete with your competitors. This strategy is called “commodize your complement”.
Your analysis is good, but that isn't what "commoditize your complement" means.
Strictly speaking, a search engine isn't a complement for FBs revenue streams. Relatively little of FBs revenue can be attributed to search engine traffic leading to FBs walled garden where they can show the user ads.
Complements are generally a required product or service that enables your core, but that isn't revenue generating for you. Examples for FB are data centers (so they participate in the Open Compute Project[0]), and mobile operating systems (which Google already has made a commodity, for their own reasons, with Android).
What FB is doing here is commoditizing their competitors' core offering (or rather, a rather promising future one). That's just the tactic though, there are several strategies this can enable, from undermining the barriers to entry into their competitors' search markets, to actually fragmenting that market by encouraging a diversity of specialized chat interfaces over one big chat model. You can see hints of both in this announcement.
Final note: FB is also protecting itself from relying on a competitor as a supplier should chat become a preferred user interface for the content on social networks, which it hasn't, but if it ever did this would count as "commoditizing their complement", though I would actually expect FB to switch to a mostly proprietary approach in that circumstance (so not much openness on having LLMs operate on social graphs and the like), just keeping the foundation they rely on, and which undermines and prevents gatekeeping by their advertising competitors, open.
I don't think that's the best interpretation of complement. Think of complement as "is your users experience benefited by the existence of the other thing". I think that Search qualifies as something that enhances the Facebook users experience (e.g. my cousin mentioned thing that they like doing today, how do I find more about thing?)
Given FBs optimization of the feed for dopamine hits, I don't think search across posts is a high priority (yes, it is table stakes, but whatever they currently use seems to be commoditized enough for their purposes). Group and page recommendations and discovery is another matter, perhaps an LLM may help make that more engaging. It might also be useful for helping users (and brands) moderate their groups and so on. These are also complements, but they aren't external ones.
Facebook maybe can't make money from it, but they arguably could save money from it, for instance in automating some fact checking and moderation activities that they currently spend quite a bit of money on.
If you’re going to ask an AI to do fact checking with today’s technology, I would urge you to start by asking the AI a simple test question: Which is heavier? A pound of feathers or two pounds of lead?
I was going to provide evidence that this works correctly, but I was surprised to find that it gets severely tripped up on this particular question, even when told to “show working step by step”.
ChatGPT: "One pound of feathers and two pounds of lead weigh the same, which is one pound or 16 ounces. The difference is in their volume, where a pound of feathers takes up more space than two pounds of lead. This is because the density of feathers is much less than that of lead, so even though the weight is the same, the amount of space they occupy is quite different."
Your question makes it sound like you believe it succeeds, but the answer you pasted is incorrect.
As for why it fails, it is likely a bias arising from the question being way more commonly asked in the corpus with equal mass than with distinct mass, increasing attention weights towards an answer expressing equality.
I believe current LLMs lack some common sense at an architectural level. They learn both specialized facts and general deduction in the same weights class: in my mind, they should separate their world model from their instance model.
They almost certainly spent at least a few million dollars on this research project. Hard to say when the decision was made to open source this (from the outset or after it started showing results), but the decision was conscious and calculated. Nothing this high-profile is going to escape the strategic decision making processes of upper management.
The researchers and engineers and other assorted dorks who built it weren’t thinking about moats, for sure, I agree with you there. But I guarantee you that the metaphor of medieval warfare was on the minds of the executives and legal team deciding whether to let the team release their work in this way.
Today we're releasing a new state-of-the-art AI large language model called LLaMA designed to help researchers advance their work. LLMs have shown a lot of promise in generating text, having conversations, summarizing written material, and more complicated tasks like solving math theorems or predicting protein structures. Meta is committed to this open model of research and we'll make our new model available to the AI research community.
I don't know what that means or if he even wrote/read it tbh. I hope it literally just means Meta is actually committed to this open model of research (for now).
Maybe he is being a Machiavellian moat filler, I stand corrected. I think/hope that they don't really have a plan to counter OpenAI yet because I am afraid this attitude won't last once they do and this stuff has recently started moving quickly.
100%. This kind of announcement is for the street. Investors need to be reassured that Meta is keeping up with the cool AI stuff over at Google and Microsoft. A release like this will cause a flurry of downstream news coverage. I’m sure they are hoping that the model will be picked up by researchers doing interesting things who might further generate good coverage.
I don't know what my bias is supposed to be but I called them dorks affectionately for one. The other replies are literally arguing that they are very extremely supervised where as I am speculating they are just eager to share their work for the right reasons and the eye of Sauron has yet to turn upon them.
Inside knowledge I never claimed. Anything else I can help you with today? :)
I think Noah Kagan's blog is literally called "OK Dork", and I always assumed the title was self-deprecating (in a fun/positive way) rather than negative.
Which kind of suggests Microsoft made a really bad move antagonizing the open-source community with Gtihub Copilot.
They got a few years of lead time in the "AI codes for you" market, but in exchange permanently soured a significant fraction of their potential userbase who will turn to open-source alternatives soon anyway.
I wonder if they'd have been better served focusing on selling Azure usage and released Copilot as an open-source product.
How did Microsoft sour developers with Copilot? I know dozens of people that pay for it (including myself) and I feel like it is widely regarded as a "no brainer" for the price that it's offered at.
The company that tried to kill Linux in the 90s, owned by the world's most famously rich man, is now stealing my code and selling it back to me? Yeah, fuck that.
It's not selling you back your code. It's different code, adapted to a different task; your own code is forever free for you, you don't need anyone to give it to you.
Given the cost of running these models, and the utmost dedication needed to train them, I think it is worth it. GPUs cost money, electricity costs money. They can't serve the world for free and offer good latency.
I mean, that's like saying an author steals the open source alphabet and charges you for reading their ordering of letters, as if the ordering of letters isn't where all the value is.
They didn’t. There is a small group of people that are always looking for the latest reason to be outraged and to point at any of the big tech companies and go “aha! They are evil!” Copilot’s ai was trained on GitHub projects and so these people are taking their turns clutching their pearls inside of their little bubble.
I’d bet that more than 95% of devs haven’t even heard of this “controversy” and even if they did, wouldn’t care.
I do think the controversy is stupid, but inside my own company, we significantly delayed migrating some projects to Github because people were concerned that the way Microsoft handled Copilot meant that Github wasn't a safe long-term host for an open-source project (and yes, I'm aware of all the reason that's irrational).
Even if the people angry about Copilot are a minority, it might still have a bad move. Trust accumulates slowly over years, but mistrusts builds up over only a few events. People are still remembering Microsoft's anticompetitive practices from 20 years ago. The mistakes it makes now might stick for a long time.
Presumably, because they trained Copilot on billions of lines of, often licensed, code (without permission), that Copilot has a tendency to regurgitate verbatim, without said license.
For a specific example some variation of "fast inverse square root" will usually get you the exact GPL licensed code from Quake III, comments included.
Do you mean the same code that has its own Wikipedia page where the exact code is written, comments included, and has probably been copy pasted into 100’s of other projects?
Do you see that notice at the top of the file? It says:
==
This file is part of Quake III Arena source code.
Quake III Arena source code is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.
===
but because it's been laundered by Microsoft, you think it's okay to steal free software and make it proprietary?
How is it made proprietary? The Quake III Arena is no more proprietary now then if it were stored on GitHub proprietary web servers. Copilot is just a fancy code index, that sometimes returns the original code and other times it gives you a modified copy.
Because as you say, it provides original or modified code but doesn't provide provenance or license information. It's copyright laundering. After decades of fighting the community in the courts over shit like this, Microsoft just turns around and says well, it's okay when we do it? Foh.
"The algorithm was often misattributed to John Carmack, but in fact the code is based on an unpublished paper by William Kahan and K.C. Ng circulated in May 1986"
The point is that it's charging having been trained on open source code. What you're saying agrees with that, but your triumphant tone seems to be implying the opposite. Which did you mean?
> that Copilot has a tendency to regurgitate verbatim, without said license.
A "tendency" is overstating it. I'm not aware of any example that would have been likely to occur if the author wasn't specifically trying to get the regurgitated code.
Isn't this a classic "commoditize your complements" play? Facebook's value is in their social graph and users' attachment to it via various apps. TikTok and others are not trying to replicate that, they're creating similar volumes of user attachment via models. You can easily imagine LLM's being applied by competitors in a comparable way. If models become commodities, then Facebook continues to hold the one advantage nobody else has.
My guess is that the primary strategic motivation here is recruiting and maintaining a competent org of people who know how to build large models. I don't think that the actual release is as calculated as all that. It's more about proving to themselves and the world that they can train big models.
i think maybe the LLM team at facebook was bummed out because twitter bullied them on their last public release (and they didn't ignore it), and this time they decided to sit down and nerd flex by doing some undeniably excellent performance work that reduces resource requirements by 10x and limits itself only to publicly available training data.
maybe they care about moats and elon muskcrosoft's closedai or whatever, but i kinda doubt it. again, it feels more like a nerd flex probably for the purposes of raising morale internally and pushing the field as a whole in a good direction by reducing resource requirements.
excellent paper! easy on the eyes and i really like the angle.
> We release all our models to the research community.
And from the FB blogpost [0]
"Request Form
Thank you for your interest in Meta AI’s LLaMA (Large Language Model Meta AI) models. To request access to the models, please fill out this form, and we'll review and let you know if your use case is approved. The information you provide below will be used solely to assess eligibility to access these models."
So much for "releasing" the model for research community.
Glad to see you still on HN! You've done amazing work in this domain!
I'd argue that this goes further back to the word2vec/glove days too. I was working for a company in 2018 who leveraged my skills for fine-tuning word2vec/fasttext even before BERT/attention is all you need paper.
>'We release all our models to the research community'
Where release means you fill out a form and wait indefinitely. Also no use for commercial purposes - which means 95% of users are out - certainly doesn't democratize LLMs.
I think these "open" models are the lamest release strategy. I get the idea of being closed à la OpenAI to protect a business model, and I understand open sharing in the community
"We'll let you see it if we approve you" is just being a self-important pompous middle man.
Funny that we had just rebranded our tool from GPT Index to LlamaIndex about a week ago to avoid potential trademark issues with OpenAI, and turns out Meta has similar ideas around LLM+llama puns :). Must mean the name is good though!
Also very excited to try plugging in the LLaMa model into LlamaIndex, will report the results.
BTW, I have been heavily experimenting with both your LlamaIndex and also LangChain which you use in LlamaIndex. I am also writing a new book centered around both projects. Great stuff!!
The most interesting snippet in the paper I think is this:
> For instance, LLaMA-13B outperforms GPT-3 on most bench- marks, despite being 10× smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU.
How about alignment/ability to answer prompt queries and chain of thought reasoning capabilities?
Without the fine tuning RLHF phase to make it like instructgpt I'm assuming it won't be as good as ChatGPT, is that right?
How hard would it be to fine tune the 65B model on commodity hardware?
Found answer here:
> Out-of-scope use cases LLaMA is a base, or foundational, model. As such, it should not be used on downstream applications without further risk evaluation and mitigation. In particular, our model has not been trained with human feedback, and can thus generate toxic or offensive content, incorrect information or generally unhelpful answers.
Yes but fine tuning for RL is not expected to be hard. You're essentially limited by how much human feedback is available, so it's very different from training the foundational model on random bulk data.
At this point I fully expect that someone will release a usable RLHF-fine-tuned language model that can run on consumer hardware, based on the methodology used for LLaMA (and other similar papers e.g. https://github.com/FMInference/FlexGen ), at some point in the next 6-24 months.
Of course, but even still the level of scale, clean data, and human supervision needed may be significant. It's reported OpenAI used an army of humans to generate question answer prompts and rate the model output.
They kept the details closely guarded and only hinted at how they did RLHF and transitioned the architecture to self supervised learning.
Does this show how you inefficient GPT3 is or how easy their model can be disrupted? The way they will need to keep business is innovate faster with a usable commercial product.
> To maintain integrity and prevent misuse, we are releasing our model under a noncommercial license focused on research use cases. Access to the model will be granted on a case-by-case basis to academic researchers; those affiliated with organizations in government, civil society, and academia; and industry research laboratories around the world. People interested in applying for access can find the link to the application in our research paper.
It is still unclear if you are even going to get access to the entire model as open source. Even if you did, you can't use it for your commercial product anyway.
> Even if you did, you can't use it for your commercial product anyway.
And of course, the irony is that its not commercial products that endanger “integrity and misuse”.
The looming misuse of LLM’s is cheap content flooding by spammers, black hats, and propaganda bots — and those people don’t care about licenses, and will inevitably defeat any watermarks meant to prevent or track leaks.
Has anyone had success being approved to download weights even if you are just a hobbyist with a big GPU? (I'm here about to begrudgingly fill out the Google Form) Asking in general, not just for this particular model.
I've been following these language models and so far went through approval once for a English/Chinese GLM-130B model and it took only ~30 minutes to get approved even though I am a complete nobody.
i got in after 2 days and im just an undergrad with some experience tangentially related to machine learning. Too bad i only found out the hardware requirement after downloading 13GB of weights
I hope they make at least the smallest one available for everyone. I find it ironic that they want to prevent misuse, but think allowing access to those affiliated with government organizations will achieve that.
I'm going to assume you know how to stand up and manage a distributed training cluster as a simplifying assumption. Note this is an aggressive assumption.
You would need to replicate the preprocessing steps. Replicating these steps is going to be tricky as they are not described in detail.Then you would need to implement the model using xformers [1]. Using xformers is going to save you a lot of compute spend. You will need to manually implement the backwards pass to reduce recomputation of expensive activations.
The model was trained using 2048 A100 GPUs with 80GBs of VRAM. A single 8 A100 GPU machine from Lambda Cloud costs $12.00/hr [2]. The team from meta used 256 such machines giving you a per day cost of $73,728. It takes 21 days to train this model. The upfront lower bound cost estimate of doing this is [(12.00 * 24) * 21 * 256) = ] $1,548,288 dollars assuming everything goes smoothly and your model doesn't bite it during training. You may be able to negotiate bulk pricing for these types of workloads.
That dollar value is just for the compute resources alone. Given the compute costs required you will probably also want a team composed of ML Ops engineers to monitor the training cluster and research scientists to help you with the preprocessing and model pipelines.
thank you! love getting a peek into infra math. hopefully these will come down someday but also my resources rise up to be able to train these inhouse. can foresee foundation model training being a core competency of many tech platform companies in future.
What's wrong with (orig. pub. 1950, rev. ed. 2009) or something like that.
With Zotero, the ref manager I use, you need to know the special magic incantation to store the original date of publication (the publishing date refers to the edition you're citing) but it just looks stupid (I think) to see Nietzsche F. 1999, Untimely Meditations (or whatever) and also I'd like to sort texts by original date of publication from oldest to newest because that's interesting. You have to put a special magic code in the extra field and then your tool-chain has to preserve it and transmogrify it correctly in the doc you're making use of the citation.
Wow. That's nuts. An open-source model as competitive as 540b Palm? Sign me in. Wish they made it easy to access it, though. Not sure if I would be able to put my hands on it solely as an interested individual.
How does token->embedding lookup work with 1.4T BPE tokens? Since there are more tokens than the 65B parameters it must be doing some sort of interesting thing based on the merge operations. Is it different from what other GPT models with ~100k tokens are doing?
At inference, how many of those tokens are used? (they mention most tokens are used only once during training, so the must be very long sequences.)
Ah, that makes more sense, thank you. Since this was mentioned in the tokenizer section and the number of unique tokens wasn't mentioned I misunderstood.
> LLaMA is a new open-source, high-performance large language model from Meta AI - FAIR.
>
> Meta is committed to open research and releases all the models the research community under a GPL v3 license.
I'd love to hear what Facebook defines as "open" and "democratic".
I understand the unwillingness to attach your brand to whatever porn that inevitably comes out of making it available for everybody, but maybe don't use such big words then.
How big is one of these models once it is trained and tuned? Could it be run on a single reasonably-large EC2? Or does it need any special architecture?
a very rough approximation is 2GB of vram for every billion fp16 model parameters, so the lower end models may be just about achievable on high-end cards
Take y=3x+7. 3 and 7 are parameters, which would be learned by using gradient descent to find the values that make the x’s in the training data produce the corresponding y’s. LLMs are essentially functions like this, only they have billions of parameters and are nonlinear.
LLaMA is short for "Large Language Model Meta AI" -- I don't get how, but I guess the "fakeronym" was a brandable, recognizable word, and close enough that they went with it?
7 billion can run on 16+ gb GPUs as fp16, 14 billion can be run on 16+ gb if quantized to int8. 14G @ fp16 and 30G at int8 will require one of the 48 gb cards (less, but hardware mostly goes 24 -> 48).
Ethical problems that they don’t purge it of Wrong Think?
Why people care that a dodgy AI can be made to say politically incorrect things is beyond me. Don’t use its outputs for anything important and everything will be alright.
Perhaps because that is going to be a key factor in how many use cases for which it could be commercialised, worth considering when training a model costs a million dollars in compute time.
* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]
* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]
* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.
* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]
* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)