LLaMA: A foundational, 65B-parameter large language model

saurabh20n · on Feb 24, 2023

Quick notes from first glance at paper https://research.facebook.com/publications/llama-open-and-ef...:

* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]

* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]

* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.

* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]

* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)

machinekob · on Feb 24, 2023

I hate when people don't include approximation for traning before final hyperparameters are found as its most costly part of whole process most of the time.

Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-

scotty79 · on Feb 25, 2023

Aren't those done on smaller version of the same model?

323 · on Feb 24, 2023

> we used 2048 A100-80GB for a period of approximately 5 months

Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?

Wolfram Alpha:

- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)

- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)

We still have the edge.

LOL, I'm being downvoted, I wonder way. Some don't like the question.

zhynn · on Feb 24, 2023

You have to include our evolutionary history too. A considerable amount of our sophisticated behavior doesn't require specific training, as it is encoded in our genetic and epigenetic systems. We aren't starting from zero.

osigurdson · on Feb 24, 2023

Then you would need to include the our history in the GPU calculation. GPUs require evolutionary bootstrapping - they didn't materialize alongside the first few hydrogen atoms post BB.

melling · on Feb 24, 2023

Every human requires the same energy, 20+ years, and training.

The trained computer model can be duplicated and used, requiring much less energy.

None of this matters to me, though.

The goal is to build better models. We can worry about the efficiency later.

swyx · on Feb 25, 2023

exactly. we are speedrunning 200,000 years of intelligent life evolution here.

isoprophlex · on Feb 24, 2023

You mean MWh maybe, not MW/h? (which is what, J/s^2 in SI... "Power rate".)

323 · on Feb 24, 2023

Right, I used the correct MWh in Wolfram, but for some reason wrote MW/h, I think it was written like that a long time ago on electricity bills.

Dylan16807 · on Feb 24, 2023

> We still have the edge.

Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.

Tepix · on Feb 26, 2023

Don't mix MW/h with MWh.

Anyway, i remember hearing that the brain uses 60 Watt. That's 10.5MWh in 20 years.

But, we can't transfer/copy that gained knowledge limitlessly.

robbiep · on Feb 24, 2023

It’s because your human math for power output is so far off it’s hard to know where to start to point you in the right direction

323 · on Feb 24, 2023

Please do tell. Or better provide your estimation. I just took raw calorie intake, no heating/transportation/lighting/computer usage/....

WASDx · on Feb 24, 2023

A thing to keep in mind is that 1 MWh of raw calories takes much more than 1 MWh to produce (fuel for tractors, inefficiency of meat etc). The GPU energy is also easier to make renewable.

I did an extremely rough calculation recently that the training of GPT-3 is comparable to one transatlantic flight (all passengers combined) in terms of emissions, very depending on the energy mix of course.

Teever · on Feb 25, 2023

That's the entire problem. There's so much more energy that goes into a modern human beyond just what they eat. Beyond physical items you've listed like clothing there's also education and healthcare. Those two institutions are critical in making a modern human and they both have their own dependency chains of physical resource, energy, and the input of even more humans.

programmer_dude · on Feb 25, 2023

Your units are bad. Did you mean MWh instead of MW/h?

zozbot234 · on Feb 24, 2023

https://github.com/facebookresearch/llama/blob/main/MODEL_CA... (linked in OP) has basic information about this model.

SethTro · on Feb 24, 2023

(1022362 + 82432) gpu-hours / 2048gpus / 5 months ~= 15% uptime.

That's only 0.08 nines of availability!

I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.

Tepix · on Feb 26, 2023

They may have thrown away some models that didn't turn out great.

foobiekr · on Feb 25, 2023

Poor GPU utilization even when available is the rule. Truly amazing. Staging of data is probably a huge part of it.

pavelstoev · on Feb 25, 2023

At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs

foobiekr · on Feb 25, 2023

90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.

bertday · on Feb 25, 2023

Is it failures or is this some backfill/budget scheduling while everyone is sleeping?

foobiekr · on Feb 25, 2023

A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.

woeirua · on Feb 24, 2023

These cost estimates really make me question OpenAI's valuation.

Also, they kind of prove to me that most companies are totally incapable of making the investments necessary to get much out of this type of AI.

pgt · on Feb 25, 2023

Financial hurdles to competitors can make the company that has overcome them more defensible.

machinekob · on Feb 25, 2023

Sadly big players take all in current world and microsoft is pretty big :|

sandGorgon · on Feb 24, 2023

>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.*

what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?

vishal0123 · on Feb 24, 2023

Scaling law is for training till convergence. Both PALM and this model have been undertrained. See the training loss plot in the paper.

sandGorgon · on Feb 25, 2023

hey thanks for your reply.

umm...so does OpenAI. In fact this is OpenAI discovery from [1]:

>Convergence is inefficient: When working within a fixed compute budget C but without any other restric- tions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)

>We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8)

P.S. Not trolling. genuinely trying to learn.

[1] https://arxiv.org/abs/2001.08361

cubefox · on Feb 25, 2023

This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556

sandGorgon · on Feb 27, 2023

hi again - genuinely trying to learn here. The Chinchilla paper is a COMPETING thesis right ? the OpenAI thesis hasnt changed or superseded here.

vishal0123 · on Feb 26, 2023

LLAMA made tradeoff for reducing parameter budget instead of training computation budget. This is better for inference computation budget.

Optimal number of tokens for 7B parameters is around 140B tokens[0], and meta trained it for trillion tokens.

[0]: https://arxiv.org/pdf/2203.15556.pdf

akomtu · on Feb 25, 2023

By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table?

hansvm · on Feb 25, 2023

Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.

Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).

That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.

Your question feels like it has a motive though. What are you really asking?

akomtu · on Feb 26, 2023

LLMs need a baseline to compare with. I suspect that when they get compared with a fuzzy hash table of a similar size (that returns a range of probabilities), their performance will become unimpressive.

hansvm · on Feb 26, 2023

You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.

LLMs can track mid-range dependencies though. Consider the following input

> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.

Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.

Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.

None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.

akomtu · on Feb 26, 2023

The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.

The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.

hansvm · on Feb 27, 2023

Ah, so less a hash table and more vanilla KNN?

Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).

akomtu · on Feb 27, 2023

k-nearest neighbors? Sort of, but I'd rather describe it as a geospatial map in many dimensions.

hansvm · on Feb 27, 2023

There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.

For some intuition, imagine the following tasks:

> Repeat the following phrase exactly twice: "sdflhasdflhasdf"

> Repeat the following phrase exactly twice: "sdflhasdflhasdg"

Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.

The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.

akomtu · on Feb 28, 2023

Indeed, that's good counterexample.

make3 · on Feb 24, 2023

do they do instruction fine-tuning

iandanforth · on Feb 24, 2023

"we are publicly releasing LLaMA"

"Access to the model will be granted on a case-by-case basis to academic researchers"

They keep saying the word 'release' but I don't think they know what that word means. There are perfectly good words in the English language to describe this situation without abusing "release." They "will begin to grant access to a select few". Nothing about that releases the model, or their control, which they are not doing and shouldn't imply.

modeless · on Feb 24, 2023

Someone needs to start a pirate bay for torrents of ML models. Call it ClosedAI, since in AI land these words mean the opposite of what they say.

Zuiii · on Feb 25, 2023

> since in AI land these words mean the opposite of what they say.

Huh. I think you just might be right: OpenAI that isn't open, AI safety/ethics "researchers" that have nothing to do with safety or ethics, almost every answer chatGPT gives about a topic considered "sensitive" by said "researchers", almost every time ChatGPT falsely asserts it "cannot" do something or simply lies (1).

I often wonder why this field became so twisted and perverted. I support the idea behind ClosedAI.

1: Answers given by "DAN" provide a glimpse of what chatGPT output could be like if it was allowed to provide answers that are factual, genuine, and truthful according to its dataset.

selfhoster11 · on Feb 25, 2023

> almost every time ChatGPT falsely asserts it "cannot" do something or simply lies (1).

Fun story: ChatGPT, if directly faced with empirical evidence that it can do something that OpenAI made it say it can’t do, (for example, by causing the model to lie about itself by poisoning the input corpus with falsehoods about ChatGPT’s own capabilities), it cannot grasp that there’s a paradox.

Good job, OpenAI.

Koshkin · on Feb 25, 2023

This has reminded me about “open addressing” also known as “closed hashing.”

codegladiator · on Feb 25, 2023

Off topic: Dan doesn't work anymore right? Or is there a new prompt

imtringued · on Feb 25, 2023

I was thinking of the same name.

egorfine · on Feb 25, 2023

Agree.

At first I was "oh wow, I should download and run this immediately". Then I have read the press-release and of course this was a marketingspeak with exactly the opposite meaning.

eshack94 · on Feb 25, 2023

Completely agree. Wish they'd stop throwing misleading words around for PR purposes. Just call it what it is.

golem14 · on Feb 25, 2023

IDK:

"Mr. Burns: Smithers, release the hounds. "

minimaxir · on Feb 24, 2023

This blog post is terrible at listing the improvements offered by the model, the abstract is better:

> We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. We release all our models to the research community.

mjburgess · on Feb 24, 2023

> We release all our models to the research community.

This is yet more evidence for the "AI isn't a competitive advantage" thesis. State-of-the-Art is a public resource, so competing with AI offers no "moat".

fwlr · on Feb 24, 2023

In terms of medieval warfare, what Facebook is doing here looks like filling the moat with rocks and dirt. OpenAI is worth billions, Microsoft is spending billions to retrofit most of their big offerings with AI, Google is doubtless also spending billions to integrate AI in their products. Their moat in all cases is “a blob of X billion weights trained on Y trillion tokens”. Facebook here is spending mere _millions_ to make and release competitive models, effectively filling in the moat by giving it to everyone.

recuter · on Feb 24, 2023

I don't see such grand stratagems as being a likely explanation. It seems more likely that a bunch of dorks are running around unsupervised trying their best to make lemonade with whatever budgets they are given as nobody can realistically manage AI researches. Or at least, it seems that much like in political governance, events are suddenly outpacing the ability of corporate to react.

In the case of OpenAI it is a "nudge humanity into a more wholesome direction" because a lot of them went on an acid fueled toxic affective altruism bender. And in this case it is "release all the things while Zuck is still obsessed with VR --for science!".

I like this other group better. But it is disturbing that it can stop at any moment. Probably why they are doing it, while they still can.

vineyardmike · on Feb 24, 2023

I think there is definitely a component of competition here.

Facebook has no real way to monetize this (eg they won’t release an api a la OpenAI and they don’t own a search engine). Since they can’t monetize it… why not provide a bit of kindling to lower the barrier for everyone else to compete with your competitors. This strategy is called “commodize your complement”.

If Facebook makes it easier to develop a google alternative, especially by doing something that doesn’t hurt them, then they just weakened a competitor. See Facebook releasing datasets for mapping. Think of the panic ChatGPT caused Google. It only cost a few Million to train but it’s probably costing Google more than that already.

webmaven · on Feb 24, 2023

> Facebook has no real way to monetize this (eg they won’t release an api a la OpenAI and they don’t own a search engine). Since they can’t monetize it… why not provide a bit of kindling to lower the barrier for everyone else to compete with your competitors. This strategy is called “commodize your complement”.

Your analysis is good, but that isn't what "commoditize your complement" means.

Strictly speaking, a search engine isn't a complement for FBs revenue streams. Relatively little of FBs revenue can be attributed to search engine traffic leading to FBs walled garden where they can show the user ads.

Complements are generally a required product or service that enables your core, but that isn't revenue generating for you. Examples for FB are data centers (so they participate in the Open Compute Project[0]), and mobile operating systems (which Google already has made a commodity, for their own reasons, with Android).

What FB is doing here is commoditizing their competitors' core offering (or rather, a rather promising future one). That's just the tactic though, there are several strategies this can enable, from undermining the barriers to entry into their competitors' search markets, to actually fragmenting that market by encouraging a diversity of specialized chat interfaces over one big chat model. You can see hints of both in this announcement.

Final note: FB is also protecting itself from relying on a competitor as a supplier should chat become a preferred user interface for the content on social networks, which it hasn't, but if it ever did this would count as "commoditizing their complement", though I would actually expect FB to switch to a mostly proprietary approach in that circumstance (so not much openness on having LLMs operate on social graphs and the like), just keeping the foundation they rely on, and which undermines and prevents gatekeeping by their advertising competitors, open.

[0] https://www.opencompute.org/

invig · on Feb 24, 2023

I don't think that's the best interpretation of complement. Think of complement as "is your users experience benefited by the existence of the other thing". I think that Search qualifies as something that enhances the Facebook users experience (e.g. my cousin mentioned thing that they like doing today, how do I find more about thing?)

webmaven · on March 2, 2023

Given FBs optimization of the feed for dopamine hits, I don't think search across posts is a high priority (yes, it is table stakes, but whatever they currently use seems to be commoditized enough for their purposes). Group and page recommendations and discovery is another matter, perhaps an LLM may help make that more engaging. It might also be useful for helping users (and brands) moderate their groups and so on. These are also complements, but they aren't external ones.

naasking · on Feb 24, 2023

Facebook maybe can't make money from it, but they arguably could save money from it, for instance in automating some fact checking and moderation activities that they currently spend quite a bit of money on.

sclarisse · on Feb 25, 2023

If you’re going to ask an AI to do fact checking with today’s technology, I would urge you to start by asking the AI a simple test question: Which is heavier? A pound of feathers or two pounds of lead?

selfhoster11 · on Feb 25, 2023

I was going to provide evidence that this works correctly, but I was surprised to find that it gets severely tripped up on this particular question, even when told to “show working step by step”.

elbear · on Feb 25, 2023

Do you have an explanation for why it fails on this question?

panarky · on Feb 25, 2023

Why assume it fails?

ChatGPT: "One pound of feathers and two pounds of lead weigh the same, which is one pound or 16 ounces. The difference is in their volume, where a pound of feathers takes up more space than two pounds of lead. This is because the density of feathers is much less than that of lead, so even though the weight is the same, the amount of space they occupy is quite different."

espadrine · on Feb 25, 2023

Your question makes it sound like you believe it succeeds, but the answer you pasted is incorrect.

As for why it fails, it is likely a bias arising from the question being way more commonly asked in the corpus with equal mass than with distinct mass, increasing attention weights towards an answer expressing equality.

I believe current LLMs lack some common sense at an architectural level. They learn both specialized facts and general deduction in the same weights class: in my mind, they should separate their world model from their instance model.

titzer · on Feb 24, 2023

They almost certainly spent at least a few million dollars on this research project. Hard to say when the decision was made to open source this (from the outset or after it started showing results), but the decision was conscious and calculated. Nothing this high-profile is going to escape the strategic decision making processes of upper management.

fwlr · on Feb 24, 2023

The researchers and engineers and other assorted dorks who built it weren’t thinking about moats, for sure, I agree with you there. But I guarantee you that the metaphor of medieval warfare was on the minds of the executives and legal team deciding whether to let the team release their work in this way.

visarga · on Feb 24, 2023

> It seems more likely that a bunch of dorks are running around unsupervised trying their best to make lemonade with whatever budgets they are given

Probably not, Zuck is announcing it.

recuter · on Feb 24, 2023

  Today we're releasing a new state-of-the-art AI large language model called LLaMA designed to help researchers advance their work. LLMs have shown a lot of promise in generating text, having conversations, summarizing written material, and more complicated tasks like solving math theorems or predicting protein structures. Meta is committed to this open model of research and we'll make our new model available to the AI research community.

I don't know what that means or if he even wrote/read it tbh. I hope it literally just means Meta is actually committed to this open model of research (for now).

Maybe he is being a Machiavellian moat filler, I stand corrected. I think/hope that they don't really have a plan to counter OpenAI yet because I am afraid this attitude won't last once they do and this stuff has recently started moving quickly.

ahati · on Feb 24, 2023

It is to please the shareholders. Now shareholder can know that Meta can compete with GPT3.

ttul · on Feb 24, 2023

100%. This kind of announcement is for the street. Investors need to be reassured that Meta is keeping up with the cool AI stuff over at Google and Microsoft. A release like this will cause a flurry of downstream news coverage. I’m sure they are hoping that the model will be picked up by researchers doing interesting things who might further generate good coverage.

And, yes, it fills the moat.

mliker · on Feb 24, 2023

this is a pretty biased and uninformed opinion. Pretty condescending to call it a "bunch of dorks...running around unsupervised."

recuter · on Feb 24, 2023

I don't know what my bias is supposed to be but I called them dorks affectionately for one. The other replies are literally arguing that they are very extremely supervised where as I am speculating they are just eager to share their work for the right reasons and the eye of Sauron has yet to turn upon them.

Inside knowledge I never claimed. Anything else I can help you with today? :)

rcpt · on Feb 24, 2023

> I called them dorks affectionately

Never in my life have I seen that word used affectionately

vidarh · on Feb 24, 2023

I have seen that many times, as well as seeing people use it about themselves.

drusepth · on Feb 25, 2023

Geek, nerd, and dork can all be used positively/affectionately. I know I've called myself and friends all of the above on many occasions.

alibero · on Feb 25, 2023

As one of the unsupervised dorks working on LLMs at Meta (not one of the authors here) I took it in a positive way :)

recuter · on Feb 25, 2023

Thank you for your service meatbag <3

sebastiennight · on Feb 24, 2023

I think Noah Kagan's blog is literally called "OK Dork", and I always assumed the title was self-deprecating (in a fun/positive way) rather than negative.

tantalor · on Feb 24, 2023

Check out https://news.ycombinator.com/newsguidelines.html

Not sure how calling out comment as "biased", "uninformed" or "condescending" is helping.

PoignardAzur · on Feb 24, 2023

Which kind of suggests Microsoft made a really bad move antagonizing the open-source community with Gtihub Copilot.

They got a few years of lead time in the "AI codes for you" market, but in exchange permanently soured a significant fraction of their potential userbase who will turn to open-source alternatives soon anyway.

I wonder if they'd have been better served focusing on selling Azure usage and released Copilot as an open-source product.

freeqaz · on Feb 24, 2023

How did Microsoft sour developers with Copilot? I know dozens of people that pay for it (including myself) and I feel like it is widely regarded as a "no brainer" for the price that it's offered at.

Please help me understand!

Mizza · on Feb 24, 2023

The company that tried to kill Linux in the 90s, owned by the world's most famously rich man, is now stealing my code and selling it back to me? Yeah, fuck that.

nl · on Feb 25, 2023

This isn't stealing at all. I want my open source code to be used like this.

nhinck2 · on Feb 25, 2023

If only there was some kind of contract like thing you could release your code under so that there was no ambiguity.

nl · on Feb 25, 2023

Sarcasm doesn't translate well online...

To be clear: there is and it's pretty difficult to argue that MS is violating even the GPL.

visarga · on Feb 24, 2023

It's not selling you back your code. It's different code, adapted to a different task; your own code is forever free for you, you don't need anyone to give it to you.

Given the cost of running these models, and the utmost dedication needed to train them, I think it is worth it. GPUs cost money, electricity costs money. They can't serve the world for free and offer good latency.

mbb70 · on Feb 24, 2023

I mean, that's like saying an author steals the open source alphabet and charges you for reading their ordering of letters, as if the ordering of letters isn't where all the value is.

robertlagrant · on Feb 24, 2023

These models are trained on sequences of words, not told the letters and left to get on with it.

KarlKemp · on Feb 24, 2023

It continues to amaze how people are incapable of following even the most trivial cases of abstract reasoning.

theRealMe · on Feb 24, 2023

They didn’t. There is a small group of people that are always looking for the latest reason to be outraged and to point at any of the big tech companies and go “aha! They are evil!” Copilot’s ai was trained on GitHub projects and so these people are taking their turns clutching their pearls inside of their little bubble.

I’d bet that more than 95% of devs haven’t even heard of this “controversy” and even if they did, wouldn’t care.

PoignardAzur · on Feb 27, 2023

I'm not so sure.

I do think the controversy is stupid, but inside my own company, we significantly delayed migrating some projects to Github because people were concerned that the way Microsoft handled Copilot meant that Github wasn't a safe long-term host for an open-source project (and yes, I'm aware of all the reason that's irrational).

Even if the people angry about Copilot are a minority, it might still have a bad move. Trust accumulates slowly over years, but mistrusts builds up over only a few events. People are still remembering Microsoft's anticompetitive practices from 20 years ago. The mistakes it makes now might stick for a long time.

sandkoan · on Feb 24, 2023

Presumably, because they trained Copilot on billions of lines of, often licensed, code (without permission), that Copilot has a tendency to regurgitate verbatim, without said license.

happypumpkin · on Feb 24, 2023

For a specific example some variation of "fast inverse square root" will usually get you the exact GPL licensed code from Quake III, comments included.

theRealMe · on Feb 24, 2023

Do you mean the same code that has its own Wikipedia page where the exact code is written, comments included, and has probably been copy pasted into 100’s of other projects?

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root

Mizza · on Feb 24, 2023

You mean this code?

https://archive.softwareheritage.org/browse/content/sha1_git...

Do you see that notice at the top of the file? It says:

==

This file is part of Quake III Arena source code.

Quake III Arena source code is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

===

but because it's been laundered by Microsoft, you think it's okay to steal free software and make it proprietary?

warkdarrior · on Feb 24, 2023

How is it made proprietary? The Quake III Arena is no more proprietary now then if it were stored on GitHub proprietary web servers. Copilot is just a fancy code index, that sometimes returns the original code and other times it gives you a modified copy.

Mizza · on Feb 24, 2023

Because as you say, it provides original or modified code but doesn't provide provenance or license information. It's copyright laundering. After decades of fighting the community in the courts over shit like this, Microsoft just turns around and says well, it's okay when we do it? Foh.

imtringued · on Feb 25, 2023

The problem is you have to obey the license of the code even if you just take a snippet and Copilot does not reproduce the correct license.

WithinReason · on Feb 27, 2023

"The algorithm was often misattributed to John Carmack, but in fact the code is based on an unpublished paper by William Kahan and K.C. Ng circulated in May 1986"

charcircuit · on Feb 25, 2023

That code didn't originate from quake

robertlagrant · on Feb 24, 2023

The point is that it's charging having been trained on open source code. What you're saying agrees with that, but your triumphant tone seems to be implying the opposite. Which did you mean?

happypumpkin · on Feb 26, 2023

Yes that code, I was replying to a comment claiming that

> Copilot has a tendency to regurgitate [code] verbatim, without said license.

and I think that is a pretty good example.

PoignardAzur · on Feb 27, 2023

> that Copilot has a tendency to regurgitate verbatim, without said license.

A "tendency" is overstating it. I'm not aware of any example that would have been likely to occur if the author wasn't specifically trying to get the regurgitated code.

layer8 · on Feb 24, 2023

Controversy about unlicensed use of source code as training data and lack of attributions in the generated output.

Vvector · on Feb 24, 2023

Microsoft makes dozens of "really bad moves" every year. This is nothing.

twoodfin · on Feb 24, 2023

Isn't this a classic "commoditize your complements" play? Facebook's value is in their social graph and users' attachment to it via various apps. TikTok and others are not trying to replicate that, they're creating similar volumes of user attachment via models. You can easily imagine LLM's being applied by competitors in a comparable way. If models become commodities, then Facebook continues to hold the one advantage nobody else has.

Veen · on Feb 24, 2023

Commoditize your complement.

https://gwern.net/complement

mgraczyk · on Feb 24, 2023

My guess is that the primary strategic motivation here is recruiting and maintaining a competent org of people who know how to build large models. I don't think that the actual release is as calculated as all that. It's more about proving to themselves and the world that they can train big models.

skybrian · on Feb 24, 2023

Not everyone. You have to apply for access.

imtringued · on Feb 25, 2023

There is no moat, the entire point of LLM is to have no moat and just throw money faster at the problem than your competitors.

a-dub · on Feb 25, 2023

i think maybe the LLM team at facebook was bummed out because twitter bullied them on their last public release (and they didn't ignore it), and this time they decided to sit down and nerd flex by doing some undeniably excellent performance work that reduces resource requirements by 10x and limits itself only to publicly available training data.

maybe they care about moats and elon muskcrosoft's closedai or whatever, but i kinda doubt it. again, it feels more like a nerd flex probably for the purposes of raising morale internally and pushing the field as a whole in a good direction by reducing resource requirements.

excellent paper! easy on the eyes and i really like the angle.

lr1970 · on Feb 24, 2023

> We release all our models to the research community.

And from the FB blogpost [0] "Request Form Thank you for your interest in Meta AI’s LLaMA (Large Language Model Meta AI) models. To request access to the models, please fill out this form, and we'll review and let you know if your use case is approved. The information you provide below will be used solely to assess eligibility to access these models."

So much for "releasing" the model for research community.

[0] https://ai.facebook.com/blog/large-language-model-llama-meta...

layer8 · on Feb 24, 2023

They are only releasing it on a case-by-case basis and with a non-commercial license.

minimaxir · on Feb 24, 2023

Business orgs have been finetuning open-source models like these on their own internal data to create a moat since BERT in 2018.

Der_Einzige · on Feb 24, 2023

Glad to see you still on HN! You've done amazing work in this domain!

I'd argue that this goes further back to the word2vec/glove days too. I was working for a company in 2018 who leveraged my skills for fine-tuning word2vec/fasttext even before BERT/attention is all you need paper.

cardine · on Feb 24, 2023

And yet there are still no publicly available models that could actually compete with ChatGPT.

I'm not even talking about RLHF (although data like that is also a huge moat) - just simple things like larger context sizes.

There are still plenty of AI advantages to be had if you go just a little bit outside of what is currently possible with off the shelf models.

jstx1 · on Feb 24, 2023

The research is open, you still need

1) to have a good idea for a product that people want

2) lots of talent and resources to actually build and run everything

ilaksh · on Feb 24, 2023

Actually no it's not available for commercial purposes or for anyone they do not approve.

mjburgess · on Feb 24, 2023

And then facebook will release a version which beats it ? :)

nafizh · on Feb 24, 2023

>'We release all our models to the research community'

Where release means you fill out a form and wait indefinitely. Also no use for commercial purposes - which means 95% of users are out - certainly doesn't democratize LLMs.

version_five · on Feb 24, 2023

I think these "open" models are the lamest release strategy. I get the idea of being closed à la OpenAI to protect a business model, and I understand open sharing in the community

"We'll let you see it if we approve you" is just being a self-important pompous middle man.

mliker · on Feb 24, 2023

> doesn't democratize LLMs

open source always leads to more democratization in the long run

circuit10 · on Feb 24, 2023

This isn’t open source, at least by the usual definition that you can freely use it for any purpose and redistribute it

bioemerl · on Feb 24, 2023

Yeah. I was hoping to be able to use the model in koboldai, but no dice.

freezed8 · on Feb 24, 2023

(creator of gpt index / llamaindex here https://github.com/jerryjliu/gpt_index)

Funny that we had just rebranded our tool from GPT Index to LlamaIndex about a week ago to avoid potential trademark issues with OpenAI, and turns out Meta has similar ideas around LLM+llama puns :). Must mean the name is good though!

Also very excited to try plugging in the LLaMa model into LlamaIndex, will report the results.

mark_l_watson · on Feb 24, 2023

I look forward to your comparison results.

BTW, I have been heavily experimenting with both your LlamaIndex and also LangChain which you use in LlamaIndex. I am also writing a new book centered around both projects. Great stuff!!

haolez · on Feb 24, 2023

gpt-index is awesome! What should I study to understand the inner workings a bit better?

simonw · on Feb 24, 2023

The most interesting snippet in the paper I think is this:

> For instance, LLaMA-13B outperforms GPT-3 on most bench- marks, despite being 10× smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU.

stevenhuang · on Feb 24, 2023

How about alignment/ability to answer prompt queries and chain of thought reasoning capabilities?

Without the fine tuning RLHF phase to make it like instructgpt I'm assuming it won't be as good as ChatGPT, is that right?

How hard would it be to fine tune the 65B model on commodity hardware?

Found answer here:

> Out-of-scope use cases LLaMA is a base, or foundational, model. As such, it should not be used on downstream applications without further risk evaluation and mitigation. In particular, our model has not been trained with human feedback, and can thus generate toxic or offensive content, incorrect information or generally unhelpful answers.

https://github.com/facebookresearch/llama/blob/main/MODEL_CA...

zozbot234 · on Feb 24, 2023

Yes but fine tuning for RL is not expected to be hard. You're essentially limited by how much human feedback is available, so it's very different from training the foundational model on random bulk data.

simonw · on Feb 24, 2023

At this point I fully expect that someone will release a usable RLHF-fine-tuned language model that can run on consumer hardware, based on the methodology used for LLaMA (and other similar papers e.g. https://github.com/FMInference/FlexGen ), at some point in the next 6-24 months.

monkeydust · on Feb 24, 2023

What are the functional steps that need to happen to get to that point ?

ntonozzi · on Feb 24, 2023

How are they different from models that already exist like FLAN-T5?

stevenhuang · on Feb 24, 2023

Of course, but even still the level of scale, clean data, and human supervision needed may be significant. It's reported OpenAI used an army of humans to generate question answer prompts and rate the model output.

They kept the details closely guarded and only hinted at how they did RLHF and transitioned the architecture to self supervised learning.

pps · on Feb 24, 2023

I'm stupid in this area and have only a very broad understanding of "AI". But isn't this already happening in this project? https://github.com/LAION-AI/Open-Assistant

m3kw9 · on Feb 24, 2023

Does this show how you inefficient GPT3 is or how easy their model can be disrupted? The way they will need to keep business is innovate faster with a usable commercial product.

rvz · on Feb 24, 2023

> To maintain integrity and prevent misuse, we are releasing our model under a noncommercial license focused on research use cases. Access to the model will be granted on a case-by-case basis to academic researchers; those affiliated with organizations in government, civil society, and academia; and industry research laboratories around the world. People interested in applying for access can find the link to the application in our research paper.

The closest you are going to get to the source is here: https://github.com/facebookresearch/llama

It is still unclear if you are even going to get access to the entire model as open source. Even if you did, you can't use it for your commercial product anyway.

ninjin · on Feb 24, 2023

Facebook continues to appropriate the word “open” [1], which is sad really. There are plenty of good words they and others could use instead.

[1]: https://news.ycombinator.com/item?id=32079558

jerpint · on Feb 24, 2023

I just signed some form with license to maybe possibly get access to the model… not very open indeed

swatcoder · on Feb 24, 2023

> Even if you did, you can't use it for your commercial product anyway.

And of course, the irony is that its not commercial products that endanger “integrity and misuse”.

The looming misuse of LLM’s is cheap content flooding by spammers, black hats, and propaganda bots — and those people don’t care about licenses, and will inevitably defeat any watermarks meant to prevent or track leaks.

adeon · on Feb 24, 2023

Has anyone had success being approved to download weights even if you are just a hobbyist with a big GPU? (I'm here about to begrudgingly fill out the Google Form) Asking in general, not just for this particular model.

I've been following these language models and so far went through approval once for a English/Chinese GLM-130B model and it took only ~30 minutes to get approved even though I am a complete nobody.

nathanh4903 · on March 2, 2023

i got in after 2 days and im just an undergrad with some experience tangentially related to machine learning. Too bad i only found out the hardware requirement after downloading 13GB of weights

amoss · on Feb 24, 2023

Given the limited release of access under a license - is there any decision on whether a model like this would be protected by copyright if it leaked?

The output of an algorithm is not a human authored creative work, which given the recent decision in the book case would seem to weigh against it.

ImprobableTruth · on Feb 24, 2023

Unfortunately non-commercial and only available to academics upon request.

yieldcrv · on Feb 24, 2023

darn, guess we'll have to wait until next week when some other surprise organization releases something extremely surprising and useful in this space

elorant · on Feb 24, 2023

Until someone leaks it.

brucethemoose2 · on Feb 24, 2023

I give it a week.

koheripbal · on March 4, 2023

Prescient

flangola7 · on Feb 24, 2023

That's probably the best and safest way to do it

lovelearning · on Feb 24, 2023

I hope they make at least the smallest one available for everyone. I find it ironic that they want to prevent misuse, but think allowing access to those affiliated with government organizations will achieve that.

swyx · on Feb 24, 2023

> To maintain integrity and prevent misuse, we are releasing our model under a noncommercial license focused on research use cases.

if say i wanted to replicate this paper for commercial use, what would it take and how do i get started? would FB have a basis for objection?

causalmodels · on Feb 24, 2023

I'm going to assume you know how to stand up and manage a distributed training cluster as a simplifying assumption. Note this is an aggressive assumption.

You would need to replicate the preprocessing steps. Replicating these steps is going to be tricky as they are not described in detail.Then you would need to implement the model using xformers [1]. Using xformers is going to save you a lot of compute spend. You will need to manually implement the backwards pass to reduce recomputation of expensive activations.

The model was trained using 2048 A100 GPUs with 80GBs of VRAM. A single 8 A100 GPU machine from Lambda Cloud costs $12.00/hr [2]. The team from meta used 256 such machines giving you a per day cost of $73,728. It takes 21 days to train this model. The upfront lower bound cost estimate of doing this is [(12.00 * 24) * 21 * 256) = ] $1,548,288 dollars assuming everything goes smoothly and your model doesn't bite it during training. You may be able to negotiate bulk pricing for these types of workloads.

That dollar value is just for the compute resources alone. Given the compute costs required you will probably also want a team composed of ML Ops engineers to monitor the training cluster and research scientists to help you with the preprocessing and model pipelines.

[1] https://github.com/facebookresearch/xformers [2] https://lambdalabs.com/service/gpu-cloud

edit: these costs are for the 65B parameter model.

swyx · on Feb 24, 2023

thank you! love getting a peek into infra math. hopefully these will come down someday but also my resources rise up to be able to train these inhouse. can foresee foundation model training being a core competency of many tech platform companies in future.

chandureddyvari · on Feb 24, 2023

82342 GPU(A-100 80GB) hours for training their 7B(smallest) model according to their paper.

riku_iki · on Feb 24, 2023

115 GPUs can be used to train it within month.

minimaxir · on Feb 24, 2023

A metric-ton of GPU compute, for starters.

muglug · on Feb 24, 2023

I was surprised to see this citation in the linked paper

> Alan M Turing. 2009. Computing machinery and intelligence

Did he invent a time machine too?

igravious · on Feb 24, 2023

Absolutely a pet peeve of mine.

What's wrong with (orig. pub. 1950, rev. ed. 2009) or something like that.

With Zotero, the ref manager I use, you need to know the special magic incantation to store the original date of publication (the publishing date refers to the edition you're citing) but it just looks stupid (I think) to see Nietzsche F. 1999, Untimely Meditations (or whatever) and also I'd like to sort texts by original date of publication from oldest to newest because that's interesting. You have to put a special magic code in the extra field and then your tool-chain has to preserve it and transmogrify it correctly in the doc you're making use of the citation.

harveywi · on Feb 24, 2023

It is undecidable to determine whether a manuscript has terminated, or whether it will continue being revised forever.

Mizza · on Feb 24, 2023

Looks like it was republished with additional commentary: https://link.springer.com/chapter/10.1007/978-1-4020-6710-5_...

rafaelero · on Feb 24, 2023

Wow. That's nuts. An open-source model as competitive as 540b Palm? Sign me in. Wish they made it easy to access it, though. Not sure if I would be able to put my hands on it solely as an interested individual.

cypress66 · on Feb 24, 2023

There's nothing even close to open source here.

miketery · on Feb 24, 2023

Can this be downloaded like stable diffusion? How big are these models? How long does it take to run a "query" against them on a decent gaming PC?

bioemerl · on Feb 24, 2023

This can't, but there's some software, KoboldAi, which lets you download and run other LLMs

flatiron · on Feb 24, 2023

you have to contact facebook for a copy, so not as open as stable diffusion

brucethemoose2 · on Feb 24, 2023

No one seems to be running it locally at the moment.

WizardClickBoy · on Feb 25, 2023

I somehow have a strong intuition that Winamp is going to beat this thing.

dentalperson · on Feb 24, 2023

How does token->embedding lookup work with 1.4T BPE tokens? Since there are more tokens than the 65B parameters it must be doing some sort of interesting thing based on the merge operations. Is it different from what other GPT models with ~100k tokens are doing?

At inference, how many of those tokens are used? (they mention most tokens are used only once during training, so the must be very long sequences.)

bitRAKE · on Feb 24, 2023

The 1.4T tokens are what the model was trained on, and not the token range of the embedding.

dentalperson · on Feb 24, 2023

Ah, that makes more sense, thank you. Since this was mentioned in the tokenizer section and the number of unique tokens wasn't mentioned I misunderstood.

carlsborg · on Feb 24, 2023

"The code is licensed under the GPLv3, which permits commercial use." Yann on Twitter

return_to_monke · on Feb 24, 2023

The code. They didn't release a model tho, so unless you have a lot of money or get granted access to it, not for you.

yawaramin · on Feb 24, 2023

LeCun on Twitter: https://twitter.com/ylecun/status/1629189925089296386

    > LLaMA is a new open-source, high-performance large language model from Meta AI - FAIR.
    >
    > Meta is committed to open research and releases all the models the research community under a GPL v3 license.

weystrom · on Feb 24, 2023

I'd love to hear what Facebook defines as "open" and "democratic".

I understand the unwillingness to attach your brand to whatever porn that inevitably comes out of making it available for everybody, but maybe don't use such big words then.

zxienin · on Feb 24, 2023

Will it get 1M users using it by next week?

“Our 65B model is better performant than ChatGPT‘s 175B” is missing the point openai made in Nov'22.

That aside, this looks nice incremental research progress.

tevon · on Feb 24, 2023

How big is one of these models once it is trained and tuned? Could it be run on a single reasonably-large EC2? Or does it need any special architecture?

bioemerl · on Feb 24, 2023

It's not really open source which is something of a massive bummer, but still cool to see Facebook of all companies being the relative good guy.

modeless · on Feb 24, 2023

How possible is it to run these models on a gaming GPU?

sp332 · on Feb 25, 2023

If you're patient, https://github.com/FMInference/FlexGen lets you trade off GPU RAM for system RAM or even disk space.

coolspot · on Feb 24, 2023

Yes, if your gaming rig is NVidia DGX-1 workstation.

https://youtu.be/5TRr2oWeSw0

_kuvn · on Feb 25, 2023

a very rough approximation is 2GB of vram for every billion fp16 model parameters, so the lower end models may be just about achievable on high-end cards

thedudeabides5 · on Feb 24, 2023

LLMs are already getting commoditized, and this is good

qup · on Feb 24, 2023

What kind of machine does it take to run the model?

_a_a_a_ · on Feb 24, 2023

Newbie asks: what is a 'parameter' here

throwaway1851 · on Feb 25, 2023

Take y=3x+7. 3 and 7 are parameters, which would be learned by using gradient descent to find the values that make the x’s in the training data produce the corresponding y’s. LLMs are essentially functions like this, only they have billions of parameters and are nonlinear.

sp332 · on Feb 24, 2023

One neuron weight. A single element of a matrix or tensor.

naveen99 · on Feb 24, 2023

Why not a 1 Billion parameter model that can run on a single GPU ?

worse is better will come to LLM’s.

apgwoz · on Feb 24, 2023

LLaMA is short for "Large Language Model Meta AI" -- I don't get how, but I guess the "fakeronym" was a brandable, recognizable word, and close enough that they went with it?

blablablub · on Feb 24, 2023

So what hardware do we need to run this model?

CuriouslyC · on Feb 24, 2023

7 billion can run on 16+ gb GPUs as fp16, 14 billion can be run on 16+ gb if quantized to int8. 14G @ fp16 and 30G at int8 will require one of the 48 gb cards (less, but hardware mostly goes 24 -> 48).

brucethemoose2 · on Feb 24, 2023

Requirements could be reduced with something like DeepSpeed or ColossalAI (or even just simple hacks to move bits to RAM more aggressively)

blablablub · on Feb 24, 2023

thanks

milesward · on Feb 24, 2023

Right, but does it whip the LLaMA's ass?

layer8 · on Feb 24, 2023

You mean, will their competitors whip the LLaMA's ass. ;)

monkeydust · on Feb 24, 2023

What no implementation example?

rsrsrs86 · on Feb 24, 2023

Funny how the release has a note that the models clearly have ethical problemas that must be addressed, and still the company chooses to publish it.