Hacker News new | past | comments | ask | show | jobs | submit login
Finetuning Large Language Models (sebastianraschka.com)
223 points by headalgorithm on April 22, 2023 | hide | past | favorite | 70 comments



I still don't have a really good answer to this question:

If you want to be able to do Q&A against an existing corpus of documentation, can fine-tuning an LLM on that documentation get good results, or is that a waste of time compared to the trick where you search for relevant content and paste that into a prompt along with your question?

I see many people get excited about fine-tuning because they want to solve this problem.

The best answer I've seen so far is in https://github.com/openai/openai-cookbook/blob/main/examples...

> Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall. [...] In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it’s like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.


It's amazing how much misinformation and vague information there is on this topic. I tried getting to the bottom of this in the following post in the OpenAI forum:

https://community.openai.com/t/fine-tuning-myths-openai-docu...

Bottom line is that fine-tuning does not seem to be a feasible option for adding new knowledge to a model for question answering.


My research shows otherwise. Tuning via transformer adapters pretty much added new knowledge to QA models or could be used for adversarial QA training. You can throw away learned adapters anytime and retrain from scratch with new information if your adapters become stale. Fine-tuning is cheap and small (e.g. 60kB data in an adapter). You can customize it in production for each individual customer as well by swapping adapters at the time of inference. Embeddings for very short-term facts and adapters for medium-long-term info seems like the best combination.


You mean like what's described in this blog post, correct?

https://adapterhub.ml/blog/2022/03/adapter-transformers-v3-u...


Yes, those adapters. There are many types now, most recently LLaMA adapter:

https://arxiv.org/abs/2303.16199


Could you link to your research and/or describe the models, libraries, data and tests you used for this?


Have you tried fine-tuning via adapter, if so what has been your experience and was the the total cost?


I liked this video:

https://youtu.be/9qq6HTr7Ocw

"OpenAI Q&A: Finetuning GPT-3 vs. Semantic Search - which to use, when, and why?"


I can't determine if this person is knows what they are talking about or is an extreme amateur just making things up and speaking confidently. He has other videos that are complete nonsense.


I mean... I you aren't OK with that, how are you going to ever use a large language model? ;P


The search+prompt approach has another benefit, which is that allows a chat interface to "cite it's source", something you really can't do with just parameters (fine-tuned or not.)

Although a lot of people are working on hallucination reduction, I think we're a long way off from that in the general case. So having the ability to point to a real piece of data, outside the model, is important for applications where accuracy matters.


I've done that with the rough Q&A tool I made. I include links to websites or documents not so much for verification, but because I work with government info and the devil's in the details. The Q&A is more like a smart search engine, with the summarized answer being a nice bonus.


I have similar questions for code assistance.

Github Copilot seems to be the most effective code assistant currently. It seems to use many heuristics for figuring out relevant snippets to include in prompts, like computing Jaccard similarity of windows of the last 20 opened files. It also tries some tree-sitter cleverness for some languages, but when I snoop on the HTTP traffic it seems to almost always just give up and only include the rest of a file as context.

I have wondered whether a model fine-tuned on my own code would do much better, and be simpler. But perhaps building embeddings and searching them (like in the article you linked) would be superior.

Code assistants need to be super low latency though which maybe complicates things too.


I would really like to have something local, without running into ChatGPT limits or costs. I looked at some text generation models, but I couldn't find any where you can ask a question and it generates the correct code.

Does anybody know if this is already available on Hugging Face or somewhere else?

I'm using TypeScript, so my idea would be to finetune on all used packages, maybe even all of the documentation as well.

I've created a small app where I create a prompt for an eslint error and let ChatGPT come up with a git diff. I manually feed that back to my application, it eslints and compiles and if there are any errors, a new prompt is generated. This works quite well.

GPT-4 is not available for me in the OpenAI API, so it's quite cumbersome.


The best option right now is SuperCOT 30B, a LoRA for LLaMA trained on Chain-of-Thought coding questions and answers. You will need at least 24GB of VRAM to load the 4bit GPTQ quantized LoRA merged model* locally. https://huggingface.co/tsumeone/llama-30b-supercot-4bit-128g...

*The merged model contains the LoRA. Applying the LoRA over a "raw" llama model uses more VRAM and does not allow for full context in 24GB of VRAM.


This is why Tabnine got e super excited, until I ran across the issue where they think their results are better than what the IDE gives you, which is incredibly annoying. https://github.com/codota/tabnine-intellij/issues/18 . I would be happy to pay, but it seems they are convinced their way is best.

I honestly think that if you could have all your private code indexed and accessible, this would be a game changer as it has way better context.


I think so too. I’m thinking about writing a little server you would run as a sidecar process. It would tokenize your code and build a local index in a vector DB. Client plugins in IDEs would ask that server for similar code snippets, with a short deadline for response, perhaps 1ms. Then the plugins could assemble a prompt to send to a configurable LLM backend, and do editor-specific code insertion or suggestion or whatever. I use emacs, so I know how I want that to work for me, but it must be different for others.

My open questions though are:

- does the tokenizer for the index need to be the same as the one used by the LLM?

- is running a local LLM feasible, or should I just assume prompts must sent over the network? This would affect latency and privacy a lot.

- is fine tuning helpful? Or necessary, even, to ensure the LLM only provides source code, and only in the target language?


What are you using to decrypt the TLS layer in your snooping?


You can disable TLS cert verification when going through a proxy in VSCode. There is an option to disable “strict SSL”. Then I run a man-in-the-middle HTTP proxy and capture the traffic.

Also possible when using the Vim copilot plugin (or its emacs port) which is pretty informative - you can see a lot of the extra tricks the VSCode implementation uses.


Proxyman is a fantastic piece of software for this on Max, there’s a free version which can snoop on up to 6 apps at a time (it installs a root certificate)


An answer from OpenAI will be very biased. OpenAI will happily take your money to query their models with long prompts. They will also happily take your money to compute embeddings and thus help you search (and lock you in!).

But, as far as I know, OpenAI will not help you fine-tune, they will not run a fine-tuned model for you, and they would probably prefer that you use their models over open models that can be fine tuned.

(None of this is to say that fine tuning is better. I’m just saying that OpenAI has a strong commercial bias.)


What about their fine tuning API? https://platform.openai.com/docs/guides/fine-tuning


Interesting. But not supported for their chat models.

I like how the fine-tuning page talks about how fine tuning is supposedly better, though.


I saw a video tutorial about it and if i understood correctly, the API will automatically include whatever you "tuned" in the prompt, making it just as costly i guess. But i could be wrong as it was late and i was getting sleepy.


> the trick where you search for relevant content and paste that into a prompt

Supabase Clippy was the first docs site to ship this experience to production as far as I can tell: https://supabase.com/blog/chatgpt-supabase-docs

I believe they called it "context injection" and I have been following suit in my own writing on the topic.

I am prototyping experiences like Supabase Clippy and am also very interested in fine-tuning for docs Q&A. But my main question is: what exactly would the fine-tuning inputs and outputs look like for docs Q&A?

Edit: For Q&A the question is the input and the answer is the desired output? Is that right?

A more general comment about fine-tuning for docs from my blog:

> AI is all about prediction. Given this temperature, this wind, this day of the year, what is the chance of rain? Temperature, wind, and date are your inputs. Chance of rain is your desired output. Now, try to apply this same type of thinking towards documentation. What are your inputs? What’s your output? The page title and code block could be your inputs. Whether or not the code builds could be your output. Or maybe the code block should be the output? This is why I keep saying that applying fine-tuning to docs is tricky. What are the inputs and outputs?

https://technicalwriting.tools/posts/ten-principles-response...

(I am an AI n00b and have not looked deeply into how fine-tuning works but it's high on my list to experiment with OpenAI's fine-tuning API. Please LMK if I am getting any fundamentals wrong.)


Which OpenAI fine-tuning API? I don't think fine-tuning is currently available for GPT-4

https://help.openai.com/en/articles/7127982-can-i-fine-tune-...


Langchain wrote about LLMs and SQL, and although it's about SQL, it's still a great read, especially the references. https://blog.langchain.dev/llms-and-sql/

I'm also super keen for the 32k API limit for OpenAI, that's going to be great.


> If you want to be able to do Q&A against an existing corpus of documentation, can fine-tuning an LLM on that documentation get good results, or is that a waste of time compared to the trick where you search for relevant content and paste that into a prompt along with your question?

If you want exact detailed recall, then using a framework that provides search and recall (by embeddings or otherwise) is probably always going to beat fine tuning, but also remember, it doesn’t have to be either-or.

I mean, if you want a person to handle Q&A on a corpus, is it better for them to have studied it, or have direct access to the corpus with an appropriate index? The answer is clearly that its better if they’ve trained and have access to the corpus, and while LLMs aren’t the same as people, I think the answer for them is the same here.


As long as the search part of search + prompt is good, the prompt part will emit accurate results or will say it couldn't find the answer. You can also cite the sources this way.

It seems pretty expensive because you may be pasting a lot of context into each query. If you allow the user to have follow up queries and you want to retain context of the conversation that seems expensive too. But it does seem like it should give the best results for Q&A as long as question is directly answered in your data somewhere.


With search, you don’t know if the model/search engine will retrieve the right context. With fine tuning you don’t know if it will forget important info, or learn incorrect things.


With search, you can "tune" the query by using related references to the documents.

Ideally, we include documents that are relevant to the topics at hand and eliminate documents that aren't. Using prompt chaining, or an iterative process of labeling and extracting features, we may be able to increase the efficiency of the final prompt.

Obviously this increases costs, but so does repeated fine-tuning.


We tried this for a while and stopped because it seemed like a dead end.

Our use case was/is a text-to-sql bot that would provide domain-specific output. Our domain is very complex so any sort of off-the-shelf AI SQL helper is a joke to us at best. My thinking goes something like "If your schema is so simple that you don't need to think about FTing the model, why do you need its help writing SQL in the first place? Your answers are probably on google somewhere."

The perspective we have now is that the LLM is a probabilistic text transformation engine that must always be contextualized by some external means. Clearly, if we could just include all of our context & it fits & it's cheap, we should just do this. But, the reason anyone is thinking about FT in the first place is because you can't fit the whole damn business in the prompt (or because it's too expensive to do it at scale).

The approach we are looking into now is a classification front end that attempts to discover the relevant business context being referred to by the user's initial prompt. Once we detect our classes, we look up the boilerplate text and incorporate it into the final prompt.

So, instead of trying to align the Jupiter-scale model to your business needs, leave it alone and build a smaller adapter layer that can work with any unmodified LLM.


I just finished augmenting some of the Spider and WikiSQL data on huggingface [1] I initially intended to train a text-to-sql LLM that would take a natural language question and be provided the CREATE TABLE statements to provide some grounding for the response. So instead of hallucinating the column and table names by using the question alone, I was hoping the CREATE TABLE statement would limit the choices. We'll see if it's useful or not, but funny enough I came across this article after I finished the dataset.[2]

I'd definitely like to see how others are doing it.

[1] https://huggingface.co/datasets/b-mc2/sql-create-context

[2] https://blog.langchain.dev/llms-and-sql/


I am currently taking a similar approach where we use an embedding and vector query to create the context relevant to the question the user posed and then throw it to the LLM. It's a hit or miss, esp for look ups of relevant code literals.


Been working on the same thing. In my case, I generated synthetic data based on the API’s schema and trained a T5 on that. Basically treated it as a translation problem from human language to API.


Are you documenting or sharing your progress anywhere? This exactly describes something I've been asked to investigate.


I can't share anything much more specific than what I already have, unfortunately.

All I can really add - Binary classification is like a super power once you understand the statistics around it. If you can find clever ways to combine multiple binary classifiers, you can quickly narrow down relevant context. You can also use the statistics to do things like determine if a query is too vague to be serviced by an LLM backend. You can also answer questions like "Why didn't we consider a table or join when writing the user's query?".


I've paid attention to the evolution of LLMs for a while now. I wonder if the need to fine-tune LLMs will become a thing of the past for many (if not most) use-cases.

Take GPT-4, for example. GPT-4's zero-shot prompt performance is excellent. It capably handles tons of tasks—even in specialized domains such as medicine and law. (Note: I'm a doctor and lawyer both.) I can see future LLMs being even more capable.

In fact, I see LLMs as platforms rather than products. Take smartphones: you have just two dominant platforms, iOS and Android. An entire ecosystem of apps runs on them. A software-only example might be web browsers: Chrome (and chromium browsers) and Firefox. These are the platforms on which millions of web applications run. I see LLMs ending up this way: a few platform providers, and a much vaster ecosystem of "apps" built on them.

This also implies something else: the "fine-tuning data" moat you thought your organization might have had might not be a moat at all. "Foundation" LLMs will become so good that everyone else will be able to do what you're doing. You'll have to compete the old-fashioned way: by building a better product, and crushing sales and marketing.


If you have many prompt-completion pairs, then fine-tuning seems to be the better approach, because of the limited context size of prompting. I am experimenting with LLMs as well, and for small examples few-shot prompting works miraculously well for my purposes, even with gpt-3.5-turbo. But I am hitting the 4k context size quickly, and even 32k is not enough for a serious application. That's why I will also look at fine-tuning, mainly to escape the context size limitation.

Edit: This affects every use case where each prompt-completion pair provides new and essential information that the model needs to take into account.


Depends: can you give an example where 32k would be necessary? In my experience, GPT-4 often outputs what you expect, even without any prompt:completion examples. (Unlike GPT-3.5-turbo—which in my opinion is even less capable than text-davinci-003 / 002). If you have complex requirements, you could just chain together multiple calls to GPT-4?

Now, the cost of GPT-4 is an issue. I hope that open-source models such those released by Stability AI reach the same level of quality as (at least) GPT-3.5-turbo soon. That would be a game changer.


Here is a very real example where 32k tokens is not enough: predicting discharge disposition for a long stay inpatient. A 30-day hospital stay can produce over 1,000 clinical notes.


When you say prediction my discharge disposition do you mean extracting from the document a statement of what the discharge disposition was? Or do you mean based on m extracted clinical characteristics, predicting what it should be?

For extracting what it was if it's in the document, you can use the indexing approach described in the article. For predicting what it should be, you'd probably be better off using llm as an extraction mechanism and then using structured data models to classify discharge disposition


You can do layered retrieval and summarization/embeddings to fit within the context bounds. Conceptually it's not far from what a clinician would do anyway. By switching from a naive one shot prompt with big context to interactive (multiple calls) and summaries, no need to retrain

But yeah my money is on context windows getting bigger and frameworks smoothing out how to do the above automatically, So it feels like a point in time optimization right now That will be tech debt by the end of the year


The latter, not the former. The former is extraction, not prediction.

A lot of factors go into predicting the post-acute care route that results in the best outcome. You are correct that it would be possible to create features using an LLM, but that is a very difficult problem compared to simply treating the problem as a sequence classification.


Agreed that there are better approaches than LLm to solve this problem There are deep learning and sequence models that are focused on this task using extracted structured info. Wouldn't you be better off using llm or any extraction + linking model to identify medical concepts and then a model that only understands medical concepts in order to do predict the next medical concepts.


Is this a joke? Medical data should stay the hell away from gpt servers



from that source we find descriptions of two features

"enhancements to automatically draft message responses"

"Another solution will bring natural language queries to SlicerDicer"

neither of which needs 32k token nor to see the patient records

also, foot note 1

"users of Azure and Azure OpenAI Service are responsible for ensuring the regulatory compliance of their use"

there is zero claim of actual compliances of these services for handling sensitive or regulated data.

now I understand the enthusiasm, but at least don't waste people time with sources that are, at best, tangential, and don't provide any substance to the discussion


Did you read the thing or what ?

"The second use will bring natural language queries and "data analysis" to SlicerDicer, which is Epic's data-exploration tool that allows searches across large numbers of patients to identify trends that could be useful for making new discoveries or for financial reasons. According to Microsoft, that will help "clinical leaders explore data in a conversational and intuitive way.""

Do you not understand what natural language queries on external data by access of/enabled by GPT-4 means/entails ?

The only thing worse than being condescending is being condescending and wrong.


yes, it seems you haven't been understanding what's written

that means natural language will understand user request and write queries for SlicerDicer data backend, not that data is sent to the llm for analysis.

it writes queries, don't receive data.


OpenAI will sign a business agreement and run an instance with isolated servers, for HIPAA compliance. That being said, you can run LLMs on-prem.


citation needed. specifically regarding gpt4 and not the other models, since that was your claim.

> That being said, you can run LLMs on-prem.

how is this relevant at all? it's just dodging the question, since the claim was about gpt4 32k tokens and not the crop of generic models that we have now availabe on prem (and which don't support 32k token anyway)


For my use case, GPT-3.5 quality seems enough, as it already delivers nearly 100% optimal output, and GPT-4 wouldn't be much of an improvement. On the other hand, an upper bound of 8k or even 32k context is just not acceptable for my use case.

I tried out da-vinci, which also works fine, but models below da-vinci don't work well for my use case.


Would you be willing to say a few words about the nature of your use cases?


>GPT-3.5-turbo—which in my opinion is even less capable than text-davinci-003 / 002

Not just opinion, GPT-3.5 Turbo is demonstrably worse at just about everything except providing so called "safe" and "aligned" replies to questions.


Suggest also looking at langchain type approaches.


Langchain seems to be just a framework. At this point, I rather use APIs directly.


Spoiler alert: this covers fine tuning encoder-only models (BERT) and not GPT models - which will be done in a later post. The former has been covered by many examples and articles over the past several years, and I was hoping for some insights on LoRA and fine tuning of GPT. Looking forward to reading the next article!


In the article it says to translate German to English we should use "Incontext learning prompt" like this:

Translate the following German sentences into English: Example 1: German: "Ich liebe Eis." English: "I love ice cream." Example 2: German: "Draußen ist es stürmisch und regnerisch" English: "It's stormy and rainy outside Translate this sentence: German: "Wo ist die naechste Supermarkt?"

Why is this needed when saying "Translate this to German: Hello how are you" just works out of the box?


You’re correct that GPT-3 and later can accomplish such tasks with what is known as a zero-shot approach.

In-context learning with few-shot exemplars are needed for things like encouraging responses that adhere to a certain JSON data type.


Few-shot learning typically works slightly better than zero-shot learning. Look at tables 3 & 4 in this paper: https://arxiv.org/pdf/2302.09210.pdf


Nice overview, the parameter efficient fine tuning space is heating up. It’s been interesting to watch fine tuning go from a binary choice to a spectrum. With the context window also continuing to grow LLMs are entering a whole new era.

I imagine down the road we may learn what to tune into models dynamically and how deep it needs to go. Or maybe the context window becomes a sort of rapidly trained thin NN layer on top like we’ve seen in diffusion models


If using C#, I made a library that simplifies fine tuning when using OpenaAI.

https://www.nuget.org/packages/OpenAILib

Instead of having to dealing with all of the various end points, creating files in a certain format (and then use the same conventions when using the model), you can simply create a fine tune and use it in completion requests.


Can someone chime in on how much VRAM is needed for fine-tuning vs inference? I have a GTX 1080 on which flan-t5-large or deBERTa can fit, but how much memory does it need for training? I suppose you need to keep the gradients somewhere. Also, how many examples are enough for simple classification? Does a multilingual model transfer its knowledge if my fine-tuning data has only some of the languages?


One of the topics he briefly touches on is indexing for vector similarity based retrieval (essentially search).

Has anyone on HN tried replacing systems like Elasticsearch with an LLM based index? Curious.

One of the systems at my startup is an elasticsearch based search of a large corpus of structured data that contains larger text fields.


Yes, there's lots of folks doing this and currently it looks like combining results from an LLM based index and a standard text retrieval indexes (e.g. using BM25) may beat either alone. Note that you can add and search LLM derived vectors from within Elasticsearch: look up 'dense vector search' in the docs. Check out the Haystack conference next week for lots of discussion of current practices: https://haystackconf.com/2023/


Thanks!

I've looked at haystack a little bit and was wondering how involved it would be to set up for a research spike.


The LLM embedding approach has given me much better search results than Elasticsearch.

Cost is definitely an issue.


Has anyone solved a real business problem with LLMs that they couldn't solve as efficiently before?


Interesting, but doesn't go into specific applications. Has anyone heard about someone finetuning ChatGPT/GPT4 on sentiment analysis, for example? I feel some benchmarks would be useful when determining whether this is a worthwhile investment.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: