I've paid attention to the evolution of LLMs for a while now. I wonder if the need to fine-tune LLMs will become a thing of the past for many (if not most) use-cases.
Take GPT-4, for example. GPT-4's zero-shot prompt performance is excellent. It capably handles tons of tasks—even in specialized domains such as medicine and law. (Note: I'm a doctor and lawyer both.) I can see future LLMs being even more capable.
In fact, I see LLMs as platforms rather than products. Take smartphones: you have just two dominant platforms, iOS and Android. An entire ecosystem of apps runs on them. A software-only example might be web browsers: Chrome (and chromium browsers) and Firefox. These are the platforms on which millions of web applications run. I see LLMs ending up this way: a few platform providers, and a much vaster ecosystem of "apps" built on them.
This also implies something else: the "fine-tuning data" moat you thought your organization might have had might not be a moat at all. "Foundation" LLMs will become so good that everyone else will be able to do what you're doing. You'll have to compete the old-fashioned way: by building a better product, and crushing sales and marketing.
If you have many prompt-completion pairs, then fine-tuning seems to be the better approach, because of the limited context size of prompting. I am experimenting with LLMs as well, and for small examples few-shot prompting works miraculously well for my purposes, even with gpt-3.5-turbo. But I am hitting the 4k context size quickly, and even 32k is not enough for a serious application. That's why I will also look at fine-tuning, mainly to escape the context size limitation.
Edit: This affects every use case where each prompt-completion pair provides new and essential information that the model needs to take into account.
Depends: can you give an example where 32k would be necessary? In my experience, GPT-4 often outputs what you expect, even without any prompt:completion examples. (Unlike GPT-3.5-turbo—which in my opinion is even less capable than text-davinci-003 / 002). If you have complex requirements, you could just chain together multiple calls to GPT-4?
Now, the cost of GPT-4 is an issue. I hope that open-source models such those released by Stability AI reach the same level of quality as (at least) GPT-3.5-turbo soon. That would be a game changer.
Here is a very real example where 32k tokens is not enough: predicting discharge disposition for a long stay inpatient. A 30-day hospital stay can produce over 1,000 clinical notes.
When you say prediction my discharge disposition do you mean extracting from the document a statement of what the discharge disposition was? Or do you mean based on m extracted clinical characteristics, predicting what it should be?
For extracting what it was if it's in the document, you can use the indexing approach described in the article. For predicting what it should be, you'd probably be better off using llm as an extraction mechanism and then using structured data models to classify discharge disposition
You can do layered retrieval and summarization/embeddings to fit within the context bounds. Conceptually it's not far from what a clinician would do anyway. By switching from a naive one shot prompt with big context to interactive (multiple calls) and summaries, no need to retrain
But yeah my money is on context windows getting bigger and frameworks smoothing out how to do the above automatically, So it feels like a point in time optimization right now That will be tech debt by the end of the year
The latter, not the former. The former is extraction, not prediction.
A lot of factors go into predicting the post-acute care route that results in the best outcome. You are correct that it would be possible to create features using an LLM, but that is a very difficult problem
compared to simply treating the problem as a sequence classification.
Agreed that there are better approaches than LLm to solve this problem There are deep learning and sequence models that are focused on this task using extracted structured info. Wouldn't you be better off using llm or any extraction + linking model to identify medical concepts and then a model that only understands medical concepts in order to do predict the next medical concepts.
from that source we find descriptions of two features
"enhancements to automatically draft message responses"
"Another solution will bring natural language queries to SlicerDicer"
neither of which needs 32k token nor to see the patient records
also, foot note 1
"users of Azure and Azure OpenAI Service are responsible for ensuring the regulatory compliance of their use"
there is zero claim of actual compliances of these services for handling sensitive or regulated data.
now I understand the enthusiasm, but at least don't waste people time with sources that are, at best, tangential, and don't provide any substance to the discussion
"The second use will bring natural language queries and "data analysis" to SlicerDicer, which is Epic's data-exploration tool that allows searches across large numbers of patients to identify trends that could be useful for making new discoveries or for financial reasons. According to Microsoft, that will help "clinical leaders explore data in a conversational and intuitive way.""
Do you not understand what natural language queries on external data by access of/enabled by GPT-4 means/entails ?
The only thing worse than being condescending is being condescending and wrong.
yes, it seems you haven't been understanding what's written
that means natural language will understand user request and write queries for SlicerDicer data backend, not that data is sent to the llm for analysis.
citation needed. specifically regarding gpt4 and not the other models, since that was your claim.
> That being said, you can run LLMs on-prem.
how is this relevant at all? it's just dodging the question, since the claim was about gpt4 32k tokens and not the crop of generic models that we have now availabe on prem (and which don't support 32k token anyway)
For my use case, GPT-3.5 quality seems enough, as it already delivers nearly 100% optimal output, and GPT-4 wouldn't be much of an improvement. On the other hand, an upper bound of 8k or even 32k context is just not acceptable for my use case.
I tried out da-vinci, which also works fine, but models below da-vinci don't work well for my use case.
Take GPT-4, for example. GPT-4's zero-shot prompt performance is excellent. It capably handles tons of tasks—even in specialized domains such as medicine and law. (Note: I'm a doctor and lawyer both.) I can see future LLMs being even more capable.
In fact, I see LLMs as platforms rather than products. Take smartphones: you have just two dominant platforms, iOS and Android. An entire ecosystem of apps runs on them. A software-only example might be web browsers: Chrome (and chromium browsers) and Firefox. These are the platforms on which millions of web applications run. I see LLMs ending up this way: a few platform providers, and a much vaster ecosystem of "apps" built on them.
This also implies something else: the "fine-tuning data" moat you thought your organization might have had might not be a moat at all. "Foundation" LLMs will become so good that everyone else will be able to do what you're doing. You'll have to compete the old-fashioned way: by building a better product, and crushing sales and marketing.