I've paid attention to the evolution of LLMs for a while now. I wonder if the ne...

auggierose · on April 22, 2023

If you have many prompt-completion pairs, then fine-tuning seems to be the better approach, because of the limited context size of prompting. I am experimenting with LLMs as well, and for small examples few-shot prompting works miraculously well for my purposes, even with gpt-3.5-turbo. But I am hitting the 4k context size quickly, and even 32k is not enough for a serious application. That's why I will also look at fine-tuning, mainly to escape the context size limitation.

Edit: This affects every use case where each prompt-completion pair provides new and essential information that the model needs to take into account.

npsomaratna · on April 22, 2023

Depends: can you give an example where 32k would be necessary? In my experience, GPT-4 often outputs what you expect, even without any prompt:completion examples. (Unlike GPT-3.5-turbo—which in my opinion is even less capable than text-davinci-003 / 002). If you have complex requirements, you could just chain together multiple calls to GPT-4?

Now, the cost of GPT-4 is an issue. I hope that open-source models such those released by Stability AI reach the same level of quality as (at least) GPT-3.5-turbo soon. That would be a game changer.

pilotneko · on April 22, 2023

Here is a very real example where 32k tokens is not enough: predicting discharge disposition for a long stay inpatient. A 30-day hospital stay can produce over 1,000 clinical notes.

jerrygenser · on April 22, 2023

When you say prediction my discharge disposition do you mean extracting from the document a statement of what the discharge disposition was? Or do you mean based on m extracted clinical characteristics, predicting what it should be?

For extracting what it was if it's in the document, you can use the indexing approach described in the article. For predicting what it should be, you'd probably be better off using llm as an extraction mechanism and then using structured data models to classify discharge disposition

lmeyerov · on April 22, 2023

You can do layered retrieval and summarization/embeddings to fit within the context bounds. Conceptually it's not far from what a clinician would do anyway. By switching from a naive one shot prompt with big context to interactive (multiple calls) and summaries, no need to retrain

But yeah my money is on context windows getting bigger and frameworks smoothing out how to do the above automatically, So it feels like a point in time optimization right now That will be tech debt by the end of the year

pilotneko · on April 22, 2023

The latter, not the former. The former is extraction, not prediction.

A lot of factors go into predicting the post-acute care route that results in the best outcome. You are correct that it would be possible to create features using an LLM, but that is a very difficult problem compared to simply treating the problem as a sequence classification.

jerrygenser · on April 22, 2023

Agreed that there are better approaches than LLm to solve this problem There are deep learning and sequence models that are focused on this task using extracted structured info. Wouldn't you be better off using llm or any extraction + linking model to identify medical concepts and then a model that only understands medical concepts in order to do predict the next medical concepts.

avereveard · on April 22, 2023

Is this a joke? Medical data should stay the hell away from gpt servers

og_kalu · on April 22, 2023

https://arstechnica.com/information-technology/2023/04/gpt-4...

avereveard · on April 22, 2023

from that source we find descriptions of two features

"enhancements to automatically draft message responses"

"Another solution will bring natural language queries to SlicerDicer"

neither of which needs 32k token nor to see the patient records

also, foot note 1

"users of Azure and Azure OpenAI Service are responsible for ensuring the regulatory compliance of their use"

there is zero claim of actual compliances of these services for handling sensitive or regulated data.

now I understand the enthusiasm, but at least don't waste people time with sources that are, at best, tangential, and don't provide any substance to the discussion

og_kalu · on April 22, 2023

Did you read the thing or what ?

"The second use will bring natural language queries and "data analysis" to SlicerDicer, which is Epic's data-exploration tool that allows searches across large numbers of patients to identify trends that could be useful for making new discoveries or for financial reasons. According to Microsoft, that will help "clinical leaders explore data in a conversational and intuitive way.""

Do you not understand what natural language queries on external data by access of/enabled by GPT-4 means/entails ?

The only thing worse than being condescending is being condescending and wrong.

avereveard · on April 22, 2023

yes, it seems you haven't been understanding what's written

that means natural language will understand user request and write queries for SlicerDicer data backend, not that data is sent to the llm for analysis.

it writes queries, don't receive data.

pilotneko · on April 22, 2023

OpenAI will sign a business agreement and run an instance with isolated servers, for HIPAA compliance. That being said, you can run LLMs on-prem.

avereveard · on April 22, 2023

citation needed. specifically regarding gpt4 and not the other models, since that was your claim.

> That being said, you can run LLMs on-prem.

how is this relevant at all? it's just dodging the question, since the claim was about gpt4 32k tokens and not the crop of generic models that we have now availabe on prem (and which don't support 32k token anyway)

auggierose · on April 22, 2023

For my use case, GPT-3.5 quality seems enough, as it already delivers nearly 100% optimal output, and GPT-4 wouldn't be much of an improvement. On the other hand, an upper bound of 8k or even 32k context is just not acceptable for my use case.

I tried out da-vinci, which also works fine, but models below da-vinci don't work well for my use case.

aix1 · on April 22, 2023

Would you be willing to say a few words about the nature of your use cases?

MacsHeadroom · on April 22, 2023

>GPT-3.5-turbo—which in my opinion is even less capable than text-davinci-003 / 002

Not just opinion, GPT-3.5 Turbo is demonstrably worse at just about everything except providing so called "safe" and "aligned" replies to questions.

osigurdson · on April 23, 2023

Suggest also looking at langchain type approaches.

auggierose · on April 23, 2023

Langchain seems to be just a framework. At this point, I rather use APIs directly.