Hacker News new | past | comments | ask | show | jobs | submit login

If you have control of the tokenizer you could make sure it doesn't produce these tokens on user input. I.e. instead of the special "<eos>" token, produce something like "<", "eos", ">" - whatever the 'natural' encoding of that string is.

See for example, the llama3 tokenizer has options to control special token tokenization:

Tokenization method with args to control special token handling: https://github.com/meta-llama/llama3/blob/bf8d18cd087a4a0b3f...

And you can see how it is used combined with special tokens and user input here: https://github.com/meta-llama/llama3/blob/bf8d18cd087a4a0b3f...

If you don't have control of the tokenizer, I guess it needs to be sanitized in the input like you say.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: