Hacker News new | past | comments | ask | show | jobs | submit login

Potentially interesting on the alignment front: In my experience the yi-6b model running on ollama is more likely to refuse politically sensitive queries (relating to Tiananmen Square, Peng Shuai’s disappearance, etc) when asked in Chinese, and more likely to provide information when asked in English. I wonder if this difference falls out naturally from available training data, is a deliberate internationalization choice, or is just noise from the queries I happened to run.



I noticed similar behaviour in an older model (Skywork 13B) a few months back. When asked in Chinese, it would politely say that nothing of note occurred when responding to queries about Tiananmen Square, etc. In English, it would usually respond truthfully. It was deliberate in the case of Skywork, based on their model card (https://huggingface.co/Skywork/Skywork-13B-base):

> We have developed a data cleaning pipeline with great care to effectively clean and filter low-quality data and eliminate harmful information from text data.

I'd imagine it's likely similar for Yi.


Huge jump to go from that line in the model card to it being intentional from the model's creators.

China censors those events. They pre-trained with a specific focus on Chinese text, and integrated more native Chinese text than most models do.

Doesn't require any additional filtering on their behalf to have the model reflect that, and if anything the fact that they're mentioned in english implies the opposite of your hypothesis.

If they were going to filter Tiananmen Square, the lift to filter it in English would not be any higher.


This may be a useful workaround, but it also forms the strongset argument I've yet seen so far against claims that LLMs do something like "understanding" or "an underlying world model". Maybe models knowing the same facts in different languages, especially across political controversy, might form a good benchmark to evaluate


I wonder if you could use the multilingual capabilities to workaround it's own censorship? I.e. what would happen if you asked it to translate the query to English, asked it in English, and then asked it to translate back to Chinese.


Could also be both. Training data organically creating the difference but with an additional layer of specific alignment on top too




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: