If you're interested in this, don't miss AI Sweden's GPT-SW3 @ 126M to 40B trained on Nordic languages (not Finnish) and English. It's funded by the Swedish government and partners, and freely available with a pretty lively Discord for ongoing AI research focusing on the Nordic languages. I think Viking is called "first" because it includes Finnish, because otherwise, GPT-SW3 was released earlier.
First thing I notice is that Finnish is part of a completely different language family from the other Nordic languages and English (Uralic vs. Indo-European). I wonder to what extent this affects the effectiveness of their low-resource training. Finnish is highly agglutinative, adding prefixes and suffixes to modify a root. My (amateur) take is that the tokenization and attention patterns may differ a lot? Would love to see more educated people than I discuss this.
>> to what extent this affects the effectiveness of
The correct use of those words demonstrates that you are either not an AI, all of them being trained on so much bad language, or are an AI from a more perfect future.
Finnish is not so different dispute having different lineages. Even if we talk about morphology, sometimes it's simply that e.g. prepositions are affixed to the end of a word big whoop. There are many dimensions to language vairation. Finnish has a long history of contact with Scandi languages and a lot of borrowed words and logic. It would be good to have Estonian and possibly Baltic languages too.
ETA: It is differentof course just perhaps not as much as people sometimes try to say. You can definitely ruffle some feathers with this one given the uniqueness of Finnish is pretty central to Finnish nationalism.
As someone growing up in relatively close contact with Finnish, I can assure you that there's no real common ground between Scandinavian languages (Swedish, Danish, Norwegian) and Finnish. There are loan words, but they are few and far between and in any case does not make for any mutual understanding. I've been so much to Finland that I would really like to learn to at least understand the language, instead of relying of memorized names of foodstuff and the like. Just have to tackle Japanese first.. (and I consider that one an easier operation)
We do have one small language, the Kven language(https://en.wikipedia.org/wiki/Kven_language), which is a sort of "Finnish structure, but with lots of borrowed Norwegian words. But for all intents and purposes, it very much sounds like Finnish.
It basically sounds like the language a Finnish person that has lived their whole life in Norway, and then starts to mix words because they forgot the Finnish words.
But that's about it. I know there are some other dialects, too, but these are all very small-scale languages that are either extinct, or will be extinct in some decades.
Much easier for Finns to learn Swedish, that for Swedes to learn Finnish, IMO. I speak Norwegian and Finnish (lived in Finland when I was young)
Loan words from Scandi are more common than you think. Eg Hei is a common greeting, tykätä is a common verb. For nouns there is even a whole paradigm for loans, a large number of which are Scandi. They are not necessarily easy to recognise since they undergo sound changes eg plaasteri becomes laasteri
Loan words, yes, but that has very little to do with the grammar and structure of the language. "Jag tycker om dig" [sv] translates to "Tykkään sinusta" [fi], which isn't anywhere near the Scandic.
Oh well if I made a spelling mistake that obviously invalidates my whole point. Thank you for teaching me that Finnish is a very special language -- just like the Finns -- such an amazing an unique people ;)
The fact it was trained on HPC which covers 20% heat consumption in a city is absolutely wild and on par with how wild it is to have English/Nordic model.
“ Further emphasizing digital sovereignty, Viking is trained on the EuroHPC supercomputer LUMI, utilizing up to 4096 AMD MI-250X GPUs. LUMI is not only Europe’s most powerful supercomputer and the 5th most powerful in the world, but also the 3rd greenest supercomputer among the top 500 supercomputers. LUMI’s energy consumption is covered with power produced 100% with hydroelectricity, and the waste heat of LUMI will account for about 20 percent of the district heating in the surrounding city of Kajaani. ”
Great talking points. These are highly relevant subjects and I'm delighted we in the Nordics are keeping up with current developments. This work is important for preserving our culture.
I hope to see this used to generate a customized curriculum for each neurodiverse child so that we can live in a more equitable society.
I have had this question. How much better would common LLMs (Llama, GPTN) be if they were only trained in one language? I have to assume they would perform better, but I might be wrong.
Perform better how? Knowing more languages gives you more data and different points of view rather than just using the English corpus and culture. When I ask chatgpt for a translation it seems to understand the meaning behind the words and finds the closest thing in the other language. The datasets seem to merge in some way
Fair, but there may be overhead that doesn't need to exist. Certainly - for the limited compute my brain can accomplish - I could gain a deeper understanding of physics, if I focused on learning physics and didn't also have to simultaneously learn French.
Wouldn't a better metaphor be if a child growing up in a bilingual household would be worse at physics as an adult? My guess would be growing up bilingual would have no impact.
This hypotetical kid would have the same size of brain/number of neurons anyway. In case of LLMs one could create a model that could be smaller thakns to not including the knowlegde about unecessary languages. A problem though could be with lacking traing data in other languages.
Human is not limited by computational power of brain (or rather, it is not the limitation we encounter). We are limited by time and the fact that our machinery degrades with time (aging).
Just like adding code to textual models helps the model develop its reasoning capabilities, it seems like adding more languages helps in other areas too. What is needed is more good quality data to train on...
We also see humans get worse at specific things when they learn too much in general. There is a cut-off point to how many concepts we can learn with what skill. To be most effective, we have to specialize in the right things while continuing to acquire generalist knowledge. It’s a balancing act.
These architectures are less capable than brains in many ways. So, we should expect them to have such trade-offs. An efficient one should work fine on English, mathematical notation, and a programming language. Maybe samples of others that illustrate unique concepts. I’m also curious how many languages or concepts you can add to a given architecture before its effectiveness starts dropping.
It's not the amount that is wrong, it's how the model is trained. The model is trained for zero and few shot tasks. It is not surprising that it is performing well when you ask for that.
I can't track down the citation (either Google or DeepMind I think), but I remember reading research from a year or two ago how adding extra languages (French, German) improved English language performance. There may have also been an investigation about multi modality too, which found that adding vision or audio helped with text as well.
Interesting thought. Maybe an LLM would build deeper insight with only one training language. On the other hand, the model might overfit with just one language -- maybe multilingual models generalize better?
I think this makes sense to the extent that an understanding of the differences between language helps separate out language from the underlying meaning. However... the models that are used receive input (i.e. translate from language), and to learn / understand, and to output information (i.e. re-encode into language), do not all have to be the same.
Would an LLM trained on a smaller language have better cultural awareness etc than one trained in English? Because English is written all over the world by all kinds of people, an English LLM will average that (and for instance feel a bit off for an American). But a Norwegian LLM for instance, trained on a language mostly written by Norwegians, would that feel more natural to me in comparison?
> To leverage the capabilities of MI250X, ROCm enables the use of GPU matrix cores through its rocBLAS and MIOpen library implementations that, in turn,are leveraged by PyTorch.
I don't think they decided that, they included Finish which is completely unrelated to the other nordic languages. If they just picked languages that are related for cross learning including Dutch or German would have made more sense indeed.
I understand, not saying they did something wrong just pointing out the selection of languages was not because of them belonging to the same family but rather to serve a certain region.
Including Finnish was probably just a political choice, since Finland and Sweden are very close politically, much closer than Germany or other areas with more similar languages.
I got the impression they are focusing on the nordic culture as much as the languages.
>Silo AI and TurkuNLP are dedicated to developing models that not only excel in linguistic performance and inclusivity but are also attuned to local values and cultures.
https://huggingface.co/AI-Sweden-Models