The Most Common Words in PanLex
The PanLex Database contains a large diversity of languages and dialects. This diversity allows us to explore interesting language facts, illuminated by casting PanLex’s wide net across the languages of the world.
One question, originally suggested by our founder and director emeritus Dr. Jonathan Pool was:
What’s the most common word in the PanLex Database?
To answer this question, we surveyed each word in the PanLex Database and tallied the number of languages it occurs in, regardless of differences in meaning across languages. Showing up in a grand total of 1,166 languages and dialects is:
ma
This is actually quite expected—ma (or similar sounding words) is an extremely common word for “mother” in many languages around the world due to the fact that ma is often the first syllable babies are able to make. (See this Wikipedia article for more information).
If we decide to look for the most common words that are longer than two letters, we find:
mata (777 languages and dialects), lima (569), susu (519)
These words consist of very common consonants and vowels, arranged into simple consonant-vowel syllables, and thus would be likely to show up in many languages. However, there is one other aspect that makes these specific words so common—the Austronesian language family. The Austronesian family is the world’s second largest language family, containing 1,256 languages spread from Madagascar, throughout Malaysia, Indonesia, and the Philippines, and all the way into Micronesia and Polynesia. In many if not most of the languages in this family, mata means “eye”, lima means “hand” or “five”, and susu means “breast” or “milk”. The sheer size of the Austronesian family makes these three words very common in the PanLex Database.
If we now eliminate all words shorter than 5 letters long, we get a new interesting phenomenon:
banda (154), Angola (154), Malta (153), Vanuatu (148), India (146)
Ignoring banda for now, we notice that the rest of the words are geographic names. This is due to the fact that geographic names are some of the most wander-y of wanderwords, so much so that when a geographic location has many different names across different languages (like Germany, Deutschland, and Alemania), it stands out in our minds as odd. It turns out that most people around the world call Angola something like Angola, Malta something like Malta, etc.
So, let’s try eliminating all words with any capital letters, in order to eliminate geographic names and other proper nouns. We now get:
banda (154), bomba (130), manga (115)
Now, we’re getting some interesting loanwords. banda (meaning “band” or “gang”) and bomba (meaning “bomb”) are very common across many languages, especially in Europe. manga, perhaps somewhat surprisingly, also commonly occurs, in its meaning of “Japanese-style comic books”. In addition to their occurrence as loanwords, these three have simple structures with common sounds, making them likely to occur in many languages simply by coincidence as well.
How about words with more than 5 letters (and no capital letters)?
nyanya (73), banana (69), ananas (65)
Nyanya’s presence here is due to the first largest language family in the world—the Niger-Congo family, spoken widely throughout Sub-Saharan Africa. In many Niger-Congo languages, especially in the Bantu subgroup, nyanya means “eight” (and/or “tomato”!). Banana is another wanderword—most people around the world call a banana banana. Ananas is also one of those wanderwords, but happens to be one that English didn’t pick up and instead went with “pineapple”.
The top words with 7, 8, 9, and 10 letters (and again, no capitals) are, respectively:
telefon (58), internet (56), esperanto (54), propaganda (54)
All are typical wanderwords here, two referring to modern technologies, one referring to a social phenomenon generally associated with the 20th and 21st centuries, and one referring to the name of the most popular artificial language (its presence here likely being due to Dr. Jonathan Pool himself—Dr. Pool is an Esperantist and is fluent in the language. He has also contributed a very large number of Esperanto bilingual dictionaries to PanLex, most of which likely contain an entry for esperanto!)
One question that may be arising right now is “what about languages that aren’t written in the Latin script?” As we are not performing any transliteration on the words, if a language has a word like mama or propaganda but does not write it in that fashion, it is not counted in the number of languages and dialects for that word. So, let’s try getting the most common expressions not in the Latin script:
базар (64), дин (64), пиво (63)
As the Cyrillic script is the world’s second most popular writing system, it’s not a surprise that all of these are written in it. The first, базар (or “bazar”) is another wanderword, meaning “bazaar”. The second дин (or “din”) is also a wanderword, though interestingly it does not occur in this capacity in Russian. It is a borrowing from Arabic دين (“dīn”) meaning “religion” (and for which we have earlier written an entire article). This word has spread with the spread of Islam throughout the Caucasus, a region of incredible linguistic diversity. And, as most of the languages of the Caucasus are written in Cyrillic, дин is thus a common word in the Database.
The third word пиво (“pivo”) is something a bit more secular, the Slavic word for “beer”!
Overall, this casual experiment showed interesting results—the most common words in the PanLex Database are either extremely widespread loanwords, or words that are decidedly not borrowed across languages, but refer to basic concepts in very large language families. And the most common word is ultimately the most common because it rests at the overlap between the most basic sound humans can make, and the most basic concept of what it means to be human.