Enabling Radically Inclusive Machine Translation (part 3)

November 29, 2018 || David Kamholz

Categories: Machine Translation, PanLex Database

Tags: Android, Egyptian Arabic, Google Translate, Javanese, machine translation, Meadow Mari, Microsoft Windows, multilingual dictionaries, under-served languages, Uyghur, Western Punjabi, Wu Chinese

In the first two posts in this series, we elaborated our belief that all people should be able to use their native language to exercise human rights and have access to opportunity. We showed that machine translation technology currently falls far short of this goal, but that there are realistic ways to make progress. In this third and final installment, we will describe in more detail our work at PanLex and how we are uniquely positioned to improve translation support in under-served languages.

We consider under-served languages to be those lacking institutional support from governments or support from major technologies such as Google Translate, Android, or Microsoft Windows. Of the world’s 7,500 languages, 6,900-7,400 are under-served. More than 2 billion people speak under-served languages, including large languages such as Western Punjabi (90M speakers), Javanese (84M), Wu Chinese (80M), Egyptian Arabic (62M), and Uyghur (10M).

Uyghur boys. (Image by OMF)

The PanLex Database draws on 2,500 multilingual dictionaries. Our team works on dictionaries individually, transforming their many different formats into a single common structure. This makes these dictionaries’ words and translations interoperable within the resulting database. This work is quite technical and requires many editorial judgments. The PanLex Database contains 25 million words in 5,700 languages, resulting in 1.3 billion directly-attested translations and billions more inferred translations. No other database has this structure at this scale. It is the world’s largest lexical translation database.

PanLex prioritizes under-served language coverage when deciding what to add to our database. The database currently contains 400,000 expressions (words and word-like phrases) in Uyghur, 67,000 in Meadow Mari (a language with 500,000 speakers), and nearly 6,000 in Javanese. Differences in coverage partly result from what dictionaries are available, but also – more significantly – from the labor involved in including them. There are more than 4,500 dictionaries in PanLex’s backlog. Your support goes a long way in allowing us to broaden our language coverage.

Dictionaries like these contain valuable lexical translations that PanLex can use. (Image by Wikimedia)

As explained in the previous post in this series, any method of improving machine translation support in under-served languages realistically must make use of broad lexical data in order to succeed. PanLex has spent the past 12 years building a massive database containing precisely this kind of data.

The PanLex team continues to work through the thousands of dictionaries in our backlog. In the coming months, we intend to partner with machine translation specialists in order to implement a machine translation engine in an under-served language. This will allow sentences and texts – not just words – to be translated. Once we have shown what is possible, we hope to expand to more languages. Please let us know if you are interested in working on this with us!

Leave a Reply