Statue of Saxon leader Widukind in Herford, Germany. (Image by M. Kunz)
Every November 5, the United Kingdom celebrates Guy Fawkes Night. Guy Fawkes was an Englishman who attempted to blow up the House of Parliament in 1605. The story is fairly well known—but why was this guy named Guy? What kind of a name is that, anyway? As it turns out, it’s kind of a long story!
Proto-Germanic, the reconstructed ancestor language of Germanic languages such as English and German, had a word *widuz ‘wood’—this, in fact, is the source of the English word wood. This root was used in names such as Old Saxon Widukind, literally ‘child of the wood’. These names could be shortened to Wido. The short form was borrowed into Old French as the name Guy and into Italian as Guido. The initial g-sound was added to fit the sound pattern of these languages; neither allowed w at the beginning of a word, and borrowed words originally beginning with w were pronounced with g. (The same process is evident in French guerre and Italian guerra ‘war’, which derive from a Frankish word similar to English war.)
On October 25, 02019, PanLex was honored to present the first keynote speech at WikidataCon in Berlin, Germany. As our representative, I was excited to share PanLex’s ideas about the importance of linguistic diversity and lexical data’s role in helping to preserve that diversity with the staff, volunteers, and users of Wikidata.
The Wikidata audience was wonderfully receptive to PanLex’s mission and work. A significant portion of the talks and workshops at the conference were on how Wikidata can help underserved, minority, and indigenous language communities, so the ground was ripe for discussions of how our respective missions aligned. Read More…
Sugi Lanus (left), the author, and other contributors to the lontar project at a cafe in Denpasar.
In the previous two updates, we described the Balinese lontar digitization project that PanLex is managing for Internet Archive. The goal is to continue the digitization of the Balinese Digital Library’s scanned lontar (palm-leaf manuscripts) by transcribing them into Unicode text, using the keyboards discussed in the last update. This work has now gotten underway in earnest, with over 2,000 lontar leaves transcribed and available at Palmleaf.org, comprising more than 60 complete works! Our current goal is to transcribe 3,000 leaves by the end of October.
The transcribed lontar are mostly in Kawi (Old Javanese), Balinese, or a mixture of the two, all written in Balinese script. The works cover a wide range of fascinating topics. There are chronicles (babad), medicinal texts (usada), mantras, several genres of poems ranging from high style (kakawin) to colloquial (geguritan), village regulations (awig-awig), horoscopes, classifications of things (carcan), and more. One entertaining example is Carcan Kucing, a “classification of cats” that serves as a guide for choosing a cat. Another is Pangayam-ayam, a cockfighting horoscope; it is a bettor’s guide organized by calendar date, suggesting which cocks are likely to win on each day.
We are honored to announce that PanLex has been asked to give one of the two keynote speeches at WikidataCon 2019 in Berlin, Germany on October 25th. Wikidata, a project of the Wikimedia Foundation, is a collaboratively-edited database of structured knowledge. Much in the way that Wikimedia’s most well-known project Wikipedia is a publicly created and edited encyclopedia, Wikidata is a database of facts that anyone can edit. For example, while the English Wikipedia has a prose article with information on Berlin, Wikidata has an entry on Berlin with directly accessible and updateable facts, such as the area (891.12 km²), population (3,613,495) and inception date (1237 C.E.).
Wikidata’s model of being a central repository of publicly-accessible information is quite parallel to PanLex’s model. So when Wikidata embarked on a project to begin collecting lexical data, and requested that PanLex speak about the importance of lexical data, we jumped at the chance. We are excited to not just present PanLex to a wider audience, but to present PanLex’s model of support for underserved languages to receptive, effective partners.
Of the world’s 7,000 languages, approximately half have some kind of writing system. Enabling digital support for all of these writing systems is a monumental undertaking. The Unicode standard has encoded 151 scripts—alphabets, syllabaries, and so on—as of the latest version. These include everything from common alphabets like Latin and Cyrillic to Han characters (used for Chinese and Japanese languages, among others), Egyptian hieroglyphs, the Cherokee syllabary, Batak (described in a previous post), and emoji. Once encoded in Unicode, these scripts can be used in digital text.
Hanifi Rohingya script written by hand at left and digitally at right in the brand new Noto Sans Hanifi Rohingya font. (Image by Ben Yang.)
Unicode support is only the first step in making it possible to use a script online. In order to read and write using a script, you also need fonts that support it. Have you ever received a message containing text or an emoji that you couldn’t view? These unreadable characters are colloquially known as “tofu”, because they often appear as rectangular white boxes resembling tofu. PanLex has recently made a large number of fonts available that were not previously easy to use on the web, helping solve this tofu problem.