Sugi Lanus (left), the author, and other contributors to the lontar project at a cafe in Denpasar.
In the previous two updates, we described the Balinese lontar digitization project that PanLex is managing for Internet Archive. The goal is to continue the digitization of the Balinese Digital Library’s scanned lontar (palm-leaf manuscripts) by transcribing them into Unicode text, using the keyboards discussed in the last update. This work has now gotten underway in earnest, with over 2,000 lontar leaves transcribed and available at Palmleaf.org, comprising more than 60 complete works! Our current goal is to transcribe 3,000 leaves by the end of October.
The transcribed lontar are mostly in Kawi (Old Javanese), Balinese, or a mixture of the two, all written in Balinese script. The works cover a wide range of fascinating topics. There are chronicles (babad), medicinal texts (usada), mantras, several genres of poems ranging from high style (kakawin) to colloquial (geguritan), village regulations (awig-awig), horoscopes, classifications of things (carcan), and more. One entertaining example is Carcan Kucing, a “classification of cats” that serves as a guide for choosing a cat. Another is Pangayam-ayam, a cockfighting horoscope; it is a bettor’s guide organized by calendar date, suggesting which cocks are likely to win on each day.
We are honored to announce that PanLex has been asked to give one of the two keynote speeches at WikidataCon 2019 in Berlin, Germany on October 25th. Wikidata, a project of the Wikimedia Foundation, is a collaboratively-edited database of structured knowledge. Much in the way that Wikimedia’s most well-known project Wikipedia is a publicly created and edited encyclopedia, Wikidata is a database of facts that anyone can edit. For example, while the English Wikipedia has a prose article with information on Berlin, Wikidata has an entry on Berlin with directly accessible and updateable facts, such as the area (891.12 km²), population (3,613,495) and inception date (1237 C.E.).
Wikidata’s model of being a central repository of publicly-accessible information is quite parallel to PanLex’s model. So when Wikidata embarked on a project to begin collecting lexical data, and requested that PanLex speak about the importance of lexical data, we jumped at the chance. We are excited to not just present PanLex to a wider audience, but to present PanLex’s model of support for underserved languages to receptive, effective partners.
Of the world’s 7,000 languages, approximately half have some kind of writing system. Enabling digital support for all of these writing systems is a monumental undertaking. The Unicode standard has encoded 151 scripts—alphabets, syllabaries, and so on—as of the latest version. These include everything from common alphabets like Latin and Cyrillic to Han characters (used for Chinese and Japanese languages, among others), Egyptian hieroglyphs, the Cherokee syllabary, Batak (described in a previous post), and emoji. Once encoded in Unicode, these scripts can be used in digital text.
Hanifi Rohingya script written by hand at left and digitally at right in the brand new Noto Sans Hanifi Rohingya font. (Image by Ben Yang.)
Unicode support is only the first step in making it possible to use a script online. In order to read and write using a script, you also need fonts that support it. Have you ever received a message containing text or an emoji that you couldn’t view? These unreadable characters are colloquially known as “tofu”, because they often appear as rectangular white boxes resembling tofu. PanLex has recently made a large number of fonts available that were not previously easy to use on the web, helping solve this tofu problem.
Om swastyastu, a common Balinese greeting. (Image by author.)
In a previous post, we introduced the Balinese Lontar Project that PanLex is managing, in coordination with the Internet Archive and Udayana University. We have some exciting updates from the last two months. The team at Pusat Kajian Lontar at Udayana has given us great feedback, PanLex’s transcription platform is now live at palmleaf.org, and the Kahle/Austin Foundation (run by Internet Archive founder Brewster Kahle and his wife Mary Austin) has agreed to fund the initial phase of work! Over the next few months, we will be working with Udayana and possibly other interested parties in Bali to transcribe complete lontar works.
Finding the right fonts to work with
PanLex has needed to solve several unanticipated but fascinating problems in order to create a viable online transcription platform. In the previous post, we said that “good Balinese fonts have only recently become available”; we meant Google’s Noto Serif Balinese font. However, the experts at Udayana informed us that Noto Serif Balinese was hard to read. They suggested that we instead use Bali Simbar, which is the most popular font currently used in Bali to write Balinese script. That turned out not to be possible, as it does not use Balinese Unicode, the only way to make Balinese text readable and searchable on all platforms. In fact, few Balinese fonts are available with Unicode support, and most are incomplete. Since the goal of the Balinese Lontar Project is to make lontar works accessible to all, we had to solve this problem.
The term onomatopœia, derived from the Greek ὀνοματοποιία (ὄνομα (ónoma), “name” + ποιέω (poiéō), “to make, to do, to produce”), refers to words whose phonetic forms originate from the sound of the thing or action the word represents. Common examples from English are “oink”, “beep”, and “hiccup”. Japanese is known for having a very large set of onomatopœias, covering a wider range of topics than the onomatopœia of other languages. For example, どきどき (doki doki) means “with a racing heart”, in imitation of a rapid heartbeat. Some Japanese onomatopœias represent a metaphorical sound, such as the rather amusing しいん (shiin), meaning “the sound of silence”. One fascinating aspect of onomatopœias is that, although they derive from non-linguistic sounds, cross-linguistically they often differ. For example, the English representation of the sound of a pig is “oink”, but in Mandarin it is 哼哼 (hēng hēng), in Swedish it is nöff, and in Thai it is อู๊ด (úut).
Angel in her Berkeley back yard. (Image by Donald Anderson.)