Of the world’s 7,000 languages, approximately half have some kind of writing system. Enabling digital support for all of these writing systems is a monumental undertaking. The Unicode standard has encoded 151 scripts—alphabets, syllabaries, and so on—as of the latest version. These include everything from common alphabets like Latin and Cyrillic to Han characters (used for Chinese and Japanese languages, among others), Egyptian hieroglyphs, the Cherokee syllabary, Batak (described in a previous post), and emoji. Once encoded in Unicode, these scripts can be used in digital text.
Hanifi Rohingya script written by hand at left and digitally at right in the brand new Noto Sans Hanifi Rohingya font. (Image by Ben Yang.)
Unicode support is only the first step in making it possible to use a script online. In order to read and write using a script, you also need fonts that support it. Have you ever received a message containing text or an emoji that you couldn’t view? These unreadable characters are colloquially known as “tofu”, because they often appear as rectangular white boxes resembling tofu. PanLex has recently made a large number of fonts available that were not previously easy to use on the web, helping solve this tofu problem.
Om swastyastu, a common Balinese greeting. (Image by author.)
In a previous post, we introduced the Balinese Lontar Project that PanLex is managing, in coordination with the Internet Archive and Udayana University. We have some exciting updates from the last two months. The team at Pusat Kajian Lontar at Udayana has given us great feedback, PanLex’s transcription platform is now live at palmleaf.org, and the Kahle/Austin Foundation (run by Internet Archive founder Brewster Kahle and his wife Mary Austin) has agreed to fund the initial phase of work! Over the next few months, we will be working with Udayana and possibly other interested parties in Bali to transcribe complete lontar works.
Finding the right fonts to work with
PanLex has needed to solve several unanticipated but fascinating problems in order to create a viable online transcription platform. In the previous post, we said that “good Balinese fonts have only recently become available”; we meant Google’s Noto Serif Balinese font. However, the experts at Udayana informed us that Noto Serif Balinese was hard to read. They suggested that we instead use Bali Simbar, which is the most popular font currently used in Bali to write Balinese script. That turned out not to be possible, as it does not use Balinese Unicode, the only way to make Balinese text readable and searchable on all platforms. In fact, few Balinese fonts are available with Unicode support, and most are incomplete. Since the goal of the Balinese Lontar Project is to make lontar works accessible to all, we had to solve this problem.
The term onomatopœia, derived from the Greek ὀνοματοποιία (ὄνομα (ónoma), “name” + ποιέω (poiéō), “to make, to do, to produce”), refers to words whose phonetic forms originate from the sound of the thing or action the word represents. Common examples from English are “oink”, “beep”, and “hiccup”. Japanese is known for having a very large set of onomatopœias, covering a wider range of topics than the onomatopœia of other languages. For example, どきどき (doki doki) means “with a racing heart”, in imitation of a rapid heartbeat. Some Japanese onomatopœias represent a metaphorical sound, such as the rather amusing しいん (shiin), meaning “the sound of silence”. One fascinating aspect of onomatopœias is that, although they derive from non-linguistic sounds, cross-linguistically they often differ. For example, the English representation of the sound of a pig is “oink”, but in Mandarin it is 哼哼 (hēng hēng), in Swedish it is nöff, and in Thai it is อู๊ด (úut).
Angel in her Berkeley back yard. (Image by Donald Anderson.)
As we reported in our March newsletter, we were honored to contribute the entire PanLex Database to the Arch Mission Foundation’s Lunar Library™, a 30-million-page archive of civilization contained in a long-duration time-capsule that traveled to the Moon last month aboard the SpaceIL Beresheet lunar lander.
In 2011, the Internet Archive photographed nearly the entire collection of Balinese palm-leaf manuscripts (130,000 leaves in all) as part of an effort to bring out of the shadows the lesser-known literatures of the world and to inspire others to do the same.
These traditional Balinese texts were inscribed with a special triangular iron stylus on treated Lontar palm leaves that come from the Borassus fabellifer family of palms. Subjects span a variety of aspects of life, including religious ceremonies, guidelines, and magic; medical, astrological, and astronomical knowledge; epic stories, histories, and genealogies; and the performing arts and illustrations. Many of the texts are centuries old and have been re-copied many times over the years, as the leaves themselves break down over time.
Lontar palm leaf book in Balinese script. (Image by Tropenmuseum.)