Before performing a Google search or asking a question to ChatGPT in English, Spanish, Mandarin or Russian, no one questions whether these platforms will understand the language. These are languages that are overrepresented on the web. More than 99% of online content is in just 35 of the approximately 7,000 languages in the world (of which more than half are oral), with English alone accounting for 62%. This leaves thousands of languages relegated to a marginal place or even non-existence on the internet.
For this reason, groups of experts around the world are dedicated to the task of digitally preserving languages. One of these is The Missing Scriptsan initiative that seeks to encode all the world’s writing systems in the Unicode standard, the universal computing system allows forms of writing to be identified so that they can be processed by various types of software. Artificial intelligence platforms, for example, rely heavily on Unicode for text processing. If a language is not in Unicode, it cannot be used on a computer.
Of the 292 writing systems that exist, 146 are not in Unicode. These include not only ancient writings, some of which have not yet been fully deciphered, but also a large number of writings from minority ethnic groups that continue to use their own language today.
“Every culture should be part of Unicode,” says Johannes Bergerhausen, professor of typography at the University of Applied Sciences in Mainz (Germany), and co-founder of The Missing Scripts with Thomas Huot-Marchand, director of the National Typographic Research Workshop, by video call. (ANRT) in Nancy, France. The project is born from an alliance between their respective study centers and the Script Encoding Initiative of the University of California, Berkeley. The Missing Scripts also received sponsorship from UNESCO in the context of the Decade of Indigenous Languages (2022-2032).
For UNESCO, the “neglect of the digital industry” towards minority languages represents a “threat of extinction.” For this reason, it considers digitalization and web presence as “empowerment tools.” However, language relegation goes beyond just minority groups. There are cases such as Urdu, the tenth language with the most native speakers in the world (close to 80 million), mainly located in Pakistan and India, who face difficulties using the language on a computer keyboard. Native speakers must resort to the romanized version through phonetic transliteration. Cases like this endanger the transmission of the language to future generations.
“When the last speaker of a language dies, we lose the culture, we lose all the heritage. That is why it is really important to register and for these languages to live on the internet and in the digital space to be able to spread them,” explains Huot-Marchand. Experts, however, make an important clarification: languages and writing systems should not be confused. There are around 7,000 languages in the world, but only 292 writing systems in the history of humanity. The Missing Scipts works exclusively in the field of written language.
A meticulous job
The project is a collective effort that transcends the work of the three study centers involved. The founders explain that they cooperate with experts in different fields, from design and typography to linguistics. “But we also have to work with native speakers, with computer scientists, with engineers and even with companies,” says Huot-Marchand, who emphasizes that the results of the work must be open because it is “the only way” to make a contribution. Both experts defend the importance of involving native speakers in the work when it comes to languages that are still alive.
According to Bergerhausen, his expectation is to have all 292 writing systems in Unicode by the year 2047. The academic, however, admits that the goal is “a little naive” because “one or two” systems appear every year. of new writing. This happens mainly in West Africa, as they explain, since many languages were recorded with the Latin alphabet due to European colonization and there are more and more communities that want to have their own writing system to express their languages.
When registering systems, in addition, unexpected difficulties arise. For example, the researcher who is working with the Lampung language system, from the island of the same name in Indonesia, discovered that this language, spoken by approximately 1.5 million people, has a dozen different scripts. “Then you have the difference between handwritten and typeface writing. So you must decide on the shape of the letter you are going to register. It would be like deciding in English or Spanish which is the perfect letter ‘A’ or the letter ‘E’ that should be included in Unicode,” says Huot-Marchand.
In the case of Lampung, it is a living language with native speakers who can help solve these issues. But at The Missing Scripts they are also registering writing systems from dead languages that can raise similar problems. Noemí Moncunill, professor of Latin Philology at the University of Barcelona, worked with The Missing Scripts to codify the Paleo-Hispanic writing system (used in the Iberian Peninsula between the 7th century BC and the 1st century AD).
This project, however, shows the limitations of Unicode, as Moncunill explains by video call: “Recording in Unicode fell short for us, because when we study historical texts, written by hand, we see a variation of writing that also interests us.” ”. For this reason, according to the academic, her team undertook a “double path” to create a standard alphabet that would be encoded in Unicode and would be useful for dissemination, but separately registering the sources that represented all the variation of Paleo-Hispanic writing. .
“In research you need to be able to express all the variation of writing. But, on the other hand, not having a Unicode is also problematic. So, from our point of view, the ideal is to have a double system,” says Moncunill.
Beyond writing
The Missing Scripts set the ambitious goal of recording everything related to the field of written language, but there are other initiatives that also want to recover languages beyond writing. One of these is the Living Tongues Institute for Endangered Languagesin the United States, which in addition to publishing scientific works, produces online multimedia dictionaries to preserve indigenous languages in collaboration with members of the communities that speak them.
Founded in 2005, this project organizes workshops to train “linguistic activists” on how to record and edit phrases in their language to record them in their “Living Dictionaries” containing tens of thousands of words, images and audio from languages of everyone.
Recent fieldwork recording #Santali speakers with near Tezpur, Assam, India, using the new #MoveMic. We visited the communities of Barbil Pathar Gaon, Patia Pukhuri & Simalu Guri Gaon. Thanks to all the Santali community members who collaborated with us, @shure & @MeetTheMonks pic.twitter.com/My4iQcqsV2
— Living Tongues Institute for Endangered Languages (@livingtongues) March 11, 2024
“Although there are many academics working with endangered languages, they don’t always have the time to really do deep work with communities. So that is one of the main reasons behind our organization: not only to immerse ourselves in the scientific side, but also to try to create resources that can be useful for communities,” explains Anna Luisa Daigneault, director for North and South America of The Living, by video call. Tongues Institute for Endangered Languages.
According to Daigneault, this working method avoids “cultural misinterpretations” and helps the result be “more authentic and linguistically rich.” The organization’s work ranges from projects with the Breton-speaking community in northern France to indigenous communities in Bolivia and collaborations with minority language speakers in India.
The expert emphasizes that the work is always supported by “rigorous, well-made and exhaustive documentation” which is then put to practical use by creating, in addition to dictionaries, online courses, books or even subtitles for movies. “It’s something tangible that we can take to the world,” says Daigneault.
Currently, the online platform for multimedia dictionaries has about 1,000 users “dispersed throughout the world” and has more than 400 languages. By the end of this year, Daigneault hopes to have “more than 500.”
Recently, Living Tongues Institute for Endangered Languages has been conducting workshops in the Brazilian Amazon with the Werikyana community, who are creating their own multimedia dictionaries. “Our Werikyana collaborators compile lists of words and phrases and discuss them as a group before adding them to the digital dictionary. Native speakers then record their own voices using their devices and upload them to the dictionary,” details Daigneault.
The expert highlights the importance of the multimedia part in cases where there is no standard writing system and there are several “competing” writings, which is why the platform offers space for multiple writing systems for the same language with support visual and auditory.
Generational learning
According to UNESCO, an indigenous language dies every two weeks. “The definition of danger has several factors, and the most important is whether or not a language is being transmitted to youth and children. A language can have a million speakers, but if it is not being passed on to children, then it is still considered endangered,” explains Daniel Kaufman, founder of The Endangered Languages Projecta nonprofit organization that works with indigenous and migrant communities around the world to “document, describe and promote their languages.”
From New York, this NGO acts as a “collaborative center” dedicated to strengthening languages in danger of disappearing. On the website, collaborators upload language samples in text, audio or video format to the system. They also organize cultural and educational activities that serve to disseminate work with different languages. “We are not trying to create a language museum or an archive that people can look at, the core is to bring the language to children in some way. And that’s something we’re still working on and expanding,” says Kaufman.
According to the expert, since most of the world’s languages are only oral, there are many people they work with who do not have writing experience and it is necessary to resort to other techniques: “Writing or blogging is not our first priority because very few They feel comfortable with that. “For them, that’s not really how they grew up with the language.” For that reason, Kaufman highlights the importance of providing communities with the tools to record and propagate their language in the way they feel most comfortable. Everything so that the digital world gradually becomes a more accurate reflection of the linguistic diversity of the real world.
#online #content #English #technology #protects #marginal #languages #internet