Since the invention of writing several thousands of years ago, humans have come up with myriad scripts that turn the phonetic sounds of spoken languages into something visual. Most of these written languages have already been deciphered, from Egyptian hieroglyphics to Maya inscriptions to ancient Chinese writing.
In some cases, linguists have simply gotten lucky when it came to breaking the code of lost languages — the Rosetta Stone, for example. Other times, they’ve spent years deciphering subtle patterns in the arrangement of letters within words and words within texts to unlock the keys.
But a few lost languages still trouble epigraphers, those who study ancient inscriptions. For example, the writings of the Olmec and Zapotec are still a mystery, as is the ancient Proto-Elamite script of Mesopotamia. The most notable undeciphered language may be the writings of the Indus Valley civilization, which has seen numerous decoding attempts, none yet successful.
Today, frustrated historians have another tool at their disposal: artificial intelligence. New advances, both in computing and linguistics, are making it possible for algorithms to begin decoding ancient languages. The latest push comes from a team of researchers at MIT’s Computer Science and Artificial Intelligence Laboratory as well as Google Brain, an artificial intelligence project. They’ve devised an algorithm that can begin to match words from unknown languages to related words, or cognates, in languages that share the same root. Though the algorithm, published on the preprint server the arXiv, has yet to tackle a truly undeciphered language, it’s a promising step forward.
Taking on ancient languages with AI does pose some unique problems, though. Machine learning algorithms are usually trained on massive datasets that they mine in order to learn through associations. Most ancient scripts have only a limited number of samples, making it difficult to feed an algorithm enough data for it to learn.
The process of training an algorithm also involves comparing its answers to known values. When a language is entirely undeciphered, however, this is impossible. You can’t tell an algorithm “Yes, that is a bike,” or “No, that word does not mean ‘stop'” if you don’t know what any of it means.
So, the researchers had to devise other methods of learning. They trained their algorithm using a language that shares a root with the undeciphered script, and paired that with theories about how languages evolve over time from linguistics research. The idea was to find words in the known language that were similar, both in terms of the characters they used and their context within the broader script, to words from the unknown language.
The two languages they used for their research, Linear B and Ugaritic, aren’t technically undeciphered, as both have been largely translated. But, they’re good training tools. The researchers’ algorithm edged past previous efforts, identifying cognates in Ugaritic some 5 percent better than before, and correctly translating over two-thirds of cognates in Linear B.
While the algorithm may not be unlocking Proto-Elamite anytime soon, it is an achievement in one important way. Linear B was used for writing in early Mycenaean Greece beginning around 1450 B.C. It shares no linguistic roots with Ugaritic, which comes from Mesopotamia and is even older. That means the AI needed to parse completely different language systems using a single approach. That’s a difficult task in linguistics, where most scripts require unique tactics to decipher. Finding a single method that’s generalizable to multiple scripts would make the work much quicker.
Still, the few undeciphered scripts still out there today may not have related languages for a similar algorithm to use as a comparison. That would make this approach difficult to apply in those situations.
It may not be soon, but the few still-mysterious ancient languages out there will certainly be cracked open one day. Whether that’s by human hands or computer circuits is currently an open question.