Finding new words

Universal lexicon
Lexicographa
Language
AI

Systematically discovering words that exist in other languages but not English

Author

Alex Leeds

Published

April 16, 2024

I’m exploring methods for systematically discovering words in other languages for concepts we don’t have in English. These are both fun and useful, but mostly just fun.

For instance: Abbiocco (in Italian, the sleepy feeling after eating a large meal), verschlimbessern (in German, to make something worse in a well-meaning attempt to make it better), dépaysement (in French, the mix of euphoria and disorientation of being in a foreign country), etc.

I’m calling this mini-project lexicographa. A lexicographer is “someone who creates dictionaries,” so it seems fitting. The goal is to create a new dictionary (and thesaurus) that blurs the boundaries between English and other languages.

There are two ways to go about the problem of finding useful words outside English, basically unsupervised and supervised machine learning:

Far out or far out?

In the unsupervised approach, I’ll begin by finding single words in European languages whose AI text embeddings locate them far from any word in English. This approach should fail at first. Among other things, distance metrics in machine learning tend to get noisy as the number of dimensions goes up. And the number of dimensions in the the text embeddings that I’m using is pretty high - often over 1,500. Being distant in hundreds of small ways doesn’t equate to being intriguingly distant in a few useful ones.

Distance can also be misleading. If the AI text embeddings are like a map of language, we don’t want to find the middle of the ocean. We want words that are distinctive within the spheres they occupy, more like unusual foreign landmarks.

To use the unsupervised approach, we will definitely have to zero in on useful dimensions for various types of words. We can probably cluster words by theme and reduce them to “important dimensions,” and then look for atypical non-English words in each cluster.

Nevertheless, the best place to start is to experiment. I’ll start with code that says “just show me words in French, Spanish, German, or Italian are distant from my English corpus and see what we find. Then we can refine how useful “distance” is represented.

In other forms of natural language processing, it would be typical to use TF-IDF (term frequency, inverse document frequency). The idea of TF-IDF is to find words that appear frequently inside some documents but not across all documents. The word “baseball” will appear many times in articles about baseball but not once in articles about other sports. Across news articles, “baseball” has a high TF-IDF and really is a useful guidepost.

This does not work here. The types of words I’m looking for have a low term frequency and low document frequency. An intriguing word might appear just once in a text, and not often in an entire corpus.

By comparison, the supervised method begins with a list of words that fit my objective and then analyzes the conditions that make hundreds of these words interesting in order to uncover new ones.

But I’m skeptical that I’ll find enough examples for supervized learning to capture the nuance we want!

In any case, we need a corpus to analyze.

Corpuses to dissect…

One possibility would be to grab whole books in different languages from Project Gutenberg. It would be easy to expand this project to see how different authors’ vocabularies vary and what words are distinctive not only by language but also by author.

But as Hemmingway demonstrates, plain words often make for better storytelling. I’m not sure books are the best source.

At first glance, the Leipzig Corpora Collection looks excellent. This database is built for linguistics research with words and sentences in 291 languages including Internet and news sites through 2023. It appears to do well for the types of colloquial phrases we would also like to consider.

There are also historical data dumps from resources like Wikipedia, Reddit, and Twitter. Twitter seems especially promising - but I’ll start with the Leipzig Corpora Collection and grow from there.