From Idea to Interpreter: Building Talea, a Programming Language for the Humanities
- Subhagato Adak
- Jun 25
- 4 min read
In the world of software development, we often build tools for other developers. But what if we could build a tool for a completely different audience? What if we could empower historians, linguists, and literary scholars to perform complex data analysis with the same ease as writing a sentence?
This was the question that led to the birth of Talea, a new programming language designed from the ground up for the humanities. Our goal was ambitious: create a language with a plain-English syntax that could harness the power of sophisticated NLP libraries without requiring a single line of traditional code.
This is the story of how we built it—the design decisions, the technical hurdles, and the "aha!" moments that brought Talea to life.
The Vision: Code That Reads Like Instructions
The core problem for many researchers in the humanities isn't a lack of interest in computational methods, but the steep learning curve of languages like Python or R. The syntax is often unforgiving, the setup is complex, and the focus is on programming constructs, not research questions.
We envisioned a language where the code was self-explanatory. A language where you could write:
# This is Talea code, not just a comment!
load "pride_and_prejudice.txt" as pride
tokenize pride as words
count words in words as word_count
print word_count
To achieve this, we knew Talea couldn't be just another language. It had to be a smart, user-friendly "glue" that connected a simple interface to a powerful backend.
Choosing the Right Tools: Rust + Python = A Perfect Match
Our biggest architectural decision was how to build Talea's core. We needed two things: a rock-solid, high-performance engine for the language itself, and access to the world's best Natural Language Processing (NLP) libraries.
The choice became clear:
The Core Engine: Rust. For the language interpreter, we chose Rust. Its famous for its performance, which is on par with C++, but more importantly, its compile-time memory safety guarantees. When you're building a bridge between different programming languages, safety isn't a feature; it's a necessity. Rust's robust tooling and helpful compiler messages were the icing on the cake.
The NLP Backend: Python. The Python ecosystem is home to incredible NLP libraries like spaCy and NLTK. Reinventing these would be impossible. Our goal was not to replace Python, but to provide a new, simpler way to access its power.
The plan was set: build an interpreter in Rust that could call out to Python to perform the heavy lifting of text analysis.
From Zero to REPL: Building the Skeleton
Every programming language starts with two fundamental components: a Lexer and a Parser.
The Lexer scans the source code (load "file.txt" as data) and breaks it down into a stream of tokens, like [Load, String("file.txt"), As, Identifier("data")]. We built an exhaustive lexer that recognized a huge vocabulary of humanities-centric words like lemmatize, concordance, filter, and tag.
The Parser takes this stream of tokens and organizes it into a structured representation called an Abstract Syntax Tree (AST). This tree understands the grammar of the language—it knows that a load command needs a file path and a variable name.
With these two pieces in place, we created a simple REPL (Read-Eval-Print Loop), an interactive command line that could read Talea code, parse it into an AST, and print the result. At this stage, Talea could understand our language, but it couldn't do anything with it.
The Bridge: Calling Python from Rust with Pyo3
This was the most exciting and challenging part: making Rust and Python talk to each other. We used the brilliant pyo3 crate, which provides the "Foreign Function Interface" (FFI) between the two worlds.
With pyo3, our Rust interpreter could finally execute a command like tag my_article with ner as entities:
Talea Interpreter (Rust): Identifies the tag command and the variable my_article.
FFI Bridge (pyo3): Takes the text content from the Rust variable and passes it into an embedded Python instance.
Python Backend: Inside the Python instance, our code executes the equivalent of:
import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(text_from_rust) entities = [(ent.text, ent.label_) for ent in doc.ents]
FFI Bridge (pyo3): The list of entities is passed back to Rust.
Talea Interpreter (Rust): The result is stored in a new Talea variable named entities.
This entire complex process is completely hidden from the user. All they see is one simple, readable command that just works. We had successfully replaced our placeholders with real, powerful logic.
Where We Are Today
Talea is no longer just an idea. It's a working interpreter with a rich, expressive vocabulary. It can read and write files, perform calculations, and, most importantly, leverage Python's spaCy library to perform advanced Named Entity Recognition and lemmatization.
Our journey from a simple concept to a functional, multi-language interpreter has been a powerful lesson in building user-centric tools. It proves that with the right design choices, we can democratize access to powerful technologies and build bridges between different fields of research.
The road ahead is long—there are many more commands to implement and backends to add—but the foundation is set. Talea is ready for its story to be written, one command at a time.
Comments