Bzo Devlog #4: Another Parser Simplification, Backend Stuff
It’s been a while since I’ve done a devlog for my programming language, Bzo. Here’s the original post for those who are new:
Progress was a bit slow for a couple months, though has sped up over the past few weeks, and I once again wound up reimplementing the parser. The parser this time however is much simpler than previous versions, is a version I’m much happier with, and is something I’ll be able to repurpose relatively easily for other projects.
It would need a bit more polishing up, and there are a few features I’d like to add to it first, but I might end up open-sourcing it in the future. It might not fit perfectly into your projects (it’s written in C), but it might function as some good example code for other parser implementations.
Some Thoughts on Parsers
There is an enormous amount of parsing theory that you are encouraged to study when embarking on efforts to build a compiler. You are bombarded with jargon about context-free grammars, LR and LALR(n) parsers, top-down, bottom-up, and so on. You’re often encouraged to use a parser generator, which often takes the form of a language for describing the syntax of other languages, which can be transpiled into a machine-generated source file that your program can include and make calls into.
Existing parsing theory was based on computational linguistics efforts to mathematically formalize natural language, and to do so well enough that even computers could understand it. Based on the fact that one of the biggest subjects of public discussion over the past couple months has been LLMs like ChatGPT and Bing Chat, the fact that these technologies are still not perfect, and the fact that they have zero resemblence whatsoever to work done by researchers like Chomsky, I think it's very safe to say that such theoretical efforts were largely a failure at this goal.
If this research actually had succeeded, ChatGPT would have been built decades ago. Instead, it is largely only useful for studying and comparing some of the more obvious structure of language, but many of the more subtle ways that humans compose meaning are largely beyond it.
Much of early parsing theory, particularly that which aimed to be applied in compilers, revolved around problems that were significant in a time with tight computing resources - very little memory (for both programs and code), and very slow computation. A great deal of work is spent on things like minimizing how far a parser has to look ahead into a stack of unprocessed tokens, reducing stuff to mere finite-state machines whenever possible, doing black magic trickery with parser rules to get them to handle operator precedence implicitly, or making sure that your language is simple enough that parsing it isn’t Turing-complete.
That last part has some merit to it - a Turing-complete parser might be very tricky to implement, very difficult to reason about, and might even get into infinite loops! However, there are many different ways of framing parsing that avoid such complexity, and the ones chosen by researchers in the 60s and 70s in retrospect seem to have less to do with solid theoretical reasons or specific needs of a programming language syntax, and more to do with trying to save a few bytes of memory here or there, even at the cost of significantly complicating the theory.
Most of this research as a result is irrelevant for computers with gigabytes of RAM, but you’re still burdened by it if you wish to apply such research in your compiler.
We humans usually think in terms of our tools, and I figure parsing theory has shaped programming language syntax to a far greater extent than we might expect. Without as much overly-rigid theory, programming language syntax might look very different today, being a lot more open to long-range or contextual structure, which these models handle very poorly.
The right tool for the job is a tool that reflects the specific problem. A generic, one-size-fits-all solution often fits no problem particularly well, and in difficult situations may force the problem to be changed to one that better fits the “solution”, when an alternative approach could have solved the original problem with no compromise.
The New Bzo Parser
The core approach to the new parser for Bzo is to focus on situations that involve some form of nesting or obvious grouping. The primary case involves paired tokens, such as parentheses or curly braces.
The parser starts out with a quick first pass to scan over all generated tokens to find matching pairs. It then builds up a data structure consisting of arrays of tokens with special tokens for such pairs. A subexpression inside parentheses becomes a single “parentheses pair token” which points to another array of tokens of its contents. This can form a tree with arbitrary nesting and branching factors.
Parsing rules can then look for simple patterns in each array of tokens, matching them with different language constructs. The patterns matched on subexpressions can also be accounted for for extra context, assuming they have been parsed already.
In addition to this, there is also support for splitting arrays of tokens into subarrays when separated by a specified token such as a comma, semicolon, or assignment operator. I’m planning to eventually add support for keywords being used to open and implicitly close previous pairs, which would be useful for the problem of making a syntax that isn’t forced to look like Lisp.
One nice advantage of this is that you can make many high-level parsing decisions (is this a function or type definition?) much earlier and more easily. Function and type definitions of course are a trivial example where a keyword might make the answer obvious from the beginning, but there are many more niche and more ambiguous cases that can crop up when getting into more specific syntax.
If you want to make a decision based on something on the other side of a pair of parentheses, you don’t have to completely parse everything between them first in order to do so.
Is this parser the most flexible parser in the world? No, not on its own. I expect I’ll probably need an additional pass over certain expressions if I want to support infix operators. For now I’m just using a Lisp-like syntax.
However, is it simple to write, use, and most importantly debug? Absolutely! While the current parser is a bit limited (I’m leaving out a lot of features for now), the entire parser (not including the lexer) is only around 900 lines of code. Previous versions of the compiler had a couple thousand lines of parser code.
Having a few different passes that are largely uncoupled from each other’s structure might also allow for syntax decisions that many other languages would shy away from down the road.
The proper way to approach a problem often is rarely to build the most general-purpose solution possible, but rather to look at specific needs and to build around solving those well. Code involves a lot of nested expressions. If you focus on making it easy to detect those, you wind up with something much simpler and easier to reason about than if you tried to contort it into whatever shape some arcane mathematics demands.
After the paywall, I’ll be talking a bit about work I’ve been doing on the compiler backend. Progress there has been going well, with much of this happening in just the past few weeks.