Writing a Markdown Parser Is Hard!

February 27, 2023

Parsers are hard to write. More often than not regular expressions are not enough to properly parse some text into a data structure a computer can easily manipulate.

Take for instance the Markdowm markup language. Seems pretty easy at first sight. We split our text into lines and look for special characters:

If it begins with one or more #, then it is a header
If it begins with >, then it is a blockquote
If it begins with *, +, or -, then it is a list item.
Non empty lines of text become part of the same paragraph

Well, that wasn’t too hard, but we still have a lot to cover, such as inline markup and links. This is when the edge cases begin and things start to get complicated.

Should the parser allow for multiple words to have inline style? What about styling the middle of a word? How we should handle nested styles? Can you style *links*? What if the link text contains special characters? Then your parser should be able to have a way of escaping characters. Lists and blockquotes can be nested and the latter can contain additional markdown elements.

There are many questions that need to be answered before writing your parser. Too many questions. Thankfully there are established markdown standards that offer extensive test suites that define how a markdown document should be translated to HTML. Some standards include extra markup for tables, definition lists, ways of specifying HTML attributes for headers and preformatted text, etcetera.

After all this, you realize that the simple markdown parser that just scans a few tokens and makes some string replacements was just a naive approach.

Another feature of markdown is that it allows you to write it alongside normal HTML, and this leads us to the realization that a Markdown parser is also an HTML parser.

That is, your parser should be able to know when something is HTML code, and render it as-is, and if it’s not HTML, escape the special characters.

The Markdown spec is a complicated spec for a language that is very simple to write. A look at the Commonmark spec is enough to deter someone from writing a Markdown parser as a weekend project. Thankfully you will probably never have to write a Markdown parser, as pretty much every language has an implementation of it. Still, it doesn’t mean you shouldn’t try writing one. Writing a parser (of any kind) is a great way of learning about abstract syntax trees, which is fundamental in the design of programming languages.