Highlighting new file formats with Pygments

By Matt Layman on September 3, 2015

I want pretty documentation for my tappy project, and syntax highlighting code samples helps make software documentation pretty. For tappy, code samples can include Python or Test Anything Protocol (TAP) output. Unfortunately, the syntax highlighter for tappy’s documentation, Pygments, did not know how to highlight TAP. That smelled like a fun project to me.

If you read the Pygments documentation, eventually you’ll learn that adding a new filetype means writing a new lexer.

A lexer’s job is to parse data and break it into tokens. Tokens are the abstract objects that you might expect in a programming language. Some examples within Pygments include a Comment, Number, or Operator. Once a lexer breaks data into tokens, those tokens can be passed to a formatter for stylizing output. For instance, a formatter might color every Keyword token green.

Pygments’ primary tool for creating new lexers is to use its RegexLexer and subclass it. With this lexer, you define a series of regular expressions and map them to tokens. Here is an example for TAP comments:

(r'^#.*\n', Comment),

TAP comments are lines that start with a hash character, #. The regular expression matches that pattern and pairs matching content to a Comment token.

That’s the core concept for the lexer, but there is more fun to have! There is another layer within this parsing process. If everything could be matched with a set of regular expressions, then the job would be over, but languages often have context that change the meaning of the source data. A contrived example would be the if characters. In one context, if should be a Keyword token. In another context, if may be part of a String token like “if I exercise, then I can stay healthy.”

The RegexLexer allows developers to handle these context changes by providing a stack. When a regular expression matches a certain pattern, it can trigger a context change and push onto the stack. In the new context, the lexer moves to a different set of regex patterns that makes sense for that context. When the context ends, the stack is popped and the lexer goes back to working with the original regex patterns.

If you’re trying to absorb how this all works, I think you should take a look at the full source of the TAPLexer (Update: this lexer was merged into Pygments so the code now lives in the Pygments code instead of tappy). I took care to document it well, and you can see the context shifts as the lexer moves from root to plan or root to test.

Now that you’re equipped, go forth and make a new lexer of your own! Also, you can check out the huge array of lexers already defined in the Pygments project if you want to study the work of others.

If you want to chat about this with me, I'm @mblayman on Twitter.



Matt Layman

Matt is the lead software engineer at Storybird.

Always eager to talk about Python and other technology topics, Matt organizes Python Frederick in Frederick, Maryland (NW of Washington D.C.) and seeks to grow software skills for people in his community.