I want pretty documentation for my tappy project, and syntax highlighting code samples helps make software documentation pretty. For tappy, code samples can include Python or Test Anything Protocol (TAP) output. Unfortunately, the syntax highlighter for tappy’s documentation, Pygments, did not know how to highlight TAP. That smelled like a fun project to me.
A lexer’s job is to parse data
and break it into tokens.
Tokens are the abstract objects that you might expect
in a programming language.
Some examples within Pygments include a
Once a lexer breaks data into tokens,
those tokens can be passed to a formatter
for stylizing output.
a formatter might color every
Keyword token green.
Pygments’ primary tool for creating new lexers
is to use its
RegexLexer and subclass it.
With this lexer,
you define a series of regular expressions
and map them to tokens.
Here is an example for TAP comments:
TAP comments are lines that start with a hash character,
The regular expression matches that pattern
and pairs matching content to a
That’s the core concept for the lexer,
but there is more fun to have!
There is another layer within this parsing process.
If everything could be matched with a set of regular expressions,
then the job would be over,
but languages often have context
that change the meaning of the source data.
A contrived example would be the
In one context,
if should be a
In another context,
if may be part of a
String token like
“if I exercise, then I can stay healthy.”
RegexLexer allows developers to handle these context changes
by providing a stack.
When a regular expression matches a certain pattern,
it can trigger a context change
and push onto the stack.
In the new context,
the lexer moves to a different set of regex patterns
that makes sense for that context.
When the context ends,
the stack is popped
and the lexer goes back to working with the original regex patterns.
If you’re trying to absorb how this all works,
I think you should take a look at the full source
(Update: this lexer was merged
so the code now lives
in the Pygments code
instead of tappy).
I took care to document it well,
and you can see the context shifts
as the lexer moves from
Now that you’re equipped, go forth and make a new lexer of your own! Also, you can check out the huge array of lexers already defined in the Pygments project if you want to study the work of others.
If you want to chat about this with me, I'm @mblayman on Twitter.
Show how to create an entire handroll extension from scratch
Matt is the lead software engineer at Storybird.
Always eager to talk about Python and other technology topics, Matt organizes Python Frederick in Frederick, Maryland (NW of Washington D.C.) and seeks to grow software skills for people in his community.