Markdent is my new event-driven Markdown parser toolkit, but why should you care?
First, let’s talk about Markdown. Markdown is yet another wiki-esque format for marking up plain text. What makes Markdown stand out is it’s emphasis on usability and “natural” usage. It’s syntax is based on things people have been doing to “mark up” plain text email for years.
For example, if you wanted to list some items in a plain text email, you’d wite something like:
* List item 1 * List item 2 * List item 3
Well, this is how it works in Markdown too. Want to emphasize some text? Wrap it in asterisks or _underscores_.
So why do you need an event-driven parser toolkit for dealing with Markdown? CPAN already has several modules for dealing with Markdown, most notably Text::Markdown.
The problem with Text::Markdown is that all you can do with it is generate HTML, but there’s so much more you could do with a Markdown document.
If you’re using Markdown for an application (like a wiki), you may need to generate slightly different HTML for different users. For example, maybe logged-in users see documents differently.
But what if you want to cache parsing in order to speed things up? If you’re going straight from Markdown to HTML, you’d need to cache the resulting HTML for each type of user (or even for each individual user in the worst case).
With Markdent, you can cache an intermediate representation of the document as a stream of events. You can then replay this stream back to the HTML generator as needed.
What’s the Impact of Caching?
Here’s a benchmark comparing three approaches.
- Use Markdent to parse the document and generate HTML from scratch each time.
- Use Text::Markdown
- Use Markdent to parse the document once, then use Storable to store the event stream. When generating HTML, thaw the event stream and replay it back to the HTML generator.
| Rate | parse from scratch | Text::Markdown | replay from captured events | |
|---|---|---|---|---|
| parse from scratch | 1.07/s | – | -67% | -83% |
| Text::Markdown | 3.22/s | 202% | – | -48% |
| replay from captured events | 6.13/s | 475% | 91% | – |
This benchmark is included in the Markdent distro. One feature to note about this benchmark is that it parses 23 documents from the mdtest test suite. Those documents are mostly pretty short.
If I benchmark just the largest document in mdtest, the numbers change a bit:
| Rate | parse from scratch | Text::Markdown | replay from captured events | |
|---|---|---|---|---|
| parse from scratch | 2.32/s | – | -58% | -84% |
| Text::Markdown | 5.52/s | 138% | – | -63% |
| replay from captured events | 14.8/s | 538% | 168% | – |
Markdent probably speeds up on large documents because each new parse requires constructing a number of objects. With 23 documents we construct those objects 23 times. When we parse one document the actual speed of parsing becomes more important, as does the speed of not parsing.
What Else?
But there’s more to Markdent than caching. One feature that a lot of wikis have is “backlinks”, which is a list of pages linking to the current page. With Markdent, you can write a handler that only looks at links. You can use this to capture all the links and generate your backlink list.
How about a full text search engine? Maybe you’d like to give a little more weight to titles than other text. You can write a handler which collects title text and body text separately, then feed that into your full text search tool.
There’s a theme here, which is that Markdent makes document analysis much easier.
That’s not all you can do. What about a Markdown-to-Textile converter? How about a Markdown-to-Markdown converter for canonicalization?
Because Markdent is modular and pluggable, if you can think of it, you can probably do it.
I haven’t even touched on extending the parser itself. That’s still a much rougher area, but it’s not that hard. The Markdent distro includes an implementation of a dialect called “Theory”, based on some Markdown extension proposals by David Wheeler.
This dialect is implemented by subclassing the Standard dialect parser classes, and providing some additional event classes to represent table elements.
I hope that other people will pick up on Markdent and write their own dialects and handlers. Imagine a rich ecosystem of tools for Markdown comparable to what’s available for XML or HTML. This would make an already useful markup language even more useful.