Markdent is my new event-driven Markdown parser toolkit, but why should you care?

First, let’s talk about Markdown. Markdown is yet another wiki-esque format for marking up plain text. What makes Markdown stand out is it’s emphasis on usability and “natural” usage. It’s syntax is based on things people have been doing to “mark up” plain text email for years.

For example, if you wanted to list some items in a plain text email, you’d wite something like:

* List item 1
* List item 2
* List item 3

Well, this is how it works in Markdown too. Want to emphasize some text? Wrap it in asterisks or _underscores_.

So why do you need an event-driven parser toolkit for dealing with Markdown? CPAN already has several modules for dealing with Markdown, most notably Text::Markdown.

The problem with Text::Markdown is that all you can do with it is generate HTML, but there’s so much more you could do with a Markdown document.

If you’re using Markdown for an application (like a wiki), you may need to generate slightly different HTML for different users. For example, maybe logged-in users see documents differently.

But what if you want to cache parsing in order to speed things up? If you’re going straight from Markdown to HTML, you’d need to cache the resulting HTML for each type of user (or even for each individual user in the worst case).

With Markdent, you can cache an intermediate representation of the document as a stream of events. You can then replay this stream back to the HTML generator as needed.

What’s the Impact of Caching?

Here’s a benchmark comparing three approaches.

  1. Use Markdent to parse the document and generate HTML from scratch each time.
  2. Use Text::Markdown
  3. Use Markdent to parse the document once, then use Storable to store the event stream. When generating HTML, thaw the event stream and replay it back to the HTML generator.
Rate parse from scratch Text::Markdown replay from captured events
parse from scratch 1.07/s -67% -83%
Text::Markdown 3.22/s 202% -48%
replay from captured events 6.13/s 475% 91%

This benchmark is included in the Markdent distro. One feature to note about this benchmark is that it parses 23 documents from the mdtest test suite. Those documents are mostly pretty short.

If I benchmark just the largest document in mdtest, the numbers change a bit:

Rate parse from scratch Text::Markdown replay from captured events
parse from scratch 2.32/s -58% -84%
Text::Markdown 5.52/s 138% -63%
replay from captured events 14.8/s 538% 168%

Markdent probably speeds up on large documents because each new parse requires constructing a number of objects. With 23 documents we construct those objects 23 times. When we parse one document the actual speed of parsing becomes more important, as does the speed of not parsing.

What Else?

But there’s more to Markdent than caching. One feature that a lot of wikis have is “backlinks”, which is a list of pages linking to the current page. With Markdent, you can write a handler that only looks at links. You can use this to capture all the links and generate your backlink list.

How about a full text search engine? Maybe you’d like to give a little more weight to titles than other text. You can write a handler which collects title text and body text separately, then feed that into your full text search tool.

There’s a theme here, which is that Markdent makes document analysis much easier.

That’s not all you can do. What about a Markdown-to-Textile converter? How about a Markdown-to-Markdown converter for canonicalization?

Because Markdent is modular and pluggable, if you can think of it, you can probably do it.

I haven’t even touched on extending the parser itself. That’s still a much rougher area, but it’s not that hard. The Markdent distro includes an implementation of a dialect called “Theory”, based on some Markdown extension proposals by David Wheeler.

This dialect is implemented by subclassing the Standard dialect parser classes, and providing some additional event classes to represent table elements.

I hope that other people will pick up on Markdent and write their own dialects and handlers. Imagine a rich ecosystem of tools for Markdown comparable to what’s available for XML or HTML. This would make an already useful markup language even more useful.

I’ve been working a new a project recently, Markdent, an event-driven Markdown parser toolkit.

Why? Because the existing Perl Markdown tools just aren’t flexible enough. They bundle up Markdown parsing with HTML conversion all in one API, and I need to do more than convert to HTML.

This sort of inflexibility is quite common when I look at CPAN libraries. Looking back at the Perl DateTime Project, one of my big problems with all the other date/time modules on CPAN was their lack of flexibility. If I could have added good time zone handling to an existing project way back then, I probably would have, but I couldn’t, and the Perl DateTime Project was born.

If there is one point I would hammer home to all module authors, it would be “solve small problems”. I think that the failure to do this is what leads to the inflexibility and tight coupling I see in so many CPAN distributions.

For example, I imagine that in the date/time world some people thought “I need a bunch of date math functions” or “I need to parse lots of possible date/time strings”. Those are good problems to solve, but by going straight there you lose any hope of a good API.

Similarly, with Markdown parsers, I imagine that someone though “I’d like to convert Markdown to HTML”, so they wrote a module that does just that.

I can’t really fault their goal-focused attitudes. Personally, I sometimes find myself getting lost in digressions. For example, I’m currently writing a webapp with the goal of exploring techniques I want to use in another webapp!

But there’s a lot to be said for not going straight to your goal. I’m a big fan of breaking a problem down into smaller pieces and solving each piece separately.

For example, when it comes to Markdown, there are several distinct steps on the way from Markdown to HTML. First, we need to be able to parse Markdown. Parsing Markdown is a step of its own. Then we need to take the results of parsing and turn it into HTML.

If we think of the problem as consisting of these pieces, a clear and flexible design emerges. We need a tool for parsing Markdown (a parser). Separately, we need a tool for converting parse results to HTML (a converter or parse result handler).

Now we need a way to connect these pieces. In the case of Markdent, the connection is an event-driven API where each event is an object and the event receiver conforms to a known API.

It’s easy to put these two things together and make a nice simple Markdown-to-HTML converter.

But since I took the time to break the problem down, you can also do other things with this tool. For example, I can do something else with our parse results, like capture all the links or cache the intermediate result of the parsing (an event stream).

And since the HTML generator is a small piece, I can also reuse that. Now that I’ve cached our event stream, I can pull it from the cache later and use it to generate HTML without re-parsing the document. In the case of Markdent, using a cached parse result to generate HTML was about six times faster in my benchmarks!

Because Markdent has small pieces, there are all sorts of interesting ways to reuse them. How about a Markdown-to-Textile converter? Or how about adding a filter which doesn’t allow any raw HTML?

We’ve all heard that loose coupling makes good APIs. But just saying that doesn’t really help you understand how to achieve loose coupling. Loose coupling comes from breaking a big problem down into small independent problems.

As you solve each problem, think about how those solutions will communicate. Design a simple API or communications protocol. You’ll know the API is simple enough if you can imagine easily swapping out each piece of the problem with another API-conformant piece. A loosely coupled API is one that makes replacing one end of the API easy.

And best of all, when you break problems down into loosely coupled pieces, you’ll make it much easier for others to contribute to and extend your tools. Moose is a great example of this. It’s fancy sugar layer exists on top of loosely coupled units known as the metaclass protocol. By separating the sugar from the underlying pieces, we’ve enabled others to create a huge number of Moose extensions.

The same goes for the Perl DateTime Project. I wrote the core pieces, but there have been many, many great contributions. This wealth of extensions wouldn’t be possible without the loosely coupled core pieces and a well-defined API for communicating between components.