Using and adapting docutils

As described in Welcome, thblog!, for thblog, I use reStructuredText for the formatted pages of my blog, i.e. most of my content here. For this purpose, I use the de facto standard module for publishing reStructuredText documents: docutils, used by most tools implementing the format such as Sphinx or Pelican.

In this post, I will introduce how thblog uses docutils, highlight the needs that are specific to thblog, and see where and how I have intervened to answer to these needs.

Parsing and writing documents

docutils has an incomplete documentation, but I completed my research by looking through the following elements:

PEP 258, entitled "Docutils Design Specification", which covers the most important design decisions around docutils;
The docutils Hacker's Guide, which covers a bit more about the structure of the tool;
The docutils source directly;
Usage of docutils made in other projects using reStructuredText, such as Sphinx or Pelican.

At the most abstract level, the responsibility of docutils is to take raw input, from a Python string, byte array, or text or byte stream, and return a formatted text or byte array. As indicated in the docutils project model, this is implemented through what is called a Publisher, which runs the following steps:

Read the content from the source and make it a list of lines;
Pass the content through a Parser, in our case the reStructuredText parser, and return a node tree, with the top-level node of this tree being called a document;
Apply transforms (Transform) to the document, in order to modify the tree once the document has been parsed;
Write the document in the specified output format on the specified output stream, using a Writer.

Every generated site (e.g. thomas.touhey.fr or thomas.touhey.uk) has multiple outputs, and some of the operations above require all documents to be read first for some metadata to be extracted. Therefore, this list of operation is actually split in two:

At content detection / decoding time, read and parse the content, which produces a document tree (or "doctree"), apply transforms, and extract metadata such as the document's title, author, or list of anchors.

This is implemented in a method called Docutils.parse;
At content producing time, copy the document tree, apply post-transforms (which may be output-specific, or require information from other documents), and produce the document in the target format (usually HTML or GMI).

This is implemented in a method called Docutils.write.

The building bricks for extending docutils

In order for thblog to make better use docutils, there are quite a few levels we can add or update elements at:

We can add node types, as subclasses of docutils.nodes.Node and/or docutils.nodes.Element, which may last until it is taken into account by a writer to produce output, or may be replaced when transforming at content producing time by other nodes. Think of them as analogous to HTML tags, e.g. <paragraph>.

For example, we add the <todo> node, which will be displayed as an admonition (like notes or warnings);
We can add or update HTML/GMI translation handlers for given node types, i.e. bind a function to "we are currently visiting this node" and/or "we are currently departing this node" to add contents to the translators associated with the writers;
We can add or update reStructuredText directives.

For example, thblog adds .. post-list:: to include a list of all posts in a page;
We can add or update reStructuredText roles.

For example, thblog adds :post:`2025-03-22-hello-thblog` to reference a post;
We can add transforms;
We can add post-transforms, which are transforms executed at content producing time, i.e. when the site is loaded, rather than right after the document is loaded.

For example, thblog adds OutputFilter which ensures that <output_filter> node contents are only included for certain output, e.g. <output_filter output='html'>.

Unfortunately, some of these operations require modifying global variables defined in docutils; much of the code that allows us to do this comes from Sphinx.

Connecting documents together

By default, docutils can only process documents in isolation; it requires extensions to connect documents together and to the rest of the website.

thblog allows documents (including non-reStructuredText documents) to reference each other using anchors, which are of the namespace:reference format. Namespaces include the following:

default:

The default namespace is used for anchors defined directly in posts and pages, e.g. by doing the following:

.. _my-custom-reference:

An incredible section title!
----------------------------

In this case, the document presents the default:my-custom-reference anchor to link to this section directly.

reStructuredText documents can reference these using the :ref: role, analogous to Sphinx's :ref: role.

post:

The namespace in which posts are referenced using the YYYY-MM-DD-slug format, e.g. post:2025-03-22-hello-thblog.

reStructuredText documents can reference these using the :post: role, or the .. post:: directive for a card such as this one:

Welcome, thblog!

Published on March 22, 2025 by Thomas Touhey.

I'm switching my blog to my own static site generator, migrating from Jekyll. Why?

page:

The namespace in which pages are referenced, e.g. page:projects.

reStructuredText documents can reference these using the :page: role.

file:

The namespace in which files are referenced, e.g. file:myage.rb.

reStructuredText documents can reference these using the :download: role, analogous to Sphinx's :download: role.

static:

The namespace in which static files are referenced, e.g. static:resume.pdf.

reStructuredText documents can reference these using the :static: role.

pep:

The namespace which allows to quickly reference Python Enhancement Proposals (PEP), e.g. using the :pep: role in reStructuredText documents.

Equivalent namespaces and roles exist for RNCP (:rncp:) and RFCs (:rfc:) as well.

In order to implement these, the roles, named "reference roles", actually produce pending_reference or pending_reference_card nodes, which are then transformed by a post-transform called PendingReferenceResolver into reference and reference_card respectively; this behaviour is much like Sphinx's XRefRole producing pending_xref nodes, which are transformed into reference by a post-transform named ReferencesResolver.

For the posts page, I actually made another directive called .. post-list::, which resolves in a post-transform into multiple pending_reference_card nodes referencing posts that can then be processed.

Hacking substitutions

As for my previous blog using Jekyll, I wanted to be able to display my age in a computed form on my about page. After giving it some thought, I decided I wanted to implement it using reStructuredText substitutions.

The idea was that |Authors.Thomas.Age| could be translated into my age as an integer. In order to do this, I made a SubstitutionDefsMapping class that takes the environment and the original substitution definitions, and replace doctree.substitution_defs after parsing, but before transforming.

This class returns the substitution if exists in the original dictionary, or checks if it matches existing patterns, such as Author.<Name>.<Property>; if it's the case, it returns the property using a method.

This method allows me to add as many keys as I want, so |UTC.Date| and |UTC.Year| also exist; after all, why not!

Adding a TODO admonition

When writing Sphinx documentations, I tend to use the .. todo:: directive a lot, and usually configure them to appear in produced documentations to let the reader know when there is information missing at a given place.

This is a Sphinx extension (in sphinx.ext.todo), and does not exist in base docutils; I had to add it back! Fortunately, with thblog, this is quite easy to do:

@extension.docutils.node
class todo(Admonition, Element):
    """Node signalling a work in progress."""

@extension.docutils.directive("todo")
class Todo(BaseAdmonition):
    """Directive signalling a work in progress."""

    node_class = todo

Adding a GMI writer

One of the challenges I had for thblog was to be able to support multiple outputs for every website, including the HTTP output. One of the possibilities was an output for Gemini, and alternate "slow" web, inspired by Gopher. On Gemini, the most common hypertext format is Gemtext (bearing the .gmi file extension).

Gemtext is line-oriented, and does not support inline formatting (no emphasis or text decoration). A line is either of these:

Simple text, with an empty line counting as such;
A title, from first to third level of importance (# to ###);
A quote (or part of a quote block), with the > prefix;

A bullet point, with the * prefix (or keep the indentation for the same block), e.g.:

* First element!
* Second element, which has several lines!
  This is line 2 of the second element.
* Third element!

A link to a page or image, using the => <url> <display text> format.

These rules are overridden if we enter a preformatted block, using three backticks, until we exit it.

This markup is very limited for what the blog is able to produce, and any such operation is "best effort" on thblog's part. The following limitations are known:

Inline markup is lost, with certain exceptions;
Links are gathered, duplicates are removed, and the result is placed at the end of the document;
Reference cards are represented as a link, a text line representing the author and publication date, and a quote containing the description.

Note

For links, the writer tries to use the name of the target rather than the name used to reference it. This means, in this example:

See `his projects <lephe's projects_>`_ for more info!

.. _Lephe's projects: https://www.silent-tower.net/projects/

The name of the link at the bottom of the document will be Lephe's projects, not lephe's projects or his projects.

Also note that the default reStructuredText parser from docutils actually loses the name of the target in the original casing defined at the end of the file by default, and only keeps a normalized version for matching with the corresponding references. I had to hack around the parser to keep it at target-level under the origname attribute, then replace the ExternalTargets transform to copy the origname attribute from the target to the reforigname attribute in the reference node. Quite the mess!

However, the biggest limitation is the absence of depth. In thblog's posts and pages, you can find code blocks in bullet points, bullet lists in bullet points, code blocks in admonitions and block quotes (both represented as quotes), and so on and so forth.

I made the choice to ignore this depth limitation, and represent embedded blocks in a Markdown fashion, e.g.:

* Hello, this is a list:

  * Hello, this is a list embedded in a list!
    Isn't that incredible?
  * It is!
* Another bullet point already?

  ```
  This one has code!
  ```

As this is non-standard, it is not interpreted by any client I know, but even in their raw text form, render correctly to people used to Markdown and other such markups.

And voilà, thblog has a Gemini output! It is far from perfect, and some work still needs to be given to the link names and image alts, but it is good enough to start using.

Conclusions

While docutils is the standard for parsing and rendering reStructuredText in Python, I can safely say it is a mess to use and extend:

The documentation is sparse when it exists (which is quite ironic for a language made for writing documentation!), leading to a lot of exploring code to understand which attributes are used on which elements;
Any class instance is considered to be "single-use", you only instantiate a publisher to use for one document, then you have to instantiate another one for another document;
Any extension requires you modify some global variables at some point (which is problematic if thblog is run in the same context as some other tool using docutils);
Some classes are a straight up nightmare to override (the reStructuredText parser references functions in dictionary attributes directly, so overriding them is not enough, you also have to override these dictionaries to reference the new methods and other classes!).

However, using it once these are smoothed out by some level of abstraction in thblog allows me to share concepts with Sphinx and other tools, and import some of the extensions they define if I like them. The other way round is also true, I can port docutils elements from thblog to a Sphinx extension for example!

Using docutils in this project allows me to test out its limits, and the interesting concepts it brings; but at some point, I may find myself building a reStructuredText parser from the ground up, while trying to avoid these pitfalls. For now, however, it works, and I should concentrate more on writing posts and other contents for my blog. :-)