thomas.touhey.uk

Using and adapting docutils

As described in Welcome, thblog!, for thblog, I use reStructuredText for the formatted pages of my blog, i.e. most of my content here. For this purpose, I use the de facto standard module for publishing reStructuredText documents: docutils, used by most tools implementing the format such as Sphinx or Pelican.

In this post, I will introduce how thblog uses docutils, highlight the needs that are specific to thblog, and see where and how I have intervened to answer to these needs.

Parsing and writing documents

docutils has an incomplete documentation, but I completed my research by looking through the following elements:

At the most abstract level, the responsibility of docutils is to take raw input, from a Python string, byte array, or text or byte stream, and return a formatted text or byte array. As indicated in the docutils project model, this is implemented through what is called a Publisher, which runs the following steps:

Every generated site (e.g. thomas.touhey.fr or thomas.touhey.uk) has multiple outputs, and some of the operations above require all documents to be read first for some metadata to be extracted. Therefore, this list of operation is actually split in two:

The building bricks for extending docutils

In order for thblog to make better use docutils, there are quite a few levels we can add or update elements at:

Unfortunately, some of these operations require modifying global variables defined in docutils; much of the code that allows us to do this comes from Sphinx.

Connecting documents together

By default, docutils can only process documents in isolation; it requires extensions to connect documents together and to the rest of the website.

thblog allows documents (including non-reStructuredText documents) to reference each other using anchors, which are of the namespace:reference format. Namespaces include the following:

default:

The default namespace is used for anchors defined directly in posts and pages, e.g. by doing the following:

.. _my-custom-reference:

An incredible section title!
----------------------------

In this case, the document presents the default:my-custom-reference anchor to link to this section directly.

reStructuredText documents can reference these using the :ref: role, analogous to Sphinx's :ref: role.

post:

The namespace in which posts are referenced using the YYYY-MM-DD-slug format, e.g. post:2025-03-22-hello-thblog.

reStructuredText documents can reference these using the :post: role, or the .. post:: directive for a card such as this one:

page:

The namespace in which pages are referenced, e.g. page:projects.

reStructuredText documents can reference these using the :page: role.

file:

The namespace in which files are referenced, e.g. file:myage.rb.

reStructuredText documents can reference these using the :download: role, analogous to Sphinx's :download: role.

static:

The namespace in which static files are referenced, e.g. static:resume.pdf.

reStructuredText documents can reference these using the :static: role.

pep:

The namespace which allows to quickly reference Python Enhancement Proposals (PEP), e.g. using the :pep: role in reStructuredText documents.

Equivalent namespaces and roles exist for RNCP (:rncp:) and RFCs (:rfc:) as well.

In order to implement these, the roles, named "reference roles", actually produce pending_reference or pending_reference_card nodes, which are then transformed by a post-transform called PendingReferenceResolver into reference and reference_card respectively; this behaviour is much like Sphinx's XRefRole producing pending_xref nodes, which are transformed into reference by a post-transform named ReferencesResolver.

For the posts page, I actually made another directive called .. post-list::, which resolves in a post-transform into multiple pending_reference_card nodes referencing posts that can then be processed.

Hacking substitutions

As for my previous blog using Jekyll, I wanted to be able to display my age in a computed form on my about page. After giving it some thought, I decided I wanted to implement it using reStructuredText substitutions.

The idea was that |Authors.Thomas.Age| could be translated into my age as an integer. In order to do this, I made a SubstitutionDefsMapping class that takes the environment and the original substitution definitions, and replace doctree.substitution_defs after parsing, but before transforming.

This class returns the substitution if exists in the original dictionary, or checks if it matches existing patterns, such as Author.<Name>.<Property>; if it's the case, it returns the property using a method.

This method allows me to add as many keys as I want, so |UTC.Date| and |UTC.Year| also exist; after all, why not!

Adding a TODO admonition

When writing Sphinx documentations, I tend to use the .. todo:: directive a lot, and usually configure them to appear in produced documentations to let the reader know when there is information missing at a given place.

This is a Sphinx extension (in sphinx.ext.todo), and does not exist in base docutils; I had to add it back! Fortunately, with thblog, this is quite easy to do:

@extension.docutils.node
class todo(Admonition, Element):
    """Node signalling a work in progress."""

@extension.docutils.directive("todo")
class Todo(BaseAdmonition):
    """Directive signalling a work in progress."""

    node_class = todo

Adding a GMI writer

One of the challenges I had for thblog was to be able to support multiple outputs for every website, including the HTTP output. One of the possibilities was an output for Gemini, and alternate "slow" web, inspired by Gopher. On Gemini, the most common hypertext format is Gemtext (bearing the .gmi file extension).

Gemtext is line-oriented, and does not support inline formatting (no emphasis or text decoration). A line is either of these:

These rules are overridden if we enter a preformatted block, using three backticks, until we exit it.

This markup is very limited for what the blog is able to produce, and any such operation is "best effort" on thblog's part. The following limitations are known:

However, the biggest limitation is the absence of depth. In thblog's posts and pages, you can find code blocks in bullet points, bullet lists in bullet points, code blocks in admonitions and block quotes (both represented as quotes), and so on and so forth.

I made the choice to ignore this depth limitation, and represent embedded blocks in a Markdown fashion, e.g.:

* Hello, this is a list:

  * Hello, this is a list embedded in a list!
    Isn't that incredible?
  * It is!
* Another bullet point already?

  ```
  This one has code!
  ```

As this is non-standard, it is not interpreted by any client I know, but even in their raw text form, render correctly to people used to Markdown and other such markups.

And voilĂ , thblog has a Gemini output! It is far from perfect, and some work still needs to be given to the link names and image alts, but it is good enough to start using.

Conclusions

While docutils is the standard for parsing and rendering reStructuredText in Python, I can safely say it is a mess to use and extend:

However, using it once these are smoothed out by some level of abstraction in thblog allows me to share concepts with Sphinx and other tools, and import some of the extensions they define if I like them. The other way round is also true, I can port docutils elements from thblog to a Sphinx extension for example!

Using docutils in this project allows me to test out its limits, and the interesting concepts it brings; but at some point, I may find myself building a reStructuredText parser from the ground up, while trying to avoid these pitfalls. For now, however, it works, and I should concentrate more on writing posts and other contents for my blog. :-)