Design

This page explains the design principles and decisions in the project.

The Minimum Viable Product (MVP) for this project is a service that creates an HTML version of all the manpages of all the packages available in Debian, for all supported suites. Basic whatis(1) functionality is also expected.

apropos(1) functionality is considered extra that can be implemented later with already existing tools.

Components

The design is split those components which map to debmans subcommands:

  1. extract: extracts manpages from Debian packages
  2. render: renders manpages into HTML
  3. site: render a static site into HTML
  4. index: indexes HTML pages for searching (not implemented yet)
  5. search: the search interface (not implemented, but there is a simple “jump” Javascript tool)

There is also a serve command which starts a local webserver to help with development.

See the Remaining work file for details about the missing bits.

Extract

This part fetches all manpages from the archive and stores them on disk. This makes them usable for tools like dman that browses remote webpages.

The layout options for where to store files were:

  • Ubuntu: $DISTRIB_CODENAME/$LOCALE/man$i/$PAGE.$i.gz (see dman)
  • original codebase: "${OUTPUTDIR}/${pooldir}/${packagename}_${version}" (from manpage-extractor.pl)

Ubuntu’s approach was chosen to avoid bitrot and follow more closely the existing filesystem layout. It also happens to be easier to implement.

The extractor uses a cache to avoid re-extracting known manpages. We use the Ubuntu layout there as well ($outputdir/$suite/.cache/$packagename_version), which leads to bitrot, but at least it’s constrained to a suite. This will be a problem for unstable, so maybe some garbage-collection may be necessary.

Render

This converts manpages to HTML so they are readable in a web browser. This is mostly about calling one of the Manpage converters and then embedding the output in one of the Templating systems.

A simple timestamp change on the template, source and target files makes sure the files that do not need to be refreshed are skipped.

Index

This indexes HTML pages in a search engine of some sort.

In the backend, something will need to index the manpages if only to implement apropos(1) functionality, but eventually also full-text search. This should be modular so that new backends can be implemented as needed.

For now, we are considering reusing the existing Xapian infrastructure the Debian project already has.

The indexer would:

  1. run omindex on the HTML tree and create a database
  2. process each locale separately so they are isolated (may be tricky for LANG=C) and that the right stemmer is called
  3. need a CGI script (provided by the xapian-omega package) to query the database - the HTML output is generated based on templates, so presumably we could reuse the existing templates.

It is assumed that Xapian can deal with large datasets (10-50GB) considering how it is used in Notmuch (my own mailbox is around ~6GB) and lists.debian.org.

We put the command name in the page <title> tag, the short description in <meta description="..."> and use the magic markers (<!--htdig_noindex-->ignored<!--/htdig_noindex-->) to make the indexer ignore redundant bits.

See the :ref`search-engines` for more information about the various search software evaluated and the web interface.

Software evaluation

This section tries to (sometimes after the fact) justify choices made in certain dependencies and software technologies used in the project.

Manpage converters

Those are the known manpage converters:

  • just the plaintext output of man wrapped in <PRE> tags (current design)
  • man2html is an old C program that ships with a bunch of CGI scripts
  • there’s another man2html that is a perl script, but I couldn’t figure out how to use it correctly.
  • w3m has another Perl script that is used by the Ubuntu site
  • roffit is another perl script. the version in Debian is ancient (2012) and doesn’t display the man(1) synopsis correctly (newer versions from github also fail)
  • pandoc can’t, unfortunately, read manpages (only write)
  • man itself can generate an HTML version with man -Hcat man and the output is fairly decent, although there is no cross-referencing
  • mandoc also has HTML output and is packaged in Debian

The Makefile here tests possible manpage HTML renderers. Each is timed with time(1) to show its performance.

Package Timing
roffit 0.06user 0.00system 0:00.07elapsed 96%CPU (0avgtext+0avgdata 4852maxresident)k
w3m 0.26user 0.01system 0:00.19elapsed 137%CPU (0avgtext+0avgdata 5456maxresident)k
man 1.63user 0.17system 0:01.81elapsed 99%CPU (0avgtext+0avgdata 27268maxresident)k
man2html 0.00user 0.00system 0:00.01elapsed 61%CPU (0avgtext+0avgdata 1568maxresident)k
mandoc 0.00user 0.00system 0:00.01elapsed 57%CPU (0avgtext+0avgdata 2352maxresident)k

Note

Those statistics were created with debmans/test/converters/Makefile in the source tree.

Here is how the actual output compares:

Package Correctness
roffit SYNOPSIS fails to display correctly
w3m includes HTTP headers, links to CGI script, all pre-formatted, no TOC
man TOC, no cross-referencing
man2html includes HTTP headers, links to CGI script, index at the end
mandoc customizable links and stylesheets, no index, can avoid <body> tags

man2html was originally chosen because it is the fastest, includes an index and is not too opiniated about how the output is formatted. Unfortunately, it would fail to parse a lot of manpages, like the ones from the gnutls project.

Then w3m was used as a fallback, even though it actually calls man itself to do part of the rendering. It required a bunch of hacks to fix the markup.

So then the mandoc package was used, and it was significantly faster. In the test corpos (452 manpages), mandoc would render all pages in 23 seconds, while w3m was 5 times slower.

Time for w3m:

77.29user 13.02system 1:26.04elapsed 104%CPU (0avgtext+0avgdata 63800maxresident)k

Time for mandoc:

15.70user 4.75system 0:26.40elapsed 77%CPU (0avgtext+0avgdata 55992maxresident)k

Templating systems

Templating is necessary to turn things into HTML consistently. We chose Jinja because it seemed lightweight and simple enough. It is used by Pelican, MoinMoin 2.0, Ansible and Salt. It also seemed to be a good middleground between arbitrary code execution and static templates in this comparison. Jinja is also used by the Flask framework which is close to the Click commandline framework we are already using.

Interesting alternatives include Mako, used by Reddit and Pyramid, which is itself part of the larger Pylons project. This all seems like too much to pull in.

Genshi was also evaluated summarily later on but discarded as I found the basic tutorial to be too complicated.

It used to be possible to change the location of the templates and static files, but that was removed during some refactoring, partly because of the lack of interest from Ubuntu to reuse the software for now. If we decide to support themes, it may be simpler to use flask-themes instead of implementing our own theming system, although that seems to be a bit heavy...

Web frameworks

Flask was chosen to build a quick CGI prototype because I wanted to experiment with it, but also because it was integrating with the templating system already chosen (Jinja) and the commandline framework tool (Click). It also competes reasonably with other Python frameworks in the techempower benchmarks. I have found the decorators in flask to be really intuitive and easy to use and while I had to bounce around for a while to merge the search engine with the regular static file server, in the end it turned out to be a simple implementation.

Other alternatives that were considered were:

  • builtin HTTPSserver in Python: used at first, but did only static files
  • builtin cgi module: wanted to try a framework instead of parsing CGI arguments and dealing with content-types by hand
  • Pyramid: too complicated
  • Bottle: interesting, looks simpler than Flask, if less popular?
  • Django, Plone, etc: too much overhead

Search engines

Various search were evaluated:

  • Xapian:
    • used by Notmuch, craigslist, search.debian.org, lists.debian.org, wiki.debian.org, old gmane.org
    • no web API, would need to index directly through the Python API
    • harder to use
    • no extra server necessary
    • internal knowledge already present in debian.org
    • bound to use their CGI interface (Omega)
    • written in C++
  • Lucene / Solr:
    • requires another server and API communications
    • per-index user/password access control
    • mature solution that has been around for a long time
    • may support indexing HTML?
    • JSON or XML data entry
    • large contributor community
    • usually faster than ES
    • based on Apache Lucene
    • written in Java
    • used by Netflix, Cnet
  • Elasticsearch:
    • requires a server
    • packaged in Debian
    • REST/JSON API
    • documentation may be lacking when compared so Solr, e.g. had to ask on IRC to see if _id can be a string (yes, it can)
    • Python API, but could also be operated with a Javascript library
    • may be easier to scale than Solr
    • performance comparable with Solr
    • was designed to replace Solr
    • supports indexing HTML directly by stripping tags and entities
    • based on Apache Lucene as well
    • written in Java as well
    • requires a CLA for contributing
    • used by Github, Foursquare, presumably new gmane.org
    • created in 2010, ~18 months support lifetime?
  • whoosh: used by moinmoin 2.0
  • sphinx: not well known, ignored
  • mngosearch: considered dead, ignored
  • homegrown:
    • Ubuntu’s tool uses a simple Python CGI to search page names, and Google Search for full-text search
    • codesearch uses Postgresql
    • David’s uses sqlite
    • Readthedocs has a custom-built Javascript-based search engine
    • we could use a simple Flask REST API for searches, but then the extractor (or renderer?) would need to write stuff to some database - sqlite reads fails when writing, so maybe not a good candidate?

Sources:

Xapian examples

Index all documents in html/:

$ ominxed --url / --db search --mime-type=gz:ignore html/
11.03user 0.36system 0:12.27elapsed 92%CPU (0avgtext+0avgdata 29952maxresident)k
18056inputs+15688outputs (0major+7884minor)pagefaults 0swaps

--url is the equivalent of the renderer’s --prefix, --db is a directory where the database will end up and --mime-type is to ignore raw manpages.

Second runs are much faster:

$ time omindex --url / --db search --mime-type=gz:ignore html/
0.01user 0.00system 0:00.02elapsed 88%CPU (0avgtext+0avgdata 4656maxresident)k
0inputs+0outputs (0major+300minor)pagefaults 0swaps

Display information about the search database:

$ delve search.db
UUID = 6fd4d4ab-2529-4d67-bbff-32b88fd888fa
number of documents = 452
average document length = 3976.96
document length lower bound = 78
document length upper bound = 243183
highest document id ever used = 452
has positional information = true

Example searches:

$ quest -d search.db man2html | grep ^url
url=/cache/man/fr/man1/man2html.1.html
url=/cache/man/man1/man2html.1.html
url=/cache/man/ro/man1/man2html.1.html
url=/cache/man/it/man1/man2html.1.html
url=/cache/man/el/man1/man2html.1.html
$ quest -d search.db setreg | grep ^url
url=/cache/man/man1/setreg.1.html
url=/cache/man/man1/mozroots.1.html
url=/cache/man/man1/chktrust.1.html
url=/cache/man/man1/certmgr.1.html

This would search only <title> fields: --prefix=title:S 'title:foo'.

Infrastructure

Debian.org

Note

This may be out of date. The purpose of this section is to explained how this is should be deployed on Debian’s infrastructure.

At least the extractor and renderer would run on manziarly. The output would be stored on the static.d.o CDN (see below). parts 3 could be a separate (pair or?) server(s?) to run the search cluster.

In the above setup, manziarly would be a master server for static file servers in the Debian.org infrastructure. Files saved there would be rsync’d to multiple frontend servers. How this is configured is detailed in the static-mirroring DSA documentation, but basically, we would need to ask the DSA team for an extra entry for manpages.d.o there to server static files.

Gitlab vs Alioth

The project was originally hosted in the Collaborative Maintenance repositories, but those quickly showed their limitations, which included lack of continuous integration, issue tracking and automatic rendering of markdown files.

A project was created on Gitlab for this purpose, in anarcat’s personnal repositories (for now). On Gitlab, the project “mirrors” the public git URL of the collab-maint repo. On collab-maint, there is a cronjob in my personnal account which runs this command to synchronize the changes from Gitlab at the 17th minute of the hour:

git -C /git/collab-maint/debmans.git fetch --quiet gitlab master:master

This was found to be the best compromise in adding the extra gitlab features while still keeping access threshold for Debian members low. Do note that there is no conflict resolution whatsoever on collab-maint’s side, and the behavior of Gitlab in case of conflicts isn’t determined yet. This may require manual fixing of merge conflicts.

Other implementations

There were already three known implementations of “man to web” archive generators when this project was started.

After careful consideration of existing alternatives, it was determined it was easier and simpler to write a cleanroom implementation, based in part on the lessons learned from the existing implementations and the more mature debsources project.

Original manpages.d.o codebase

The original codebase is a set of Perl and bash CGI scripts that dynamically generate (and search through) manpages.

The original codebase extracts manpages with dpkg --fsys-tarfile and the tar tar commands. It also creates indexes using man -k for future searches. Manpages are stored in a directory for each package-version pair, so it doesn’t garbage-collect disappeared manpages. It also appears that packages are always extracted, even if they had been parsed before.

The CGI script just calls man and outputs plain text wrapped in <PRE> tags without any cross-referencing or further formatting.

There is also a copy of the Ubuntu scripts in the source code.

Ubuntu

Ubuntu has their own manpage repository at https://manpages.ubuntu.com/. Their codebase is partly Python, Perl and Bash.

It looks like there’s a bash ‘’‘and’‘’ python implementation of the same thing. They process the whole archive on the local filesystem and create a timestamp file for every package found, which avoids processing packages repeatedly (but all packages from the Packages listing are stat‘d at every run). In the bash version, the manpages are extracted with dpkg -x, in the Python version as well, athough it uses the apt python package to list files. It uses a simple regex (^usr/share/man/.*\.gz$) to find manpages.

It keeps a cache of the md5sum of the package in "$PUBLIC_HTML_DIR/manpages/$dist/.cache/$name to avoid looking at known packages. The bash version only looks at the timestamp of the file versus the package, and only checks at the modification year.

To generate the HTML version of the manpage, both programs use the /usr/lib/w3m/cgi-bin/w3mman2html.cgi shipped with the w3m package.

Seach is operated by a custom Python script that looks through manpages filenames or uses Google to do a full text search.

dgilman codebase

A new codebase written by dgilman is available in github. It is a simple Python script with a sqlite backend. It extracts the tarfile with dpkg --fsys-tarfile then parses it with the Python tarfile library. It uses rather complicated regexes to find manpages and stores various apropos and metadata about manpages in the sqlite database. All manpages are unconditionnally extracted.

OpenBSD

The OpenBSD project has a man.cgi(8) program that powers the whole application behind man.openbsd.org. It uses mandoc(1) to format manpages, as the manual system in OpenBSD has native support for HTML output. This is linked directly in the CGI, which is written in C.

FreeBSD

Wolfram Schneider

The man.freebsd.org site is powererd by a perl script written by Wolfram Schneider which parses the output of the man(1) command directly. See the help page for more information.