Design

This page explains the design principles and decisions in the project.

Minimum viable product

The Minimum Viable Product for this project is a service that creates an HTML version of all the manpages of all the packages available in Debian, for all supported suites. Basic whatis(1) functionality is also expected.

apropos(1) functionality is considered extra that can be implemented later with already existing tools.

The design is split those components which map to debmans subcommands:

  1. extract: extracts manpages from Debian packages
  2. render: renders manpages into HTML
  3. site: render a static site into HTML
  4. index: indexes HTML pages for searching (not implemented yet)
  5. search: the search interface (not implemented, but there is a simple “jump” Javascript tool)

There is also a serve command which starts a local webserver to help with development.

See the Remaining work file for details about the missing bits.

Extract

This part fetches all manpages from the archive and stores them on disk. This makes them usable for tools like dman that browses remote webpages.

The layout options for where to store files were:

  • Ubuntu: $DISTRIB_CODENAME/$LOCALE/man$i/$PAGE.$i.gz (see dman)
  • original codebase: "${OUTPUTDIR}/${pooldir}/${packagename}_${version}" (from manpage-extractor.pl)

Ubuntu’s approach was chosen to avoid bitrot and follow more closely the existing filesystem layout. It also happens to be easier to implement.

The extractor uses a cache to avoid re-extracting known manpages. We use the Ubuntu layout there as well ($outputdir/$suite/.cache/$packagename_version), which leads to bitrot, but at least it’s constrained to a suite. This will be a problem for unstable, so maybe some garbage-collection may be necessary.

Render

This converts manpages to HTML so they are readable in a web browser.

Possile options for this implementation:

  • just the plaintext output of man wrapped in <PRE> tags (current design)
  • man2html is an old C program that ships with a bunch of CGI scripts
  • there’s another man2html that is a perl script, but I couldn’t figure out how to use it correctly.
  • w3m has another Perl script that is used by the Ubuntu site
  • roffit is another perl script. the version in Debian is ancient (2012) and doesn’t display the man(1) synopsis correctly (newer versions from github also fail)
  • pandoc can’t, unfortunately, read manpages (only write)
  • man itself can generate an HTML version with man -Hcat man and the output is fairly decent, although there is no cross-referencing

The Makefile here tests possible manpage HTML renderers. Each is timed with time(1) to show its performance.

Package Timing
roffit 0.06user 0.00system 0:00.07elapsed 96%CPU (0avgtext+0avgdata 4852maxresident)k
w3m 0.26user 0.01system 0:00.19elapsed 137%CPU (0avgtext+0avgdata 5456maxresident)k
man 1.63user 0.17system 0:01.81elapsed 99%CPU (0avgtext+0avgdata 27268maxresident)k
man2html 0.00user 0.00system 0:00.01elapsed 61%CPU (0avgtext+0avgdata 1568maxresident)k

Note

Those statistics were created with debmans/test/converters/Makefile in the source tree.

Here is how the actual output compares:

Package Correctness
roffit SYNOPSIS fails to display correctly
w3m includes HTTP headers, links to CGI script, all pre-formatted, no TOC
man TOC, no cross-referencing
man2html includes HTTP headers, links to CGI script, index at the end

man2html was originally chosen because it is the fastest, includes an index and is not too opiniated about how the output is formatted. Unfortunately, it would fail to parse a lot of manpages, like the ones from the gnutls project. w3m was used as a fallback, even though it actually calls man itself to do part of the rendering.

Index

This indexes HTML pages in a search engine of some sort.

In the backend, something will need to index the manpages if only to implement apropos(1) functionality, but eventually also full-text search. This should be modular so that new backends can be implemented as needed.

For now, we are considering reusing the existing Xapian infrastructure the Debian project already has.

The indexer would:

  1. run omindex on the HTML tree and create a database
  2. process each locale separately so they are isolated (may be tricky for LANG=C) and that the right stemmer is called
  3. need a CGI script (provided by the xapian-omega package) to query the database - the HTML output is generated based on templates, so presumably we could reuse the existing templates.

It is assumed that Xapian can deal with large datasets (10-50GB) considering how it is used in Notmuch (my own mailbox is around ~6GB) and lists.debian.org.

We put the command name in the page <title> tag, the short description in <meta description="..."> and use the magic markers (<!--htdig_noindex-->ignored<!--/htdig_noindex-->) to make the indexer ignore redundant bits.

See the Search section for more information about the various search software evaluated and the web interface.

Xapian examples

Index all documents in html/:

$ ominxed --url / --db search --mime-type=gz:ignore html/
11.03user 0.36system 0:12.27elapsed 92%CPU (0avgtext+0avgdata 29952maxresident)k
18056inputs+15688outputs (0major+7884minor)pagefaults 0swaps

--url is the equivalent of the renderer’s --prefix, --db is a directory where the database will end up and --mime-type is to ignore raw manpages.

Second runs are much faster:

$ time omindex --url / --db search --mime-type=gz:ignore html/
0.01user 0.00system 0:00.02elapsed 88%CPU (0avgtext+0avgdata 4656maxresident)k
0inputs+0outputs (0major+300minor)pagefaults 0swaps

Display information about the search database:

$ delve search.db
UUID = 6fd4d4ab-2529-4d67-bbff-32b88fd888fa
number of documents = 452
average document length = 3976.96
document length lower bound = 78
document length upper bound = 243183
highest document id ever used = 452
has positional information = true

Example searches:

$ quest -d search.db man2html | grep ^url
url=/cache/man/fr/man1/man2html.1.html
url=/cache/man/man1/man2html.1.html
url=/cache/man/ro/man1/man2html.1.html
url=/cache/man/it/man1/man2html.1.html
url=/cache/man/el/man1/man2html.1.html
$ quest -d search.db setreg | grep ^url
url=/cache/man/man1/setreg.1.html
url=/cache/man/man1/mozroots.1.html
url=/cache/man/man1/chktrust.1.html
url=/cache/man/man1/certmgr.1.html

This would search only <title> fields: --prefix=title:S 'title:foo'.

Searching

The search interface itself would be a CGI or WSGI tool (if written in Python) interface that would hook into the webserver to perform searches.

Currently only a browser-based, Javascript search tool implements basic whatis(1) functionality. It looks up the manpage using a XMLHttpRequest to see if the requested page exists and redirects appropriately. It doesn’t look at different locales yet.

This should be extended to a full search interface, using omega‘s web interface or other pluggable interfaces.

Search software evaluation

Various search were evaluated:

  • Xapian:
    • used by Notmuch, craigslist, search.debian.org, lists.debian.org, wiki.debian.org, old gmane.org
    • no web API, would need to index directly through the Python API
    • harder to use
    • no extra server necessary
    • internal knowledge already present in debian.org
    • bound to use their CGI interface (Omega)
    • written in C++
  • Lucene / Solr:
    • requires another server and API communications
    • per-index user/password access control
    • mature solution that has been around for a long time
    • may support indexing HTML?
    • JSON or XML data entry
    • large contributor community
    • usually faster than ES
    • based on Apache Lucene
    • written in Java
    • used by Netflix, Cnet
  • Elasticsearch:
    • requires a server
    • packaged in Debian
    • REST/JSON API
    • documentation may be lacking when compared so Solr, e.g. had to ask on IRC to see if _id can be a string (yes, it can)
    • Python API, but could also be operated with a Javascript library
    • may be easier to scale than Solr
    • performance comparable with Solr
    • was designed to replace Solr
    • supports indexing HTML directly by stripping tags and entities
    • based on Apache Lucene as well
    • written in Java as well
    • requires a CLA for contributing
    • used by Github, Foursquare, presumably new gmane.org
    • created in 2010, ~18 months support lifetime?
  • sphinx: not well known, ignored
  • mngosearch: considered dead, ignored
  • homegrown:
    • codesearch uses Postgresql
    • David’s uses sqlite
    • Readthedocs has a custom-built Javascript-based search engine
    • we could use a simple Flask REST API for searches, but then the extractor (or renderer?) would need to write stuff to some database - sqlite reads fails when writing, so maybe not a good candidate?

Sources:

Infrastructure

At least the extractor and renderer would run on manziarly. The output would be stored on the static.d.o CDN (see below). parts 3 could be a separate (pair or?) server(s?) to run the search cluster.

In the above setup, manziarly would be a master server for static file servers in the Debian.org infrastructure. Files saved there would be rsync’d to multiple frontend servers. How this is configured is detailed in the static-mirroring DSA documentation, but basically, we would need to ask the DSA team for an extra entry for manpages.d.o there to server static files.

Gitlab vs Alioth

The project was originally hosted in the Collaborative Maintenance repositories, but those quickly showed their limitations, which included lack of continuous integration, issue tracking and automatic rendering of markdown files.

A project was created on Gitlab for this purpose, in anarcat’s personnal repositories (for now). On Gitlab, the project “mirrors” the public git URL of the collab-maint repo. On collab-maint, there is a cronjob in my personnal account which runs this command to synchronize the changes from Gitlab at the 17th minute of the hour:

git -C /git/collab-maint/debmans.git fetch --quiet gitlab master:master

This was found to be the best compromise in adding the extra gitlab features while still keeping access threshold for Debian members low. Do note that there is no conflict resolution whatsoever on collab-maint’s side, and the behavior of Gitlab in case of conflicts isn’t determined yet. This may require manual fixing of merge conflicts.

Other implementations

There were already three known implementations of “man to web” archive generators when this project was started.

After careful consideration of existing alternatives, it was determined it was easier and simpler to write a cleanroom implementation, based in part on the lessons learned from the existing implementations and the more mature debsources project.

Original manpages.d.o codebase

The original codebase is a set of Perl and bash CGI scripts that dynamically generate (and search through) manpages.

The original codebase extracts manpages with dpkg --fsys-tarfile and the tar tar commands. It also creates indexes using man -k for future searches. Manpages are stored in a directory for each package-version pair, so it doesn’t garbage-collect disappeared manpages. It also appears that packages are always extracted, even if they had been parsed before.

The CGI script just calls man and outputs plain text wrapped in <PRE> tags without any cross-referencing or further formatting.

There is also a copy of the Ubuntu scripts in the source code.

Ubuntu

Ubuntu has their own manpage repository at https://manpages.ubuntu.com/. Their codebase is partly Python, Perl and Bash.

It looks like there’s a bash ‘’‘and’‘’ python implementation of the same thing. They process the whole archive on the local filesystem and create a timestamp file for every package found, which avoids processing packages repeatedly (but all packages from the Packages listing are stat‘d at every run). In the bash version, the manpages are extracted with dpkg -x, in the Python version as well, athough it uses the apt python package to list files. It uses a simple regex (^usr/share/man/.*\.gz$) to find manpages.

It keeps a cache of the md5sum of the package in "$PUBLIC_HTML_DIR/manpages/$dist/.cache/$name to avoid looking at known packages. The bash version only looks at the timestamp of the file versus the package, and only checks at the modification year.

To generate the HTML version of the manpage, both programs use the /usr/lib/w3m/cgi-bin/w3mman2html.cgi shipped with the w3m package.

Seach is operated by a custom Python script that looks through manpages filenames or uses Google to do a full text search.

dgilman codebase

A new codebase written by dgilman is available in github. It is a simple Python script with a sqlite backend. It extracts the tarfile with dpkg --fsys-tarfile then parses it with the Python tarfile library. It uses rather complicated regexes to find manpages and stores various apropos and metadata about manpages in the sqlite database. All manpages are unconditionnally extracted.