This page explains the design principles and decisions in the project.
The Minimum Viable Product (MVP) for this project is a service that creates an
HTML version of all the manpages of all the packages available in
Debian, for all supported suites. Basic
whatis(1) functionality is
apropos(1) functionality is considered extra that can be implemented
later with already existing tools.
The design is split those components which map to
extract: extracts manpages from Debian packages
render: renders manpages into HTML
site: render a static site into HTML
index: indexes HTML pages for searching (not implemented yet)
There is also a
serve command which starts a local webserver to help
See the Remaining work file for details about the missing bits.
This part fetches all manpages from the archive and stores them on disk. This makes them usable for tools like dman that browses remote webpages.
The layout options for where to store files were:
- original codebase:
Ubuntu’s approach was chosen to avoid bitrot and follow more closely the existing filesystem layout. It also happens to be easier to implement.
The extractor uses a cache to avoid re-extracting known manpages. We use
the Ubuntu layout there as well
$outputdir/$suite/.cache/$packagename_version), which leads to
bitrot, but at least it’s constrained to a suite. This will be a problem
for unstable, so maybe some garbage-collection may be necessary.
A simple timestamp change on the template, source and target files makes sure the files that do not need to be refreshed are skipped.
This indexes HTML pages in a search engine of some sort.
In the backend, something will need to index the manpages if only to
apropos(1) functionality, but eventually also full-text
search. This should be modular so that new backends can be implemented
For now, we are considering reusing the existing Xapian infrastructure the Debian project already has.
The indexer would:
omindexon the HTML tree and create a database
- process each locale separately so they are isolated (may be tricky
LANG=C) and that the right stemmer is called
- need a CGI script (provided by the xapian-omega package) to query the database - the HTML output is generated based on templates, so presumably we could reuse the existing templates.
It is assumed that Xapian can deal with large datasets (10-50GB) considering how it is used in Notmuch (my own mailbox is around ~6GB) and lists.debian.org.
We put the command name in the page
<title> tag, the short
<meta description="..."> and use the magic markers
<!--htdig_noindex-->ignored<!--/htdig_noindex-->) to make the
indexer ignore redundant bits.
See the :ref`search-engines` for more information about the various search software evaluated and the web interface.
The search interface itself would be a CGI or WSGI tool (if written in Python) interface that would hook into the webserver to perform searches.
whatis(1) functionality. It looked up the manpage using a
XMLHttpRequest to see
if the requested page exists and redirects appropriately. It didn’t
look at different locales yet.
A prototype of a CGI-based approach was then written in Flask. The
search engine will look at the filesystem for a given pattern and,
optionnally, section, suite and locale parameters. It doesn’t show all
suites by default and prefers to show all matching manpages, behaving
like a names-only
See the Web frameworks section for more information about the decision to use Flask..
This should be extended to a full search interface, using
web interface or another pluggable interfaces.
This section tries to (sometimes after the fact) justify choices made in certain dependencies and software technologies used in the project.
Those are the known manpage converters:
- just the plaintext output of man wrapped in
<PRE>tags (current design)
- man2html is an old C program that ships with a bunch of CGI scripts
- there’s another man2html that is a perl script, but I couldn’t figure out how to use it correctly.
- w3m has another Perl script that is used by the Ubuntu site
- roffit is another perl script. the version in Debian is
ancient (2012) and doesn’t display the
man(1)synopsis correctly (newer versions from github also fail)
- pandoc can’t, unfortunately, read manpages (only write)
- man itself can generate an HTML version with
man -Hcat manand the output is fairly decent, although there is no cross-referencing
- mandoc also has HTML output and is packaged in Debian
The Makefile here tests possible manpage HTML renderers. Each is timed
time(1) to show its performance.
Those statistics were created with
debmans/test/converters/Makefile in the source tree.
Here is how the actual output compares:
|roffit||SYNOPSIS fails to display correctly|
|w3m||includes HTTP headers, links to CGI script, all pre-formatted, no TOC|
|man||TOC, no cross-referencing|
|man2html||includes HTTP headers, links to CGI script, index at the end|
|mandoc||customizable links and stylesheets, no index, can avoid
man2html was originally chosen because it is the fastest, includes
an index and is not too opiniated about how the output is
formatted. Unfortunately, it would fail to parse a lot of manpages,
like the ones from the
w3m was used as a fallback, even though it actually calls
man itself to do part of the rendering. It required a bunch of
hacks to fix the markup.
So then the
mandoc package was used, and it was significantly
faster. In the test corpos (452 manpages), mandoc would render all
pages in 23 seconds, while w3m was 5 times slower.
Time for w3m:
77.29user 13.02system 1:26.04elapsed 104%CPU (0avgtext+0avgdata 63800maxresident)k
Time for mandoc:
15.70user 4.75system 0:26.40elapsed 77%CPU (0avgtext+0avgdata 55992maxresident)k
Templating is necessary to turn things into HTML consistently. We chose Jinja because it seemed lightweight and simple enough. It is used by Pelican, MoinMoin 2.0, Ansible and Salt. It also seemed to be a good middleground between arbitrary code execution and static templates in this comparison. Jinja is also used by the Flask framework which is close to the Click commandline framework we are already using.
Genshi was also evaluated summarily later on but discarded as I found the basic tutorial to be too complicated.
It used to be possible to change the location of the templates and static files, but that was removed during some refactoring, partly because of the lack of interest from Ubuntu to reuse the software for now. If we decide to support themes, it may be simpler to use flask-themes instead of implementing our own theming system, although that seems to be a bit heavy...
Flask was chosen to build a quick CGI prototype because I wanted to experiment with it, but also because it was integrating with the templating system already chosen (Jinja) and the commandline framework tool (Click). It also competes reasonably with other Python frameworks in the techempower benchmarks. I have found the decorators in flask to be really intuitive and easy to use and while I had to bounce around for a while to merge the search engine with the regular static file server, in the end it turned out to be a simple implementation.
Other alternatives that were considered were:
- builtin HTTPSserver in Python: used at first, but did only static files
cgimodule: wanted to try a framework instead of parsing CGI arguments and dealing with content-types by hand
- Pyramid: too complicated
- Bottle: interesting, looks simpler than Flask, if less popular?
- Django, Plone, etc: too much overhead
Various search were evaluated:
- used by Notmuch, craigslist, search.debian.org, lists.debian.org, wiki.debian.org, old gmane.org
- no web API, would need to index directly through the Python API
- harder to use
- no extra server necessary
- internal knowledge already present in debian.org
- bound to use their CGI interface (Omega)
- written in C++
- Lucene / Solr:
- requires another server and API communications
- per-index user/password access control
- mature solution that has been around for a long time
- may support indexing HTML?
- JSON or XML data entry
- large contributor community
- usually faster than ES
- based on Apache Lucene
- written in Java
- used by Netflix, Cnet
- requires a server
- packaged in Debian
- REST/JSON API
- documentation may be lacking when compared so Solr, e.g. had to ask
on IRC to see if
_idcan be a string (yes, it can)
- may be easier to scale than Solr
- performance comparable with Solr
- was designed to replace Solr
- supports indexing HTML directly by stripping tags and entities
- based on Apache Lucene as well
- written in Java as well
- requires a CLA for contributing
- used by Github, Foursquare, presumably new gmane.org
- created in 2010, ~18 months support lifetime?
- whoosh: used by moinmoin 2.0
- sphinx: not well known, ignored
- mngosearch: considered dead, ignored
- Ubuntu’s tool uses a simple Python CGI to search page names, and Google Search for full-text search
- codesearch uses Postgresql
- David’s uses sqlite
- we could use a simple Flask REST API for searches, but then the extractor (or renderer?) would need to write stuff to some database - sqlite reads fails when writing, so maybe not a good candidate?
Index all documents in
$ ominxed --url / --db search --mime-type=gz:ignore html/ 11.03user 0.36system 0:12.27elapsed 92%CPU (0avgtext+0avgdata 29952maxresident)k 18056inputs+15688outputs (0major+7884minor)pagefaults 0swaps
--url is the equivalent of the renderer’s
a directory where the database will end up and
--mime-type is to
ignore raw manpages.
Second runs are much faster:
$ time omindex --url / --db search --mime-type=gz:ignore html/ 0.01user 0.00system 0:00.02elapsed 88%CPU (0avgtext+0avgdata 4656maxresident)k 0inputs+0outputs (0major+300minor)pagefaults 0swaps
Display information about the search database:
$ delve search.db UUID = 6fd4d4ab-2529-4d67-bbff-32b88fd888fa number of documents = 452 average document length = 3976.96 document length lower bound = 78 document length upper bound = 243183 highest document id ever used = 452 has positional information = true
$ quest -d search.db man2html | grep ^url url=/cache/man/fr/man1/man2html.1.html url=/cache/man/man1/man2html.1.html url=/cache/man/ro/man1/man2html.1.html url=/cache/man/it/man1/man2html.1.html url=/cache/man/el/man1/man2html.1.html $ quest -d search.db setreg | grep ^url url=/cache/man/man1/setreg.1.html url=/cache/man/man1/mozroots.1.html url=/cache/man/man1/chktrust.1.html url=/cache/man/man1/certmgr.1.html
This would search only
This may be out of date. The purpose of this section is to explained how this is should be deployed on Debian’s infrastructure.
At least the extractor and renderer would run on
output would be stored on the static.d.o CDN (see below). parts 3 could
be a separate (pair or?) server(s?) to run the search cluster.
In the above setup,
manziarly would be a master server for static
file servers in the Debian.org infrastructure. Files saved there would
be rsync’d to multiple frontend servers. How this is configured is
detailed in the
DSA documentation, but basically, we would need to ask the DSA team for
an extra entry for manpages.d.o there to server static files.
Gitlab vs Alioth¶
The project was originally hosted in the Collaborative Maintenance repositories, but those quickly showed their limitations, which included lack of continuous integration, issue tracking and automatic rendering of markdown files.
A project was created on Gitlab for this purpose, in anarcat’s personnal
repositories (for now). On Gitlab, the project “mirrors” the public git
URL of the
collab-maint repo. On collab-maint, there is a cronjob in my
personnal account which runs this command to synchronize the changes
from Gitlab at the 17th minute of the hour:
git -C /git/collab-maint/debmans.git fetch --quiet gitlab master:master
This was found to be the best compromise in adding the extra gitlab features while still keeping access threshold for Debian members low. Do note that there is no conflict resolution whatsoever on collab-maint’s side, and the behavior of Gitlab in case of conflicts isn’t determined yet. This may require manual fixing of merge conflicts.
There were already three known implementations of “man to web” archive generators when this project was started.
After careful consideration of existing alternatives, it was determined
it was easier and simpler to write a cleanroom implementation, based in
part on the lessons learned from the existing implementations and the
Original manpages.d.o codebase¶
The original codebase is a set of Perl and bash CGI scripts that dynamically generate (and search through) manpages.
The original codebase extracts manpages with
dpkg --fsys-tarfile and
tar commands. It also creates indexes using
man -k for
future searches. Manpages are stored in a directory for each
package-version pair, so it doesn’t garbage-collect disappeared
manpages. It also appears that packages are always extracted, even if
they had been parsed before.
The CGI script just calls
man and outputs plain text wrapped in
<PRE> tags without any cross-referencing or further formatting.
There is also a copy of the Ubuntu scripts in the source code.
It looks like there’s a
implementation of the same thing. They process the whole archive on the
local filesystem and create a timestamp file for every package found,
which avoids processing packages repeatedly (but all packages from the
Packages listing are
stat‘d at every run). In the bash version,
the manpages are extracted with
dpkg -x, in the Python version as
well, athough it uses the
apt python package to list files. It uses
a simple regex (
^usr/share/man/.*\.gz$) to find manpages.
It keeps a cache of the md5sum of the package in
"$PUBLIC_HTML_DIR/manpages/$dist/.cache/$name to avoid looking at
known packages. The bash version only looks at the timestamp of the file
versus the package, and only checks at the modification year.
To generate the HTML version of the manpage, both programs use the
/usr/lib/w3m/cgi-bin/w3mman2html.cgi shipped with the
Seach is operated by a custom Python script that looks through manpages filenames or uses Google to do a full text search.
A new codebase written by dgilman is available in
github. It is a simple Python
script with a sqlite backend. It extracts the tarfile with
dpkg --fsys-tarfile then parses it with the Python
library. It uses rather complicated regexes to find manpages and stores
various apropos and metadata about manpages in the sqlite database. All
manpages are unconditionnally extracted.
The OpenBSD project has a man.cgi(8) program that powers the whole application behind man.openbsd.org. It uses mandoc(1) to format manpages, as the manual system in OpenBSD has native support for HTML output. This is linked directly in the CGI, which is written in C.