Design¶
This page explains the design principles and decisions in the project.
The Minimum Viable Product (MVP) for this project is a service that creates an
HTML version of all the manpages of all the packages available in
Debian, for all supported suites. Basic whatis(1)
functionality is
also expected.
apropos(1)
functionality is considered extra that can be implemented
later with already existing tools.
Components¶
The design is split those components which map to debmans
subcommands:
extract
: extracts manpages from Debian packagesrender
: renders manpages into HTMLsite
: render a static site into HTMLindex
: indexes HTML pages for searching (not implemented yet)search
: the search interface (not implemented, but there is a simple “jump” Javascript tool)
There is also a serve
command which starts a local webserver to help
with development.
See the Remaining work file for details about the missing bits.
Extract¶
This part fetches all manpages from the archive and stores them on disk. This makes them usable for tools like dman that browses remote webpages.
The layout options for where to store files were:
- Ubuntu:
$DISTRIB_CODENAME/$LOCALE/man$i/$PAGE.$i.gz
(see dman) - original codebase:
"${OUTPUTDIR}/${pooldir}/${packagename}_${version}"
(from manpage-extractor.pl)
Ubuntu’s approach was chosen to avoid bitrot and follow more closely the existing filesystem layout. It also happens to be easier to implement.
The extractor uses a cache to avoid re-extracting known manpages. We use
the Ubuntu layout there as well
($outputdir/$suite/.cache/$packagename_version
), which leads to
bitrot, but at least it’s constrained to a suite. This will be a problem
for unstable, so maybe some garbage-collection may be necessary.
Render¶
This converts manpages to HTML so they are readable in a web browser. This is mostly about calling one of the Manpage converters and then embedding the output in one of the Templating systems.
A simple timestamp change on the template, source and target files makes sure the files that do not need to be refreshed are skipped.
Index¶
This indexes HTML pages in a search engine of some sort.
In the backend, something will need to index the manpages if only to
implement apropos(1)
functionality, but eventually also full-text
search. This should be modular so that new backends can be implemented
as needed.
For now, we are considering reusing the existing Xapian infrastructure the Debian project already has.
The indexer would:
- run
omindex
on the HTML tree and create a database - process each locale separately so they are isolated (may be tricky
for
LANG=C
) and that the right stemmer is called - need a CGI script (provided by the xapian-omega package) to query the database - the HTML output is generated based on templates, so presumably we could reuse the existing templates.
It is assumed that Xapian can deal with large datasets (10-50GB) considering how it is used in Notmuch (my own mailbox is around ~6GB) and lists.debian.org.
We put the command name in the page <title>
tag, the short
description in <meta description="...">
and use the magic markers
(<!--htdig_noindex-->ignored<!--/htdig_noindex-->
) to make the
indexer ignore redundant bits.
See the :ref`search-engines` for more information about the various search software evaluated and the web interface.
Search¶
The search interface itself would be a CGI or WSGI tool (if written in Python) interface that would hook into the webserver to perform searches.
Originally, only a browser-based Javascript search tool implemented basic
whatis(1)
functionality. It looked up the manpage using a
XMLHttpRequest to see
if the requested page exists and redirects appropriately. It didn’t
look at different locales yet.
A prototype of a CGI-based approach was then written in Flask. The
search engine will look at the filesystem for a given pattern and,
optionnally, section, suite and locale parameters. It doesn’t show all
suites by default and prefers to show all matching manpages, behaving
like a names-only apropos(1)
.
See the Web frameworks section for more information about the decision to use Flask..
This should be extended to a full search interface, using omega
‘s
web interface or another pluggable interfaces.
Software evaluation¶
This section tries to (sometimes after the fact) justify choices made in certain dependencies and software technologies used in the project.
Manpage converters¶
Those are the known manpage converters:
- just the plaintext output of man wrapped in
<PRE>
tags (current design) - man2html is an old C program that ships with a bunch of CGI scripts
- there’s another man2html that is a perl script, but I couldn’t figure out how to use it correctly.
- w3m has another Perl script that is used by the Ubuntu site
- roffit is another perl script. the version in Debian is
ancient (2012) and doesn’t display the
man(1)
synopsis correctly (newer versions from github also fail) - pandoc can’t, unfortunately, read manpages (only write)
- man itself can generate an HTML version with
man -Hcat man
and the output is fairly decent, although there is no cross-referencing - mandoc also has HTML output and is packaged in Debian
The Makefile here tests possible manpage HTML renderers. Each is timed
with time(1)
to show its performance.
Package | Timing |
---|---|
roffit | 0.06user 0.00system 0:00.07elapsed 96%CPU (0avgtext+0avgdata 4852maxresident)k |
w3m | 0.26user 0.01system 0:00.19elapsed 137%CPU (0avgtext+0avgdata 5456maxresident)k |
man | 1.63user 0.17system 0:01.81elapsed 99%CPU (0avgtext+0avgdata 27268maxresident)k |
man2html | 0.00user 0.00system 0:00.01elapsed 61%CPU (0avgtext+0avgdata 1568maxresident)k |
mandoc | 0.00user 0.00system 0:00.01elapsed 57%CPU (0avgtext+0avgdata 2352maxresident)k |
Note
Those statistics were created with
debmans/test/converters/Makefile
in the source tree.
Here is how the actual output compares:
Package | Correctness |
---|---|
roffit | SYNOPSIS fails to display correctly |
w3m | includes HTTP headers, links to CGI script, all pre-formatted, no TOC |
man | TOC, no cross-referencing |
man2html | includes HTTP headers, links to CGI script, index at the end |
mandoc | customizable links and stylesheets, no index, can avoid <body> tags |
man2html
was originally chosen because it is the fastest, includes
an index and is not too opiniated about how the output is
formatted. Unfortunately, it would fail to parse a lot of manpages,
like the ones from the gnutls
project.
Then w3m
was used as a fallback, even though it actually calls
man
itself to do part of the rendering. It required a bunch of
hacks to fix the markup.
So then the mandoc
package was used, and it was significantly
faster. In the test corpos (452 manpages), mandoc would render all
pages in 23 seconds, while w3m was 5 times slower.
Time for w3m:
77.29user 13.02system 1:26.04elapsed 104%CPU (0avgtext+0avgdata 63800maxresident)k
Time for mandoc:
15.70user 4.75system 0:26.40elapsed 77%CPU (0avgtext+0avgdata 55992maxresident)k
Templating systems¶
Templating is necessary to turn things into HTML consistently. We chose Jinja because it seemed lightweight and simple enough. It is used by Pelican, MoinMoin 2.0, Ansible and Salt. It also seemed to be a good middleground between arbitrary code execution and static templates in this comparison. Jinja is also used by the Flask framework which is close to the Click commandline framework we are already using.
Interesting alternatives include Mako, used by Reddit and Pyramid, which is itself part of the larger Pylons project. This all seems like too much to pull in.
Genshi was also evaluated summarily later on but discarded as I found the basic tutorial to be too complicated.
It used to be possible to change the location of the templates and static files, but that was removed during some refactoring, partly because of the lack of interest from Ubuntu to reuse the software for now. If we decide to support themes, it may be simpler to use flask-themes instead of implementing our own theming system, although that seems to be a bit heavy...
Web frameworks¶
Flask was chosen to build a quick CGI prototype because I wanted to experiment with it, but also because it was integrating with the templating system already chosen (Jinja) and the commandline framework tool (Click). It also competes reasonably with other Python frameworks in the techempower benchmarks. I have found the decorators in flask to be really intuitive and easy to use and while I had to bounce around for a while to merge the search engine with the regular static file server, in the end it turned out to be a simple implementation.
Other alternatives that were considered were:
- builtin HTTPSserver in Python: used at first, but did only static files
- builtin
cgi
module: wanted to try a framework instead of parsing CGI arguments and dealing with content-types by hand- Pyramid: too complicated
- Bottle: interesting, looks simpler than Flask, if less popular?
- Django, Plone, etc: too much overhead
Search engines¶
Various search were evaluated:
- Xapian:
- used by Notmuch, craigslist, search.debian.org, lists.debian.org, wiki.debian.org, old gmane.org
- no web API, would need to index directly through the Python API
- harder to use
- no extra server necessary
- internal knowledge already present in debian.org
- bound to use their CGI interface (Omega)
- written in C++
- Lucene / Solr:
- requires another server and API communications
- per-index user/password access control
- mature solution that has been around for a long time
- may support indexing HTML?
- JSON or XML data entry
- large contributor community
- usually faster than ES
- based on Apache Lucene
- written in Java
- used by Netflix, Cnet
- Elasticsearch:
- requires a server
- packaged in Debian
- REST/JSON API
- documentation may be lacking when compared so Solr, e.g. had to ask
on IRC to see if
_id
can be a string (yes, it can) - Python API, but could also be operated with a Javascript library
- may be easier to scale than Solr
- performance comparable with Solr
- was designed to replace Solr
- supports indexing HTML directly by stripping tags and entities
- based on Apache Lucene as well
- written in Java as well
- requires a CLA for contributing
- used by Github, Foursquare, presumably new gmane.org
- created in 2010, ~18 months support lifetime?
- whoosh: used by moinmoin 2.0
- sphinx: not well known, ignored
- mngosearch: considered dead, ignored
- homegrown:
- Ubuntu’s tool uses a simple Python CGI to search page names, and Google Search for full-text search
- codesearch uses Postgresql
- David’s uses sqlite
- Readthedocs has a custom-built Javascript-based search engine
- we could use a simple Flask REST API for searches, but then the extractor (or renderer?) would need to write stuff to some database - sqlite reads fails when writing, so maybe not a good candidate?
Sources:
Xapian examples¶
Index all documents in html/
:
$ ominxed --url / --db search --mime-type=gz:ignore html/
11.03user 0.36system 0:12.27elapsed 92%CPU (0avgtext+0avgdata 29952maxresident)k
18056inputs+15688outputs (0major+7884minor)pagefaults 0swaps
--url
is the equivalent of the renderer’s --prefix
, --db
is
a directory where the database will end up and --mime-type
is to
ignore raw manpages.
Second runs are much faster:
$ time omindex --url / --db search --mime-type=gz:ignore html/
0.01user 0.00system 0:00.02elapsed 88%CPU (0avgtext+0avgdata 4656maxresident)k
0inputs+0outputs (0major+300minor)pagefaults 0swaps
Display information about the search database:
$ delve search.db
UUID = 6fd4d4ab-2529-4d67-bbff-32b88fd888fa
number of documents = 452
average document length = 3976.96
document length lower bound = 78
document length upper bound = 243183
highest document id ever used = 452
has positional information = true
Example searches:
$ quest -d search.db man2html | grep ^url
url=/cache/man/fr/man1/man2html.1.html
url=/cache/man/man1/man2html.1.html
url=/cache/man/ro/man1/man2html.1.html
url=/cache/man/it/man1/man2html.1.html
url=/cache/man/el/man1/man2html.1.html
$ quest -d search.db setreg | grep ^url
url=/cache/man/man1/setreg.1.html
url=/cache/man/man1/mozroots.1.html
url=/cache/man/man1/chktrust.1.html
url=/cache/man/man1/certmgr.1.html
This would search only <title>
fields:
--prefix=title:S 'title:foo'
.
Infrastructure¶
Debian.org¶
Note
This may be out of date. The purpose of this section is to explained how this is should be deployed on Debian’s infrastructure.
At least the extractor and renderer would run on manziarly
. The
output would be stored on the static.d.o CDN (see below). parts 3 could
be a separate (pair or?) server(s?) to run the search cluster.
In the above setup, manziarly
would be a master server for static
file servers in the Debian.org infrastructure. Files saved there would
be rsync’d to multiple frontend servers. How this is configured is
detailed in the
static-mirroring
DSA documentation, but basically, we would need to ask the DSA team for
an extra entry for manpages.d.o there to server static files.
Gitlab vs Alioth¶
The project was originally hosted in the Collaborative Maintenance repositories, but those quickly showed their limitations, which included lack of continuous integration, issue tracking and automatic rendering of markdown files.
A project was created on Gitlab for this purpose, in anarcat’s personnal
repositories (for now). On Gitlab, the project “mirrors” the public git
URL of the
collab-maint
repo. On collab-maint, there is a cronjob in my
personnal account which runs this command to synchronize the changes
from Gitlab at the 17th minute of the hour:
git -C /git/collab-maint/debmans.git fetch --quiet gitlab master:master
This was found to be the best compromise in adding the extra gitlab features while still keeping access threshold for Debian members low. Do note that there is no conflict resolution whatsoever on collab-maint’s side, and the behavior of Gitlab in case of conflicts isn’t determined yet. This may require manual fixing of merge conflicts.
Other implementations¶
There were already three known implementations of “man to web” archive generators when this project was started.
After careful consideration of existing alternatives, it was determined
it was easier and simpler to write a cleanroom implementation, based in
part on the lessons learned from the existing implementations and the
more mature debsources
project.
Original manpages.d.o codebase¶
The original codebase is a set of Perl and bash CGI scripts that dynamically generate (and search through) manpages.
The original codebase extracts manpages with dpkg --fsys-tarfile
and
the tar tar
commands. It also creates indexes using man -k
for
future searches. Manpages are stored in a directory for each
package-version
pair, so it doesn’t garbage-collect disappeared
manpages. It also appears that packages are always extracted, even if
they had been parsed before.
The CGI script just calls man
and outputs plain text wrapped in
<PRE>
tags without any cross-referencing or further formatting.
There is also a copy of the Ubuntu scripts in the source code.
Ubuntu¶
Ubuntu has their own manpage repository at https://manpages.ubuntu.com/. Their codebase is partly Python, Perl and Bash.
It looks like there’s a
bash
‘’‘and’‘’
python
implementation of the same thing. They process the whole archive on the
local filesystem and create a timestamp file for every package found,
which avoids processing packages repeatedly (but all packages from the
Packages
listing are stat
‘d at every run). In the bash version,
the manpages are extracted with dpkg -x
, in the Python version as
well, athough it uses the apt
python package to list files. It uses
a simple regex (^usr/share/man/.*\.gz$
) to find manpages.
It keeps a cache of the md5sum of the package in
"$PUBLIC_HTML_DIR/manpages/$dist/.cache/$name
to avoid looking at
known packages. The bash version only looks at the timestamp of the file
versus the package, and only checks at the modification year.
To generate the HTML version of the manpage, both programs use the
/usr/lib/w3m/cgi-bin/w3mman2html.cgi
shipped with the
w3m package.
Seach is operated by a custom Python script that looks through manpages filenames or uses Google to do a full text search.
dgilman codebase¶
A new codebase written by dgilman is available in
github. It is a simple Python
script with a sqlite backend. It extracts the tarfile with
dpkg --fsys-tarfile
then parses it with the Python tarfile
library. It uses rather complicated regexes to find manpages and stores
various apropos and metadata about manpages in the sqlite database. All
manpages are unconditionnally extracted.
OpenBSD¶
The OpenBSD project has a man.cgi(8) program that powers the whole application behind man.openbsd.org. It uses mandoc(1) to format manpages, as the manual system in OpenBSD has native support for HTML output. This is linked directly in the CGI, which is written in C.
FreeBSD¶
Wolfram Schneider
The man.freebsd.org site is powererd by a
perl script written by
Wolfram Schneider which parses the output of the man(1)
command
directly. See the help page for more information.