Design¶
This page explains the design principles and decisions in the project.
Minimum viable product¶
The Minimum Viable Product for this project is a service that creates an
HTML version of all the manpages of all the packages available in
Debian, for all supported suites. Basic whatis(1)
functionality is
also expected.
apropos(1)
functionality is considered extra that can be implemented
later with already existing tools.
The design is split those components which map to debmans
subcommands:
extract
: extracts manpages from Debian packagesrender
: renders manpages into HTMLsite
: render a static site into HTMLindex
: indexes HTML pages for searching (not implemented yet)search
: the search interface (not implemented, but there is a simple “jump” Javascript tool)
There is also a serve
command which starts a local webserver to help
with development.
See the Remaining work file for details about the missing bits.
Extract¶
This part fetches all manpages from the archive and stores them on disk. This makes them usable for tools like dman that browses remote webpages.
The layout options for where to store files were:
- Ubuntu:
$DISTRIB_CODENAME/$LOCALE/man$i/$PAGE.$i.gz
(see dman) - original codebase:
"${OUTPUTDIR}/${pooldir}/${packagename}_${version}"
(from manpage-extractor.pl)
Ubuntu’s approach was chosen to avoid bitrot and follow more closely the existing filesystem layout. It also happens to be easier to implement.
The extractor uses a cache to avoid re-extracting known manpages. We use
the Ubuntu layout there as well
($outputdir/$suite/.cache/$packagename_version
), which leads to
bitrot, but at least it’s constrained to a suite. This will be a problem
for unstable, so maybe some garbage-collection may be necessary.
Render¶
This converts manpages to HTML so they are readable in a web browser.
Possile options for this implementation:
- just the plaintext output of man wrapped in
<PRE>
tags (current design) - man2html is an old C program that ships with a bunch of CGI scripts
- there’s another man2html that is a perl script, but I couldn’t figure out how to use it correctly.
- w3m has another Perl script that is used by the Ubuntu site
- roffit is another perl script. the version in Debian is
ancient (2012) and doesn’t display the
man(1)
synopsis correctly (newer versions from github also fail) - pandoc can’t, unfortunately, read manpages (only write)
- man itself can generate an HTML version with
man -Hcat man
and the output is fairly decent, although there is no cross-referencing
The Makefile here tests possible manpage HTML renderers. Each is timed
with time(1)
to show its performance.
Package | Timing |
---|---|
roffit | 0.06user 0.00system 0:00.07elapsed 96%CPU (0avgtext+0avgdata 4852maxresident)k |
w3m | 0.26user 0.01system 0:00.19elapsed 137%CPU (0avgtext+0avgdata 5456maxresident)k |
man | 1.63user 0.17system 0:01.81elapsed 99%CPU (0avgtext+0avgdata 27268maxresident)k |
man2html | 0.00user 0.00system 0:00.01elapsed 61%CPU (0avgtext+0avgdata 1568maxresident)k |
Note
Those statistics were created with
debmans/test/converters/Makefile
in the source tree.
Here is how the actual output compares:
Package | Correctness |
---|---|
roffit | SYNOPSIS fails to display correctly |
w3m | includes HTTP headers, links to CGI script, all pre-formatted, no TOC |
man | TOC, no cross-referencing |
man2html | includes HTTP headers, links to CGI script, index at the end |
man2html
was originally chosen because it is the fastest, includes
an index and is not too opiniated about how the output is
formatted. Unfortunately, it would fail to parse a lot of manpages,
like the ones from the gnutls
project. w3m
was used as a
fallback, even though it actually calls man
itself to do part of
the rendering.
Index¶
This indexes HTML pages in a search engine of some sort.
In the backend, something will need to index the manpages if only to
implement apropos(1)
functionality, but eventually also full-text
search. This should be modular so that new backends can be implemented
as needed.
For now, we are considering reusing the existing Xapian infrastructure the Debian project already has.
The indexer would:
- run
omindex
on the HTML tree and create a database - process each locale separately so they are isolated (may be tricky
for
LANG=C
) and that the right stemmer is called - need a CGI script (provided by the xapian-omega package) to query the database - the HTML output is generated based on templates, so presumably we could reuse the existing templates.
It is assumed that Xapian can deal with large datasets (10-50GB) considering how it is used in Notmuch (my own mailbox is around ~6GB) and lists.debian.org.
We put the command name in the page <title>
tag, the short
description in <meta description="...">
and use the magic markers
(<!--htdig_noindex-->ignored<!--/htdig_noindex-->
) to make the
indexer ignore redundant bits.
See the Search section for more information about the various search software evaluated and the web interface.
Xapian examples¶
Index all documents in html/
:
$ ominxed --url / --db search --mime-type=gz:ignore html/
11.03user 0.36system 0:12.27elapsed 92%CPU (0avgtext+0avgdata 29952maxresident)k
18056inputs+15688outputs (0major+7884minor)pagefaults 0swaps
--url
is the equivalent of the renderer’s --prefix
, --db
is
a directory where the database will end up and --mime-type
is to
ignore raw manpages.
Second runs are much faster:
$ time omindex --url / --db search --mime-type=gz:ignore html/
0.01user 0.00system 0:00.02elapsed 88%CPU (0avgtext+0avgdata 4656maxresident)k
0inputs+0outputs (0major+300minor)pagefaults 0swaps
Display information about the search database:
$ delve search.db
UUID = 6fd4d4ab-2529-4d67-bbff-32b88fd888fa
number of documents = 452
average document length = 3976.96
document length lower bound = 78
document length upper bound = 243183
highest document id ever used = 452
has positional information = true
Example searches:
$ quest -d search.db man2html | grep ^url
url=/cache/man/fr/man1/man2html.1.html
url=/cache/man/man1/man2html.1.html
url=/cache/man/ro/man1/man2html.1.html
url=/cache/man/it/man1/man2html.1.html
url=/cache/man/el/man1/man2html.1.html
$ quest -d search.db setreg | grep ^url
url=/cache/man/man1/setreg.1.html
url=/cache/man/man1/mozroots.1.html
url=/cache/man/man1/chktrust.1.html
url=/cache/man/man1/certmgr.1.html
This would search only <title>
fields:
--prefix=title:S 'title:foo'
.
Searching¶
The search interface itself would be a CGI or WSGI tool (if written in Python) interface that would hook into the webserver to perform searches.
Currently only a browser-based, Javascript search tool implements basic
whatis(1)
functionality. It looks up the manpage using a
XMLHttpRequest to see
if the requested page exists and redirects appropriately. It doesn’t
look at different locales yet.
This should be extended to a full search interface, using omega
‘s
web interface or other pluggable interfaces.
Search software evaluation¶
Various search were evaluated:
- Xapian:
- used by Notmuch, craigslist, search.debian.org, lists.debian.org, wiki.debian.org, old gmane.org
- no web API, would need to index directly through the Python API
- harder to use
- no extra server necessary
- internal knowledge already present in debian.org
- bound to use their CGI interface (Omega)
- written in C++
- Lucene / Solr:
- requires another server and API communications
- per-index user/password access control
- mature solution that has been around for a long time
- may support indexing HTML?
- JSON or XML data entry
- large contributor community
- usually faster than ES
- based on Apache Lucene
- written in Java
- used by Netflix, Cnet
- Elasticsearch:
- requires a server
- packaged in Debian
- REST/JSON API
- documentation may be lacking when compared so Solr, e.g. had to ask
on IRC to see if
_id
can be a string (yes, it can) - Python API, but could also be operated with a Javascript library
- may be easier to scale than Solr
- performance comparable with Solr
- was designed to replace Solr
- supports indexing HTML directly by stripping tags and entities
- based on Apache Lucene as well
- written in Java as well
- requires a CLA for contributing
- used by Github, Foursquare, presumably new gmane.org
- created in 2010, ~18 months support lifetime?
- sphinx: not well known, ignored
- mngosearch: considered dead, ignored
- homegrown:
- codesearch uses Postgresql
- David’s uses sqlite
- Readthedocs has a custom-built Javascript-based search engine
- we could use a simple Flask REST API for searches, but then the extractor (or renderer?) would need to write stuff to some database - sqlite reads fails when writing, so maybe not a good candidate?
Sources:
Infrastructure¶
At least the extractor and renderer would run on manziarly
. The
output would be stored on the static.d.o CDN (see below). parts 3 could
be a separate (pair or?) server(s?) to run the search cluster.
In the above setup, manziarly
would be a master server for static
file servers in the Debian.org infrastructure. Files saved there would
be rsync’d to multiple frontend servers. How this is configured is
detailed in the
static-mirroring
DSA documentation, but basically, we would need to ask the DSA team for
an extra entry for manpages.d.o there to server static files.
Gitlab vs Alioth¶
The project was originally hosted in the Collaborative Maintenance repositories, but those quickly showed their limitations, which included lack of continuous integration, issue tracking and automatic rendering of markdown files.
A project was created on Gitlab for this purpose, in anarcat’s personnal
repositories (for now). On Gitlab, the project “mirrors” the public git
URL of the
collab-maint
repo. On collab-maint, there is a cronjob in my
personnal account which runs this command to synchronize the changes
from Gitlab at the 17th minute of the hour:
git -C /git/collab-maint/debmans.git fetch --quiet gitlab master:master
This was found to be the best compromise in adding the extra gitlab features while still keeping access threshold for Debian members low. Do note that there is no conflict resolution whatsoever on collab-maint’s side, and the behavior of Gitlab in case of conflicts isn’t determined yet. This may require manual fixing of merge conflicts.
Other implementations¶
There were already three known implementations of “man to web” archive generators when this project was started.
After careful consideration of existing alternatives, it was determined
it was easier and simpler to write a cleanroom implementation, based in
part on the lessons learned from the existing implementations and the
more mature debsources
project.
Original manpages.d.o codebase¶
The original codebase is a set of Perl and bash CGI scripts that dynamically generate (and search through) manpages.
The original codebase extracts manpages with dpkg --fsys-tarfile
and
the tar tar
commands. It also creates indexes using man -k
for
future searches. Manpages are stored in a directory for each
package-version
pair, so it doesn’t garbage-collect disappeared
manpages. It also appears that packages are always extracted, even if
they had been parsed before.
The CGI script just calls man
and outputs plain text wrapped in
<PRE>
tags without any cross-referencing or further formatting.
There is also a copy of the Ubuntu scripts in the source code.
Ubuntu¶
Ubuntu has their own manpage repository at https://manpages.ubuntu.com/. Their codebase is partly Python, Perl and Bash.
It looks like there’s a
bash
‘’‘and’‘’
python
implementation of the same thing. They process the whole archive on the
local filesystem and create a timestamp file for every package found,
which avoids processing packages repeatedly (but all packages from the
Packages
listing are stat
‘d at every run). In the bash version,
the manpages are extracted with dpkg -x
, in the Python version as
well, athough it uses the apt
python package to list files. It uses
a simple regex (^usr/share/man/.*\.gz$
) to find manpages.
It keeps a cache of the md5sum of the package in
"$PUBLIC_HTML_DIR/manpages/$dist/.cache/$name
to avoid looking at
known packages. The bash version only looks at the timestamp of the file
versus the package, and only checks at the modification year.
To generate the HTML version of the manpage, both programs use the
/usr/lib/w3m/cgi-bin/w3mman2html.cgi
shipped with the
w3m package.
Seach is operated by a custom Python script that looks through manpages filenames or uses Google to do a full text search.
dgilman codebase¶
A new codebase written by dgilman is available in
github. It is a simple Python
script with a sqlite backend. It extracts the tarfile with
dpkg --fsys-tarfile
then parses it with the Python tarfile
library. It uses rather complicated regexes to find manpages and stores
various apropos and metadata about manpages in the sqlite database. All
manpages are unconditionnally extracted.