Revision History | ||
---|---|---|
Revision v0.01 | 29 July 2004 | esr |
First draft. |
Abstract
I describe a project plan for getting us to where Fedora can have a fully hypertexted system-wide documentation database (competitive with, for example Microsoft Help) through relatively simple changes to the build system and tools, without requiring massive manual conversion.
Table of Contents
The goal of Project Paradise is to develop tools and practices which will get Fedora Linux to having a fully hypertexted system-wide documentation database, indexed and searchable and addressable through a browser.
The motivation for this goal is that desktop users expect — and deserve — a better documentation infrastructure than Linux currently provides. One of the prerequisites for world domination is need to be able to compete with — and beat — Microsoft Help.
This is difficult problem, because system and application documentation is scattered across several different formats in many different places on a Linux system. Most of it is in manual pages, some is in HTML, some in GNU info files, and some in flat text. Accordingly, just merging all these sources into a single corpus for indexing and cross-referencing is significant work. There are also design issues about what storage formats to use in the corpus, how to do indexing, and so forth.
I'm proposing Project Paradise because I believe enough of the sticky issues have been resolved that we can now take on and solve the central problem. In the remainder of this document I will lay out a concrete series of solution steps.
Some of what would have been major issues in designing Paradise ten or even five years ago have been settled in the last few years. Notably, the right presentation format for such system is no longer controversial. It must be HTML, because today's users expect to reach documentation through a browser. Even their expectations about searching and indexing are conditioned by experience with hyperlinks and Web-search engines.
The basic decision to use HTML as a presentation format carries with it easy first-cut solutions to some associated problems. Full-text search and indexing of HTML trees is a well-understood technology, supported by mature open-source tools like ht:Dig. The choice of HTML also means we can defer issues about rendering for print; users have shown they are willing to live with print from HTML when generation of tuned PostScript or PDF is unavailable.
HTML, while excellent for presentation, is semantically thin. This has both obvious and unobvious consequences. One obvious consequences is that it is hard to support searches that know about the structure and type of items in the documentation. From HTML, for example, you can't automatically generate an index of all names of files referenced from documentation, or all code example listings. To have any hope of ever supporting rich searching above the word-by-word ht:Dig level, the documentation corpus has to be stored in a rich format.
Another obvious consequence: there are kinds of structure that HTML does not express well, like "This should be a concept index entry" or "This is a footnote". Yet another obvious consequence is the quality of print rendering. HTML-to-print is acceptable as a first cut, but it's not really good.
An unobvious consequence is that conversion of old-fashioned presentation-level formats like man pages to HTML is too easy to do badly. Tools like man2html(1) tend to produce poor-quality and inconsistent presentation HTML from a large documentation corpus because they simply copy over presentation-level markup from the source — so you end up (for example) with filenames presented in an italic font or a bold font or a constant-width font or with no highlighting at all.
Thus, behind the scenes, Paradise will need a common storage or master format (distinct from the presentation format) that is rich enough to express annotations like "This should be indexed" or "This is a filename" or "This is a footnote" or "This is a sidebar".
In 2004 this choice is not difficult either. Many open-source projects (including the Linux kernel, the Linux Documentation Project, GNOME, KDE, X.org, and mySQL) have recognized the need for a richer master markup, and have moved or are moving to XML-Docbook. Projects going to DocBook. Part of the reason for this choice is that generating good HTML from DocBook is easy.
DocBook is not without problems, most notably that it's verbose and rather heavyweight. But it is less nasty than any alternative that is even nearly as capable.
The choice of HTML as presentation format and DocBook as a common intermediate format freezes a large number of what would otherwise be variables in the design, and lets us defer others (such as indexing and searching). We're left with three problems:
Conversion: How do we lift old formats like man, texinfo plain text, and HTML into the DocBook common intermediate format? Assuming we can do this, what changes to RPM-building procedures are needed to implement it?
Indexing and hyperlinking: What kind of indexes do we want to generate from a system's documentation database? How do we keep them up-to-date as packages are installed and uninstalled?
Backward Compatibility: Do we support old access methods like man(1) and info(1)? If so, how?
Most Linux software is still documented in man pages. There are over 10,000 manual pages in a full Fedora Core 2 installation. The next largest category with a single identifiable source format is info documents, around 550 of them. There are over 21,000 HTML pages under /usr/share, but quick inspection of the filenames reveals that (a) most are derived from several different master formats including DocBook, JavaDoc, and man pages, and (b) that number is inflated by the tendency of HTML generators to break single source documents into multiple small chunks. After paging through the list of names, I estimate that they are all derived from around 300 source documents (some very large). Thus, man pages still create over 90% of the conversion problem.
I am making the Project Paradise proposal because I have a solution for lifting man pages to XML-Docbook. It is called doclifter, and can also lift documents in mm, ms, me, mdoc, TkMan, and pod2man-generated pages. It uses a combination of compiler technology and cliche recognition to analyze the presentation-level markup of troff and turn it into structural markup.
As an example, one of doclifter's rules is “If you see a word in a section named FILES that is set in italic and contains either a slash or a leading dot, replace the italic markup with a <filename> tag pair.” One consequence of this sort of cliche analysis is that doclifter generates better, more consistent HTML than tools that simply copy over presentation-level markup.
On a stock Fedora Core 2 system, doclifter successfully converts over 96% of 10,897 manual pages to validated XML-DocBook . I have about 270 trivial patches for broken markup that push the clean-conversion rate to over 99%, leaving only about 80 pages that cannot be converted. Of these patches, about 40 have been accepted by upstream maintainers.
I know of no tools for converting info pages to DocBook. But the same makeinfo(1) tool that is used to make info files from their Texinfo masters has options to render Texinfo to either HTML or DocBook.
I believe these automated conversions are now good enough to make lifting of over 95% of the Fedora Core documentation corpus into DocBook possible without manual intervention. Furthermore, I believe that with a little pressure on upstream maintainers we can push that percentage up to better than 99%, reducing the number of exception cases that require special treatment to a manageable small handful.