Much of what constitutes best practice in the open-source community is a natural adaptation to distributed development; you'll read a lot in the rest of this chapter about behaviors that maintain good communication with other developers. Where Unix conventions are arbitrary (such as the standard names of files that convey metainformation about a source distribution) they often trace back either to Usenet in the early 1980s, or to the conventions and standards of the GNU project.
Most people become involved in open-source software by writing patches for other people's software before releasing projects of their own. Suppose you've written a set of source-code changes for someone else's baseline code. Now put yourself in that person's shoes. How is he to judge whether to include the patch?
It is very difficult to judge the quality of code, so developers tend to evaluate patches by the quality of the submission. They look for clues in the submitter's style and communications behavior instead — indications that the person has been in their shoes and understands what it's like to have to evaluate and merge an incoming patch.
This is actually a rather reliable proxy for code quality. In many years of dealing with patches from many hundreds of strangers, I have only seldom seen a patch that was thoughtfully presented and respectful of my time but technically bogus. On the other hand, experience teaches that patches which look careless or are packaged in a lazy and inconsiderate way are very likely to actually be bogus.
Here are some tips on how to get your patch accepted:
If your change includes a new file that doesn't exist in the code, then of course you have to send the whole file. But if you're modifying an already-existing file, don't send the whole file. Send a diff instead; specifically, send the output of the diff(1) command run to compare the baseline distributed version against your modified version.
The diff(1) command and its dual, patch(1), are the most basic tools of open-source development. Diffs are better than whole files because the developer you're sending a patch to may have changed the baseline version since you got your copy. By sending him a diff you save him the effort of separating your changes from his; you show respect for his time.
It is both counterproductive and rude to send a maintainer patches against the code as it existed several releases ago, and expect him to do all the work of determining which changes duplicate things he has since done, versus which things are actually novel in your patch.
As a patch submitter, it is your responsibility to track the state of the source and send the maintainer a minimal patch that expresses what you want done to the main-line codebase. That means sending a patch against the current version.
Before you send your patch, walk through it and delete any patch bands for files in it that are going to be automatically regenerated once the maintainer applies the patch and remakes. The classic examples of this error are C files generated by Bison or Flex.
These days the most common mistake of this kind is sending a diff with a huge band that is nothing but changebars between your configure script and the maintainer's. This file is generated by autoconf.
This is inconsiderate. It means your recipient is put to the trouble of separating the real content of the patch from a lot of bulky noise. It's a minor error, not as important as some of the things we'll get to further on — but it will count against you.
Some people put special tokens in their source files that are expanded by the version-control system when the file is checked in: the $Id$ construct used by RCS and CVS, for example.
If you're using a local version-control system yourself, your changes may alter these tokens. This isn't really harmful, because when your recipient checks his code back in after applying your patch the tokens will be re-expanded in accordance with the maintainer's version-control status. But those extra patch bands are noise. They're distracting. It's more considerate not to send them.
This is another minor error. You'll be forgiven for it if you get the big things right. But you want to avoid it anyway.
The default (-e) format of diff(1) is very brittle. It doesn't include any context, so the patch tool can't cope if any lines have been inserted or deleted in the baseline code since you took the copy you modified.
Getting an -e diff is annoying, and suggests that the sender is either an extreme newbie, careless, or clueless. Most such patches get tossed out without a second thought.
This is very important. If your patch makes a user-visible addition or change to the software's features, include changes to the appropriate man pages and other documentation files in your patch. Do not assume that the recipient will be happy to document your code for you, or to have undocumented features lurking in the code.
Documenting your changes well demonstrates some good things. First, it's considerate to the person you are trying to persuade. Second, it shows that you understand the ramifications of your change well enough to explain it to somebody who can't see the code. Third, it demonstrates that you care about the people who will ultimately use the software.
Good documentation is usually the most visible sign of what separates a solid contribution from a quick and dirty hack. If you take the time and care necessary to produce it, you'll find you're already 85% of the way to having your patch accepted by most developers.
Your patch should include cover notes explaining why you think the patch is necessary or useful. This is explanation directed not to the users of the software but to the maintainer to whom you are sending the patch.
The note can be short — in fact, some of the most effective cover notes I've ever seen just said “See the documentation updates in this patch”. But it should show the right attitude.
The right attitude is helpful, respectful of the maintainer's time, quietly confident but unassuming. It's good to display understanding of the code you're patching. It's good to show that you can identify with the maintainer's problems. It's also good to be up front about any risks you perceive in applying the patch. Here are some examples of the sorts of explanatory comments that experienced developers send:
“I've seen two problems with this code, X and Y. I fixed problem X, but I didn't try addressing problem Y because I don't think I understand the part of the code that I believe is involved”.
“Fixed a core dump that can happen when one of the foo inputs is too long. While I was at it, I went looking for similar overflows elsewhere. I found a possible one in blarg.c, near line 666. Are you sure the sender can't generate more than 80 characters per transmission?”
“Have you considered using the Foonly algorithm for this problem? There is a good implementation at <http://www.example.com/~jsmith/foonly.html>”.
“This patch solves the immediate problem, but I realize it complicates the memory allocation in an unpleasant way. Works for me, but you should probably test it under heavy load before shipping”.
“This may be featuritis, but I'm sending it anyway. Maybe you'll know a cleaner way to implement the feature”.
A maintainer will want to have strong confidence that he understands your changes before merging them in. This isn't an invariable rule; if you have a track record of good work with the maintainer, he may just run a casual eye over the changes before checking them in semiautomatically. But everything you can do to help him understand your code and decrease his uncertainty increases your chances that your patch will be accepted.
Good comments in your code help the maintainer understand it. Bad comments don't.
Here's an example of a bad comment:
/* norman newbie fixed this 13 Aug 2001 */
This conveys no information. It's nothing but a muddy territorial bootprint you're planting in the middle of the maintainer's code. If he takes your patch (which you've made less likely) he will almost certainly strip out this comment. If you want a credit, include a patch band for the project NEWS or HISTORY file. He's more likely to take that.
Here's an example of a good comment:
/* * This conditional needs to be guarded so that crunch_data() never * gets passed a NULL pointer. <norman_newbie@foosite.com> */
This comment shows that you understand not only the maintainer's code but the kind of information that he needs to have confidence in your changes. This kind of comment gives him confidence in your changes.
There are lots of reasons a patch can be rejected that don't reflect on you. Remember that most maintainers are under heavy time pressure, and have to be conservative in what they accept lest the project code get broken. Sometime resubmitting with improvements will help. Sometimes it won't. Life is hard.
As the load on maintainers of archives like ibiblio, SourceForge, and CPAN increases, there is an increasing trend for submissions to be processed partly or wholly by programs (rather than entirely by a human).
This makes it more important for project and archive-file names to fit regular patterns that computer programs can parse and understand.
It's helpful to everybody if your archive files all have GNU-like names — all-lower-case alphanumeric stem prefix, followed by a hyphen, followed by a version number, extension, and other suffixes.
A good general form of name has these parts in order:
project prefix
dash
version number
dot
“src” or “bin” (optional)
dot or dash (dot preferred)
binary type and options (optional)
archiving and compression extensions
Name stems in this style can contain hyphen or underscores to separate syllables; dashes are actually preferred. It is good practice to group related projects by giving the stems a common hyphen-terminated prefix.
Let's suppose you have a project you call ‘foobar’ at major version 1, minor version or release 2, patchlevel 3. If it's got just one archive part (presumably the sources), here's what its names should look like like:
The source archive.
The LSM file (assuming you're submitting to ibiblio).
Please don't use names like these:
This looks to many programs like an archive for a project called “foobar123” with no version number.
This looks to many programs like an archive for a project called “foobar1” at version 2.3.
Many programs think this goes with a project called “foobar-v1”.
The underscore is hard for people to speak, type, and remember.
Unless you like looking like a marketing weenie. This is also hard for people to speak, type, and remember.
If you have to differentiate between source and binary archives, or between different kinds of binary, or express some kind of build option in the file name, please treat that as a file extension to go after the version number. That is, please do this:
Sources.
Binaries, type not specified.
i386 binaries.
i386 binaries statically linked.
SPARC binaries.
Please don't use names like ‘foobar-i386-1.2.3.tar.gz’, because programs have a hard time telling type infixes (like ‘-i386’) from the stem.
The convention for distinguishing major from minor release is simple: you increment the patch level for fixes or minor features, the minor version number for compatible new features, and the major version number when you make incompatible changes.
Some projects and communities have well-defined conventions for names and version numbers that aren't necessarily compatible with the above advice. For instance, Apache modules are generally named like mod_foo, and have both their own version number and the version of Apache with which they work. Likewise, Perl modules have version numbers that can be treated as floating point numbers (e.g., you might see 1.303 rather than 1.3.3), and the distributions are generally named Foo-Bar-1.303.tar.gz for version 1.303 of module Foo::Bar. (Perl itself, on the other hand, switched to using the conventions described here in late 1999.)
Look for and respect the conventions of specialized communities and developers; for general use, follow the above guidelines.
The stem prefix should be common to all of a project's files, and it should be easy to read, type, and remember. So please don't use underscores. And don't capitalize or BiCapitalize without extremely good reason — it messes up the natural human-eyeball search order and looks like some marketing weenie trying to be clever.
It confuses people when two different projects have the same stem name. So try to check for collisions before your first release. Two good places to check are the index file of ibiblio and the application index at Freshmeat. Another good place to check is SourceForge; do a name search there.
Here are some of the behaviors that can make the difference between a successful project with lots of contributors and one that stalls out after attracting no interest:
Don't rely on proprietary languages, libraries, or other code. Doing so is risky business at the best of times; in the open-source community, it is considered downright rude. Open-source developers don't trust code for which they can't review the source.
Configuration choices should be made at compile time. A significant advantage of open-source distributions is that they allow the package to adapt at compile-time to the environment it finds. This is critical because it allows the package to run on platforms its developers have never seen, and it allows the software's community of users to do their own ports. Only the largest of development teams can afford to buy all the hardware and hire enough employees to support even a limited number of platforms.
Therefore: Use the GNU autotools to handle portability issues, do system-configuration probes, and tailor your makefiles. People building from sources today expect to be able to type configure; make; make install and get a clean build — and rightly so. There is a good tutorial on these tools.
autoconf and autoheader are mature. automake, as we've previously noted, is still buggy and brittle as of mid-2003; you may have to maintain your own Makefile.in. Fortunately it's the least important of the autotools.
Regardless of your approach to configuration, do not ask the user for system information at compile-time. The user installing the package does not know the answers to your questions, and this approach is doomed from the start. The software must be able to determine for itself any information that it may need at compile- or install-time.
But autoconf should not be regarded as a license for knob-ridden designs. If at all possible, program to standards like POSIX and refrain also from asking the system for configuration information. Keep ifdefs to a minimum — or, better yet, have none at all.
A good test suite allows the team to easily run regression tests before releases. Create a strong, usable test framework so that you can incrementally add tests to your software without having to train developers in the specialized intricacies of the test suite.
Distributing the test suite allows the community of users to test their ports before contributing them back to the group.
Encourage your developers to use a wide variety of platforms as their desktop and test machines, so that code is continuously being tested for portability flaws as part of normal development.
It is good practice, and encourages confidence in your code, when it ships with the test suite you use, and that test suite can be run with make test.
By “sanity check” we mean: use every tool available that has a reasonable chance of catching errors a human would be prone to overlook. The more of these you catch with tools, the fewer your users and you will have to contend with.
If you're writing C/C++ using GCC, test-compile with -Wall and clean up all warning messages before each release. Compile your code with every compiler you can find — different compilers often find different problems. Specifically, compile your software on a true 64-bit machine. Underlying datatypes can change on 64-bit machines, and you will often find new problems there. Find a Unix vendor's system and run the lint utility over your software.
Run tools that look for memory leaks and other runtime errors; Electric Fence and Valgrind are two good ones available in open source.
For Python projects, the PyChecker program can be a useful check. It often catches nontrivial errors.
If you're writing Perl, check your code with perl -c (and maybe -T, if applicable). Use perl -w and 'use strict' religiously. (See the Perl documentation for further discussion.)
Spell-check your documentation, README files and error messages in your software. Sloppy code, code that produces warning messages when compiled, and spelling errors in README files or error messages, all lead users to believe the engineering behind it is also haphazard and sloppy.
If you are writing C, feel free to use the full ANSI features. Specifically, do use function prototypes, which will help you spot cross-module inconsistencies. The old-style K&R compilers are ancient history.
Do not assume compiler-specific features such as the GCC -pipe option or nested functions are available. These will come around and bite you the second somebody ports to a non-Linux, non-GCC system.
Code required for portability should be isolated to a single area and a single set of source files (for example, an os subdirectory). Compiler, library and operating system interfaces with portability issues should be abstracted to files in this directory.
A portability layer is a library (or perhaps just a set of macros in header files) that abstracts away just the parts of an operating system's API your program is interested in. Portability layers make it easier to do new software ports. Often, no member of the development team knows the porting platform (for example, there are literally hundreds of different embedded operating systems, and nobody knows any significant fraction of them). By creating a separate portability layer, it becomes possible for a specialist who knows a platform to port your software without having to understand anything outside the portability layer.
Portability layers also simplify applications. Software rarely needs the full functionality of more complex system calls such as mmap(2) or stat(2), and programmers commonly configure such complex interfaces incorrectly. A portability layer with abstracted interfaces (say, something named __file_exists instead of a call to stat(2)) allows you to import only the limited, necessary functionality from the system, simplifying the code in your application.
Always write your portability layer to select based on a feature, never based on a platform. Trying to create a separate portability layer for each supported platform results in a multiple update problem maintenance nightmare. A “platform” is always selected on at least two axes: the compiler and the library/operating system release. In some cases there are three axes, as when Linux vendors select a C library independently of the operating system release. With M vendors, N compilers, and O operating system releases, the number of platforms quickly scales out of reach of any but the largest development teams. On the other hand, by using language and systems standards such as ANSI and POSIX 1003.1, the set of features is relatively constrained.
Portability choices can be made along either lines of code or compiled files. It doesn't make a difference if you select alternate lines of code on a platform, or one of a few different files. A rule of thumb is to move portability code for different platforms into separate files when the implementations diverge significantly (shared memory mapping on Unix vs. Windows), and leave portability code in a single file when the differences are minimal (for example, whether you're using gettimeofday, clock_gettime, ftime or time to find out the current time-of-day).
For anywhere outside a portability layer, heed this advice:
Use of #ifdef and #if is permissible (if well controlled) within a portability layer. Outside it, try hard to confine these to conditionalizing #includes based on feature symbols.
Never intrude on the namespace of any other part of the system, including filenames, error return values and function names. Where the namespace is shared, document the portion of the namespace that you use.
Choose a coding standard. The debate over the choice of standard can go on forever — regardless, it is too difficult and expensive to maintain software built using multiple coding standards, and so some common style must be chosen. Enforce your coding standard ruthlessly, as consistency and cleanliness of the code are of the highest priority; the details of the coding standard itself are a distant second.
These guidelines describe how your distribution should look when someone downloads, retrieves and unpacks it.
The single most annoying mistake fledgling contributors make is to build tarballs that unpack the files and directories in the distribution into the current directory, potentially overwriting files already located there. Never do this!
Instead, make sure your archive files all have a common directory part named after the project, so they will unpack into a single top-level directory directly beneath the current one. Conventionally, the name of the directory should be the same as the stem of the tarball's name. So, for example, a tarball named foo-0.23.tar.gz is expected to unpack into a subdirectory named foo-0.23.
Example 19.1 shows a makefile trick that, assuming your distribution directory is named “foobar” and SRC contains a list of your distribution files, accomplishes this.
Include a file called README that is a roadmap of your source distribution. By ancient convention (originating with Dennis Ritchie himself before 1980, and promulgated on Usenet in the early 1980s), this is the first file intrepid explorers will read after unpacking the source.
README files should be short and easy to read. Make yours an introduction, not an epic. Good things to have in the README include the following:
A brief description of the project.
A pointer to the project website (if it has one).
Notes on the developer's build environment and potential portability problems.
A roadmap describing important files and subdirectories.
Either build/installation instructions or a pointer to a file containing same (usually INSTALL).
Either a maintainers/credits list or a pointer to a file containing same (usually CREDITS).
Either recent project news or a pointer to a file containing same (usually NEWS).
Project mailing list addresses.
At one time this file was commonly READ.ME, but this interacts badly with browsers, who are all too likely to assume that the .ME suffix means it's not textual and can only be downloaded rather than browsed. This usage is deprecated.
Before even looking at the README, your intrepid explorer will have scanned the filenames in the top-level directory of your unpacked distribution. Those names can themselves convey information. By adhering to certain standard naming practices, you can give the explorer valuable clues about where to look next.
Here are some standard top-level file names and what they mean. Not every distribution needs all of these.
The roadmap file, to be read first.
Configuration, build, and installation instructions.
List of project contributors (GNU convention).
Recent project news.
Project history.
Log of significant changes between revisions.
Project license terms (GNU convention).
Project license terms.
Plain-text Frequently-Asked-Questions document for the project.
Note the overall convention that filenames with all-caps names are human-readable metainformation about the package, rather than build components. This elaboration of the README was developed early on at the Free Software Foundation.
Having a FAQ file can save you a lot of grief. When a question about the project comes up often, put it in the FAQ; then direct users to read the FAQ before sending questions or bug reports. A well-nurtured FAQ can decrease the support burden on the project maintainers by an order of magnitude or more.
Having a HISTORY or NEWS file with timestamps in it for each release is valuable. Among other things, it may help establish prior art if you are ever hit with a patent-infringement lawsuit (this hasn't happened to anyone yet, but best to be prepared).
Your software will change over time as you put out new releases. Some of these changes will not be backward-compatible. Accordingly, you should give serious thought to designing your installation layouts so that multiple installed versions of your code can coexist on the same system. This is especially important for libraries — you can't count on all your client programs to upgrade in lockstep with your API changes.
The Emacs, Python, and Qt projects have a good convention for handling this: version-numbered directories (another practice that seems to have been made routine by the FSF). Here's how an installed Qt library hierarchy looks (${ver} is the version number):
/usr/lib/qt /usr/lib/qt-${ver} /usr/lib/qt-${ver}/bin # Where you find moc /usr/lib/qt-${ver}/lib # Where you find .so /usr/lib/qt-${ver}/include # Where you find header files
With this organization, multiple versions can coexist. Client programs have to specify the library version they want, but that's a small price to pay for not having the interfaces break on them. This good practice avoids the notorious “DLL Hell” failure mode of Windows.
The de facto standard format for installable binary packages under Linux that used by the Red Hat Package manager, RPM. It's featured in the most popular Linux distribution, and supported by effectively all other Linux distributions (except Debian and Slackware; and Debian can install from RPMs). Accordingly, it's a good idea for your project site to provide installable RPMs as well as source tarballs.
It's also a good idea for you to include in your source tarball the RPM spec file, with a production that makes RPMs from it in your makefile. The spec file should have the extension .spec; that's how the rpm -t option finds it in a tarball.
For extra style points, generate your spec file with a shellscript that automatically plugs in the correct version number by analyzing the project makefile or a version.h.
Note: If you supply source RPMs, use BuildRoot to make the program be built in /tmp or /var/tmp. If you don't, during the course of running the make install part of your build, the install will install the files in the real final places. This will happen even if there are file collisions, and even if you didn't want to install the package at all. When you're done, the files will have been installed and your system's RPM database will not know about it. Such badly behaved SRPMs are a minefield and should be eschewed.
Provide checksums with your binaries (tarballs, RPMs, etc.). This will allow people to verify that they haven't been corrupted or had Trojan-horse code inserted in them.
While there are several commands you can use for this purpose (such as sum and cksum) it is best to use a cryptographically-secure hash function. The GPG package provides this capability via the --detach-sign option; so does the GNU command md5sum.
For each binary you ship, your project Web page should list the checksum and the command you used to generate it.
Your software and documentation won't do the world much good if nobody but you knows they exist. Also, developing a visible presence for the project on the Internet will assist you in recruiting users and co-developers. Here are the standard ways to do that.
Announce to Freshmeat. Besides being widely read itself, this group is a major feeder for Web-based technical news channels.
Never assume the audience has been reading your release announcements since the beginning of time. Always include at least a one-line description of what the software does. Bad example: “Announcing the latest release of FooEditor, now with themes and ten times faster”. Good example: “Announcing the latest release of FooEditor, the scriptable editor for touch-typists, now with themes and ten times faster”.
Find a Usenet topic group directly relevant to your application, and announce there as well. Post only where the function of the code is relevant, and exercise restraint.
If (for example) you are releasing a program written in Perl that queries IMAP servers, you should certainly post to comp.mail.imap. But you should probably not post to comp.lang.perl unless the program is also an instructive example of cutting-edge Perl techniques.
Your announcement should include the URL of a project website.
If you intend trying to build any substantial user or developer community around your project, it should have a website. Standard things to have on the website include:
The project charter (why it exists, who the audience is, etc.).
Download links for the project sources.
Instructions on how to join the project mailing list(s).
A FAQ (Frequently Asked Questions) list.
HTMLized versions of the project documentation.
Links to related and/or competing projects.
Refer to the website examples in Chapter 16 for examples of what a well-educated project website looks like.
An easy way to have a website is to put your project on one of the sites that specializes in providing free hosting. In 2003 the two most important of these are SourceForge (which is a demonstration and test site for proprietary collaboration tools) or Savannah (which hosts open-source projects as an ideological statement).
It's standard practice to have a private development list through which project collaborators can communicate and exchange patches. You may also want to have an announcements list for people who want to be kept informed of the project's progress.
If you are running a project named ‘foo’, your developer list might be <foo-dev> or <foo-friends>; your announcement list might be <foo-announce>.
An important decision is just how private the “private” development list is. Wider participation in design discussions is often a good thing, but if the list is relatively open, sooner or later you will get people asking new-user questions on it. Opinions vary on how best to solve this problem. Just having the documentation tell the new users not to ask elementary questions on the development list is not a solution; such a request must be enforced somehow.
An announcements list needs to be tightly controlled. Traffic should be at most a few messages a month; the whole purpose of such a list is to accommodate people who want to know when something important happens, but don't want to hear about day-to-day details. Most such people will quickly unsubscribe if the list starts generating significant clutter in their mailboxes.
See the section Where Should I Look? in Chapter 16 for specifics on the major open-source archive sites. You should release your package to these.
Other important locations include:
The Python Software Activity site (for software written in Python).
The CPAN, the Comprehensive Perl Archive Network (for software written in Perl).