To achieve the CONTRIBUTOR-DRIVEN objective, submissions and updates will normally be done through a Web form with upload capability. Maintaining metadata will be the responsibility of each package's authors and maintainers.
The ENABLING objective implies that at least package resources (if not the metadata) should be directly accessible via FTP or the Web.
The LOCATION-INDEPENDENCE objective implies that all resource pointers in metadata are actually URLs.
The ENABLING and LOCATION-INDEPENDENCE objectives together require that the Trove data architecture must have a clean separation between two parts; the catalog, a database holding package metadata, and the archive, a local FTP/Web tree holding some (but necessarily all) of the resources pointed to by the catalog.
The ENABLING and PERFORMANCE objectives further imply that as much as possible of the catalog view should be available through unmediated Web and FTP access into the archive. This implies making HTML and plaintext versions of package metadata available in the archive, updated automatically when the master copy in the catalog database changes.
To achieve RICH METADATA, we must roughly capture RPM's annotation semantics. See the appendix on importing RPMs.
The NOTIFICATION implies that each package's metadata must include a mailing list, and that the interface must support subscription and unsubscription facilities.
The SCALABILITY requirement implies using managing the metadata with a real database capable of handling high transaction volumes.
For the ENABLING and EMAIL and CRAWLER objectives, we must define a plain-text tag format for rendering metadata. We'll use this to (1) represent the metadata in FTP-accessible files in the archive, (2) define the required format for email submissions, and (3) define the required format for trusted remote metadata.
The plain-text tag format will come up again, so it needs a name: TRL, for Trove Request Language.
The forgoing objectives make it pretty clear what the general architecture of the system. A Trove site will consist of the following parts:
The structure of TRL, with an example, is discussed in the Appendix.
To reason about the design, we need to know what kinds of things will be in the Trove database and how they are named (e.g. what handles they can be retrieved by. Some of this has been touched on in the section on terminology.
There are three different kind of objects in the Trove universe. These are:
All three kinds of resources are always explicitly created, modified, and deleted, with a notoification to interested parties on each action.
The general policy on name validation is that references to unregistered people and packages are not. Thus, maintainers of a package need not be in the Person table as long as they have syntactically valid email addresses; and package relations may refer to packages by name that are not registered in Trove.
This implies that every creation of a Package or Person record needs a global check to mark references it suddenly fills, but that is an acceptable price for making the namespace open rather than closed.
Issue: We know that package names will be unique per site. Are they unique across all sites in the Trove ring? If not, how do we do synchronization when rings merge? And how do crawlers know which package they are responsible for?
The catalog will be stored in a database. The schema is available at the Trove website.
To make the rest of this document concrete, we need to specify an organization for the archive part. Here it is:
Each project has a directory. The name of the directory is the name of the project, without a version number (this is so project directories can contain multiple directories). Observe the implication that project names must be unique per Trove site.
Project directories may live directly under a per-site root, or (for performance) under superdirectories which express some kind of hash on the names. It is important for bare-FTP accessibility that this hash be easy for human beings to calculate by inspection. Example: terminfo's scheme of having each terminal type live in a superdirectory named after the first character of the terminal type name. Whether such a scheme is used, an what it is, is per-site policy.
Within each project's directory live all its associated local resources. Other resources may live offsite (the catalog records don't care, they use URIs for everything). The directory will also contain FTP and HTML versions of the package's metadata, as files named %%INDEX.TRL and index.html respectively. The former name is chosen to sort as early as possible in an FTP directory listing without including Unix shell metacharacters; the latter, to be the page automatically displayed by a browser pointed at the directory.
The librarian will be a set of HTML pages and CGIs that mediate between users (including uploaders and maintainers) and the library.
It will be necessary for the librarian to maintain state through multiple-form transactions. For discussion of the librarian design, see the major section on user interface design below.
These will be programs that, essentially, translate metadata submissions in TRL into actions on the archive. The only difference between them will be that the email robot waits for input fed to it though a mail alias, while the crawler looks for descriptions in remote locations specified by metadata URIs.
In both cases, a parse error or package name collision or other exception will generate email to the submitting party and contact persons given in the both new and old metadata.
What do we use as the database back end? Postgres95? SOLID? MySQL? Something else?