Trove Design Document: Data Formats

6.1 Text Field Rules

Description fields are interpreted according to the following rules:

Text is plain text. Paragraphs are separated by one or more blank lines. No HTML tags are recognized; >, & and < mean themselves. Normal paragraphs are word-filled. Indented text is treated as-is and converted to <PRE>...</PRE> in HMTL (tabs should be expanded to spaces here). A single word between *asterisks* means <b>bf</b> and a single word in _underscores_ means <i>italics</i> (even in indented text). Any text that looks sufficiently like a URL (e.g. http://www.python.org) is turned into a hyperlink with an <A...>...</A> tag pair (even in indented text).

6.2 Mappings between Trove and RPM metadata

It would be highly desirable to be able to automatically import Red Hat RPMs and SRPMs into the Trove scheme. To this end, we compare them here.

Here are the currently defined RPM metadata tags:

Name -- the package name
Version -- the package version
Release -- the RPM release number
Copyright -- the license type of the software
Group -- topic category of the application
Source -- URL pointing to home archive of the sources
URL -- URL pointing to documentation
Release -- the RPM release number
Distribution -- the distribution this package belongs to
Vendor -- the organization distributing the software
Packager -- contact email of package maker
Summary -- one-line summary description
Description -- multiline description of package.
Icon -- GIF or XPM icon for package.

Going through the Trove schema package fields in order, we see that we can copy the Name, Summary, Description, and Icon fields directly. RPM's URL field is equivalent to our Home-Page field.

We don't need RPMs to fill in the Crawl-To, Remote-Date, or Refresh-Date fields, as those are strictly for the crawler's use.

In the package access record, the Created, Update-Count, Modified, and Locked fields could be created at RPM translation time. The Contributor field could be copied from the RPM Packager field.

Now, proceeding to the relations. Assuming we had some systematic mapping of Group discriminators to Trove discriminators, we could derive exactly one topic discriminator. We could derive `required-by' relations from the Required header. We could derive the license-type controlled keywords from the Copyright header. No way to extract `supercedes', `extends', or `see-also'.

The real problems are with the package-to-person relations. RPM has no discriminators for contact people, authors, or maintainers. Metadada maintainership privileges would default to the contributor, but in the case of RPMs created by (say) Red Hat for distribution this is unlikely to be useful.

The picture is a little brighter with respect to automatically declaring resources. We could declare the RPM itself a resource with a version number composed from the Version and Release fields. The Contributor field could set the maintainer.

Conclusion: RPM metadata is not really adequate for generating Trove records from. The major problems are (a) it doesn't supply enough keyword/discriminator info, and (b) there is no way to derive a reliable maintainer or author list from it.

6.3 Mappings between Debian and RPM metadata

It would be desirable to be able to automatically import Debian .deb packages into the Trove scheme. To this end, we compare them here.

Here are the Debian metadata tags defined in the Debian Packaging Manual:

Package -- the package name
Version -- the package version
Architecture -- architecture the package is for
Maintainer -- contact email of package maker
Source -- name of corresponding source package
Depends -- declares an absolute dependency
Recommends -- declares a strong but not absolute dependency
Suggests -- recommends other packages to install
Pre-Depends -- declares an installation dependency
Conflicts -- says what this cannot coexist with
Replaces -- declares that this replaces given packages
Provides -- declares `virtual' packages for dependency purposes
Description -- multiline description of package.
Essential -- declares that a package cannot be removed (only replaced).
Priority -- how essential the package is.
Section -- application area of the package
Installed-Size -- installed size of the package
Standards-Version -- applicable version of Debian packaging standards
Distribution -- the distribution this package belongs to
Urgency -- how important it is to get current
Date -- last-modified-date of metadata
Format -- format level for changes file
Changes -- human-readable changelog data
Size -- size of binary package
MD5sum -- MD5 checksum of the package

(A few listed fields which are not package metadata have been omitted. Note that the Maintainer field is the .deb maintainer, analogous to the RPM Packager field, not the person responsible for the software.)

We can copy the Package, Version, and Description fields directly. Our Requires field might be derivable from Debian's Depends. It has been noted that we might be able to derive Crawl-To, Latest-Version, and Last-Stable Version by looking at the Debian FTP site. Otherwise there is little overlap -- not nearly enough to make using Debian metadata reasonable.

6.4 Comparison with Dublin Core metadata

The Dublin Core) is a set of 15 metadata items that are meant to be fully general across all kinds of intellectual-property resources. Here is a summary of the Dublin Core fields:

Title: -- the name of the resource
Creator: -- the person who created the intellectual content of the resource
Subject: -- structured keywords
Description: -- free text
Publisher: -- the entity responsible for making the resource available
Contributor: -- secondary provider of content
Date: -- creation or first-ability date of the resource
Type: -- category of work (home page, novel, poem, working paper)
Format: -- data format, intended to identify what is required to present or use the resource
Identifier: -- URL, URN, ISBN, or other unique identifier within category
Source: -- information about a base resource from which this one is derived.
Language -- language of the intellectual content of the resource
Relation -- relates this reasource to another, via assertion such as IsVersionOf(), IsBasedOn(), IsPartOf(), etc.
Coverage: -- spatial or temporal characteristics of the intellectual content of the resource.
Rights: -- pointer to license and rights information.

We are certainly not going to be able to use the Dublin Core as a complete set of descriptors. But there are some things we could do to be name-compatible where we're semantically compatible, and avoid name clashes where we cannot be semantically compatible.


Simple renamings:
        Author -> Creator
        Maintainer -> Contributor
        Contributor -> Publisher        
        Discriminators -> Subject

Fieldnames to avoid in our metadata:
        Title      (hard experience that people don't interpret this well)
        Date       (because of creation vs. last-modified ambiguity)
        Type       (incompatible vocabulary with Dublin Core's)
        Format     (incompatible vocabulary with Dublin Core's)
        Identifier (there isn't any natural scheme)
        Source     (doesn't specify mode of derivation well enough)
        Language   (doc-language vs. implementation-language ambiguity)
        Relation   (incompatible vocabulary with Dublin Core's)
        Coverage   (just irrelevant)

Finally, we could set a bit that if we end up disambiguating package names with a site prefix or other uniquifying prefix (rather than resolving collisions), the "true name" could be designated the Identifier.

No decision has been made on this yet.

Next Previous Contents

6. Data Formats

6.1 Text Field Rules

6.2 Mappings between Trove and RPM metadata

6.3 Mappings between Debian and RPM metadata

6.4 Comparison with Dublin Core metadata