Next Previous Contents

6. Data Formats

6.1 Text Field Rules

Description fields are interpreted according to the following rules:

Text is plain text. Paragraphs are separated by one or more blank lines. No HTML tags are recognized; >, & and < mean themselves. Normal paragraphs are word-filled. Indented text is treated as-is and converted to <PRE>...</PRE> in HMTL (tabs should be expanded to spaces here). A single word between *asterisks* means <b>bf</b> and a single word in _underscores_ means <i>italics</i> (even in indented text). Any text that looks sufficiently like a URL (e.g. http://www.python.org) is turned into a hyperlink with an <A...>...</A> tag pair (even in indented text).

6.2 Mappings between Trove and RPM metadata

It would be highly desirable to be able to automatically import Red Hat RPMs and SRPMs into the Trove scheme. To this end, we compare them here.

Here are the currently defined RPM metadata tags:

Going through the Trove schema package fields in order, we see that we can copy the Name, Summary, Description, and Icon fields directly. RPM's URL field is equivalent to our Home-Page field.

We don't need RPMs to fill in the Crawl-To, Remote-Date, or Refresh-Date fields, as those are strictly for the crawler's use.

In the package access record, the Created, Update-Count, Modified, and Locked fields could be created at RPM translation time. The Contributor field could be copied from the RPM Packager field.

Now, proceeding to the relations. Assuming we had some systematic mapping of Group discriminators to Trove discriminators, we could derive exactly one topic discriminator. We could derive `required-by' relations from the Required header. We could derive the license-type controlled keywords from the Copyright header. No way to extract `supercedes', `extends', or `see-also'.

The real problems are with the package-to-person relations. RPM has no discriminators for contact people, authors, or maintainers. Metadada maintainership privileges would default to the contributor, but in the case of RPMs created by (say) Red Hat for distribution this is unlikely to be useful.

The picture is a little brighter with respect to automatically declaring resources. We could declare the RPM itself a resource with a version number composed from the Version and Release fields. The Contributor field could set the maintainer.

Conclusion: RPM metadata is not really adequate for generating Trove records from. The major problems are (a) it doesn't supply enough keyword/discriminator info, and (b) there is no way to derive a reliable maintainer or author list from it.

6.3 Mappings between Debian and RPM metadata

It would be desirable to be able to automatically import Debian .deb packages into the Trove scheme. To this end, we compare them here.

Here are the Debian metadata tags defined in the Debian Packaging Manual:

(A few listed fields which are not package metadata have been omitted. Note that the Maintainer field is the .deb maintainer, analogous to the RPM Packager field, not the person responsible for the software.)

We can copy the Package, Version, and Description fields directly. Our Requires field might be derivable from Debian's Depends. It has been noted that we might be able to derive Crawl-To, Latest-Version, and Last-Stable Version by looking at the Debian FTP site. Otherwise there is little overlap -- not nearly enough to make using Debian metadata reasonable.

6.4 Comparison with Dublin Core metadata

The Dublin Core) is a set of 15 metadata items that are meant to be fully general across all kinds of intellectual-property resources. Here is a summary of the Dublin Core fields:

We are certainly not going to be able to use the Dublin Core as a complete set of descriptors. But there are some things we could do to be name-compatible where we're semantically compatible, and avoid name clashes where we cannot be semantically compatible.

Simple renamings:
        Author -> Creator
        Maintainer -> Contributor
        Contributor -> Publisher        
        Discriminators -> Subject

Fieldnames to avoid in our metadata:
        Title      (hard experience that people don't interpret this well)
        Date       (because of creation vs. last-modified ambiguity)
        Type       (incompatible vocabulary with Dublin Core's)
        Format     (incompatible vocabulary with Dublin Core's)
        Identifier (there isn't any natural scheme)
        Source     (doesn't specify mode of derivation well enough)
        Language   (doc-language vs. implementation-language ambiguity)
        Relation   (incompatible vocabulary with Dublin Core's)
        Coverage   (just irrelevant)

Finally, we could set a bit that if we end up disambiguating package names with a site prefix or other uniquifying prefix (rather than resolving collisions), the "true name" could be designated the Identifier.

No decision has been made on this yet.


Next Previous Contents