next up previous contents
Next: Numeric Codes Up: Pros and Cons Previous: How Well Did We   Contents

Questions and Answers

Despite all its features, the df format does have some deficiencies. Some of these are justifiable (for the moment), some can be worked around (albeit awkwardly), and some must wait future enhancements of the format.

Why was this format created? It this really supposed to go into competition against the likes of HDF and netCDF?

No. In fact, the authors dream not of the day when the df format overwhelms all others, but of the day when the features they need--and which the df format at present alone supplies--show up in other, more widely accepted data formats.

Until that day comes, however, our research needs dictated a solution that could be used in the present, and so the authors had no choice but to design a new format. Taking a broader view, one can almost say that the df format was created as a means of demonstrating what we meant, and why the features we say we need are indeed important.

The df format specification is too complicated. As a scientist, I want something I can implement in BASIC in five minutes.

The format could have been made simpler, but only at the expense of making it more restrictive. The apparent complexity here is a result of being able to express complicated, pathological data structures; you can do things here you cannot do elsewhere. (If anything, df falls short of being able to represent every conceivable complicated data structure.) The moment we rule out such data structures, telling scientists they cannot write or think of their data in those terms, we lose.

In the short term, the complexity will frighten away some potential users. But in the long term, as a standard library of subroutines to handle the df format takes shape, the complexity will be hidden from the naive users, and they need never see this document. (The more advanced users, of course, will always be able to write their own software to improve performance in their individual environments.)

For now, the only consolation is to realize that for most cases, the specification is simpler than it appears. Simple cases can be simply expressed. The dimensional indices, for example, can be as straightforward or a twisted as the user cares to make them.

And, yes, a df format file can be written from a BASIC program!

It seems that a key feature of the df format is the use of numeric codes to represent metadata. This scheme, however, is not only overly restrictive on users creating their own data, but it relies on being able to distribute the definitions of these codes to all users of the format. And that seems unlikely to be successful.

Numeric codes are used instead of text strings for three reasons: (a) to save storage space (this means that we can require that metadata be present in each dataset, even in a massive collection of files), (b) to promote automated handling of the metadata (interpretation, labelling, cataloging, etc.), and (c) to reduce the ambiguity and language problems associated with the use of English text.

The codes, though, do necessitate implementing a mechanism for translating them (see Section B.2) as well as following the hierarchical scheme for adding new codes to the standard. Current technology is capable of supporting the distribution of the code definitions, whether it be through copying files or through the use of a distributed network server.

If a few small datasets are to be exported to another site, and there is some concern about whether that site is able to translate the codes, then we recommend that the code translations used in those datasets be inserted as COMMENT records into at least the first one. If large numbers of datasets are to be shipped, then it would probably be worthwhile to arrange for distribution of the complete set of code translations.

Note that the capability exists to define certain codes locally. One does not have to wait to write one's datasets until a central authority assigns a code; moreover, codes defined locally according to the standard are structured in such a way that they will not conflict with other sites' local codes. The only code that must be assigned centrally is the site identification code.

It should also be noted that, with the exception of the data format codes, translations are not necessary to read the data or determine its structure, but are merely an aid to interpreting it. If one knows, for instance, that a file contains temperature on latitude-longitude-pressure grids, then one need not rely on the codes to say so. Even the data format codes, in fact, are structured in such a way that a program should be able to parse an unknown code and figure out what data types are intended in most cases, without having access to a list of data format code translations.

The header records require the user to know what will be written before the data are written: the data sizes, the number of comments, the number and type of processing codes, etc. What if we want to write the data first, then figure out what the metadata are?

This is an unavoidable side effect of the requirement that programs written in languages such as Fortran know record lengths before the records are read. This requires that records be written in a certain order.

In the long run, though, as these languages move out of use, and this feature is less in demand, the requirement may be relaxed. In that case, records may be written in any order (so long as the TEST record comes first). At that time, it will be necessary to know the length of a data record which occurs before the metadata, in order to skip past it to read the metadata (which contains the information about how big the data records are); a new record type may then be created to give the length of the next record in bytes.

Using the df format, I can lump scalar quantities together as components of a single larger quantity. But I cannot do that with vectors or tensors (much less mix scalars, vectors, and tensors) without breaking them up into their components. I want to be able to nest Level 0 dimensions.

At present, single datum's complexity is limited. While the convention of the Level 0 dimensions provides for scalar, vectors, and tensor of arbitrary sizes and types, it does not allow for components of different forms (such as scalars and tensors) to be lumped together to form a single datum. Currently, separate data objects must be defined within a single dataset--a less than satisfactory solution.

What is needed is some completely general way of describing a single datum of arbitrary complexity. this is an area for future expansion.


next up previous contents
Next: Numeric Codes Up: Pros and Cons Previous: How Well Did We   Contents
Eric Nash 2003-09-25