Next: Specification of the Format Up: Introduction Previous: Reconciliation Contents

The ``df'' format

The creators of this proposed standard did not lightly reject the other formats. The time and effort in designing a standard format is substantial, and they did everything in their power to avoid it. Nevertheless, they have reluctantly concluded that the existing formats which they have encountered do not meet the criteria that they believe are essential for a truly useful standard format for scientific data. This is unfortunate, since it adds yet another proposed standard to the growing list. Yet, if our needs as scientists are to be met, it seems necessary.

This is not to say that the proposed format, which is called ``df,'' is the final word on the subject. While considerable thought went into implementation and efficiency issues, this standard will undoubtably have some aspects whose implementation will make computer professionals groan, and certain constructs may not be as elegant or efficient as those designed by computer scientists. This standard should be viewed as an object lesson or testbed for ideas, detailing some of the features scientists need in a file format. If these features are incorporated into whatever format wins out in the end, the effort devoted to this standard shall have been worthwhile.

The authors believe that good software and good standards should be derived from a careful consideration of users' needs--not programmers' needs, not programmers' ideas of users' needs, but the users' themselves. Scientists should be consulted from the beginning, not near the end, of the design process, and that consultation should continue through to the end. Their real, perceived needs--in what ways do they think of their data? what sort of premium do they place on disk space?--must be given paramount importance. One does not begin by asking, ``what do you think of this data format?'' One should begin by observing scientists as they collect, manipulate, and analyze their data; from these observations, a clearer picture of their needs can be obtained. (And an afternoon's interview will not suffice!) We believe that flexibility and user-specifiability must be built-in from the ground up, not tacked on as an afterthought.

The file format proposed herein strikes a balance between the conflicting demands outlined above. Users and/or projects are allowed to determine their own balance between efficiency and portability, and yet a single standard is adhered to in either extreme.

For example, provision is made for portability of data where desired, but where efficiency is preferred, machine-specific representations are used (and are identified as such in the file). This standard also allows for various compression and data-packing methods to conserve storage, again at the discretion of the user. Bad or missing data are provided for, and other sorts of processing notes and flags can be attached to specific subsets of the data. A sort of audit trail is even provided, whereby datasets which are derived from other datasets can have their entire family trees laid out for the inspection of the users.

Also, and just as importantly, the format is flexible enough to be used for a wide variety of data fields and disciplines.

This document describes the df format, defining the structure of the bits which make up a dataset, explaining the concepts behind the format, giving examples of its use, and discussing implementation issues. While this introduction is intended for a wide audience, the rest of the document is heavily oriented towards computer programmers. (Physical scientists who are used to simply dumping their data arrays onto disk with no describing metadata should be forewarned: they will probably find the following chapters somewhat tedious and overcomplicated.)

Next: Specification of the Format Up: Introduction Previous: Reconciliation Contents

Eric Nash 2003-09-25