One of the underappreciated benefits of computer technology in scientific research is the exchange of data among different research groups. The ability of relatively small physical devices to store and retrieve vast amounts of information has opened up new methods of collaboration between scientists which were unthinkable in the days of personal journals stuffed with long lists of hand-written numbers. We believe that the cross-fertilization of ideas engendered by these new capabilities can only aid scientific progress.
Too often, however, the technology is not used to best advantage. Typically, a researcher will mail to a colleague a reel of magnetic tape or a diskette with a data file written on it, accompanied by a photostatic copy of a document outlining the file format. If the recipient is fortunate, a sample program to read the data will be included as well. More often than not, however, the document will be obsolete and fail to reflect the actual format of the data. Also, the sample program will be filled with mysterious local conventions and unusual subroutines--crucial to successful operation--which are unknown outside the colleague's site.
If the recipient, through the application of clever graduate students, finally manages to crack the code on this digital Rosetta Stone, he is likely to get requests from others of his colleagues for the data (with the permission of the data's originator, of course). Many of the caveats attached to the data by the first scientist will have been forgotten by this time and hence will fail to be passed along to these third parties, for whom the warnings may have critical relevance.
Add to these difficulties the more computer-specific problems of incompatible media, binary floating-point representation, machine word size, and so forth, and one has a significant obstacle to data exchange.
The use of a standard file format can reduce or eliminate many of these problems. By definition, a standard format can be read by a diverse group of machines and people. In addition, a properly designed format will contain enough self-documentation that the other scientists can not only read the data, but make some sense of it as well.
Several such standards have been promoted in recent years [NCSA Software Tools Group, 1989], [Rew, 1990], addressing varying tradeoffs between the conflicting needs and goals of a standard format. Some have emphasized, for example, portability of data over computer networks to the detriment of all other considerations. Others have insisted that the entire format, especially all metadata, be in what they consider ``human readable'' form. Still others have attempted to impose a particular paradigm upon the data; users are expected to think of their data in the ways that the format designers demand.
Unfortunately, the tradeoffs imposed upon these formats by their creators often violate scientists's needs. The form that the data files take should be driven by the needs of the research, and those needs vary from group to group. Computer programmers are not physical scientists, moreover, and frequently misunderstand what those scientists require. What is needed is a data format which will leave strategic decisions and paradigm choices in the hands of those who will actually be using the data sets.
The file format proposed herein is intended to satisfy this requirement. Its creators are practicing scientists, rather than computer professionals, and thus it was designed at every step to address the actual needs of working scientists, rather than the mere perception of those needs by computer programmers. This format imposes no paradigm, no grand stratagem for data management. Rather, it seeks to provide a common language to allow scientists to express their data in the way they think is best. As much as possible, strategic decisions and tradeoffs are left in the hands of the users, since the wisdom of each such decision depends upon a user's specific application.
We term this format ``df'', and this document describes it. This first chapter discusses some of the issues involved in choosing or designing a standard file format; programmers only interested in implementing the df format itself may wish to skip to the next chapter. Subsequent chapters define the format, give examples to clarify some of the concepts introduced, and discuss how well the df format meets its design criteria. Finally, a set of appendices is included to aid in implementing software to read and write this format. The second and later chapters are written primarily for use by programmers; scientists and managers may find them uninteresting.