next up previous contents
Next: Complexity Up: Away from a Standard Previous: Inflexibility   Contents


High Overhead

Another disadvantage associated with standard file formats is the overhead involved in reading and writing them. Data written in a portable binary representation must be converted to a machine's native representation before calculations can be performed. The metadata for self-documentation eats up valuable disk space. As a result, users in search of high storage density and performance end up avoiding standard formats.

The binary representation issue can be dealt with by allowing native machine-dependent formats. In this case, however, the file must still contain information identifying the file as being in the standard format written in a particular native representation, and that information must still be readable from any machine. Recalling our previous example, a Unix system, having retrieved a file from an IBM mainframe, should be able to tell that the file is written in the standard format using Cray native binary representations.

It is reasonable, then, not to demand that a standard format exhibit strict binary compatibility across machines: if necessary, one can convert one machine's native binary format to another's. What is important is to ensure that one knows the type and structure of the data one is converting, and that the data types being converted (as opposed to the implementation or representation of those data types) not be proprietary to a single machine. The use of a file format standard satisfies the first condition; using only portable data types satisfies the second. Types such as characters, bytes, signed and unsigned integers, and single and double precision floating-point meet this criterion. Complicated structures and pointers do not.

Allowing native binary representations in the dataset format will eliminate the guarantee of being able to read a dataset fresh off a network, but with an extra binary representation conversion step added, the overall goal of being able to port the dataset easily to new machines is met. Furthermore, for those applications in which true ``out of the box'' readability is required, use of a portable representation should still be allowed.

Another source of computational overhead, aside from data conversion, is requiring the user to use a standard software package. Of course, standard packages are, in themselves, good and useful. They encourage people to read and write the standard format who do not wish to expend the time or effort to understand the format and create their own readers and writers. Nevertheless, no set of subroutines can operate at maximum efficiency in all applications. If users require high performance, they should be able to consult the standard specification and write their own I/O software. Documenting the reading and writing subroutines only, while keeping the underlying file format secret, prevents these users from adopting the standard.

In addition to computation overhead, standard file formats are often associated with greatly increased disk storage requirements. Aside from the increased storage requirements of certain portable binary representations, information for self-documentation accounts for much of this file-size bloat.

Users manipulating a handful of toy datasets on their personal computers may be able to live with the larger file sizes. On a large supercomputing system, however, with ten years or more of daily global meteorological data, the extra disk space required can be prohibitive.

How can one squeeze the files down?

Metadata written as plain text generally requires much more storage than that written as, say, integer codes. In seeking to reduce this overhead, then, one must examine the question of whether the metadata should be machine-readable or human-readable.

While some may claim that metadata must exist in humanly-readable form, it should be pointed out that the only storage media which allow humans to read the data directly are punched cards and paper tape, neither of which is used in today's research environment. Most commonly, data are stored as microscopic magnetic or optical patterns in some special material, and a computer is required to read them. Thus, the question becomes not, ``Shall the metadata be human-readable?'' but rather ``What kind of computer tool shall be used to inspect the metadata?''

The case for making metadata in ``human-readable'' form is that a simple text editor can be used to inspect it, and everybody has a text editor. Unfortunately, many editors will choke on binary data which may be found elsewhere in the file (and if the entire file, including all data, is written as text, then many editors will still be unable to hold the entire file in memory). Therefore some specialized tool must be written and distributed which will allow one to browse or inspect a dataset. As long as such a tool must be used, however, it really does not matter how human-readable the dataset is; what matters is how human-readable the tool's output is.

Furthermore, metadata encoded as text comments embedded in a data file tend to be exclusive towards a single language, such as English; this can be a disadvantage when dealing with an international community.

The case for machine-readability, on the other hand, can be made by considering the direction in which scientific data processing is moving. In the past, many datasets were relatively small--small enough for one person to comprehend entirely. These were the sorts of datasets which had been found in laboratory notebooks, and storing and processing such datasets on computers was mainly a matter of convenience. With the advent of satellite data and output from sophisticated numerical models, the volume of datasets has grown, so that computer processing has become necessary rather than merely convenient; no one person could hope to perform calculations by hand on such datasets. In the future, one should expect more of the ``grunt work'' to be taken over by computer tools of increasing sophistication and speed. It would take the advent of true artificial intelligence for machines to do science--no foreseeable machine will be able to detect trends in atmospheric constituents, for example--but more of the routine, mechanical work (such as ``Find the global temperatures on the 50-millibar pressure surface for 12 January 1989, and plot them up for me.'') will be done by computers. It is profitable therefore to make the metadata relatively simple for a computer to understand. Naturally, this has to be limited in some respects; careful explanations of unusual experimental conditions are best described in human-readable terms, since they tend to be unique and difficult to encode. But such common items as data identifications, units, and dimensions should be machine-readable.

Choices made on this issue, of course, affect those made on others. If a special machine-readable code is used for, say, data identification, then one must also decide whether the data ID codes are defined by a central authority or by each user. Additionally, one must decide where to put the machine-to-human translations of the data ID codes. Again, a reasonable compromise is for a basic framework to be set up by a central authority, with the ability for users to fit their own code definitions into that framework. In this way, even if a code definition used in a foreign dataset is missing from a site's own list of code definitions, a clever utility program could make some sense of what the code means.

Another way to make files smaller would be to allow for packing and compression of data. By ``packing,'' we mean the scaling of floating-point data to smaller integers occupying less storage. By ``compression,'' we mean the reduction of redundancy in a bit stream by such algorithms as Lempel-Zif. The format should probably allow only a few standard methods, and, once the method is specified by the user, the data should be packed and/or compressed transparently by the writer subroutines (and, of course, uncompressed and unpacked transparently by the reader subroutines).


next up previous contents
Next: Complexity Up: Away from a Standard Previous: Inflexibility   Contents
Eric Nash 2003-09-25