next up previous contents
Next: Overview of the Dataset Up: Specification of the Format Previous: Specification of the Format   Contents

Definitions and Concepts

We define a datum to be a specific localized piece of information, consisting of one or more components which may be of any numeric or string type. Generally, both the datum and its individual components are identified with a quantity and physical units. Data are distributed in some sort of coordinate space indexed by one or more dimensional variables, each of which varies independently of the others. This implies the use of a general data array as our basic model of how data are accessed, with the proviso that irregularly distributed data must also be accommodated. The various dimensions which locate a datum, distinguishing it from other data, may then be identified tentatively with indices into an array. These indices, along with those which select components from a datum, tend to fall into a few functional groups, which we refer to as dimensional levels.

We designate indices which select components within a datum as Level 0 dimensions. For example, a vector wind measurement might consist of the three spatial components of the wind (pointing in the direction of longitude, latitude, and pressure altitude); the index or variable which selects one component from these three would comprise the Level 0 dimension of the wind data.

The coordinates of a datum comprise the Level 1 and Level 2 dimensions of the data. Data are most likely to be stored in groups, or records. We therefore define the Level 1 dimensions as those coordinates which vary within a record; they are considered to vary solely within a group of data and are by nature fixed and bounded. Level 2 dimensions are those coordinates that vary across records. They are considered to vary across groups of data and may be unbounded.

Carrying the above example forward, each wind measurement will have a location in longitude, latitude, pressure, and time associated with it. Longitudes and latitudes might be considered fixed and bounded at, say, every 5 degrees; it seems natural to store these data as a series of two-dimensional arrays. The two dimensions, latitude and longitude, would then belong to Level 1. The other coordinates, pressure and time, would index these latitude-longitude arrays, and would therefore belong to Level 2. Note that, if one chooses to store the data as three-dimensional arrays (latitude, longitude, and pressure), then those three dimensions would be Level 1, and only time would belong to Level 2. In other words, whether a dimension belongs to Level 1 or Level 2 is determined by how the data are stored, not by any intrinsic nature of the dimension. The nature of a dimension may, of course, influence how the data are written: time, being unbounded in most instances, will often be used as a Level 2 dimension.

This division into Level 1 and Level 2 dimensions also allows more flexibility in storing the data: As the Level 2 dimensions change, the range of the Level 1 dimensions may change as well. E.g., at different pressure levels, our latitude-longitude arrays of wind measurements may be of different sizes.

Finally, we also consider coordinates over which the data have been averaged in some way. While not used as indices into a data array, such dimensions are nonetheless extremely important in understanding the nature of the data; we refer to these as the Level 3 dimensions. Continuing with our example, we may average the wind components over a month. We could indicate this averaging with Level 3 dimensions that identify each day of the month which contributed data to the average.

By dimensional order, we mean the number of dimensions at a dimension level. Using our (non-averaged) wind vector example above, the orders of the Level 0, Level 1, and Level 2 dimensions would be: three (for the East-West, North-South, and vertical components), two (for latitude and longitude), and two (for pressure and time), respectively.

The actual values along each dimension where the data are defined (not the data values themselves) we call the grid points of that dimension. For our wind vectors, the longitude dimension might be defined at every 5 degrees from 0 E to 355 E; there would be 72 grid points along this dimension. The latitudes vary from 90 S to 90 N by 5 degrees; this implies 37 grid points along the latitude dimension.

The rank of a dimension is the number of grid points along that dimension.

To fully describe the data with all its components and dimensions, we need other information which we refer to as metadata (data about the data). Metadata will identify the quantities and associated units that the data and dimensions represent. We also allow for the use of metadata in adding comment-type information to the data, as well as an audit trail; information on processing, packing and compression; and information about the physical form in which the data are stored.

In order to reduce the physical storage required by the metadata, to make it more machine-understandable, and to reduce the language dependencies that text comments impose, much of the metadata is stored in the form of integer codes. Some of the codes are centrally defined and uniform across all data sites; these codes are described in Section A.1. Other codes can be locally defined at a given site, and these are described in Section A.2. While the codes and their meanings are defined as part of this data format standard, we do not prescribe the form that the translation mechanism between these codes and their human-readable equivalents must take, although one current implementation based on using lookup tables is outlined in Appendix B.

Processing information is metadata which describes how the data were processed. For example, if a day of our wind data was missing, we might decide to interpolate in time to fill in that day's data. The processing information would consist of a code indicating time interpolation and providing the starting and ending days of the interpolation.

Auxiliary information is information which refers to or supplements the data. While processing information should be thought of as labels or tags applied to sections of the data, the auxiliary information on the other hand is more in the nature of a group of secondary variables which could not stand on their own as data in their own right, but which by their nature refer to the actual data. Examples would include instrument status words and uncertainties (error bars) in the data.

Packing information provides for the reduction of floating-point data to (usually smaller) integers by some scaling operation. This is different from compression, which uses an algorithm (such as Lempel-Zif) to reduce the data to a smaller, encoded sequence of bytes. Data can be both packed and compressed. For example, we may take our wind measurements and scale them at each pressure level to more compact integers. We could then use a compression routine to compress this packed data into a byte array whose length is determined by the bit patterns in the data. To read the data, we would first uncompress it and then unpack it using the metadata provided in the specification of the winds.

The collection of the data, the specification of its dimensions, and its metadata are collectively referred to as a data object.

Finally, at the highest level of organization, we define a dataset as a collection of data objects. Note that we avoid the use of the term ``file,'' since we wish the proposed standard to be applicable to data storage and access entities other than simple files on disk.


next up previous contents
Next: Overview of the Dataset Up: Specification of the Format Previous: Specification of the Format   Contents
Eric Nash 2003-09-25