The task of managing the significantly larger satellite data sets will raise a number of hardware and software issues that must be addressed [sto93]. The most significant hardware issue is how to store the data. In the short term, it would not be economically feasible to build sufficient physical storage utilising magnetic disks, the cost is too high. Instead, a massive tertiary storage device, based on tape robots, is required, providing at least an order of magnitude cost saving for the same storage. With existing tape robots it is now possible to construct a hierarchical file system, HFS, using disk and tape storage capable of handling many petabytes of data. Consequently, when the new, massive, satellite data sets become available over the next few years the major hardware issues should have been addressed. However, research into the software issues raised is urgently required or the potential software applications will not be realised.
The primary aim of the proposed research is to investigate
techniques that will accurately predict and/or describe the data required
by an application manipulating satellite data so that it can be supplied
by a file system in a timely manner. This will involve an investigation
of a HFS that can be explicitly informed of an application's data requirements
and then monitor and adapt to the true resource usage. The result of the
proposed research will be new file system technology that can effectively
manage massive satellite data sets. This significant contribution should
become available to software developers in time to support the full range
of potential uses of earth observation data.
In many applications requiring a HFS the data sets
being manipulated are very large but regular. For example, a HFS used to
store many terabytes of satellite data may only have a few megabytes of
meta data, that is descriptions of the very large satellite images. Using
the meta data it may be possible to identify areas of images required by
specific applications and over time to generate selective copies of that
data [bos92, fla88]. More importantly, the regularity of the data could
be allowed to show through the HFS to the user application level. Applications
could then explicitly describe the data they require well in advance allowing
the HFS time to retrieve exactly what is required.
The proposed research will investigate mechanisms
for predicting the data needs of an application manipulating satellite
data, migrating the necessary data from tape to disk in advance of its
use and adapting to the actual use of the data by the application.
The recorded access patterns will be classified in
terms of file access and in terms of data access. File access patterns
will give some baseline measures that can be used to evaluate experiments
with HFS implementation techniques. Data access patterns will provide some
initial insight on how to specify an application's data requirements in
terms of the satellite data. This requires some prior knowledge of the
format of the satellite data including the compression techniques used,
time ordering of images, rectification techniques used to synthesise images
of interest and the proportion of raw data actually required.
With commercial off-the-shelf applications it will
not be possible to modify the applications so any explicit requests for
data must be made just prior to execution time. Where application sources
are available, it may be possible to include explicit requests at the start
of major phases of the application. However, if the explicit requests are
not made until the data is required there may not be enough time for the
HFS to reorganise itself and be able to supply the data in a timely manner.
Finally, explicit requests for data made by an application
may not be accurate. In many cases the data of interest may vary as a user
interacts with an application. In these cases it will be necessary to vary
the requests made or even cancel some of them. Given the scale of the data
involved, it is essential to inform the HFS when data is no longer required
so that it does not waste valuable resources fetching the data.
This phase of the project will investigate the scope
for other uses of a HFS in those cases where explicit requests for data
are made well in advance of actual use. This situation may arise where
the applications are run in batch mode, for example, a job is submitted
to the system and the results are not required until the next day. A supercomputer
installation supporting both a batch system and interactive users needs
to be able to trade the throughput requirements of the batch system against
the response time needs of the interactive users.
[coy93] Coyne, R.A., Hulen, H. & Watson, R.W.
[esa95] European Space Agency.
[fla88] Flanders, P.M.
[gol95] Goldick, J.S., Benninger, K., Kirkby, C.,
Maher, C. & Zumach, B.
[kim96] Kim, F.
[lsc97] LSC, Inc.
[nfs95] Sun Microsystems
[pat97] Patel, J., Yu, J., Kabra, N., Tufte, K.,
Nag, B., Burger, J., Hall, N., Ramasamy, K., Lueder, R., Ellman, C., Kupsch,
J., Guo, S., Larson, J., DeWitt, D. & Naughton, J.
[pat98] Patten, C.J., Vaughan, F., Hawick, K. &
Brown, A.L.
[sto93] Stonebraker, M.
File System Issues
Identifying the data required by an application can
be relatively easy, however, retrieving the data from a HFS in a timely
manner can be very difficult. For example, the data required in generating
a next day weather forecast may be easy to identify but the forecast will
be useless if it takes more than a day to retrieve the data. In the few
existing HFS implementations [coy93, gol95, kim96, lsc97, sto93], the usual
approach to this problem is to employ the traditional file system technique
of prefetching data based on physical locality. That is, if a data block
is read from disk then assume the next data block on disk is also required.
This may not be appropriate in a HFS where the data of interest may be
distributed throughout a time sequenced set of very large satellite images
many of which may be on tape not disk [pat97].
Research Plan, methods and techniques
The proposed research will be conducted
in the following steps:
Implementation of a prototype HFS
To provide a prototype HFS with which to conduct experiments,
a simulated Network Filesystem, NFS, will be used [nfs95,pat98]. This will
permit any existing applications to use the HFS without modifications.
It can also be used over a network without further work. The initial implementation
will transparently move entire files between a normal filesystem and a
large tape silo. As it runs it will record details of application access
patterns, clustering of requests, delays due to retrieving files from the
tape silo, the size of requests, etc. The ideal approach would be to implement
a full HFS and integrate it with a host operating system. However, this
level of sophistication is not essential to the project since by implementing
an NFS server the prototype provides many of the same benefits.
Measurements of access patterns from real applications
Given an implementation of a prototype HFS, a number
of existing applications will be measured.
During this phase of the project, access patterns will be measured to give
a baseline for evaluating the effectiveness of the subsequent experiments.
The simulated NFS front-end to the prototype HFS will provide a convenient
software layer that can be instrumented without unduly interfering with
the applications being measured.
An investigation of automatic adaptation to application
behaviour.
Given a basic understanding of applications that manipulate
satellite data, a number of techniques will be investigated to automatically
adapt to an application's behaviour and prefetch data from tertiary storage.
One strategy is to adopt the traditional operating system approach of prefetching
data from tertiary storage based on physical locality. An alternative strategy
may be to identify geographic areas from satellite images and prefetch
the next occurrence of the geographic area in a time sequence. This phase
of the project should identify the major strengths and weaknesses of these
automatic adaption strategies when dealing with satellite data. This phase
can also introduce special purpose instrumentation into the simulated NFS
front-end of the prototype HFS.
An investigation of responding to explicit advanced
requests for data
The next phase of the project will look at extending
the HFS interface to include a mechanism for explicitly requesting data
in advance. This will involve modifying the simulated NFS front-end so
that it can respond to additional remote procedure calls. In this way,
the normal NFS service can be maintained whilst providing direct communication
with the HFS implementation. A number of issues must be addressed here
such as:
The description of the required data will necessarily
be based on the regular structure of the satellite data. However, it is
not clear how this should be intimated to the HFS. For example, the HFS
may choose to generate multiple, read-only copies of the satellite data
set, some of which only contain certain areas of interest. This may be
performed in response to a number of previous requests for a particular
geographic area. How these multiple copies of the satellite data are managed
may well impact on how areas of interest are described.
Integrating automatic adaptation and explicit requests
for data
This phase of the project will investigate how to monitor
and adapt to an application's behaviour with respect to its explicit requests
for data. If requests for data are made well in advance it may not be appropriate
to immediately act on the advice. Ideally, the HFS will supply data just
in time. Therefore, the ideal HFS should be able to adapt to service other
applications' needs when explicitly requested data is not immediately required.
References
[bos92] Bosman, O., Fletchar, P. & Tsui, K.
K-Tiling: A Structure to Support Regular Ordering
and Mapping of Image Data, Proc. of the Australian Pattern Recognition
Society Workshop on Two and Three Dimensional Spatial Data: Representation
and Standards, Perth, Australia, 1992.
The High Performance Storage System. Proc. of SuperComputing’93,
Portland, Oregon, November 1993.
General Study Earth Observation - Modelling of the
Distributed User Services. Final Report, DUS-MGT-006-TR-1.0, European Space
Agency, 4 December 1995.
DAP Series - Parallel Data Transforms, Active Memory
Technology (formerly ICL DAP Division), 1988.
Multi-resident AFS: An Adventure in Mass-Storage.
Proc. of USENIX 1995 Winter Technical Conference on Unix and Advanced Computing
Systems, New Orleans, Louisiana, January 1995.
UniTree: A Closer Look at Solving the Data Storage
Problem. UniTree Software Inc. White Paper, 1996.
SAM-FS, 1997, http://www.lsci.com/lsci/products/samfs.htm.
The NFS Distributed File Service, NFS White Paper,
March 1995, http://www.sun.com/solaris/wp-nfs.
Building a Scalable Geo-Spatial DBMS: Technology,
Implementation and Evaluation. Department of Computer Sciences, University
of Wisconsin-Madison, USA, 1997.
"DWorFS: File System Support for Legacy Applications
in DISCWorld", Proc. Of the Fifth IDEA Workshop, Fremantle, Western Australia,
Australia, 7-10 February (1998), pp30-33.
Sequoia 2000 - A Reflection on the First Three Years.
Technical Report, Department of Computer Science, University of California
at Berkeley, California, USA, 1993.