[GO-ESSP] gridded data management systems

Jon Blower jdb at mail.nerc-essc.ac.uk
Thu Nov 25 13:55:51 MST 2004


Hi Steve,

Thanks very much for this.  We'll certainly check out the Ferret Data
Server, it looks very interesting.  To answer your questions:

I was referring to "flat files" to mean "anything but a database".  Of
course, I appreciate that netCDF and HDF are rather more sophisticated than
"raw" IEEE files and currently we are working with file formats such as
these (i.e. netCDF etc).  As I see it, the advantages that databases might
have over netCDF/HDF and their APIs are as follows (I'm happy to be
corrected on this, I don't know all the details):

1)  Caching of frequently-used data chunks, improving performance in many
cases (of course, operating systems and hard disks also have their own
caching strategies outside of the database)
2)  Automatic splitting of the data into tiles so that data subsets can be
retrieved efficiently (did you say that HDF does tiling too?)
3)  Intelligent, automatic storage of data at different resolutions
("pyramid" scheme).  I would expect databases using such a scheme to be much
faster than using the netCDF/HDF APIs at retrieving data at different
resolutions.  We have found that extracting, say, every third data point
from an HDF5 file to be very slow.
4)  Better performance for slicing data in all four dimensions.  In our
experience, data sets based on netCDF/HDF tend to have one file per timestep
(this is how they are typically output from models).  This means that to
extract a timeseries of 100 points, we need to open 100 files, taking one
point out of each.  The equivalent operation is much faster with the
database.
5)  Ability to store vector data (e.g. observations at points or along ship
tracks) alongside raster (gridded) data.

The main downsides of most database-centred approaches are the fact that the
systems are generally not open-source and are often expensive.

We haven't yet done any formal benchmarking of our database's performance,
but from simply playing with it, it certainly seems like it performs well,
particularly for the cases mentioned above.  As far as network access goes,
our intention would be to provide access via a Web Services interface,
and/or OPeNDAP.

You mention GIS - my understanding of GIS is that it only really understands
"two and a half" dimensions, i.e. it understands lat-long, and time is not a
true dimension but an attributed of a data segment.  I understand that GIS
struggles with the third dimension (at least, standard OpenGIS does), being
concerned mostly with surface features.  So at the moment we do not intend
to link with a GIS system.  I understand that there are projects that are
attempting to resolve this issue and create GIS systems suitable for fully
4-D data.

In summary, I guess I am trying to find out whether there are any existing
open-source systems that can offer the same or similar functionality and
performance to the database system we have under test.  If not, perhaps
there is a system that can be adapted and developed by the community - I'm
sure that many groups must have considered precisely these problems and
gotten some way to a solution.  There may well be a way of creating a system
that uses many of the tricks of a database, but is not actually itself
connected to any particular DBMS.

Regards,
Jon
  -----Original Message-----
  From: Steve Hankin [mailto:Steven.C.Hankin at noaa.gov]
  Sent: 24 November 2004 17:32
  To: Jon Blower
  Cc: go-essp at ucar.edu; Adit Santokhee; Keith Haines
  Subject: Re: [GO-ESSP] gridded data management systems


  Hi Jon,
  When you refer to "standard flat files" are you including formats like
netCDF and HDF5 under that title?  This is often a source of terminology
confusion as "flat files" sometimes refers to "anything but a database".
Others regard n-dimensional, multi-variate data standards like netCDF and
HDF to be alternatives to "flat" IEEE files.

  The question that you pose is essentially to weigh the pros and cons of
managing your data with a commercial database that has been enhanced to
handle grids, or to handle your data with netCDF (which in the next version
will merge with HDF5 to handle compression, tiles, etc.) and the free netCDF
utilities.  (Presumably from-scratch development with IEEE binary files is
not the way to go.)  You mentioned some down-sides to the commercial
software route (cost, "proprietariness" of software, dependence on a single
supplier,...).   Are the advantages of the database approach sufficient to
outweigh these costs?   You have also not mentioned network access to the
data.  Is it a requirement is for the data to be OPeNDAP accessible?  Or
alternatively, is access from enterprise GIS systems at the center of your
bullseye?

    a.. Items 1-2 are trivial for either system.  Comparative performance
... do you have any data?  The Barrodale product is new and one-of-a-kind.
It would be interesting to see some benchmarks comparing it to netCDF and
HDF5.
    b.. Items 3-4 can be handled with the new FDS (Ferret Data Server) and
probably the GDS server, as well.  Custom code may be required depending
upon the list of projections that is desired, but these are open
environments, where this can be added.   Item 3-4 capabilities are also
available and presumably well supported if your database is embedded in an
enterprise GIS framework.
    c.. Item 5 is probably better handled in a database environment, though
it can also be handled (with some effort -- in various ways) in a Web
service environment based on OPeNDAP.
  Just bouncing around the ideas.  This community will be interested to hear
what further you learn.
      - steve

  ====================================

  Jon Blower wrote:

    Hi all,
    As some of you may know, we at the Reading e-Science Centre have been
    investigating some new ways to store and manage data from models of the
    oceans and atmosphere.  We have been looking at storing data in
databases,
    rather than standard flat-file systems, and have over the last few
months
    been evaluating IBM's Informix database with Barrodale Computing
Services'
    Grid DataBlade plug-in (see http://www.resc.rdg.ac.uk/projects.php for
more
    details).  Eventually this might form the back-end to our own data
portal
    page (http://www.nerc-essc.ac.uk/godiva).

    We have found good and bad points about this system and are now
wondering
    how to take things forward.  I have been considering the feasibility of
    writing (essentially from scratch) an intelligent storage/management
    application for gridded geospatial data.  The key features of this
system
    would include:

    1) Data would be stored in a single format but can be extracted in a
variety
    of formats
    2) Data could be sliced and subsetted in all possible ways (e.g.
extraction
    of 1-D timeseries, 2-D areas, 3-D volumes/animations, 4-D data blocks)
and
    extracted at different spatial and temporal resolutions
    3) Data could be stored on the original grid (including rotated grids)
but
    extracted on the grid of the user's choice
    4) The necessary projection and interpolation would happen on the fly
    5) The system would allow complex queries to be made (e.g. "Give me all
the
    times and locations at which the sea surface temperature was greater
than 20
    degC in the North Atlantic in June 2003")

    The systems we have looked at so far get us part, but not all, of the
way
    there.  Furthermore, the system currently under evaluation
(Informix/Grid
    DataBlade) is closed-source, commercial software so we can't modify it
    ourselves.  However, such database-based systems have some key
advantages
    over standard flat files, notably intelligent tiling and caching, giving
    very fast retrieval of data.

    I was wondering whether this community would welcome an effort to create
an
    open-source data management/storage system for geospatial data, perhaps
as a
    plug-in to an open-source DBMS such as PostgreSQL.  I haven't found an
    existing project that answers our requirements, but please let me know
if
    you know of anything (some packages seem to deal with geospatial data,
but
    are not designed for _gridded_ data).  It seems that this could be of
    benefit to a to the GO-ESSP community, considering that any Earth System
    Portal must be backed by some kind of data store! ;-)

    This has been rather a long post, sorry!  Any suggestions or feedback
would
    be very much appreciated.

    Best wishes,
    Jon

    --------------------------------------------------------------
    Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
    Technical Director         Tel: +44 118 378 8741 (ESSC)
    Reading e-Science Centre   Fax: +44 118 378 6413
    ESSC                       Email: jdb at mail.nerc-essc.ac.uk
    University of Reading
    3 Earley Gate
    Reading RG6 6AL, UK
    --------------------------------------------------------------

    _______________________________________________
    GO-ESSP mailing list
    GO-ESSP at ucar.edu
    http://mailman.ucar.edu/mailman/listinfo/go-essp

  --

  Steve Hankin, NOAA/PMEL -- Steven.C.Hankin at noaa.gov
  7600 Sand Point Way NE, Seattle, WA 98115-0070
  ph. (206) 526-6080, FAX (206) 526-6744

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp/attachments/20041125/55c15609/attachment.htm


More information about the GO-ESSP mailing list