<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2800.1476" name=GENERATOR></HEAD>
<BODY>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff size=2>Hi
Steve,</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff size=2>Thanks
very much for this. We'll certainly check out the Ferret Data Server, it
looks very interesting. To answer your questions:</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff size=2>I was
referring to "flat files" to mean "anything but a database". Of course, I
appreciate that netCDF and HDF are rather more sophisticated than "raw" IEEE
files and currently we are working with file formats such as these (i.e. netCDF
etc). As I see it, the advantages that databases might have over
netCDF/HDF and their APIs are as follows (I'm happy to be corrected on this, I
don't know all the details):</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2>1) Caching of frequently-used data chunks, improving performance in
many cases (of course, operating systems and hard disks also have their own
caching strategies outside of the database)</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2>2) Automatic splitting of the data into tiles so that data subsets
can be retrieved efficiently (did you say that HDF does tiling
too?)</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2>3) Intelligent, automatic storage of data at different
resolutions ("pyramid" scheme). I would expect databases using such a
scheme to be much faster than using the netCDF/HDF APIs at retrieving data at
different resolutions. We have found that extracting, say, every third
data point from an HDF5 file to be very slow.</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2>4) Better performance for slicing data in all four
dimensions. In our experience, data sets based on netCDF/HDF tend to have
one file per timestep (this is how they are typically output from models).
This means that to extract a timeseries of 100 points, we need to open 100
files, taking one point out of each. The equivalent operation is much
faster with the database.</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2>5) Ability to store vector data (e.g. observations at points or
along ship tracks) alongside raster (gridded) data.</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=687144910-25112004></SPAN><FONT face=Arial><FONT
color=#0000ff><FONT size=2>T<SPAN class=687144910-25112004>he main downsides of
most database-centred approaches are the fact that the systems are generally not
open-source and are often expensive.</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff><FONT size=2><SPAN
class=687144910-25112004></SPAN></FONT></FONT></FONT> </DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff size=2>We
haven't yet done any formal benchmarking of our database's performance, but from
simply playing with it, it certainly seems like it performs well, particularly
for the cases mentioned above. As far as network access goes, our
intention would be to provide access via a Web Services interface, and/or
OPeNDAP.</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff size=2>You
mention GIS - my understanding of GIS is that it only really understands "two
and a half" dimensions, i.e. it understands lat-long, and time is not a true
dimension but an attributed of a data segment. I understand that GIS
struggles with the third dimension (at least, standard OpenGIS does), being
concerned mostly with surface features. So at the moment we do not intend
to link with a GIS system. I understand that there are projects that are
attempting to resolve this issue and create GIS systems suitable for fully 4-D
data.</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff size=2>In
summary, I guess I am trying to find out whether there are any existing
open-source systems that can offer the same or similar functionality and
performance to the database system we have under test. If not, perhaps
there is a system that can be adapted and developed by the community - I'm sure
that many groups must have considered precisely these problems and gotten some
way to a solution. There may well be a way of creating a system that uses
many of the tricks of a database, but is not actually itself connected to any
particular DBMS.</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2>Regards,</FONT></SPAN></DIV>
<DIV><SPAN class=687144910-25112004><FONT face=Arial color=#0000ff
size=2>Jon</FONT></SPAN></DIV>
<BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader dir=ltr align=left><FONT face=Tahoma
size=2>-----Original Message-----<BR><B>From:</B> Steve Hankin
[mailto:Steven.C.Hankin@noaa.gov]<BR><B>Sent:</B> 24 November 2004
17:32<BR><B>To:</B> Jon Blower<BR><B>Cc:</B> go-essp@ucar.edu; Adit Santokhee;
Keith Haines<BR><B>Subject:</B> Re: [GO-ESSP] gridded data management
systems<BR><BR></FONT></DIV>Hi Jon,
<P>When you refer to "standard flat files" are you including formats like
netCDF and HDF5 under that title? This is often a source of terminology
confusion as "flat files" sometimes refers to "anything but a database".
Others regard n-dimensional, multi-variate data standards like netCDF and HDF
to be alternatives to "flat" IEEE files.
<P>The question that you pose is essentially to weigh the pros and cons of
managing your data with a commercial database that has been enhanced to handle
grids, or to handle your data with netCDF (which in the next version will
merge with HDF5 to handle compression, tiles, etc.) and the free netCDF
utilities. (Presumably from-scratch development with IEEE binary files
is not the way to go.) You mentioned some down-sides to the commercial
software route (cost, "proprietariness" of software, dependence on a single
supplier,...). Are the advantages of the database approach
sufficient to outweigh these costs? You have also not mentioned
network access to the data. Is it a requirement is for the data to be
OPeNDAP accessible? Or alternatively, is access from enterprise GIS
systems at the center of your bullseye?
<UL>
<LI>Items 1-2 are trivial for either system. Comparative performance
... do you have any data? The Barrodale product is new and
one-of-a-kind. It would be interesting to see some benchmarks
comparing it to netCDF and HDF5.
<LI>Items 3-4 can be handled with the new FDS (Ferret Data Server) and
probably the GDS server, as well. Custom code may be required
depending upon the list of projections that is desired, but these are open
environments, where this can be added. Item 3-4 capabilities are
also available and presumably well supported if your database is embedded in
an enterprise GIS framework.
<LI>Item 5 is probably better handled in a database environment, though it
can also be handled (with some effort -- in various ways) in a Web service
environment based on OPeNDAP. </LI></UL>Just bouncing around the ideas.
This community will be interested to hear what further you learn.
<P> - steve
<P>====================================
<P>Jon Blower wrote:
<BLOCKQUOTE TYPE="CITE">Hi all,
<P>As some of you may know, we at the Reading e-Science Centre have been
<BR>investigating some new ways to store and manage data from models of the
<BR>oceans and atmosphere. We have been looking at storing data in
databases, <BR>rather than standard flat-file systems, and have over the
last few months <BR>been evaluating IBM's Informix database with Barrodale
Computing Services' <BR>Grid DataBlade plug-in (see <A
href="http://www.resc.rdg.ac.uk/projects.php">http://www.resc.rdg.ac.uk/projects.php</A>
for more <BR>details). Eventually this might form the back-end to our
own data portal <BR>page (<A
href="http://www.nerc-essc.ac.uk/godiva">http://www.nerc-essc.ac.uk/godiva</A>).
<P>We have found good and bad points about this system and are now wondering
<BR>how to take things forward. I have been considering the
feasibility of <BR>writing (essentially from scratch) an intelligent
storage/management <BR>application for gridded geospatial data. The
key features of this system <BR>would include:
<P>1) Data would be stored in a single format but can be extracted in a
variety <BR>of formats <BR>2) Data could be sliced and subsetted in all
possible ways (e.g. extraction <BR>of 1-D timeseries, 2-D areas, 3-D
volumes/animations, 4-D data blocks) and <BR>extracted at different spatial
and temporal resolutions <BR>3) Data could be stored on the original grid
(including rotated grids) but <BR>extracted on the grid of the user's choice
<BR>4) The necessary projection and interpolation would happen on the fly
<BR>5) The system would allow complex queries to be made (e.g. "Give me all
the <BR>times and locations at which the sea surface temperature was greater
than 20 <BR>degC in the North Atlantic in June 2003")
<P>The systems we have looked at so far get us part, but not all, of the way
<BR>there. Furthermore, the system currently under evaluation
(Informix/Grid <BR>DataBlade) is closed-source, commercial software so we
can't modify it <BR>ourselves. However, such database-based systems
have some key advantages <BR>over standard flat files, notably intelligent
tiling and caching, giving <BR>very fast retrieval of data.
<P>I was wondering whether this community would welcome an effort to create
an <BR>open-source data management/storage system for geospatial data,
perhaps as a <BR>plug-in to an open-source DBMS such as PostgreSQL. I
haven't found an <BR>existing project that answers our requirements, but
please let me know if <BR>you know of anything (some packages seem to deal
with geospatial data, but <BR>are not designed for _gridded_ data). It
seems that this could be of <BR>benefit to a to the GO-ESSP community,
considering that any Earth System <BR>Portal must be backed by some kind of
data store! ;-)
<P>This has been rather a long post, sorry! Any suggestions or
feedback would <BR>be very much appreciated.
<P>Best wishes, <BR>Jon
<P>-------------------------------------------------------------- <BR>Dr Jon
Blower
Tel: +44 118 378 5213 (direct line) <BR>Technical
Director Tel: +44 118 378
8741 (ESSC) <BR>Reading e-Science Centre Fax: +44 118 378 6413
<BR>ESSC
Email: jdb@mail.nerc-essc.ac.uk <BR>University of Reading <BR>3 Earley Gate
<BR>Reading RG6 6AL, UK
<BR>--------------------------------------------------------------
<P>_______________________________________________ <BR>GO-ESSP mailing list
<BR>GO-ESSP@ucar.edu <BR><A
href="http://mailman.ucar.edu/mailman/listinfo/go-essp">http://mailman.ucar.edu/mailman/listinfo/go-essp</A></P></BLOCKQUOTE>
<P>--
<P>Steve Hankin, NOAA/PMEL -- Steven.C.Hankin@noaa.gov <BR>7600 Sand Point Way
NE, Seattle, WA 98115-0070 <BR>ph. (206) 526-6080, FAX (206) 526-6744
<BR> </P></BLOCKQUOTE></BODY></HTML>