[Go-essp-tech] Configuring esgcet for CMIP5 and the DRS structure

Fri Feb 5 09:55:04 MST 2010

Thanks Bob,

This was a great help in getting our datanode working with CMIP3.  Can I
ask why you map the time frequency "fixed" to "monthly" in your esg.ini?

As you would expect we've done it differently.  We've rearranged the
CMIP3 archive on disk to match the DRS specification.  I attach our
esg.ini that shows you how we've configured the datanode to recognise
DRS.

This morning I published about 100 CMIP3 datasets to the PCMDI gateway.
They are visible under the test project.  The Gateway generated a wget
for me that worked great!  

We've tried running "esgpublish --thredds" on sets of datasets up to 500
and found performance acceptable if you use the "--map" option to avoid
multiple calls to esgpublish.  Generating THREDDS XML for the entire
CMIP5 archive would definitely take days but not weeks.  I hope to have
more detailed figures soon.

Notifying the gateway is a more serious bottleneck.  We ran "esgpublish
--publish" on about 100 datasets to publish them to PCMDI's gateway.
Each dataset took about 1s to rescan then took about 9s to notify the
gateway.  

I've a few comments on the way the datanode works that slightly
complicates using the DRS.  I hope you don't mind me bugging you about
them -- generally I find the datanode works great:

 1. The datanode hard-codes the concept of a run_name as the identifier
"run?" (in 
    ipcc4_handler.py).  This isn't compatible with the DRS syntax of
"r?i?p?".  
 2. The DRS concept of "component" (i.e. institute, model, experiment,
etc.) is given multiple 
    different names within the datanode.  In esg.ini it is "category",
in esgpublish it is 
    "property" and in the database schema it is "attribute".  This is
confusing but I can see 
    why it's done most of the time.
 3. The datanode has an intrinsic concept of "model" but not
"institute".  Each dataset is 
    associated with a model rather than an (institute, model) pair
therefore you cannot easily 
    deal with multiple institutes running the same model (by which I
mean a model with the same 
    identifier -- of course each institute will run a slightly different
configuration).  I 
    think CMIP5 isn't going to have this problem as every institute
seems to be naming their 
    model slightly differently.  Am I right?
 4. Running "esgunpublish --skip-gateway" to delete the THREDDS XML
without removing the dataset 
    from the database does not create a datanode event.  This would be
useful in the case that 
    you want to remove the THREDDS XML of a dataset that hasn't been
published to a gateway yet.

#1 and #3 can be worked around in the esg.ini file by defining mappings
between categories.  For instance we use the internal category "run_drs"
to capture the DRS realisation syntax "r?i?p?" from directory paths then
map it to "run_name".

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
British Atmospheric Data Centre
Rutherford Appleton Laboratory

-----Original Message-----
From: Bob Drach [mailto:drach at llnl.gov] 
Sent: 01 February 2010 19:24
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: go-essp-tech at ucar.edu
Subject: Re: Configuring esgcet for CMIP5 and the DRS structure

Hi Stephen,

Funny you should mention ... I just published a portion of the CMIP3
archive with DRS-style datasets. The project portion of the init file is
attached.

For the attached init file to work you'll need the bleeding edge
repository version of the publisher - there is a tweak to get the
variable_standard_name and variable_long_name fields into the
dataset_name_format options. [I believe if you remove any mention of
variable_standard_name, that Version 2.1 should work, but I haven't
tested it.]

You'll need to change a few things for this to work in your environment:

- parent_id = %(root_id)s.ipcc4.%(model)s assumes that there are
existing intermediate datasets, one for each model.
- directory_format is probably unique to our environment.
- realm is deduced from the directory structure, in our case.

Part of the reason for testing this on our end is to evaluate
publication performance with dataset granularity at the level of
DRS-style datasets. One thing that became obvious is that there is a
per-dataset overhead for each web-service call to the gateway - the
final publish step. It's not too significant when all variables for a
run are grouped together as has been the case up to now. But when each
dataset is a single variable the number of datasets baloons to ~28,000,
and a few seconds per call becomes very significant. The bleeding edge
version has one change that reduces the overhead ~25%, and the gateway
developers claim they have also sped up the publication processing. But
the question still remains what the overhead of the web-service call
itself is. I'll be interested to see what your experience is.

Bob

stephen.pascoe at stfc.ac.uk wrote:
> Hi Bob,
>  
> We now have the CMIP3 archive partially in DRS format -- 1pctto2x and 
> 1pctto4x experiments are done with the rest proceeding.  So I'm now 
> trying to configure esgcet to recognise this structure by defining a 
> new project in esg.ini "[project:cmip3_drs]".
>  
> Do you have a sample [project:cmip5] section you are working on or 
> should I continue to follow my intuition on how one should map DRS 
> components to categories in esg.ini?
>  
> Cheers,
> Stephen.
>  
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>  
>
> --
> Scanned by iCritical.
>
>

-- 
Scanned by iCritical.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: esg.ini
Type: application/octet-stream
Size: 10553 bytes
Desc: esg.ini
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100205/31e074b7/attachment.obj