[Go-essp-tech] Configuring esgcet for CMIP5 and the DRS structure

Mon Feb 8 12:46:18 MST 2010

Hi Stephen,

I'll be interested in the results of your publication testing. The next 
test here will be to publish at the granularity of 'realm'. More 
comments below.

Bob

stephen.pascoe at stfc.ac.uk wrote:
> Thanks Bob,
>
> This was a great help in getting our datanode working with CMIP3.  Can I
> ask why you map the time frequency "fixed" to "monthly" in your esg.ini?
>   
That is an artifact of how data was organized for CMIP3. Fixed fields 
and monthly were in the same CMOR table. That will not be the case for 
CMIP5.
> As you would expect we've done it differently.  We've rearranged the
> CMIP3 archive on disk to match the DRS specification.  I attach our
> esg.ini that shows you how we've configured the datanode to recognise
> DRS.
>
> This morning I published about 100 CMIP3 datasets to the PCMDI gateway.
> They are visible under the test project.  The Gateway generated a wget
> for me that worked great!  
>   
That's very good news.
> We've tried running "esgpublish --thredds" on sets of datasets up to 500
> and found performance acceptable if you use the "--map" option to avoid
> multiple calls to esgpublish.  Generating THREDDS XML for the entire
> CMIP5 archive would definitely take days but not weeks.  I hope to have
> more detailed figures soon.
>
> Notifying the gateway is a more serious bottleneck.  We ran "esgpublish
> --publish" on about 100 datasets to publish them to PCMDI's gateway.
> Each dataset took about 1s to rescan then took about 9s to notify the
> gateway.  
>
>
> I've a few comments on the way the datanode works that slightly
> complicates using the DRS.  I hope you don't mind me bugging you about
> them -- generally I find the datanode works great:
>
>  1. The datanode hard-codes the concept of a run_name as the identifier
> "run?" (in 
>     ipcc4_handler.py).  This isn't compatible with the DRS syntax of
> "r?i?p?".  
>   
Yes, I agree. Again, an artifact of CMIP3.
>  2. The DRS concept of "component" (i.e. institute, model, experiment,
> etc.) is given multiple 
>     different names within the datanode.  In esg.ini it is "category",
> in esgpublish it is 
>     "property" and in the database schema it is "attribute".  This is
> confusing but I can see 
>     why it's done most of the time.
>   
It really is confusing - I wish the terminology were more consistent. In 
part this comes from having so many separate bits of software. My take 
on the terminology:

- ESG / publisher: category ( == facet when the field is available for 
advanced search);
- THREDDS: property - as contained in the thredds catalogs as property 
elements. Not all ESG categories necessarily end up as THREDDS properties.
- DRS: component
>  3. The datanode has an intrinsic concept of "model" but not
> "institute".  Each dataset is 
>     associated with a model rather than an (institute, model) pair
> therefore you cannot easily 
>     deal with multiple institutes running the same model (by which I
> mean a model with the same 
>     identifier -- of course each institute will run a slightly different
> configuration).  I 
>     think CMIP5 isn't going to have this problem as every institute
> seems to be naming their 
>     model slightly differently.  Am I right?
>   
Yes, right. For CMIP5 there will be an enumerated list of 
(model,institution) each with a separate identifier.
>  4. Running "esgunpublish --skip-gateway" to delete the THREDDS XML
> without removing the dataset 
>     from the database does not create a datanode event.  This would be
> useful in the case that 
>     you want to remove the THREDDS XML of a dataset that hasn't been
> published to a gateway yet.
>   
I'll add a DELETE_THREDDS_CATALOG event.
> #1 and #3 can be worked around in the esg.ini file by defining mappings
> between categories.  For instance we use the internal category "run_drs"
> to capture the DRS realisation syntax "r?i?p?" from directory paths then
> map it to "run_name".
>
> Cheers,
> Stephen.
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>
> -----Original Message-----
> From: Bob Drach [mailto:drach at llnl.gov] 
> Sent: 01 February 2010 19:24
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: go-essp-tech at ucar.edu
> Subject: Re: Configuring esgcet for CMIP5 and the DRS structure
>
> Hi Stephen,
>
> Funny you should mention ... I just published a portion of the CMIP3
> archive with DRS-style datasets. The project portion of the init file is
> attached.
>
> For the attached init file to work you'll need the bleeding edge
> repository version of the publisher - there is a tweak to get the
> variable_standard_name and variable_long_name fields into the
> dataset_name_format options. [I believe if you remove any mention of
> variable_standard_name, that Version 2.1 should work, but I haven't
> tested it.]
>
> You'll need to change a few things for this to work in your environment:
>
> - parent_id = %(root_id)s.ipcc4.%(model)s assumes that there are
> existing intermediate datasets, one for each model.
> - directory_format is probably unique to our environment.
> - realm is deduced from the directory structure, in our case.
>
> Part of the reason for testing this on our end is to evaluate
> publication performance with dataset granularity at the level of
> DRS-style datasets. One thing that became obvious is that there is a
> per-dataset overhead for each web-service call to the gateway - the
> final publish step. It's not too significant when all variables for a
> run are grouped together as has been the case up to now. But when each
> dataset is a single variable the number of datasets baloons to ~28,000,
> and a few seconds per call becomes very significant. The bleeding edge
> version has one change that reduces the overhead ~25%, and the gateway
> developers claim they have also sped up the publication processing. But
> the question still remains what the overhead of the web-service call
> itself is. I'll be interested to see what your experience is.
>
> Bob
>
> stephen.pascoe at stfc.ac.uk wrote:
>   
>> Hi Bob,
>>  
>> We now have the CMIP3 archive partially in DRS format -- 1pctto2x and 
>> 1pctto4x experiments are done with the rest proceeding.  So I'm now 
>> trying to configure esgcet to recognise this structure by defining a 
>> new project in esg.ini "[project:cmip3_drs]".
>>  
>> Do you have a sample [project:cmip5] section you are working on or 
>> should I continue to follow my intuition on how one should map DRS 
>> components to categories in esg.ini?
>>  
>> Cheers,
>> Stephen.
>>  
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>  
>>
>> --
>> Scanned by iCritical.
>>
>>
>>     
>
>
>