[Go-essp-tech] Configuring esgcet for CMIP5 and the DRS structure

Tue Feb 9 10:03:03 MST 2010

Hi Bob,

This all sounds promising.  I have 2 other queries.

I have noticed that publishing a dataset generates duplicate
PUBLISH_DATASET events.  I saw this when publishing a batch of datasets
using "esgpublish --map <mapfile> --publish".  Are you aware of this or
should I investigate futher?  In one case the duplicate events had
exactly the same timestamp which confused one of my SQL queries.

Is the format of mapfiles documented?  The esgscan_directory script says
the format is lines of

"dataset_id | absolute_file_path | size" 

but I notice output from that script is 

"dataset_id | absolute_file_path | size | mod_time=x"

It might be convenient for us to generate mapfiles directly from our
ingest system so we'd like to know whether the final field is optional.

Thanks,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
British Atmospheric Data Centre
Rutherford Appleton Laboratory

-----Original Message-----
From: Bob Drach [mailto:drach at llnl.gov] 
Sent: 08 February 2010 19:46
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: go-essp-tech at ucar.edu
Subject: Re: Configuring esgcet for CMIP5 and the DRS structure

Hi Stephen,

I'll be interested in the results of your publication testing. The next
test here will be to publish at the granularity of 'realm'. More
comments below.

Bob

stephen.pascoe at stfc.ac.uk wrote:
> Thanks Bob,
>
> This was a great help in getting our datanode working with CMIP3.  Can

> I ask why you map the time frequency "fixed" to "monthly" in your
esg.ini?
>   
That is an artifact of how data was organized for CMIP3. Fixed fields
and monthly were in the same CMOR table. That will not be the case for
CMIP5.
> As you would expect we've done it differently.  We've rearranged the
> CMIP3 archive on disk to match the DRS specification.  I attach our 
> esg.ini that shows you how we've configured the datanode to recognise 
> DRS.
>
> This morning I published about 100 CMIP3 datasets to the PCMDI
gateway.
> They are visible under the test project.  The Gateway generated a wget

> for me that worked great!
>   
That's very good news.
> We've tried running "esgpublish --thredds" on sets of datasets up to 
> 500 and found performance acceptable if you use the "--map" option to 
> avoid multiple calls to esgpublish.  Generating THREDDS XML for the 
> entire
> CMIP5 archive would definitely take days but not weeks.  I hope to 
> have more detailed figures soon.
>
> Notifying the gateway is a more serious bottleneck.  We ran 
> "esgpublish --publish" on about 100 datasets to publish them to
PCMDI's gateway.
> Each dataset took about 1s to rescan then took about 9s to notify the 
> gateway.
>
>
> I've a few comments on the way the datanode works that slightly 
> complicates using the DRS.  I hope you don't mind me bugging you about

> them -- generally I find the datanode works great:
>
>  1. The datanode hard-codes the concept of a run_name as the 
> identifier "run?" (in
>     ipcc4_handler.py).  This isn't compatible with the DRS syntax of 
> "r?i?p?".
>   
Yes, I agree. Again, an artifact of CMIP3.
>  2. The DRS concept of "component" (i.e. institute, model, experiment,
> etc.) is given multiple 
>     different names within the datanode.  In esg.ini it is "category",

> in esgpublish it is
>     "property" and in the database schema it is "attribute".  This is 
> confusing but I can see
>     why it's done most of the time.
>   
It really is confusing - I wish the terminology were more consistent. In
part this comes from having so many separate bits of software. My take
on the terminology:

- ESG / publisher: category ( == facet when the field is available for
advanced search);
- THREDDS: property - as contained in the thredds catalogs as property
elements. Not all ESG categories necessarily end up as THREDDS
properties.
- DRS: component
>  3. The datanode has an intrinsic concept of "model" but not 
> "institute".  Each dataset is
>     associated with a model rather than an (institute, model) pair 
> therefore you cannot easily
>     deal with multiple institutes running the same model (by which I 
> mean a model with the same
>     identifier -- of course each institute will run a slightly 
> different configuration).  I
>     think CMIP5 isn't going to have this problem as every institute 
> seems to be naming their
>     model slightly differently.  Am I right?
>   
Yes, right. For CMIP5 there will be an enumerated list of
(model,institution) each with a separate identifier.
>  4. Running "esgunpublish --skip-gateway" to delete the THREDDS XML 
> without removing the dataset
>     from the database does not create a datanode event.  This would be

> useful in the case that
>     you want to remove the THREDDS XML of a dataset that hasn't been 
> published to a gateway yet.
>   
I'll add a DELETE_THREDDS_CATALOG event.
> #1 and #3 can be worked around in the esg.ini file by defining 
> mappings between categories.  For instance we use the internal
category "run_drs"
> to capture the DRS realisation syntax "r?i?p?" from directory paths 
> then map it to "run_name".
>
> Cheers,
> Stephen.
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>
> -----Original Message-----
> From: Bob Drach [mailto:drach at llnl.gov]
> Sent: 01 February 2010 19:24
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: go-essp-tech at ucar.edu
> Subject: Re: Configuring esgcet for CMIP5 and the DRS structure
>
> Hi Stephen,
>
> Funny you should mention ... I just published a portion of the CMIP3 
> archive with DRS-style datasets. The project portion of the init file 
> is attached.
>
> For the attached init file to work you'll need the bleeding edge 
> repository version of the publisher - there is a tweak to get the 
> variable_standard_name and variable_long_name fields into the 
> dataset_name_format options. [I believe if you remove any mention of 
> variable_standard_name, that Version 2.1 should work, but I haven't 
> tested it.]
>
> You'll need to change a few things for this to work in your
environment:
>
> - parent_id = %(root_id)s.ipcc4.%(model)s assumes that there are 
> existing intermediate datasets, one for each model.
> - directory_format is probably unique to our environment.
> - realm is deduced from the directory structure, in our case.
>
> Part of the reason for testing this on our end is to evaluate 
> publication performance with dataset granularity at the level of 
> DRS-style datasets. One thing that became obvious is that there is a 
> per-dataset overhead for each web-service call to the gateway - the 
> final publish step. It's not too significant when all variables for a 
> run are grouped together as has been the case up to now. But when each

> dataset is a single variable the number of datasets baloons to 
> ~28,000, and a few seconds per call becomes very significant. The 
> bleeding edge version has one change that reduces the overhead ~25%, 
> and the gateway developers claim they have also sped up the 
> publication processing. But the question still remains what the 
> overhead of the web-service call itself is. I'll be interested to see
what your experience is.
>
> Bob
>
> stephen.pascoe at stfc.ac.uk wrote:
>   
>> Hi Bob,
>>  
>> We now have the CMIP3 archive partially in DRS format -- 1pctto2x and

>> 1pctto4x experiments are done with the rest proceeding.  So I'm now 
>> trying to configure esgcet to recognise this structure by defining a 
>> new project in esg.ini "[project:cmip3_drs]".
>>  
>> Do you have a sample [project:cmip5] section you are working on or 
>> should I continue to follow my intuition on how one should map DRS 
>> components to categories in esg.ini?
>>  
>> Cheers,
>> Stephen.
>>  
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>  
>>
>> --
>> Scanned by iCritical.
>>
>>
>>     
>
>
>   

-- 
Scanned by iCritical.