[Go-essp-tech] Configuring esgcet for CMIP5 and the DRS structure

Tue Feb 9 13:31:49 MST 2010

Hi Stephen,

On Feb 9, 2010, at 9:03 AM, <stephen.pascoe at stfc.ac.uk> wrote:

> Hi Bob,
>
> This all sounds promising.  I have 2 other queries.
>
> I have noticed that publishing a dataset generates duplicate
> PUBLISH_DATASET events.  I saw this when publishing a batch of  
> datasets
> using "esgpublish --map <mapfile> --publish".  Are you aware of  
> this or
> should I investigate futher?  In one case the duplicate events had
> exactly the same timestamp which confused one of my SQL queries.

I wasn't aware that this could happen - and wouldn't have expected  
it. I'll try to duplicate the problem and correct it.

>
> Is the format of mapfiles documented?  The esgscan_directory script  
> says
> the format is lines of
>
> "dataset_id | absolute_file_path | size"
>
> but I notice output from that script is
>
> "dataset_id | absolute_file_path | size | mod_time=x"
>
> It might be convenient for us to generate mapfiles directly from our
> ingest system so we'd like to know whether the final field is  
> optional.

The mapfile format is documented at:

http://www2-pcmdi.llnl.gov/Members/bdrach/.personal/esg-publication- 
scripts/

The property=value fields are optional, for backward compatibility  
with existing map files.

Regards,

Bob

>
> Thanks,
> Stephen.
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>
> -----Original Message-----
> From: Bob Drach [mailto:drach at llnl.gov]
> Sent: 08 February 2010 19:46
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: go-essp-tech at ucar.edu
> Subject: Re: Configuring esgcet for CMIP5 and the DRS structure
>
> Hi Stephen,
>
> I'll be interested in the results of your publication testing. The  
> next
> test here will be to publish at the granularity of 'realm'. More
> comments below.
>
> Bob
>
> stephen.pascoe at stfc.ac.uk wrote:
>> Thanks Bob,
>>
>> This was a great help in getting our datanode working with CMIP3.   
>> Can
>
>> I ask why you map the time frequency "fixed" to "monthly" in your
> esg.ini?
>>
> That is an artifact of how data was organized for CMIP3. Fixed fields
> and monthly were in the same CMOR table. That will not be the case for
> CMIP5.
>> As you would expect we've done it differently.  We've rearranged the
>> CMIP3 archive on disk to match the DRS specification.  I attach our
>> esg.ini that shows you how we've configured the datanode to recognise
>> DRS.
>>
>> This morning I published about 100 CMIP3 datasets to the PCMDI
> gateway.
>> They are visible under the test project.  The Gateway generated a  
>> wget
>
>> for me that worked great!
>>
> That's very good news.
>> We've tried running "esgpublish --thredds" on sets of datasets up to
>> 500 and found performance acceptable if you use the "--map" option to
>> avoid multiple calls to esgpublish.  Generating THREDDS XML for the
>> entire
>> CMIP5 archive would definitely take days but not weeks.  I hope to
>> have more detailed figures soon.
>>
>> Notifying the gateway is a more serious bottleneck.  We ran
>> "esgpublish --publish" on about 100 datasets to publish them to
> PCMDI's gateway.
>> Each dataset took about 1s to rescan then took about 9s to notify the
>> gateway.
>>
>>
>> I've a few comments on the way the datanode works that slightly
>> complicates using the DRS.  I hope you don't mind me bugging you  
>> about
>
>> them -- generally I find the datanode works great:
>>
>>  1. The datanode hard-codes the concept of a run_name as the
>> identifier "run?" (in
>>     ipcc4_handler.py).  This isn't compatible with the DRS syntax of
>> "r?i?p?".
>>
> Yes, I agree. Again, an artifact of CMIP3.
>>  2. The DRS concept of "component" (i.e. institute, model,  
>> experiment,
>> etc.) is given multiple
>>     different names within the datanode.  In esg.ini it is  
>> "category",
>
>> in esgpublish it is
>>     "property" and in the database schema it is "attribute".  This is
>> confusing but I can see
>>     why it's done most of the time.
>>
> It really is confusing - I wish the terminology were more  
> consistent. In
> part this comes from having so many separate bits of software. My take
> on the terminology:
>
> - ESG / publisher: category ( == facet when the field is available for
> advanced search);
> - THREDDS: property - as contained in the thredds catalogs as property
> elements. Not all ESG categories necessarily end up as THREDDS
> properties.
> - DRS: component
>>  3. The datanode has an intrinsic concept of "model" but not
>> "institute".  Each dataset is
>>     associated with a model rather than an (institute, model) pair
>> therefore you cannot easily
>>     deal with multiple institutes running the same model (by which I
>> mean a model with the same
>>     identifier -- of course each institute will run a slightly
>> different configuration).  I
>>     think CMIP5 isn't going to have this problem as every institute
>> seems to be naming their
>>     model slightly differently.  Am I right?
>>
> Yes, right. For CMIP5 there will be an enumerated list of
> (model,institution) each with a separate identifier.
>>  4. Running "esgunpublish --skip-gateway" to delete the THREDDS XML
>> without removing the dataset
>>     from the database does not create a datanode event.  This  
>> would be
>
>> useful in the case that
>>     you want to remove the THREDDS XML of a dataset that hasn't been
>> published to a gateway yet.
>>
> I'll add a DELETE_THREDDS_CATALOG event.
>> #1 and #3 can be worked around in the esg.ini file by defining
>> mappings between categories.  For instance we use the internal
> category "run_drs"
>> to capture the DRS realisation syntax "r?i?p?" from directory paths
>> then map it to "run_name".
>>
>> Cheers,
>> Stephen.
>>
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>
>> -----Original Message-----
>> From: Bob Drach [mailto:drach at llnl.gov]
>> Sent: 01 February 2010 19:24
>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>> Cc: go-essp-tech at ucar.edu
>> Subject: Re: Configuring esgcet for CMIP5 and the DRS structure
>>
>> Hi Stephen,
>>
>> Funny you should mention ... I just published a portion of the CMIP3
>> archive with DRS-style datasets. The project portion of the init file
>> is attached.
>>
>> For the attached init file to work you'll need the bleeding edge
>> repository version of the publisher - there is a tweak to get the
>> variable_standard_name and variable_long_name fields into the
>> dataset_name_format options. [I believe if you remove any mention of
>> variable_standard_name, that Version 2.1 should work, but I haven't
>> tested it.]
>>
>> You'll need to change a few things for this to work in your
> environment:
>>
>> - parent_id = %(root_id)s.ipcc4.%(model)s assumes that there are
>> existing intermediate datasets, one for each model.
>> - directory_format is probably unique to our environment.
>> - realm is deduced from the directory structure, in our case.
>>
>> Part of the reason for testing this on our end is to evaluate
>> publication performance with dataset granularity at the level of
>> DRS-style datasets. One thing that became obvious is that there is a
>> per-dataset overhead for each web-service call to the gateway - the
>> final publish step. It's not too significant when all variables for a
>> run are grouped together as has been the case up to now. But when  
>> each
>
>> dataset is a single variable the number of datasets baloons to
>> ~28,000, and a few seconds per call becomes very significant. The
>> bleeding edge version has one change that reduces the overhead ~25%,
>> and the gateway developers claim they have also sped up the
>> publication processing. But the question still remains what the
>> overhead of the web-service call itself is. I'll be interested to see
> what your experience is.
>>
>> Bob
>>
>> stephen.pascoe at stfc.ac.uk wrote:
>>
>>> Hi Bob,
>>>
>>> We now have the CMIP3 archive partially in DRS format -- 1pctto2x  
>>> and
>
>>> 1pctto4x experiments are done with the rest proceeding.  So I'm now
>>> trying to configure esgcet to recognise this structure by defining a
>>> new project in esg.ini "[project:cmip3_drs]".
>>>
>>> Do you have a sample [project:cmip5] section you are working on or
>>> should I continue to follow my intuition on how one should map DRS
>>> components to categories in esg.ini?
>>>
>>> Cheers,
>>> Stephen.
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> British Atmospheric Data Centre
>>> Rutherford Appleton Laboratory
>>>
>>>
>>> --
>>> Scanned by iCritical.
>>>
>>>
>>>
>>
>>
>>
>
> --
> Scanned by iCritical.