[Go-essp-tech] DRS syntax and TDS identifiers

Martina Stockhause martina.stockhause at zmaw.de
Fri Sep 3 02:30:15 MDT 2010


Hi, Bob,

thanks a lot for clearing things up.

The variable is missing in the netcdf file dataset_id, but the id of an
aggregation (atomic dataset) looks different in our example publication
(http://cmip2.dkrz.de/thredds/esgcet/1/cmip5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.r1i1p1.v1.xml):

<dataset
name="cmip5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.r1i1p1.co2mass.1.aggregation"
ID="cmip5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.r1i1p1.co2mass.1.aggregation"
urlPath="cmip5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.r1i1p1.co2mass.1.aggregation"
restrictAccess="esg-user">

Where do I find the version 'v1'?
What does the '1' mean? Version of realm-ensemble, i.e. equal to 'v1'?
Or number of performed aggregations of netcdf files?

----
Version: If I consult the DRS document, I agree that the version is
related to an atomic dataset. But recalling the discussion that the ESG
portal changed from resolving atomic datasets to resolving realms and
the remarks from Stephen, I thought that the version is now connected to
a realm. For the QC it is not important if it is
<realm>.<version>.<ensemble> or <realm>.<ensemble>.<version> as long as
the position of <variable> in the DRS hierarchy does not change. So I
suggest you discuss with Stephen and the portal people, which is
preferable. The decision should result in a new version of the DRS document.

Best wishes,
Martina


Robert S. Drach wrote:
> Hi Martina,
>
> As Stephen indicated in an earlier part of the thread, the easiest way
> to map from THREDDS dataset identifiers (as published) to DRS atomic
> dataset identifiers is to look inside the TDS catalog of the dataset.
> All the DRS identification is there, mainly in the <property> tags
> directly under the top-level dataset. For example, a catalog might
> look like (omitting the irrelevant parts):
>
> =================================================
> <catalog xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xmlns:xlink="http://www.w3.org/1999/xlink"
> xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
> name="TDS configuration file"
> xsi:schemaLocation="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0
> http://www.unidata.ucar.edu/schemas/thredds/InvCatalog.1.0.2.xsd">
> ...
>  <property name="catalog_version" value="2"/>
>  <dataset restrictAccess="esg-user"
> ID="cmip5.output.PCMDI.pcmdi-test.historical.fx.atmos.r0i0p0.v1"
> name="project=CMIP5 / IPCC Fifth Assessment Report, model=PCMDI,
> experiment=historical, time_frequency=fx, modeling realm=atmos,
> ensemble=r0i0p0, version=1">
>    <property name="dataset_id"
> value="cmip5.output.PCMDI.pcmdi-test.historical.fx.atmos.r0i0p0"/>
>    <property name="dataset_version" value="1"/>
>    <property name="project" value="cmip5"/>
>    <property name="experiment" value="historical"/>
>    <property name="product" value="output"/>
>    <property name="model" value="pcmdi-test"/>
>    <property name="time_frequency" value="fx"/>
>    <property name="realm" value="atmos"/>
>    <property name="ensemble" value="r0i0p0"/>
>    <property name="institute" value="PCMDI"/>
>    <property name="forcing" value="Nat"/>
>    <property name="title" value="pcmdi-test model output prepared for
> CMIP5 Historical"/>
>    <property name="creation_time" value="2010-09-01 17:28:58"/>
>    <property name="format" value="netCDF, CF-1.4"/>
>    <metadata>
>      <variables vocabulary="CF-1.0">
>        <variable name="sftlf" vocabulary_name="land_area_fraction"
> units="%">Land Area Fraction</variable>
>        <variable name=...</variable>
>      </variables>
>    </metadata>
> ...
> =================================================
>
> All the DRS identifiers are present with the exception of 'variable'.
> But the variables are defined in the <variables> tag, so from this the
> relevant DRS atomic datasets can be inferred.
>
> The one caveat is version. In the above example, the dataset_version
> is the version associated with the 'ensemble level' dataset (what
> you're calling the TDS dataset) as published on the gateway. From
> looking at the DRS document, I *believe* that the DRS version is meant
> to be relative to the atomic dataset (correct me if I'm wrong).
> However, there are no explicit 'atomic level' versions generated or
> stored in ESG.
>
> Best regards,
>
> Bob
>
> Martina Stockhause wrote:
>> Hi Bob, dear all,
>>
>> for the QC a defined and stable DRS syntax and a clear mapping of other
>> IDs (TDS and metafor) to the DRS are required.
>>
>> The granularity of the quality checks is the atomic dataset. QC level 3
>> (STD-DOI) is assigned on the experiment level but includes references to
>> every netcdf file (or chunk) of this experiment. For a correct
>> identification of the files, which belong to the DOI experiment, the DRS
>> definition has to include DRS levels: 'experiment, atomic dataset,
>> netcdf file'. We extract the metadata of the data from the TDS XML. So
>> we have to identify the names of experiment, atomic dataset (aggregation
>> of netcdf files) and netcdf file by the names or identifiers used by the
>> TDS (e.g. with field dataset_ID). Presently, there are differences
>> between them.
>>
>> I attached a figure visualizing the differences in DRS and TDS field
>> 'dataset_ID'.
>>
>> ----------
>> DRS-Syntax: We agreed on moving the <version> from the position behind
>> the atomic dataset to the position behind the realm, i.e.
>>
>> cmip5.<product>.<institute>.<model>.<experiment>.<frequency>.<realm>.<version>.<ensemble>.<variable>.<netcdf>
>>
>>
>> Karl, could you please update the DRS document?
>>
>> Bob, why does the TDS realm version include the ensemble member?
>>
>> cmip5.<product>.<institute>.<model>.<experiment>.<frequency>.<realm>.<ensemble>.<version>
>>
>>
>> So, in the DRS a realm like atmos of an experiment performed with a
>> specific Earth System model has a version, but in the TDS every
>> realization of this experiment (ensemble member) has its own version.
>>
>> ----------
>> Mapping TDS dataset ID to DRS syntax:
>> Down to the realm the TDS ID is identical with the DRS syntax, beneath
>> it not.
>> Bob, I need a clear mapping direction for the atomic dataset and chunk
>> levels.
>> We figured out the following mapping from our example publication. Can
>> you verify that?
>>
>> atomic dataset ID:
>>
>> cmip5.<product>.<institute>.<model>.<experiment>.<frequency>.<realm>.<ensemble>.<variable>.1.aggregation
>>
>>
>> Question: Is the '1' identical with the realm.ensemble version 'v1'? If
>> not, what does it mean and where do we find the <version>?
>>
>> chunk dataset ID:
>>
>> cmip5.<product>.<institute>.<model>.<experiment>.<frequency>.<realm>.<ensemble>.<version>.<netcdf>
>>
>>
>> Question: The <variable> is left out there and can be extracted as first
>> part of the chunk name split by '_'?
>>
>> By the way, could you explain, why the TDS dataset IDs are not identical
>> with the DRS syntax any longer?
>>
>> If we do not have a clear definition of the DRS syntax and mapping
>> directions between TDS IDs and DRS syntax, we cannot assign DOIs to a
>> distinct set of chunks organized according to the DRS syntax (for which
>> we have the required quality information).
>>
>> Alternatively, we would have to scan the netcdf dataheaders and build
>> aggregations for the atomic dataset level again. We would like to avoid
>> that additional effort.
>>
>> I need answers and stability in the DRS syntax and TDS IDs soon. I
>> suggest we speak about this on the telco 14th or 21st September.
>>
>>
>> Best wishes,
>> Martina
>>
>>
>>
>> -------- Original Message --------
>> Subject:     Re: [Go-essp-tech] DRS structure
>> Date:     Thu, 26 Aug 2010 08:18:54 +0200
>> From:     Martina Stockhause <martina.stockhause at zmaw.de>
>> To:     stephen.pascoe at stfc.ac.uk
>> CC:     go-essp-tech at ucar.edu
>> References:     <4C7284B9.6090005 at zmaw.de> <4C7367BF.3060906 at zmaw.de>
>> <EB1E7CB92F5B35459E0B926D2A614DB60BD053A6 at EXCHANGE19.fed.cclrc.ac.uk>
>> <4C73969B.1030409 at zmaw.de>
>> <EB1E7CB92F5B35459E0B926D2A614DB60CE189F1 at EXCHANGE19.fed.cclrc.ac.uk>
>>
>>
>>
>> Good Morning, Stephen, hi, Bob,
>>
>> what about the position of <ensemble> in the DRS syntax?
>>
>> Does it move behind <realm>? Or behind <realm>.<version>?
>> I.e.
>> cmip5.<product>.<institute>.<model>.<experiment>.<frequency>.<realm>.<version>.<ensemble>.<variable>.<netcdf>
>>
>> or
>> cmip5.<product>.<institute>.<model>.<experiment>.<frequency>.<realm>.<ensemble>.<version>.<variable>.<netcdf>
>>
>>
>> Or do we leave it after <variable> as documented in
>> http://*cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf?
>> I.e.
>> cmip5.<product>.<institute>.<model>.<experiment>.<frequency>.<realm>.<version>.<variable>.<ensemble>.<netcdf>
>>
>>
>> It is important for me to know the position of <variable> (atomic
>> dataset) relative to the <experiment> and the <netcdf> (chunks).
>>
>> Thanks a lot,
>> Martina
>>
>>
>> stephen.pascoe at stfc.ac.uk wrote:
>>  
>>> Hi Martina,
>>>
>>> Which TDS server are you working with?  Is it one at DKRZ?  Everything
>>> below is based on what we've been doing at BADC with the CMIP3 dataset
>>> and MOHC's CMIP5 test data.
>>>
>>>      
>>>> Example: I get a QC result for the atomic dataset in the directory
>>>>           
>>> CMIP5/output/MPI-M/ECHAM6-MPIOM-LR/rcp45/mon/atmos/pr      
>>>> How do I find the TDS ID for it?
>>>>
>>>>           
>>> ID="CMIP5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.pr.1.aggregation"
>>>
>>>  
>>> urlPath="CMIP5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.pr.1.aggrega
>>>
>>> tion"
>>>      
>>>> Is this the structure, how it will remain? Then I can cut the last
>>>>           
>>> two.
>>>
>>> The directory should be part of the dataset with
>>> dataset_id="cmip5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos".  There
>>> will be 1 or more versions of that dataset with THREDDS catalogue names
>>>   cmip5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.v1.xml
>>>   cmip5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.v2.xml
>>>   etc.
>>>
>>> Within each catalogue there is a dataset element for the realm-dataset
>>> containing a dataset element for each file.
>>> I'm not sure you can use the aggregation datasets to represent
>>> atomic-datasets.  To be honest I haven't looked at them in detail.
>>>
>>>      
>>>> Will the directory structure change to move the version behind the
>>>>           
>>> realm as well? In my example:
>>>      
>>>> CMIP5/output/MPI-M/ECHAM6-MPIOM-LR/rcp45/mon/atmos/v1/pr
>>>>           
>>> Yes, the BADC datanode doesn't have this at the moment for CMIP3 data
>>> because it would be timeconsuming to change after the fact.  However,
>>> our UKMO test runs are putting the version directory where you say. 
>>> I guess this only answers part of your questions but I hope it helps.
>>>
>>> S.
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> British Atmospheric Data Centre
>>> Rutherford Appleton Laboratory
>>>
>>> -----Original Message-----
>>> From: Martina Stockhause [mailto:martina.stockhause at zmaw.de] Sent:
>>> 24 August 2010 10:54
>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>> Cc: estanislao.gonzalez at zmaw.de; drach at llnl.gov; go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] DRS structure
>>>
>>> Hi, Stephen,
>>>
>>> I'd like to stay with the TDS XML if I can, because there are a lot of
>>> open issues in the QC workflow. Or you convince me that I find more
>>> suitable information in the postgres db.
>>>
>>> Example: I get a QC result for the atomic dataset in the directory
>>> CMIP5/output/MPI-M/ECHAM6-MPIOM-LR/rcp45/mon/atmos/pr
>>> How do I find the TDS ID for it?
>>> ID="CMIP5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.pr.1.aggregation"
>>>
>>> urlPath="CMIP5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.pr.1.aggrega
>>>
>>> tion"
>>> Is this the structure, how it will remain? Then I can cut the last two.
>>>
>>> And all information of the datasets belonging to the atomic dataset?
>>> ID="CMIP5.output.MPI-M.ECHAM6-MPIOM-LR.rcp45.mon.atmos.v1.pr_Amon_ECHAM6
>>>
>>> -MPIOM-LR_rcp45_r1_195501-199412.nc"
>>> urlPath="atmos/CMIP5/output/MPI-M/ECHAM6-MPIOM-LR/rcp45/mon/atmos/pr/r1/
>>>
>>> pr_Amon_ECHAM6-MPIOM-LR_rcp45_r1_195501-199412.nc"
>>> There I can identify the datasets belonging to this atomic dataset
>>> using
>>> the urlPath.
>>>
>>> Will the directory structure change to move the version behind the
>>> realm
>>> as well? In my example:
>>> CMIP5/output/MPI-M/ECHAM6-MPIOM-LR/rcp45/mon/atmos/v1/pr
>>>
>>> Thanks a lot in advance to clearify that.
>>> Best wishes,
>>> Martina
>>>
>>>
>>> stephen.pascoe at stfc.ac.uk wrote:
>>>      
>>>> Hi Martina,
>>>>
>>>> For efficiency reasons we need to publish multiple variables as one
>>>>           
>>> dataset, therefore the dataset_id won't contain a variable identifier.
>>> However, the variable names are contained in the THREDDS XML inside
>>> metadata/variable tags so they are still available.
>>>      
>>>> I would recommend you don't rely on the syntax of dataset_ids.  Either
>>>>           
>>> get the DRS attributes from THREDDS <property> elements or bypass the
>>> XML entirely and inspect the publisher database.  The THREDDS
>>> properties
>>> are tied directly to the DRS attributes CMOR creates so they will be
>>> much less likely to be wrong due to missconfiguration.
>>>      
>>>> I know the DRS document is out of date but the syntax should be stable
>>>>           
>>> -- We'll sort out the confusion Estani has just pointed out ASAP.
>>>      
>>>> Cheers,
>>>> Stephen.
>>>>
>>>> -----Original Message-----
>>>> From: Martina Stockhause [mailto:martina.stockhause at zmaw.de]
>>>> Sent: Tue 8/24/2010 7:33 AM
>>>> To: Estanislao Gonzalez
>>>> Cc: Bob Drach; Pascoe, Stephen (STFC,RAL,SSTD); go-essp-tech at ucar.edu
>>>> Subject: Re: [Go-essp-tech] DRS structure
>>>>  
>>>> Dear all,
>>>>
>>>> we really need to fix the DRS structure and the reflectance of the DRS
>>>>           
>>>      
>>>> syntax in the TDS catalogue.
>>>>
>>>> During the QC, which runs in the file system with DRS syntax, I
>>>> need to have a connection to the TDS to check the consistency of
>>>> data against metadata after the automated checks. Since I don't
>>>> want to touch each dataset again, I take the TDS metadata as
>>>> reference for
>>>>           
>>> data content.
>>>      
>>>> Up to now it was possible to take the dataset_id as DRS name out of
>>>> the TDS in the atomic dataset (TDS aggregation = QC result level)
>>>> and the netcdf file level.
>>>>
>>>> Now the <variable> part of the DRS is missing in the dataset_id of the
>>>>           
>>>      
>>>> netcdf file, so that I am about to take the urlPath instead.
>>>> Is that ok?
>>>>
>>>> Why can't we use the DRS syntax as IDs in the TDS and in metafor? That
>>>>           
>>>      
>>>> would make things much easier.
>>>>
>>>> The DRS syntax is my connection from the QC checked files to the
>>>> TDS and to metafor. Therefore the DRS syntax should be fixed soon
>>>> and documented in the DRS document. So, that we can start to adapt
>>>> our examples and scripts.
>>>>
>>>> Best wishes,
>>>> Martina
>>>>
>>>>
>>>> Estanislao Gonzalez wrote:
>>>>            
>>>>> Hi all,
>>>>>
>>>>> I've realized we've been moving things from one place to another
>>>>> regarding the DRS components, and the DRS Reference Syntax
>>>>> document (from 7/4/2010) does not reflect this changes.
>>>>>
>>>>> There are two major difference here:
>>>>> 1) versioning: the drslib tool is creating a structure which is,
>>>>> for the time being, not drs conform. I totally agree with the new
>>>>> version-component placement, but should that not be reflected in
>>>>> the DRS syntax document?
>>>>> 2) in CMIP5 Best Practices for Data Publication stays that the
>>>>> dataset_id should be:
>>>>> cmip5.<product>.<institute>.<model>.<experiment>.<time_frequency>.<re
>>>>> alm>.<ensemble> I know the dataset_id is not required to necessary
>>>>> match any drs structure. But I personally think we should avoid
>>>>> drs-similar identifiers, as IMHO it increases confusion.
>>>>> I think this solution helps solving some publishing problems, but
>>>>> defines a new dataset level, the "ensemble dataset". And the
>>>>> realm-dataset is not being used anywhere else (or am I missing
>>>>> something?)
>>>>>
>>>>> I'm not aware of the reasons behind the definition of the DRS
>>>>> structure as it currently is. But I think, we should avoid
>>>>> drifting away from that document. In any case the document should
>>>>> be updated
>>>>>               
>>> first.
>>>      
>>>>> If I try to join all changes and proposals I've heard of, AFAIC
>>>>> the DRS structure we are going to appears to look something like:
>>>>> cmip5.<product>.<institute>.<model>.<experiment>.<time_frequency>.<re
>>>>> alm>.<ensemble>.<version>.<variable>
>>>>>
>>>>> Which is different from the original:
>>>>> cmip5.<product>.<institute>.<model>.<experiment>.<time_frequency>.<re
>>>>> alm>.<variable>.<ensemble>.<version>
>>>>>
>>>>> Can anyone with more knowledge on the subject comment on this?
>>>>>
>>>>> Thanks,
>>>>> Estani
>>>>>
>>>>>                     
>>>>             



More information about the GO-ESSP-TECH mailing list