[Go-essp-tech] drs, cmor, realms, and atomic datasets and components

Tue Sep 22 15:32:57 MDT 2009

Hi Bryan,
	I suspect we will need to talk more about this at GO-ESSP, it helps  
to be face to face.

As a general comment, I can say that the Gateway does not assume  
anything at all about the structure of the datasets, and it tries to  
ingest anything that is passed to it by the publisher. So, whatever  
the publisher builds as dataset hierarchy, or flags as atomic  
datasets, and whatever properties the generated catalogs contain, the  
Gateway will be able to ingest.
Now, as for your specific questions:

On Sep 22, 2009, at 1:37 PM, Bob Drach wrote:

> Hi Bryan,
>
> I'll answer from the perspective of publication:
>
> The publisher has only three hardwired components (using the DRS  
> terminology):
>
> - project (== activity)
> - experiment
> - model
>
> Each of these must be defined for a given dataset, and there are  
> tables for describing the permitted values. The remainder of the  
> components are configured on a per-project basis. So for the CMIP5/ 
> AR5 project, <institute>, <frequency>, <modelling_realm> etc. are  
> defined specifically for CMIP5. In addition to the static project  
> configuration, there will be a CMIP5 handler - implemented as a  
> Python class - that specifies how to look inside a data file,  
> validate it as a CMIP5 file, and discover any additional metadata  
> not otherwise determined from directory names and command line  
> arguments.
>
> Within the per-project configuration there is a field 'dataset_id',  
> a template for construction of dataset identifiers. If for CMIP5  
> this is defined similarly to the DRS spec then datasets will  
> correspond to the DRS definition.
>
> On Sep 22, 2009, at 9:47 AM, Bryan Lawrence wrote:
>
>>
>> Hi Folks (probably Luca, Bob primarily)
>>
>> I'm about to ask some questions, but in order to be very accurate,  
>> I need some definitions:
>>
>> An atomic dataset defines a variable from a single model run. The  
>> breakdown of components
>> in a CMIP5 DRS compliant dataset look like ...
>> <activity>/<institute>/<model>/<experiment>/<frequency>/<modeling
>> realm>/<variable>/<ensemble member>/<version>/[<endpoint>],
>>
>> I believe CMOR is writing directory hierarchies that look like  
>> that. For now I'm interested in <modelling_realm> which is a tag  
>> that comes from the *primary* realm associated with a variable in  
>> the CMOR tables.
>>
>> In terms of cataloguing, from Luca's comments, I *think* ESG was  
>> planning on aggregating these up so that a dataset in their  
>> catalogue looks like the agregation of all variables in a given  
>> modelling realm (for a given ensemble member and version), and the  
>> idea was that one browse between datasets and their modelling realms.
>
> This is certainly do-able, given the comments above.
>
>>
>> This is because metafor (and curator) also have the concept of  
>> modelling realms, and these are the "top level" components within  
>> the model.
>>
>> I think there was an assumption that these two uses of  
>> modelling_realm were the same. As of today they're not quite. I'll  
>> get back to that.
>>
>> CMOR also has the concept of secondary realms, that is, one can tag  
>> a variable with more than one realm.
>>
>> So the first of my questions:
>> 1) Is ESG using those secondary realms at all in the catalogue (or  
>> planning to do so)?
>
> The plan is to publish all CMOR-generated metadata.
... And we can make the gateway ingest all of that metadata as well,  
as it is encoded in the THREDDS catalogs. The sooner we have examples  
with the full metadata, the better.
We also need to decide which of these metadata will be interesting for  
searching, such that they can be made into facets.

>
>> 2) Do they make it to the catalogue via ESG publisher?
>
> They willl make it into THREDDS catalogs as properties. They can  
> then be harvested into the gateway database, although I don't  
> believe this is being done at the moment.
The gateway currently has a table called "physical domain" which we  
can rename "realm" if it is more appropriate. I also don't think it is  
harvested though since it was not found in the THREDDS catalogs.
>
>> 3) Is ESG providing wget scripts to get all the data in one of  
>> their aggregated datasets?
Yes... well, the user can select any files in the system, and get a  
wget script for that. If we wanted we could make pre-populated wget  
scripts for all files in a given dataset, we never thought of it but  
it is possible and probabily a good idea.

>> 4) Is ESG providing a way of getting wget scripts for the  
>> individual atomic datasets within an aggregated dataset?
Same answer as before... wget scripts are file-based, so whatever  
files are published to the system, ESG is going to enable the creation  
of a wget script that contains them.
>>
>> Getting back to the difference between modelling realms as seen by  
>> CMOR and metafor/curator.
>>
>> 5) Does it matter if there is a slight difference between them. (At  
>> the moment curator/metafor has aerosols within atmospheric  
>> chemistry, CMOR has them as distinct primary realms). Either CMOR  
>> could change or Metafor could change or neither could change, but  
>> the balance of choosing between these options depends on the  
>> answers to the five questions above, since both CMOR and metafor/ 
>> curator have valid reasons for the way they have done things).
>
> From the publisher persective it doesn't matter - the CMOR-generated  
> metadata will be published to the gateway. It's not clear to me if  
> it matters from a gateway search perspective.

  If the Gateway is to ingest metadata via multiple methods (the  
Metaphor questionaire and CMOR metadata in the thredds catalogs), we  
might need to combine the two sources. This might mean that a single  
dataset is flagged with two realms - maybe it's not a problem but it's  
more a question for the modelers.

thanks, luca
>
> Best regards,
>
> Bob
>
>>
>> thanks
>> Bryan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -- 
>> Bryan Lawrence
>> Director of Environmental Archival and Associated Research
>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>> STFC, Rutherford Appleton Laboratory
>> Phone +44 1235 445012; Fax ... 5848;
>> Web: home.badc.rl.ac.uk/lawrence
>