[Go-essp-tech] DRS syntax into ESG

Bryan Lawrence bryan.lawrence at stfc.ac.uk
Thu Nov 5 22:47:11 MST 2009


Hi Folks

This is precisely the issue that I was bringing up last Tuesday, which is the action for the ESG gateway team on page 5 (thanks for getting it going Luca) and which I was discussing with respect to figure 2.

The key issue for me is what is the concept of a dataset in the gateway, and can we have multiple views (aka datasets) onto the physical heirarchy?

What we want to be able to do is as fast as possible (aka clicks) get a download script that gets me
(or view the metadata for) *at least*:
- exactly the atomic dataset I want
 - datasets corresonding to all the collections going up the DRS heirarchy
 AND
 - atomic datasets for a specific variable for all simulations carried out by all models in a specific experiment.
 - all atomic datasets in a specific realm for all  simulations carried out by all models in a specific experiment.

So the key is that you need to offer multiple views on the data ... which are essentially virtual collections.

(With a view to the future, one *wouldn't* version the virtual collections, because at the end of the day, all the virtual collections correpond to version controlled atomic datasets, and so they're the only thing that need versioning).

Cheers
Bryan


On Thursday 05 November 2009 21:01:39 Bob Drach wrote:
> OK, I'll discuss it with Dean and Karl, and come up with some ideas.  
> Thanks,
> 
> Bob
> 
> On Nov 5, 2009, at 11:59 AM, Luca Cinquini wrote:
> 
> > Hi Bob,
> > 	I think we can do pretty much anything, we just need to be clear  
> > on what the requirements are. I agree that 9 clicks might be too  
> > much, but maybe 3 or 4 can be a good compromise between speed and  
> > overwhelming results. A matrix is possible too, for example see here:
> >
> > http://*esg.ucar.edu/browse/viewProject.htm? 
> > projectId=ff3949c8-2008-45c8-8e27-5834f54be50f
> >
> > (where now all folders are eventually empty).
> >
> > Maybe, since this is mostly a CMIP5 presentation issue, you guys at  
> > PCMDI can decide on what kind of browsing/clicking you would like  
> > the users to go through, and let us know ?
> >
> > thanks, Luca
> >
> >
> > On Nov 5, 2009, at 12:53 PM, Bob Drach wrote:
> >
> >> Hi Luca,
> >>
> >> On Nov 5, 2009, at 10:57 AM, Luca Cinquini wrote:
> >>
> >>> Hi Bob,
> >>> 	how you build the dataset hierarchy really boils down on how you  
> >>> want users to browse. I was under the impression that you wanted  
> >>> users to browse the catalogs reflecting how the data was stored  
> >>> on disk, but maybe I was wrong.
> >>
> >> The browsing should be organized for user convenience - as few  
> >> clicks as necessary. If the browsing hierarchy is decoupled from  
> >> the organization on disk, then the disk hierarchy can be arranged  
> >> for convenience of publication as well. This is particularly  
> >> useful for publication of legacy data, where you don't necessarily  
> >> have control over the disk organization. So no, I don't think the  
> >> gateway browsing hierarchy should necessarily mirror the disk  
> >> organization at the data node.
> >>
> >>
> >>> You don't think it would be too confusing to have all datasets  
> >>> for a single model/experiment/frequency/realm/variable/ensemble  
> >>> be contained in the very same HTML page ?
> >>
> >> Well yes, I do think that might be confusing. But it would be  
> >> worse to have to click through nine level of hierarchy to find a  
> >> dataset. Isn't there some intermediate representation that  
> >> balances the depth of hierarchy with information per page?
> >>
> >> For example, the hierarchy might be presented as a table of model  
> >> vs. experiment, with each table cell containing links to datasets  
> >> (or at least to a shallower hierarchy). Would that be difficult to  
> >> do?
> >>
> >> Thanks,
> >>
> >> Bob
> >>
> >>
> >>> I think for searching we all agree that what needs to be done is  
> >>> simply harvest all the fields in the database/triple store and  
> >>> then expose the corresponding facets.
> >>> thanks, Luca
> >>>
> >>> On Nov 5, 2009, at 11:28 AM, Bob Drach wrote:
> >>>
> >>>> Hi Luca,
> >>>>
> >>>> Thanks for raising the issue - I've been wondering about this too.
> >>>>
> >>>> The hierarchy of datasets as presented by the gateway - for  
> >>>> users to browse through - shouldn't necessarily be the same as  
> >>>> the hierarchy introduced by DRS. Users should be able to find  
> >>>> datasets with as few clicks as possible, which is why we just  
> >>>> went through the exercise of 'flattening' the THREDDS catalogs.
> >>>>
> >>>> The publisher already associates properties corresponding to the  
> >>>> DRS fields (model, experiment, etc.) into the catalogs, with the  
> >>>> exception of version numbers (which are coming in the next  
> >>>> release). So here's a way forward:
> >>>>
> >>>> - The publisher is configured such that the categories defined  
> >>>> for the IPCC5/CMIP5 project (activity) include the DRS fields.  
> >>>> As I said, this is already mostly true. The categories are  
> >>>> mandatory - must be resolved before publication.
> >>>> - Each catalog corresponding to a dataset has properties that  
> >>>> define these values. On publication the gateway ingests these  
> >>>> values in searchable fashion.
> >>>> - When the portal receives a DRS request, it parses the URL,  
> >>>> searches on the resulting fields, and resolves to the  
> >>>> corresponding dataset.
> >>>>
> >>>> The main point is that this can be independent of the dataset  
> >>>> hierarchy as generated during publication.
> >>>>
> >>>> Bob
> >>>>
> >>>> On Nov 5, 2009, at 4:50 AM, Luca Cinquini wrote:
> >>>>
> >>>>> Hi,
> >>>>> 	the purpose of this email is to start a conversation, and a  
> >>>>> plan of
> >>>>> action, on how to incorporate the DRS syntax into the ESG system.
> >>>>> As a reminder, the current DRS specification states that a CMIP5
> >>>>> dataset will be uniquely identified by the following URL:
> >>>>>
> >>>>> http://***<hostname>/<activity>/<institute>/<model>/<experiment>/
> >>>>> <frequency>/<modeling realm>/<variable>/<ensemble member>/ 
> >>>>> <version>/
> >>>>> [<endpoint>]
> >>>>>
> >>>>> where most of the <...> fields are controlled vocabularies
> >>>>> for example:
> >>>>>
> >>>>> http://***badc.nerc.ac.uk/activity/institute/model/experiment/ 
> >>>>> frequency/realm/varname/r
> >>>>> 1/v1/
> >>>>>
> >>>>> The first question would be what does it mean to capture the  
> >>>>> semantics
> >>>>> of the DRS syntax within ESG ? I can see at least two answers:
> >>>>>
> >>>>> a) The user is able to browse the CMIP5 datasets hierarchically
> >>>>> according to the DRS hierarchy of fields
> >>>>> b) The user is able to search for data based on facets that  
> >>>>> reflect
> >>>>> the DRS syntax: activity, institute, experiment, etc..
> >>>>>
> >>>>> So how do we get there ? A straw-man workflow could be the  
> >>>>> following:
> >>>>>
> >>>>> o) The ESG Data Node publishing client, when building the THREDDS
> >>>>> catalogs, creates a hierarchy of datasets that reflects the  
> >>>>> syntax.
> >>>>> There is probably also a need to mark up these catalogs as  
> >>>>> "DRS" or
> >>>>> "CMIP5".
> >>>>> o) The ESG Gateway, when parsing these catalogs, invokes a  
> >>>>> specific
> >>>>> handler that creates the same datasets hierarchy (this is actually
> >>>>> automatic, I believe), and additionally associates corresponding
> >>>>> objects at each level of the hierarchy. For example, at first  
> >>>>> level
> >>>>> the dataset will be associated with an activity, at second  
> >>>>> level with
> >>>>> an institute, and so on. An alternative way would be to  
> >>>>> associate all
> >>>>> the objects only to the leaf level dataset.
> >>>>> o) When the metadata for the leaf nodes datasets is harvested  
> >>>>> into RDF
> >>>>> triples for searching, the dataset - object associations must be
> >>>>> transfered to the triple store
> >>>>> o) Specific CMIP5 facets can be configured to search by DRS fields
> >>>>> (perhaps only on the PCMDI Gateway, or perhaps on all gateways).
> >>>>>
> >>>>> As mentioned, this is just a start. I do believe though that  
> >>>>> this is
> >>>>> an extremely important issue that must be tackled as soon as  
> >>>>> possible.
> >>>>>
> >>>>> thanks, Luca
> >>>>>
> >>>>> _______________________________________________
> >>>>> GO-ESSP-TECH mailing list
> >>>>> GO-ESSP-TECH at ucar.edu
> >>>>> http://***mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 



-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence


More information about the GO-ESSP-TECH mailing list