[Go-essp-tech] [esg-node-dev] Use of <metadata> element in THREDDS catalogs

stephen.pascoe at stfc.ac.uk stephen.pascoe at stfc.ac.uk
Wed Jun 1 10:06:36 MDT 2011


> Sorry for being confusing: the system already indexes the full content of THREDDS catalogs into records of type Dataset and File. Right now,
> the information is NOT propagated from Dataset to Files. We could do so to make the search more Files more powerful: this will allow you to
> search for Files by employing metadata that is defined at the Dataset level.

Since the mechanism exists and is documented in the THREDDS schema it seems the correct approach would be to propagate properties in <metadata> elements but not those part of the top-level <dataset> element.  However, we then have the problem that existing ESG THREDDS catalogs don't put properties in <metadata> elements.

I can see the practicality but this exposes a deficiency of SOLr.  As soon as you split your index into 2 types you sort of want to join them like you would 2 database tables.  However, that can't be done easily so you suggest copying content from one into the other.  Relational databases deal with this problem easily.

S.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK

From: Cinquini, Luca (3880) [mailto:Luca.Cinquini at jpl.nasa.gov]
Sent: 01 June 2011 16:55
To: Pascoe, Stephen (STFC,RAL,RALSP)
Cc: Roland.Schweitzer at noaa.gov; esg-node-dev at lists.llnl.gov; go-essp-tech at ucar.edu
Subject: Re: [esg-node-dev] Use of <metadata> element in THREDDS catalogs

Hi Stephen:

On Jun 1, 2011, at 9:33 AM, <stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk>> wrote:


Hi Luca,

I don't think we should rush to implement before we understand what's being suggested.
Well that is how the system already works: index all metadata found in the catalogs. I didn't mean to rush anything :).


>From my first reading of your email you seem to contradict yourself: p2p search doesn't index files but Charles wants you to enhance the "search for files".  I clearly don't understand what is being indexed.  From what you say I am guessing that the indexer grabs all <property> elements whether they are parents of a top-level <dataset> element, a file-level <dataset> element or a  <metadata> element.

Sorry for being confusing: the system already indexes the full content of THREDDS catalogs into records of type Dataset and File. Right now, the information is NOT propagated from Dataset to Files. We could do so to make the search more Files more powerful: this will allow you to search for Files by employing metadata that is defined at the Dataset level.


I'm wary of the approach of propagating all metadata to all levels of the system.  My instinct is we should decide what belongs at the dataset level and stick with it.
That's why we haven't done it so far.... It's a philosophical decision...

thanks, Luca



Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK

From: Cinquini, Luca (3880) [mailto:Luca.Cinquini at jpl.nasa.gov]
Sent: 01 June 2011 16:18
To: Pascoe, Stephen (STFC,RAL,RALSP)
Cc: Roland.Schweitzer at noaa.gov<mailto:Roland.Schweitzer at noaa.gov>; esg-node-dev at lists.llnl.gov<mailto:esg-node-dev at lists.llnl.gov>; go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
Subject: Re: [esg-node-dev] Use of <metadata> element in THREDDS catalogs

Hi Stephen,
            to answer some of your questions...

o The p2p index will harvest all properties in the THREDDS catalogs. Infact, I was able to run a quick job and ingest that catalog in our prototype system - you can search for "cordex" at this URL:

http://esg-datanode.jpl.nasa.gov/esgf-web-fe/

As you can see, I have defined two facets: "CORDEX_domain" and "Frequency" (upper case!) that relate to the metadata in that catalog. As I was mentioning, the metadata just flows through.

o Note that I think some of the metadata property names should really be lower case, instead of upper case.... at least that's the CMIP5 convention. Off course we could change the case while parsing the catalogs

o Your last point about inheriting metadata is exactly what we were discussing with Charles and others in previous days. Charles asked that, in order to make the search for files more powerful, we tag all files that belong to a dataset with the properties that belong to the dataset: this way, you could make a search for files subject to the constraints experiment=X, frequency=Y and model=Z. This is something that is not difficult to do, but we haven't done yet because it means "interpreting" the catalogs as opposed to just "parsing" them. But it looks like there is enough momentum behind this requirement that we should go ahead and do it...

o Finally, note that so far the p2p search only looks for Datasets - this is to limit the number of results. We could as well look for Files, if we wanted, from the web interface.

thanks, Luca


On Jun 1, 2011, at 7:59 AM, <stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk>> wrote:



(note I CC'd gonzalez at dkrz.de<mailto:gonzalez at dkrz.de> by mistake -- I meant go-essp)

Hi Roland,

I suppose what I'm getting at is would the Gateway detect driving_model_id=ERAINT should result in a facet value for that dataset or just ignore it.  Also I think the P2P index node will index files and datasets separately.  In theory it should therefore include this facet to both the dataset and all files it contains but will it now and should it in the future?

More generally, do we want to use this inheritance feature for key/value pairs that result in facets in our user interfaces and search APIs?  This gets to an underlying design decision about what information we expose at the file level and what at the dataset level.  It is the case that each CMIP5 file has a model_id but this property isn't exposed in the THREDDS as file properties, only dataset properties.

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK

From: Roland Schweitzer [mailto:Roland.Schweitzer at noaa.gov]
Sent: 01 June 2011 14:49
To: Pascoe, Stephen (STFC,RAL,RALSP)
Cc: esg-node-dev at lists.llnl.gov<mailto:esg-node-dev at lists.llnl.gov>; gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
Subject: Re: [esg-node-dev] Use of <metadata> element in THREDDS catalogs

Hi All,

I agree that we need to formalize an ESG profile.

To that end, the THREDDS XML schema allows for the inheritance of metadata to be controlled by an attribute.  And the schema allows for more than one <metadata> element with different inheritance in a particular <dataset>.  Perhaps all that is needed is to get the inheritance right.

But, isn't it the case in the example you sent that the inheritance is in fact correct.  A variable in this data set has the property driving_model_id=ERAINT, for example.  What are the properties that were added that should not be inherited?

Roland

On 06/01/2011 04:18 AM, stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk> wrote:
Hi all,

I've just received an excellent example of why we need to formalise an ESG profile for THREDDS catalogs.  Henrik Wiberg has added some extra THREDDS properties to support the CORDEX project (see the attached email for links).  He's put these properties in a <metadata> element within the top-level dataset element.  This is valid THREDDS but I'm not sure what ESG would do with it.

Properties in <metadata> elements implies they apply to all dataset elements contained within the current one.  Now the new search engine will index properties in files as well as datasets we need to decide whether we are going to support this feature of THREDDS.  My guess is that the Gateway and P2P index wouldn't process this right.

My instinct is that there should be a clear distinction between properties associated with a dataset and those associated with the files it contains -- therefore in this case we'd need to move the properties out of the metadata section.

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK



--
Scanned by iCritical.




--
Scanned by iCritical.





--
Scanned by iCritical.




-- 
Scanned by iCritical.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110601/13c9445f/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list