[Go-essp-tech] Corrected THREDDS group wiki page link

Fri Feb 24 06:22:13 MST 2012

Hi all,

I think we are drifting from the main point, which was to define  a 
subset of the catalog schema tailored for ESG, so we can now what to expect.

The TDS is much more complex that what we use, and the inheritance 
forces to store state while going down the tree, which we don't use at 
all (nor probably want)
We should think already that we will have much more datasets in the 
future, I'm already promoting going to the atomic dataset level for 
CORDEX, which will simplify data management, but will bloat the number 
of datasets (40x).
Displaying a list of ~6000 datasets, what we now have, takes some time, 
and at this time much memory from TDS side. So we might want to add an 
intermediate level to address this (assume this solves anything...)

But besides this, I don't think we have any compelling reason for 
supporting the whole TDS schema.
IMHO we should define what we now have in a xsd as a subset of what it's 
allowed (but considering the previous point regarding scalability).

Furthermore, because the TDS is a metadata "producer" for the ESG, we 
could add some "general" catalogs with other metadata in the same 
structure. This could be the modification time of the catalogs or other 
data not handle directly by the catalogs (i.e. controlled vocabulary 
services endpoints). Just to depict what I mean (I'm not "proposing" 
anything, just presenting some possibilities):

Root/
|-MPI <all MPI catalogs>
|   |-metadata
|      |- last_mod_time
|-MOHC <all MOHC catalogs, for us, replicas>
|   |-metadata
|      |- last_mod_time
|-metadata
    |- controlled vocabularies referenced

Though I'd prefer to rely on a standard (ISO?), or a subset of it, 
instead of the TDS own metadata format.

Just my 2c,
Estani

Am 23.02.2012 18:28, schrieb martin.juckes at stfc.ac.uk:
>
> There is a page on the schema here: 
> http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/InvCatalogSpec.html#catalog
>
> They state pretty clearly that datasets at the top level are allowed 
> .....,
>
> Cheers,
>
> Martin
>
> *From:*Pascoe, Stephen (STFC,RAL,RALSP)
> *Sent:* 23 February 2012 17:19
> *To:* 'Cinquini, Luca (3880)'
> *Cc:* Juckes, Martin (STFC,RAL,RALSP); go-essp-tech at ucar.edu
> *Subject:* RE: [Go-essp-tech] Corrected THREDDS group wiki page link
>
> Hi Luca,
>
> This is what I found googling "THREDDS Schema":
>
> <!-- 
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
> -->
>
> <!-- Catalog element -->
>
> <xsd:element name="catalog">
>
> <xsd:complexType>
>
> <xsd:sequence>
>
> <xsd:element ref="service" minOccurs="0" maxOccurs="unbounded"/>
>
> <xsd:element ref="datasetRoot" minOccurs="0" maxOccurs="unbounded"/>
>
> <xsd:element ref="property" minOccurs="0" maxOccurs="unbounded"/>
>
> <xsd:element ref="dataset" minOccurs="1" *maxOccurs="unbounded"* />
>
> </xsd:sequence>
>
> <xsd:attribute name="base" type="xsd:anyURI" />
>
> <xsd:attribute name="name" type="xsd:string"/>
>
> <xsd:attribute name="expires" type="dateType"/>
>
> <xsd:attribute name="version" type="xsd:token" default="1.0.2"/>
>
> </xsd:complexType>
>
> So that says you can have 1..* dataset elements but we know that real 
> THREDDS catalogs can have no dataset elements, just catalogRef 
> elements, so it looks like we should treat the schema with a pinch of 
> salt.
>
> I agree ignoring mid-level properties is probably the right thing to do.
>
> Cheers,
>
> Stephen.
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
> *From:*Cinquini, Luca (3880) [mailto:Luca.Cinquini at jpl.nasa.gov]
> *Sent:* 23 February 2012 16:46
> *To:* Pascoe, Stephen (STFC,RAL,RALSP)
> *Cc:* Juckes, Martin (STFC,RAL,RALSP); go-essp-tech at ucar.edu
> *Subject:* Re: [Go-essp-tech] Corrected THREDDS group wiki page link
>
> Hi Stephen:
>
> On Feb 23, 2012, at 8:54 AM, <stephen.pascoe at stfc.ac.uk 
> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>
> Hi Luca,
>
> > o Each catalog is harvested as a single discoverable dataset - the 
> reason being that hopefully the data provider thought about how to 
> generate
>
> > the catalogs, and decided on what should be the single unit of discovery
>
> >
>
> > o For each catalog, all files are assigned to the top-level dataset 
> container - so if there were many nested datasets with files, it still 
> would result
>
> > in a single discoverable dataset with as many files
>
> Does this mean that each catalog should contain 0 or 1 top-level 
> datasets and any further nesting below that is collapsed down?  That 
> sounds quite sensible.  What happens to any properties in any dataset 
> below the top-level one?
>
> I may be mistaken, but at a thredds catalog always only contains one 
> top-level dataset ? At least that used to be the case, I believe. At 
> the very least, I don't know of any catalog that has many top-level 
> datasets.
>
> As for the properties - if they are associated with mid-level 
> datasets, they would currently be ignored. This could change, if we 
> had examples to work with.
>
> thanks, L
>
> Cheers,
>
> Stephen.
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
> *From:*Cinquini, Luca (3880) [mailto:Luca.Cinquini at jpl.nasa.gov]
> *Sent:*23 February 2012 15:02
> *To:*Pascoe, Stephen (STFC,RAL,RALSP)
> *Cc:*Juckes, Martin (STFC,RAL,RALSP);go-essp-tech at ucar.edu 
> <mailto:go-essp-tech at ucar.edu>
> *Subject:*Re: [Go-essp-tech] Corrected THREDDS group wiki page link
>
> Hi Stephen and Martin,
>
> just for clarification, this is what the P2P harvesting software 
> currently does - this doesn't mean that it cannot be changed if desired:
>
> o Each catalog can contain an arbitrary hierarchy of datasets and 
> catalogRefs
>
> o Each catalog is harvested as a single discoverable dataset - the 
> reason being that hopefully the data provider thought about how to 
> generate the catalogs, and decided on what should be the single unit 
> of discovery
>
> o For each catalog, all files are assigned to the top-level dataset 
> container - so if there were many nested datasets with files, it still 
> would result in a single discoverable dataset with as many files
>
> o And obviously, all catalogRef are followed in harvesting, and 
> generate separate discoverable datasets.
>
> thanks, Luca
>
> On Feb 23, 2012, at 7:33 AM, <stephen.pascoe at stfc.ac.uk 
> <mailto:stephen.pascoe at stfc.ac.uk>> <stephen.pascoe at stfc.ac.uk 
> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>
>
>
> Thanks Martin.  There is a catalog_version attribute already, although 
> I don't think there is any documentation on what it means.
>
> On the hierarchy, I personally believe we could allow any number of 
> intermediate catalogues containing <catalogRef> elements in the spec.  
> Datanodes currently only produce 2 levels .../thredds/catalog.xml and 
> .../thredds/esgcet/catalog.xml, but there would be no harm in having 
> deeper nesting.  What I think is less flexible is the constraint that 
> "leaf-catalogs" contain a single container <dataset> element and a set 
> of child <dataset> elements representing files and aggregations.  This 
> design is what LAS and other bits of ESGF rely on. General THREDDS 
> allows you to mix catalogRef, container datasets and "real" datasets 
> throughout the hierarchy.
>
> Anyone, please chip in if you dissagree.
>
> Cheers,
>
> Stephen.
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
> *From:*Juckes, Martin (STFC,RAL,RALSP)
> *Sent:*23 February 2012 12:11
> *To:*Pascoe, Stephen (STFC,RAL,RALSP);go-essp-tech at ucar.edu 
> <mailto:go-essp-tech at ucar.edu>
> *Subject:*RE: Corrected THREDDS group wiki page link
>
> Hello All,
>
> Sorry I had to leave the telco early -- but it was a useful discussion.
>
> After leaving, I had a couple of thoughts:
>
> (1)There syntax should be versioned , and the version should be 
> indicated in the catalogue somewhere -- whatever we agree on, there is 
> bound to be need to change in the future, and changes will be much 
> easier to manage if we have the version in the catalogue. There could 
> be independent syntax versions for the top level catalogue and the 
> "publication unit" catalogue. The cleanest way to do this would be 
> with an xsd document referenced in the schemaLocation attribute. We 
> could set this up initially with a "permissive" xsd schema imposing 
> necessary constraints, but not all the required constraints.
>
> (2)The decision to stick to a 2-level hierarchy of THREDDS documents 
> (a top-level catalogue with a list of "catalogRef"s and a 
> sub-catalogue for each publication unit) is certainly right for now, 
> but may be too restrictive in the medium term. The specification of 
> "catalogRef" means that very little information is in the top level, 
> and at the next level you have to fetch everything. Having an 3^rd 
> level -- e.g. for each simulation -- would allow more flexibility in 
> recording changes and pointing to documentation.
>
> Cheers,
>
> Martin
>
> *From:*go-essp-tech-bounces at ucar.edu 
> <mailto:go-essp-tech-bounces at ucar.edu>[mailto:go-essp-tech-bounces at ucar.edu]*On 
> Behalf Of*stephen.pascoe at stfc.ac.uk <mailto:stephen.pascoe at stfc.ac.uk>
> *Sent:*21 February 2012 16:14
> *To:*go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
> *Subject:*[Go-essp-tech] Corrected THREDDS group wiki page link
>
> http://esgf.org/wiki/ESGFInterfaceGroups/ThreddsGroup
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
> --
> Scanned by iCritical.
>
> --
> Scanned by iCritical.
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
> --
> Scanned by iCritical.
>
>
> -- 
> Scanned by iCritical.
>
>
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120224/68fc216b/attachment-0001.html