[Go-essp-tech] Grouping files of differing versions

stephen.pascoe at stfc.ac.uk stephen.pascoe at stfc.ac.uk
Fri Jul 22 07:13:22 MDT 2011


Nebojsa,

A specific example would help.  We have had some problems with version number consistency at BADC so the problem might be in our metadata, not the semantics of DRS/THREDDS.

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK


-----Original Message-----
From: Nebojsa Balic [mailto:balic at dkrz.de] 
Sent: 22 July 2011 14:04
To: Juckes, Martin (STFC,RAL,RALSP)
Cc: Pascoe, Stephen (STFC,RAL,RALSP); go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Grouping files of differing versions

  Hello Martin,
I am glad that you have also joined this discussion.
The version number of the dataset contains also the date so I assumed 
that the datasets with the most recent version number are also the 
datasets most recently published. I could not establish any relation 
between this date and the mod_time property of files in THREDDS catalogs 
which would be the relevant criteria for identifying the most recent 
files. Is there any? Can it happen that some files being part of 
datasets with higher version number were modified before those of 
datasets with the lower?
I am trying to figure out a way to determine the files with the most 
recent version that are output of a query. According to what has already 
been said - these are the files that belong to the dataset with the 
highest version number. However, the results of the query are usually 
files belonging to various datasets some of them can have same version 
number but different DRS components. Due to this reason, my 
understanding was to group datasets or files based on some criteria and 
then from each group to determine the dataset with the latest version. 
If I just apply the most recent version number as a criteria it will 
result in incomplete results (files which publication have been 
completed would not appear as results since other files (from other 
institute or with differen DRS components) that are recently published 
satisfy the search criteria belong to the datasets with higher version 
number) .
That is why I think that some grouping is necessary before determining 
the files with most recent version. Ideal case would be to identify 
files with the same simulation results but I am not sure if this can be 
dome from the information given in THREDDS catalogs.
The main problem is to find criteria for assigning files in datasets 
into groups in which they differ only in version number?
Regrads
Nebojsa

On 07/22/2011 01:08 PM, martin.juckes at stfc.ac.uk wrote:
> Hello Nebojsa,
>
> Just to add to whet Stephen says, we could have the situation where a file is present in an early version of the dataset, but not in a later version:
>
> cmip5.output1.blabla.v20110720: foo1.nc, foo2.nc, foo3.nc
> cmip5.output1.blabla.v20110721: foo1.nc, foo2.nc
>
> The collection of files in the most recent version is not the same as the collection of most recent files. You say you want the latter, but I think it would be better to provide the former (i.e. foo1.nc and foo2.nc) in this case, so it might be worth clarifying the requirements,
>
> Cheers,
> Martin
>
>>> -----Original Message-----
>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
>>> Sent: 22 July 2011 11:48
>>> To: balic at dkrz.de
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] Grouping files of differing versions
>>>
>>> Nebojsa,
>>>
>>>> I need a criteria for assigning files differing only in the version
>>> number into separate groups.
>>>
>>> I think your problem may be confused because ESGF is versioning
>>> datasets not files.  If you want to find the files of the latest
>>> version of a dataset search for all dataset versions of that dataset
>>> and then list the files in that dataset version.
>>>
>>> If you start with the files I believe there will be problems.  As you
>>> say, both version and product are not determinable from the filename.
>>> *in theory* any given filename should be in the same product for all
>>> versions.  However, I wouldn't depend on this.  For example a datanode
>>> could discover they have got the product of some data wrong and they
>>> may republish a new datasets containing the same files to fix this.
>>> You would then see 2 datasets containing the same files but different
>>> product.
>>>
>>> E.g.
>>>
>>> Originally you could have a dataset versions:
>>>
>>> cmip5.output1.blabla.v20110720: f1.nc, f2.nc, f3.nc, f4.nc
>>>
>>> The datanode realises f3.nc and f4.nc should be in output2 so
>>> publishes 2 new datasets:
>>>
>>> cmip5.output1.blabla.v20110720: f1.nc, f2.nc, f3.nc, f4.nc
>>> cmip5.output1.blabla.v20110721: f1.nc, f2.nc
>>> cmip5.output2.blabla.v20110721: f3.nc, f4.nc
>>>
>>> We hope this won't happen but it might.  If you start by looking for
>>> all datasets containing f3.nc you will find 2 with different version
>>> and product: cmip5.output1.blabla.v20110720 and
>>> cmip5.output2.blabla.v20110721.
>>>
>>> I believe it is safer to search at the dataset level then drill down
>>> to individual files.
>>>
>>> Cheers,
>>> Stephen.
>>>
>>>
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> Centre of Environmental Data Archival
>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX,
>>> UK
>>>
>>>
>>> -----Original Message-----
>>> From: Nebojsa Balic [mailto:balic at dkrz.de]
>>> Sent: 22 July 2011 11:10
>>> To: Pascoe, Stephen (STFC,RAL,RALSP)
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] Grouping files of differing versions
>>>
>>>   Stephen,
>>> Thank you for a prompt answer.
>>> A new demand on the search interface in the vERC portal is to provide
>>> the files of the most recent version that satisfy the given set of
>>> search constraints (DRS components, elements of the geospatial and
>>> temporal coverage). In order to determine the files with the latest
>>> version, I need a criteria for assigning files differing only in the
>>> version number into separate groups. The files cannot be grouped by
>>> their id-s because they all have different one. The version number is
>>> also not an option because files of different models, simulation,
>>> experiments etc. can have the same version number. The name seems to
>>> be
>>> the best grouping criteria since it does not contain the version of a
>>> file but DRS content. But does it mean that the files with the same
>>> name
>>> varies only in version? The product can be additional grouping
>>> crietria
>>> since the name does not contain this information. The search is
>>> performed on the CMI5 data so the files are all of the same activity.
>>> So If I group files by their names and activity and for each of these
>>> groups I determine the file with the highest version number - do I get
>>> files of the latest version?
>>> Thanks
>>> Nebojsa
>>>
>>> On 07/22/2011 11:08 AM, stephen.pascoe at stfc.ac.uk wrote:
>>>> Nebojsa,
>>>>
>>>> Since CMOR is not version-aware files have no indication of their
>>> version number.  Version should be explicit in the THREDDS dataset at ID
>>> attribute and the property[@name="version"] attribute.
>>>>> assumption that they all belong to the same product
>>>> Can you give us an example as I'm not sure what you mean.  Files
>>> shouldn't move product between versions unless they were miss-
>>> classified initially.
>>>> I would use the property[@name="version"] attribute to distinguish
>>> between versions as the dataset at ID could be changed  in the future.  I
>>> think of it as an internal THREDDS identifier.
>>>> Cheers,
>>>> Stephen.
>>>>
>>>> ---
>>>> Stephen Pascoe  +44 (0)1235 445980
>>>> Centre of Environmental Data Archival
>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>>> 0QX, UK
>>>> -----Original Message-----
>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>> bounces at ucar.edu] On Behalf Of Nebojsa Balic
>>>> Sent: 22 July 2011 09:58
>>>> To: go-essp-tech at ucar.edu
>>>> Subject: [Go-essp-tech] Grouping files of differing versions
>>>>
>>>>     Dear All,
>>>> I am trying to group files differing only in the version number in
>>> order
>>>> to determine the files of the latest version. I have come to
>>> conclusion
>>>> that files differing only in the version have all the same name but
>>>> different ID under the assumption that they all belong to the same
>>>> product. Is this a necessary and sufficient condition for grouping
>>> files
>>>> with different versions?
>>>> Thanks
>>>> Nebojsa Balic
>>>> MPI-M
>>>> Hamburg
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> --
>>> Scanned by iCritical.
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list