[Go-essp-tech] uniquely identifying an NetCDF CMIP5 file

Estanislao Gonzalez gonzalez at dkrz.de
Wed Feb 15 09:53:26 MST 2012


Hi,

I'm changing the subject so we get back on course :-)

Indeed URL *will* change, *but* they contain more information that the 
filename and you can easily get to the subset of filename + checksum 
(just use basename on the url from any UNIX machine).

Furthermore you could *know* from which server you got the data (a 
replica, the original, etc).

That's why I think the simples is to store just the script or whatever 
url list the user got.
As I said, the checksum is enough to identify the file, but people might 
also want to find what happened with it... and the extra information 
might provide the required clues.

But yes, as Stephen pointed out there are multiple use cases, I was 
thinking about the user. From the server side you could indeed have 
"dataset + set(file_name, file_checksum)" as a representation of that 
dataset that will be the same for all replicas. I'm missing the point 
though, why would you want to do that? Citation? Storage? a notification 
service? The URL *should* be:
https?://<host_name>/<service_root>/*DRS_PATH*/*DRS_filename*

so DRS_PATH (=dataset + version) + DRS_filename + checksum is what you 
are proposing. I do think it's enough.

Regards,
Estani



Am 15.02.2012 15:45, schrieb Kettleborough, Jamie:
> minor point - but we've seen URLs change too - e.g. pcmdi moved their 
> threeds from one server to another, and others seem to sometimes do 
> some sort of data management that means the earlier parts of the paths 
> in the URL change.
> Jamie
>
>     ------------------------------------------------------------------------
>     *From:* stephen.pascoe at stfc.ac.uk [mailto:stephen.pascoe at stfc.ac.uk]
>     *Sent:* 15 February 2012 14:00
>     *To:* gonzalez at dkrz.de
>     *Cc:* Kettleborough, Jamie; go-essp-tech at ucar.edu
>     *Subject:* RE: [Go-essp-tech] Incorrect file names?
>
>     This thread is now officially off-topic but, just to pick Eastani
>     up on one point
>
>     > I would say URL + checksum is enough information (that's more
>     than the filename). The filename alone it's ok, but you will have
>     to look for the
>
>     > dataset version...
>
>     I think we are talking about different use-cases.  I'm imagining a
>     manifest that describes the dataset's contents at a particular
>     version, independent of it's location and the name we've given the
>     version.  URL + checksum contains the dataset's location as well
>     as it's contents, in the case of DRS it also contains the
>     version.  Think of the analogy of a git tree object -- it just
>     contains the names and hashes of everything in the tree.  The URLs
>     will be different for each replica and I was talking about a
>     manifest that was the same for all replicas.
>
>     This is an idea for the future really.
>
>     Stephen.
>
>     ---
>
>     Stephen Pascoe  +44 (0)1235 445980
>
>     Centre of Environmental Data Archival
>
>     STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>     0QX, UK
>
>     *From:*Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
>     *Sent:* 15 February 2012 13:04
>     *To:* Pascoe, Stephen (STFC,RAL,RALSP)
>     *Cc:* jamie.kettleborough at metoffice.gov.uk; go-essp-tech at ucar.edu
>     *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>     Hi,
>
>     the Tracking_id is generated by cmor, so if the files are
>     re-concatenated using any other tool (e.g. cdo) then it should be
>     left as is. This has the benefit of not altering the checksum and
>     thus marking the file as it is. So basically a renaming should
>     trigger a new version but should not alter the files in any other
>     way (i.e. same checksum).
>
>     And indeed Stephen, the checksums alone are "enough" but it's not
>     practical for any other purposes (until we have a service to
>     reverse search to files).
>     I would say URL + checksum is enough information (that's more than
>     the filename). The filename alone it's ok, but you will have to
>     look for the dataset version...
>
>     So either dataset_id + version + file_name + checksum or url +
>     checksum, which is the general case of the former.
>
>     Url + checksum is already being stored in the wget script, so that
>     file would be the key for citing/finding files.
>     My advice: store it together with the data (which I know most
>     people are doing already).
>
>     ...I'm already changing the subject of this thread... sorry for that.
>
>     Thanks,
>     Estani
>     Am 15.02.2012 11:44, schrieb stephen.pascoe at stfc.ac.uk:
>     <mailto:stephen.pascoe at stfc.ac.uk:>
>
>     Hi all,
>
>     This subject is full of gray areas :-(.  I would say keeping the
>     tracking_id the same is ok as it is an indication that the
>     contents of the NetCDF hasn't changed.
>
>     Practical matters for CMIP5 aside, I've been thinking about how we
>     could create an unambiguous manifest of a dataset-version.  I.e.
>     containing enough information to uniquely identify it's contents
>     without any extraneous information that might change with dataset
>     location, available services, etc.  .  I came to the conclusion
>     there are 2 possible solutions: either a it's a sorted list of
>     (filename, checksum) pairs or it's just a sorted list of
>     checksums.  The difference is whether filenames are "part of the
>     dataset".  My instinct is that you can't decouple filenames from
>     the dataset.  Users expect filenames to be meaningful and in some
>     contexts information inside files could refer to filenames within
>     the dataset (e.g. gridspec files).  This is how every other
>     contents-based versioning/packaging system I know of works: git,
>     BagIt, BitTorrent
>
>     So, that's a long way of saying a new version would be necessary,
>     on theoretical grounds as well as pragmatic ones.
>
>     Cheers,
>
>     Stephen.
>
>     ---
>
>     Stephen Pascoe  +44 (0)1235 445980
>
>     Centre of Environmental Data Archival
>
>     STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>     0QX, UK
>
>     *From:*go-essp-tech-bounces at ucar.edu
>     <mailto:go-essp-tech-bounces at ucar.edu>
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of
>     *Kettleborough, Jamie
>     *Sent:* 15 February 2012 10:20
>     *To:* Estanislao Gonzalez; go-essp-tech at ucar.edu
>     <mailto:go-essp-tech at ucar.edu>
>     *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>     Thanks Estani,
>
>     do you have any thoughts on the tracking_id?  Should this be left
>     as is (I think what you say below implies it should).
>
>     Jamie
>
>         ------------------------------------------------------------------------
>
>         *From:*go-essp-tech-bounces at ucar.edu
>         <mailto:go-essp-tech-bounces at ucar.edu>
>         [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of
>         *Estanislao Gonzalez
>         *Sent:* 15 February 2012 10:06
>         *To:* go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
>         *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>         Hi,
>
>         it's required a new version to be published, so that its
>         publication will signal that something has changed. If not,
>         then it won't be picked up by other services, e.g. replica
>         services, as it would assume the data hasn't been changed at all.
>         In our particular case (DKRZ) we will see files haven't been
>         changed, provided that the checksums are properly published,
>         and will just link them to the older ones, allowing users to
>         get both versions (i.e. users that are already downloading
>         files will be able to keep downloading them, and not have to
>         start everything anew because of this renaming)
>         (I can't see if the checksums are there because the node is
>         not accessible at this time)
>
>         We could use that information to infer what happened and
>         display it under the history information.
>
>         Maintaining the same version provides no benefit for the
>         publisher at all and creates the same confusion to the user
>         (which will see that files are missing).
>
>         Just my 2c,
>         Estani
>
>         Am 15.02.2012 10:29, schrieb Kettleborough, Jamie:
>
>         Hello,
>
>         sorry, a but of a side track, but maybe useful.  I know this
>         is an unusual case - but it is another example of an
>         understandable slip that can be made when producing data. 
>         When Laura republishes these should it be under a new
>         publication data set version or not?   I think the only thing
>         that is changing is the filename - is that right?  I don't
>         think this warrants a new publication data set version, but
>         could be wrong.
>
>         Jamie
>
>         ------------------------------------------------------------------------
>
>         *From:*go-essp-tech-bounces at ucar.edu
>         <mailto:go-essp-tech-bounces at ucar.edu>
>         [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Laura
>         Carriere
>         *Sent:* 14 February 2012 19:43
>         *To:* Jennifer Adams
>         *Cc:* go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
>         *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>
>             Ah, now I see your reply.  Since you have a solution to
>             your immediate problem, I will not rush the republish but
>             will have it done as soon as it's convenient.  Thanks.
>
>               Laura.
>
>             On 2/14/2012 2:32 PM, Jennifer Adams wrote:
>
>             Oh dear. Rather than renaming the files, I used a set of
>             symlinks to solve my immediate problem, but I still find
>             this a bit troubling. I will have to check all GEOS-5 data
>             files I grab from now on. I asked Larry Marx to check if
>             any of COLA's CMIP5 data at NASA have 8-digit date stamps,
>             and he found everything to be correct with only YYYYMM
>             date strings.
>
>             --Jennifer
>
>             On Feb 14, 2012, at 1:29 PM, Laura Carriere wrote:
>
>
>
>
>
>             Quick answer on my way to a meeting - CMOR2 was used for
>             this and at least one other dataset that we have (from
>             COLA) that also has the yyyymmdd format.  I'll ask a few
>             other questions after my meeting but that's the short answer.
>
>               Laura.
>
>             On 2/14/2012 1:21 PM, Karl Taylor wrote:
>
>             Dear Novice (with clearly more knowledge than most
>             so-called experts),
>
>             I'm copying a contact for the GEOS-5 model who may be able
>             to provide some information on this.  I can't explain why
>             the monthly file names are inconsistent with what CMOR2
>             puts out.  Maybe CMOR2 wasn't used.  The DRS document
>             doesn't absolutely forbid including more precision than
>             necessary in specifying the time-periods, so I don't think
>             we can force them to rename their files.  That being said,
>             my hope was everyone would use CMOR, so the file names
>             would all follow the same template.
>
>             Karl
>
>             On 2/13/12 9:30 AM, Jennifer Adams wrote:
>
>             Dear Experts,
>
>             Here is a dataset:
>
>             http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NASA-GMAO.GEOS-5.decadal1960.mon.atmos.Amon.r1i1p1.html
>
>             And here is the file name template for all the variables
>             in this dataset:
>
>             <varname>_Amon_GEOS-5_decadal1960_r1i1p1_19610116-19701216.nc
>
>             My script to generate a GrADS descriptor for this file
>             barked because the MONTHLY data file has time stamps in
>             the YYYYMMDD format.
>
>             If I have read the DRS document correctly, this a not a
>             correct file name.
>
>             Shouldn't I be able to assume that monthly files will have
>             only YYYYMM date strings?
>
>             --Jennifer
>
>             --
>
>             Jennifer M. Adams
>
>             IGES/COLA
>
>             4041 Powder Mill Road, Suite 302
>
>             Calverton, MD 20705
>
>             jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>
>
>
>
>             -- 
>
>               
>
>                Laura Carriere, SAIClaura.carriere at nasa.gov  <mailto:laura.carriere at nasa.gov>
>
>                NCCS, Code 606.2                 301 614-5064
>
>             _______________________________________________
>             GO-ESSP-TECH mailing list
>             GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>             http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>             --
>
>             Jennifer M. Adams
>
>             IGES/COLA
>
>             4041 Powder Mill Road, Suite 302
>
>             Calverton, MD 20705
>
>             jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>
>
>
>
>
>
>
>             -- 
>
>               
>
>                Laura Carriere, SAIClaura.carriere at nasa.gov  <mailto:laura.carriere at nasa.gov>
>
>                NCCS, Code 606.2                 301 614-5064
>
>
>
>
>
>         _______________________________________________
>
>         GO-ESSP-TECH mailing list
>
>         GO-ESSP-TECH at ucar.edu  <mailto:GO-ESSP-TECH at ucar.edu>
>
>         http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
>
>
>         -- 
>
>         Estanislao Gonzalez
>
>           
>
>         Max-Planck-Institut für Meteorologie (MPI-M)
>
>         Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>
>         Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
>           
>
>         Phone:   +49 (40) 46 00 94-126
>
>         E-Mail:gonzalez at dkrz.de  <mailto:gonzalez at dkrz.de>  
>
>     -- 
>     Scanned by iCritical.
>
>
>
>
>     -- 
>
>     Estanislao Gonzalez
>
>       
>
>     Max-Planck-Institut für Meteorologie (MPI-M)
>
>     Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>
>     Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
>       
>
>     Phone:   +49 (40) 46 00 94-126
>
>     E-Mail:gonzalez at dkrz.de  <mailto:gonzalez at dkrz.de>  
>
>
>     -- 
>     Scanned by iCritical.
>
>


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120215/330bdf7b/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list