[Go-essp-tech] Incorrect file names?

Estanislao Gonzalez gonzalez at dkrz.de
Wed Feb 15 06:03:49 MST 2012


Hi,

the Tracking_id is generated by cmor, so if the files are 
re-concatenated using any other tool (e.g. cdo) then it should be left 
as is. This has the benefit of not altering the checksum and thus 
marking the file as it is. So basically a renaming should trigger a new 
version but should not alter the files in any other way (i.e. same 
checksum).

And indeed Stephen, the checksums alone are "enough" but it's not 
practical for any other purposes (until we have a service to reverse 
search to files).
I would say URL + checksum is enough information (that's more than the 
filename). The filename alone it's ok, but you will have to look for the 
dataset version...

So either dataset_id + version + file_name + checksum or url + checksum, 
which is the general case of the former.

Url + checksum is already being stored in the wget script, so that file 
would be the key for citing/finding files.
My advice: store it together with the data (which I know most people are 
doing already).

...I'm already changing the subject of this thread... sorry for that.

Thanks,
Estani
Am 15.02.2012 11:44, schrieb stephen.pascoe at stfc.ac.uk:
>
> Hi all,
>
> This subject is full of gray areas :-(.  I would say keeping the 
> tracking_id the same is ok as it is an indication that the contents of 
> the NetCDF hasn't changed.
>
> Practical matters for CMIP5 aside, I've been thinking about how we 
> could create an unambiguous manifest of a dataset-version.  I.e. 
> containing enough information to uniquely identify it's contents 
> without any extraneous information that might change with dataset 
> location, available services, etc.  .  I came to the conclusion there 
> are 2 possible solutions: either a it's a sorted list of (filename, 
> checksum) pairs or it's just a sorted list of checksums.  The 
> difference is whether filenames are "part of the dataset".  My 
> instinct is that you can't decouple filenames from the dataset.  Users 
> expect filenames to be meaningful and in some contexts information 
> inside files could refer to filenames within the dataset (e.g. 
> gridspec files).  This is how every other contents-based 
> versioning/packaging system I know of works: git, BagIt, BitTorrent
>
> So, that's a long way of saying a new version would be necessary, on 
> theoretical grounds as well as pragmatic ones.
>
> Cheers,
>
> Stephen.
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
> *From:*go-essp-tech-bounces at ucar.edu 
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Kettleborough, Jamie
> *Sent:* 15 February 2012 10:20
> *To:* Estanislao Gonzalez; go-essp-tech at ucar.edu
> *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
> Thanks Estani,
>
> do you have any thoughts on the tracking_id?  Should this be left as 
> is (I think what you say below implies it should).
>
> Jamie
>
>     ------------------------------------------------------------------------
>
>     *From:*go-essp-tech-bounces at ucar.edu
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Estanislao
>     Gonzalez
>     *Sent:* 15 February 2012 10:06
>     *To:* go-essp-tech at ucar.edu
>     *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>     Hi,
>
>     it's required a new version to be published, so that its
>     publication will signal that something has changed. If not, then
>     it won't be picked up by other services, e.g. replica services, as
>     it would assume the data hasn't been changed at all.
>     In our particular case (DKRZ) we will see files haven't been
>     changed, provided that the checksums are properly published, and
>     will just link them to the older ones, allowing users to get both
>     versions (i.e. users that are already downloading files will be
>     able to keep downloading them, and not have to start everything
>     anew because of this renaming)
>     (I can't see if the checksums are there because the node is not
>     accessible at this time)
>
>     We could use that information to infer what happened and display
>     it under the history information.
>
>     Maintaining the same version provides no benefit for the publisher
>     at all and creates the same confusion to the user (which will see
>     that files are missing).
>
>     Just my 2c,
>     Estani
>
>     Am 15.02.2012 10:29, schrieb Kettleborough, Jamie:
>
>     Hello,
>
>     sorry, a but of a side track, but maybe useful.  I know this is an
>     unusual case - but it is another example of an understandable slip
>     that can be made when producing data.  When Laura republishes
>     these should it be under a new publication data set version or
>     not?   I think the only thing that is changing is the filename -
>     is that right?  I don't think this warrants a new publication data
>     set version, but could be wrong.
>
>     Jamie
>
>     ------------------------------------------------------------------------
>
>     *From:*go-essp-tech-bounces at ucar.edu
>     <mailto:go-essp-tech-bounces at ucar.edu>
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Laura Carriere
>     *Sent:* 14 February 2012 19:43
>     *To:* Jennifer Adams
>     *Cc:* go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
>     *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>
>         Ah, now I see your reply.  Since you have a solution to your
>         immediate problem, I will not rush the republish but will have
>         it done as soon as it's convenient.  Thanks.
>
>           Laura.
>
>         On 2/14/2012 2:32 PM, Jennifer Adams wrote:
>
>         Oh dear. Rather than renaming the files, I used a set of
>         symlinks to solve my immediate problem, but I still find this
>         a bit troubling. I will have to check all GEOS-5 data files I
>         grab from now on. I asked Larry Marx to check if any of COLA's
>         CMIP5 data at NASA have 8-digit date stamps, and he found
>         everything to be correct with only YYYYMM date strings.
>
>         --Jennifer
>
>         On Feb 14, 2012, at 1:29 PM, Laura Carriere wrote:
>
>
>
>
>         Quick answer on my way to a meeting - CMOR2 was used for this
>         and at least one other dataset that we have (from COLA) that
>         also has the yyyymmdd format.  I'll ask a few other questions
>         after my meeting but that's the short answer.
>
>           Laura.
>
>         On 2/14/2012 1:21 PM, Karl Taylor wrote:
>
>         Dear Novice (with clearly more knowledge than most so-called
>         experts),
>
>         I'm copying a contact for the GEOS-5 model who may be able to
>         provide some information on this.  I can't explain why the
>         monthly file names are inconsistent with what CMOR2 puts out. 
>         Maybe CMOR2 wasn't used.  The DRS document doesn't absolutely
>         forbid including more precision than necessary in specifying
>         the time-periods, so I don't think we can force them to rename
>         their files.  That being said, my hope was everyone would use
>         CMOR, so the file names would all follow the same template.
>
>         Karl
>
>         On 2/13/12 9:30 AM, Jennifer Adams wrote:
>
>         Dear Experts,
>
>         Here is a dataset:
>
>         http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NASA-GMAO.GEOS-5.decadal1960.mon.atmos.Amon.r1i1p1.html
>
>         And here is the file name template for all the variables in
>         this dataset:
>
>         <varname>_Amon_GEOS-5_decadal1960_r1i1p1_19610116-19701216.nc
>
>         My script to generate a GrADS descriptor for this file barked
>         because the MONTHLY data file has time stamps in the YYYYMMDD
>         format.
>
>         If I have read the DRS document correctly, this a not a
>         correct file name.
>
>         Shouldn't I be able to assume that monthly files will have
>         only YYYYMM date strings?
>
>         --Jennifer
>
>         --
>
>         Jennifer M. Adams
>
>         IGES/COLA
>
>         4041 Powder Mill Road, Suite 302
>
>         Calverton, MD 20705
>
>         jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>
>
>
>         -- 
>
>           
>
>            Laura Carriere, SAIClaura.carriere at nasa.gov  <mailto:laura.carriere at nasa.gov>
>
>            NCCS, Code 606.2                 301 614-5064
>
>         _______________________________________________
>         GO-ESSP-TECH mailing list
>         GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>         http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>         --
>
>         Jennifer M. Adams
>
>         IGES/COLA
>
>         4041 Powder Mill Road, Suite 302
>
>         Calverton, MD 20705
>
>         jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>
>
>
>
>
>         -- 
>
>           
>
>            Laura Carriere, SAIClaura.carriere at nasa.gov  <mailto:laura.carriere at nasa.gov>
>
>            NCCS, Code 606.2                 301 614-5064
>
>
>
>
>     _______________________________________________
>
>     GO-ESSP-TECH mailing list
>
>     GO-ESSP-TECH at ucar.edu  <mailto:GO-ESSP-TECH at ucar.edu>
>
>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
>
>     -- 
>
>     Estanislao Gonzalez
>
>       
>
>     Max-Planck-Institut für Meteorologie (MPI-M)
>
>     Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>
>     Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
>       
>
>     Phone:   +49 (40) 46 00 94-126
>
>     E-Mail:gonzalez at dkrz.de  <mailto:gonzalez at dkrz.de>  
>
>
> -- 
> Scanned by iCritical.
>
>


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120215/a2e2fd59/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list