[Go-essp-tech] Incorrect file names?

stephen.pascoe at stfc.ac.uk stephen.pascoe at stfc.ac.uk
Wed Feb 15 06:59:42 MST 2012


This thread is now officially off-topic but, just to pick Eastani up on one point
> I would say URL + checksum is enough information (that's more than the filename). The filename alone it's ok, but you will have to look for the
> dataset version...

I think we are talking about different use-cases.  I'm imagining a manifest that describes the dataset's contents at a particular version, independent of it's location and the name we've given the version.  URL + checksum contains the dataset's location as well as it's contents, in the case of DRS it also contains the version.  Think of the analogy of a git tree object -- it just contains the names and hashes of everything in the tree.  The URLs will be different for each replica and I was talking about a manifest that was the same for all replicas.

This is an idea for the future really.

Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK

From: Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
Sent: 15 February 2012 13:04
To: Pascoe, Stephen (STFC,RAL,RALSP)
Cc: jamie.kettleborough at metoffice.gov.uk; go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Incorrect file names?

Hi,

the Tracking_id is generated by cmor, so if the files are re-concatenated using any other tool (e.g. cdo) then it should be left as is. This has the benefit of not altering the checksum and thus marking the file as it is. So basically a renaming should trigger a new version but should not alter the files in any other way (i.e. same checksum).

And indeed Stephen, the checksums alone are "enough" but it's not practical for any other purposes (until we have a service to reverse search to files).
I would say URL + checksum is enough information (that's more than the filename). The filename alone it's ok, but you will have to look for the dataset version...

So either dataset_id + version + file_name + checksum or url + checksum, which is the general case of the former.

Url + checksum is already being stored in the wget script, so that file would be the key for citing/finding files.
My advice: store it together with the data (which I know most people are doing already).

...I'm already changing the subject of this thread... sorry for that.

Thanks,
Estani
Am 15.02.2012 11:44, schrieb stephen.pascoe at stfc.ac.uk:<mailto:stephen.pascoe at stfc.ac.uk:>
Hi all,

This subject is full of gray areas :-(.  I would say keeping the tracking_id the same is ok as it is an indication that the contents of the NetCDF hasn't changed.

Practical matters for CMIP5 aside, I've been thinking about how we could create an unambiguous manifest of a dataset-version.  I.e. containing enough information to uniquely identify it's contents without any extraneous information that might change with dataset location, available services, etc.  .  I came to the conclusion there are 2 possible solutions: either a it's a sorted list of (filename, checksum) pairs or it's just a sorted list of checksums.  The difference is whether filenames are "part of the dataset".  My instinct is that you can't decouple filenames from the dataset.  Users expect filenames to be meaningful and in some contexts information inside files could refer to filenames within the dataset (e.g. gridspec files).  This is how every other contents-based versioning/packaging system I know of works: git, BagIt, BitTorrent

So, that's a long way of saying a new version would be necessary, on theoretical grounds as well as pragmatic ones.

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK

From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Kettleborough, Jamie
Sent: 15 February 2012 10:20
To: Estanislao Gonzalez; go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
Subject: Re: [Go-essp-tech] Incorrect file names?

Thanks Estani,

do you have any thoughts on the tracking_id?  Should this be left as is (I think what you say below implies it should).

Jamie

________________________________
From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
Sent: 15 February 2012 10:06
To: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
Subject: Re: [Go-essp-tech] Incorrect file names?
Hi,

it's required a new version to be published, so that its publication will signal that something has changed. If not, then it won't be picked up by other services, e.g. replica services, as it would assume the data hasn't been changed at all.
In our particular case (DKRZ) we will see files haven't been changed, provided that the checksums are properly published, and will just link them to the older ones, allowing users to get both versions (i.e. users that are already downloading files will be able to keep downloading them, and not have to start everything anew because of this renaming)
(I can't see if the checksums are there because the node is not accessible at this time)

We could use that information to infer what happened and display it under the history information.

Maintaining the same version provides no benefit for the publisher at all and creates the same confusion to the user (which will see that files are missing).

Just my 2c,
Estani

Am 15.02.2012 10:29, schrieb Kettleborough, Jamie:
Hello,

sorry, a but of a side track, but maybe useful.  I know this is an unusual case - but it is another example of an understandable slip that can be made when producing data.  When Laura republishes these should it be under a new publication data set version or not?   I think the only thing that is changing is the filename - is that right?  I don't think this warrants a new publication data set version, but could be wrong.

Jamie

________________________________
From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Laura Carriere
Sent: 14 February 2012 19:43
To: Jennifer Adams
Cc: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
Subject: Re: [Go-essp-tech] Incorrect file names?

Ah, now I see your reply.  Since you have a solution to your immediate problem, I will not rush the republish but will have it done as soon as it's convenient.  Thanks.

  Laura.

On 2/14/2012 2:32 PM, Jennifer Adams wrote:
Oh dear. Rather than renaming the files, I used a set of symlinks to solve my immediate problem, but I still find this a bit troubling. I will have to check all GEOS-5 data files I grab from now on. I asked Larry Marx to check if any of COLA's CMIP5 data at NASA have 8-digit date stamps, and he found everything to be correct with only YYYYMM date strings.
--Jennifer


On Feb 14, 2012, at 1:29 PM, Laura Carriere wrote:




Quick answer on my way to a meeting - CMOR2 was used for this and at least one other dataset that we have (from COLA) that also has the yyyymmdd format.  I'll ask a few other questions after my meeting but that's the short answer.

  Laura.

On 2/14/2012 1:21 PM, Karl Taylor wrote:
Dear Novice (with clearly more knowledge than most so-called experts),

I'm copying a contact for the GEOS-5 model who may be able to provide some information on this.  I can't explain why the monthly file names are inconsistent with what CMOR2 puts out.  Maybe CMOR2 wasn't used.  The DRS document doesn't absolutely forbid including more precision than necessary in specifying the time-periods, so I don't think we can force them to rename their files.  That being said, my hope was everyone would use CMOR, so the file names would all follow the same template.

Karl

On 2/13/12 9:30 AM, Jennifer Adams wrote:
Dear Experts,

Here is a dataset:
http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NASA-GMAO.GEOS-5.decadal1960.mon.atmos.Amon.r1i1p1.html

And here is the file name template for all the variables in this dataset:
<varname>_Amon_GEOS-5_decadal1960_r1i1p1_19610116-19701216.nc

My script to generate a GrADS descriptor for this file barked because the MONTHLY data file has time stamps in the YYYYMMDD format.
If I have read the DRS document correctly, this a not a correct file name.
Shouldn't I be able to assume that monthly files will have only YYYYMM date strings?

--Jennifer


--
Jennifer M. Adams
IGES/COLA
4041 Powder Mill Road, Suite 302
Calverton, MD 20705
jma at cola.iges.org<mailto:jma at cola.iges.org>








--



  Laura Carriere, SAIC                 laura.carriere at nasa.gov<mailto:laura.carriere at nasa.gov>

  NCCS, Code 606.2                 301 614-5064
_______________________________________________
GO-ESSP-TECH mailing list
GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

--
Jennifer M. Adams
IGES/COLA
4041 Powder Mill Road, Suite 302
Calverton, MD 20705
jma at cola.iges.org<mailto:jma at cola.iges.org>










--



  Laura Carriere, SAIC                 laura.carriere at nasa.gov<mailto:laura.carriere at nasa.gov>

  NCCS, Code 606.2                 301 614-5064





_______________________________________________

GO-ESSP-TECH mailing list

GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech





--

Estanislao Gonzalez



Max-Planck-Institut für Meteorologie (MPI-M)

Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre

Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany



Phone:   +49 (40) 46 00 94-126

E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>


--
Scanned by iCritical.





--

Estanislao Gonzalez



Max-Planck-Institut für Meteorologie (MPI-M)

Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre

Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany



Phone:   +49 (40) 46 00 94-126

E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>

-- 
Scanned by iCritical.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120215/ef3784db/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list