[Go-essp-tech] Incorrect file names?

Gavin M. Bell gavin at llnl.gov
Wed Feb 15 12:04:47 MST 2012


 Hi,

I think the solution you are looking for is to build a merkle tree for
every dataset.
Standardize the paths based on the DRS (whether they are laid out that
way on the filesystem or not)
Build the merkle trie and a human readable manifest file with the DRS +
checksum + size + etc.
Merkle hash trie's have nice properties that enable efficient
anti-entropy characteristics.  Namely zooming in on exactly what
datasets are different between two datasets.  I wrote one a while
back... but it would be a good exercise for someone to build a merkle
trie builder for our datasets.

Between the manif and m-trie we would be all good.

On 2/15/12 5:59 AM, stephen.pascoe at stfc.ac.uk wrote:
>
>  
>
> This thread is now officially off-topic but, just to pick Eastani up
> on one point
>
> > I would say URL + checksum is enough information (that's more than
> the filename). The filename alone it's ok, but you will have to look
> for the
>
> > dataset version...
>
> I think we are talking about different use-cases.  I'm imagining a
> manifest that describes the dataset's contents at a particular
> version, independent of it's location and the name we've given the
> version.  URL + checksum contains the dataset's location as well as
> it's contents, in the case of DRS it also contains the version.  Think
> of the analogy of a git tree object -- it just contains the names and
> hashes of everything in the tree.  The URLs will be different for each
> replica and I was talking about a manifest that was the same for all
> replicas.
>
>  
>
> This is an idea for the future really.
>
>  
>
> Stephen.
>
>  
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
>  
>
> *From:* Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
> *Sent:* 15 February 2012 13:04
> *To:* Pascoe, Stephen (STFC,RAL,RALSP)
> *Cc:* jamie.kettleborough at metoffice.gov.uk; go-essp-tech at ucar.edu
> *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>  
>
> Hi,
>
> the Tracking_id is generated by cmor, so if the files are
> re-concatenated using any other tool (e.g. cdo) then it should be left
> as is. This has the benefit of not altering the checksum and thus
> marking the file as it is. So basically a renaming should trigger a
> new version but should not alter the files in any other way (i.e. same
> checksum).
>
> And indeed Stephen, the checksums alone are "enough" but it's not
> practical for any other purposes (until we have a service to reverse
> search to files).
> I would say URL + checksum is enough information (that's more than the
> filename). The filename alone it's ok, but you will have to look for
> the dataset version...
>
> So either dataset_id + version + file_name + checksum or url +
> checksum, which is the general case of the former.
>
> Url + checksum is already being stored in the wget script, so that
> file would be the key for citing/finding files.
> My advice: store it together with the data (which I know most people
> are doing already).
>
> ...I'm already changing the subject of this thread... sorry for that.
>
> Thanks,
> Estani
> Am 15.02.2012 11:44, schrieb stephen.pascoe at stfc.ac.uk:
> <mailto:stephen.pascoe at stfc.ac.uk:>
>
> Hi all,
>
>  
>
> This subject is full of gray areas :-(.  I would say keeping the
> tracking_id the same is ok as it is an indication that the contents of
> the NetCDF hasn't changed.
>
>  
>
> Practical matters for CMIP5 aside, I've been thinking about how we
> could create an unambiguous manifest of a dataset-version.  I.e.
> containing enough information to uniquely identify it's contents
> without any extraneous information that might change with dataset
> location, available services, etc.  .  I came to the conclusion there
> are 2 possible solutions: either a it's a sorted list of (filename,
> checksum) pairs or it's just a sorted list of checksums.  The
> difference is whether filenames are "part of the dataset".  My
> instinct is that you can't decouple filenames from the dataset.  Users
> expect filenames to be meaningful and in some contexts information
> inside files could refer to filenames within the dataset (e.g.
> gridspec files).  This is how every other contents-based
> versioning/packaging system I know of works: git, BagIt, BitTorrent
>
>  
>
> So, that's a long way of saying a new version would be necessary, on
> theoretical grounds as well as pragmatic ones.
>
>  
>
> Cheers,
>
> Stephen.
>
>  
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
>  
>
> *From:* go-essp-tech-bounces at ucar.edu
> <mailto:go-essp-tech-bounces at ucar.edu>
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Kettleborough, Jamie
> *Sent:* 15 February 2012 10:20
> *To:* Estanislao Gonzalez; go-essp-tech at ucar.edu
> <mailto:go-essp-tech at ucar.edu>
> *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>  
>
> Thanks Estani,
>
>  
>
> do you have any thoughts on the tracking_id?  Should this be left as
> is (I think what you say below implies it should).
>
>  
>
> Jamie
>
>      
>
>     ------------------------------------------------------------------------
>
>     *From:* go-essp-tech-bounces at ucar.edu
>     <mailto:go-essp-tech-bounces at ucar.edu>
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Estanislao
>     Gonzalez
>     *Sent:* 15 February 2012 10:06
>     *To:* go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
>     *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>     Hi,
>
>     it's required a new version to be published, so that its
>     publication will signal that something has changed. If not, then
>     it won't be picked up by other services, e.g. replica services, as
>     it would assume the data hasn't been changed at all.
>     In our particular case (DKRZ) we will see files haven't been
>     changed, provided that the checksums are properly published, and
>     will just link them to the older ones, allowing users to get both
>     versions (i.e. users that are already downloading files will be
>     able to keep downloading them, and not have to start everything
>     anew because of this renaming)
>     (I can't see if the checksums are there because the node is not
>     accessible at this time)
>
>     We could use that information to infer what happened and display
>     it under the history information.
>
>     Maintaining the same version provides no benefit for the publisher
>     at all and creates the same confusion to the user (which will see
>     that files are missing).
>
>     Just my 2c,
>     Estani
>
>     Am 15.02.2012 10:29, schrieb Kettleborough, Jamie:
>
>     Hello,
>
>      
>
>     sorry, a but of a side track, but maybe useful.  I know this is an
>     unusual case - but it is another example of an understandable slip
>     that can be made when producing data.  When Laura republishes
>     these should it be under a new publication data set version or
>     not?   I think the only thing that is changing is the filename -
>     is that right?  I don't think this warrants a new publication data
>     set version, but could be wrong.
>
>      
>
>     Jamie
>
>      
>
>     ------------------------------------------------------------------------
>
>     *From:* go-essp-tech-bounces at ucar.edu
>     <mailto:go-essp-tech-bounces at ucar.edu>
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Laura Carriere
>     *Sent:* 14 February 2012 19:43
>     *To:* Jennifer Adams
>     *Cc:* go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
>     *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>
>         Ah, now I see your reply.  Since you have a solution to your
>         immediate problem, I will not rush the republish but will have
>         it done as soon as it's convenient.  Thanks.
>
>           Laura.
>
>         On 2/14/2012 2:32 PM, Jennifer Adams wrote:
>
>         Oh dear. Rather than renaming the files, I used a set of
>         symlinks to solve my immediate problem, but I still find this
>         a bit troubling. I will have to check all GEOS-5 data files I
>         grab from now on. I asked Larry Marx to check if any of COLA's
>         CMIP5 data at NASA have 8-digit date stamps, and he found
>         everything to be correct with only YYYYMM date strings. 
>
>         --Jennifer
>
>          
>
>          
>
>         On Feb 14, 2012, at 1:29 PM, Laura Carriere wrote:
>
>
>
>
>
>         Quick answer on my way to a meeting - CMOR2 was used for this
>         and at least one other dataset that we have (from COLA) that
>         also has the yyyymmdd format.  I'll ask a few other questions
>         after my meeting but that's the short answer.
>
>           Laura.
>
>         On 2/14/2012 1:21 PM, Karl Taylor wrote:
>
>         Dear Novice (with clearly more knowledge than most so-called
>         experts),
>
>         I'm copying a contact for the GEOS-5 model who may be able to
>         provide some information on this.  I can't explain why the
>         monthly file names are inconsistent with what CMOR2 puts out. 
>         Maybe CMOR2 wasn't used.  The DRS document doesn't absolutely
>         forbid including more precision than necessary in specifying
>         the time-periods, so I don't think we can force them to rename
>         their files.  That being said, my hope was everyone would use
>         CMOR, so the file names would all follow the same template.
>
>         Karl
>
>         On 2/13/12 9:30 AM, Jennifer Adams wrote:
>
>         Dear Experts, 
>
>          
>
>         Here is a dataset:
>
>         http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NASA-GMAO.GEOS-5.decadal1960.mon.atmos.Amon.r1i1p1.html
>
>          
>
>         And here is the file name template for all the variables in
>         this dataset: 
>
>         <varname>_Amon_GEOS-5_decadal1960_r1i1p1_19610116-19701216.nc 
>
>          
>
>         My script to generate a GrADS descriptor for this file barked
>         because the MONTHLY data file has time stamps in the YYYYMMDD
>         format. 
>
>         If I have read the DRS document correctly, this a not a
>         correct file name. 
>
>         Shouldn't I be able to assume that monthly files will have
>         only YYYYMM date strings? 
>
>          
>
>         --Jennifer
>
>          
>
>          
>
>         --
>
>         Jennifer M. Adams
>
>         IGES/COLA
>
>         4041 Powder Mill Road, Suite 302
>
>         Calverton, MD 20705
>
>         jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>          
>
>          
>
>          
>
>
>
>
>
>         -- 
>
>          
>
>           Laura Carriere, SAIC                 laura.carriere at nasa.gov <mailto:laura.carriere at nasa.gov>
>
>           NCCS, Code 606.2                 301 614-5064
>
>         _______________________________________________
>         GO-ESSP-TECH mailing list
>         GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>         http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>          
>
>         --
>
>         Jennifer M. Adams
>
>         IGES/COLA
>
>         4041 Powder Mill Road, Suite 302
>
>         Calverton, MD 20705
>
>         jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>          
>
>
>
>
>          
>
>
>
>
>
>         -- 
>
>          
>
>           Laura Carriere, SAIC                 laura.carriere at nasa.gov <mailto:laura.carriere at nasa.gov>
>
>           NCCS, Code 606.2                 301 614-5064
>
>
>
>
>
>     _______________________________________________
>
>     GO-ESSP-TECH mailing list
>
>     GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>
>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
>
>
>     -- 
>
>     Estanislao Gonzalez
>
>      
>
>     Max-Planck-Institut für Meteorologie (MPI-M)
>
>     Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>
>     Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
>      
>
>     Phone:   +49 (40) 46 00 94-126
>
>     E-Mail:  gonzalez at dkrz.de <mailto:gonzalez at dkrz.de> 
>
>  
>
> -- 
> Scanned by iCritical.
>
>  
>
>
>
>
> -- 
> Estanislao Gonzalez
>  
> Max-Planck-Institut für Meteorologie (MPI-M)
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>  
> Phone:   +49 (40) 46 00 94-126
> E-Mail:  gonzalez at dkrz.de <mailto:gonzalez at dkrz.de> 
>
> -- 
> Scanned by iCritical.
>
>

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120215/4da06784/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list