[Go-essp-tech] Incorrect file names?
Gavin M. Bell
gavin at llnl.gov
Wed Feb 15 12:04:47 MST 2012
Hi,
I think the solution you are looking for is to build a merkle tree for
every dataset.
Standardize the paths based on the DRS (whether they are laid out that
way on the filesystem or not)
Build the merkle trie and a human readable manifest file with the DRS +
checksum + size + etc.
Merkle hash trie's have nice properties that enable efficient
anti-entropy characteristics. Namely zooming in on exactly what
datasets are different between two datasets. I wrote one a while
back... but it would be a good exercise for someone to build a merkle
trie builder for our datasets.
Between the manif and m-trie we would be all good.
On 2/15/12 5:59 AM, stephen.pascoe at stfc.ac.uk wrote:
>
>
>
> This thread is now officially off-topic but, just to pick Eastani up
> on one point
>
> > I would say URL + checksum is enough information (that's more than
> the filename). The filename alone it's ok, but you will have to look
> for the
>
> > dataset version...
>
> I think we are talking about different use-cases. I'm imagining a
> manifest that describes the dataset's contents at a particular
> version, independent of it's location and the name we've given the
> version. URL + checksum contains the dataset's location as well as
> it's contents, in the case of DRS it also contains the version. Think
> of the analogy of a git tree object -- it just contains the names and
> hashes of everything in the tree. The URLs will be different for each
> replica and I was talking about a manifest that was the same for all
> replicas.
>
>
>
> This is an idea for the future really.
>
>
>
> Stephen.
>
>
>
> ---
>
> Stephen Pascoe +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
>
>
> *From:* Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
> *Sent:* 15 February 2012 13:04
> *To:* Pascoe, Stephen (STFC,RAL,RALSP)
> *Cc:* jamie.kettleborough at metoffice.gov.uk; go-essp-tech at ucar.edu
> *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>
>
> Hi,
>
> the Tracking_id is generated by cmor, so if the files are
> re-concatenated using any other tool (e.g. cdo) then it should be left
> as is. This has the benefit of not altering the checksum and thus
> marking the file as it is. So basically a renaming should trigger a
> new version but should not alter the files in any other way (i.e. same
> checksum).
>
> And indeed Stephen, the checksums alone are "enough" but it's not
> practical for any other purposes (until we have a service to reverse
> search to files).
> I would say URL + checksum is enough information (that's more than the
> filename). The filename alone it's ok, but you will have to look for
> the dataset version...
>
> So either dataset_id + version + file_name + checksum or url +
> checksum, which is the general case of the former.
>
> Url + checksum is already being stored in the wget script, so that
> file would be the key for citing/finding files.
> My advice: store it together with the data (which I know most people
> are doing already).
>
> ...I'm already changing the subject of this thread... sorry for that.
>
> Thanks,
> Estani
> Am 15.02.2012 11:44, schrieb stephen.pascoe at stfc.ac.uk:
> <mailto:stephen.pascoe at stfc.ac.uk:>
>
> Hi all,
>
>
>
> This subject is full of gray areas :-(. I would say keeping the
> tracking_id the same is ok as it is an indication that the contents of
> the NetCDF hasn't changed.
>
>
>
> Practical matters for CMIP5 aside, I've been thinking about how we
> could create an unambiguous manifest of a dataset-version. I.e.
> containing enough information to uniquely identify it's contents
> without any extraneous information that might change with dataset
> location, available services, etc. . I came to the conclusion there
> are 2 possible solutions: either a it's a sorted list of (filename,
> checksum) pairs or it's just a sorted list of checksums. The
> difference is whether filenames are "part of the dataset". My
> instinct is that you can't decouple filenames from the dataset. Users
> expect filenames to be meaningful and in some contexts information
> inside files could refer to filenames within the dataset (e.g.
> gridspec files). This is how every other contents-based
> versioning/packaging system I know of works: git, BagIt, BitTorrent
>
>
>
> So, that's a long way of saying a new version would be necessary, on
> theoretical grounds as well as pragmatic ones.
>
>
>
> Cheers,
>
> Stephen.
>
>
>
> ---
>
> Stephen Pascoe +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
>
>
> *From:* go-essp-tech-bounces at ucar.edu
> <mailto:go-essp-tech-bounces at ucar.edu>
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Kettleborough, Jamie
> *Sent:* 15 February 2012 10:20
> *To:* Estanislao Gonzalez; go-essp-tech at ucar.edu
> <mailto:go-essp-tech at ucar.edu>
> *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>
>
> Thanks Estani,
>
>
>
> do you have any thoughts on the tracking_id? Should this be left as
> is (I think what you say below implies it should).
>
>
>
> Jamie
>
>
>
> ------------------------------------------------------------------------
>
> *From:* go-essp-tech-bounces at ucar.edu
> <mailto:go-essp-tech-bounces at ucar.edu>
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Estanislao
> Gonzalez
> *Sent:* 15 February 2012 10:06
> *To:* go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
> *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
> Hi,
>
> it's required a new version to be published, so that its
> publication will signal that something has changed. If not, then
> it won't be picked up by other services, e.g. replica services, as
> it would assume the data hasn't been changed at all.
> In our particular case (DKRZ) we will see files haven't been
> changed, provided that the checksums are properly published, and
> will just link them to the older ones, allowing users to get both
> versions (i.e. users that are already downloading files will be
> able to keep downloading them, and not have to start everything
> anew because of this renaming)
> (I can't see if the checksums are there because the node is not
> accessible at this time)
>
> We could use that information to infer what happened and display
> it under the history information.
>
> Maintaining the same version provides no benefit for the publisher
> at all and creates the same confusion to the user (which will see
> that files are missing).
>
> Just my 2c,
> Estani
>
> Am 15.02.2012 10:29, schrieb Kettleborough, Jamie:
>
> Hello,
>
>
>
> sorry, a but of a side track, but maybe useful. I know this is an
> unusual case - but it is another example of an understandable slip
> that can be made when producing data. When Laura republishes
> these should it be under a new publication data set version or
> not? I think the only thing that is changing is the filename -
> is that right? I don't think this warrants a new publication data
> set version, but could be wrong.
>
>
>
> Jamie
>
>
>
> ------------------------------------------------------------------------
>
> *From:* go-essp-tech-bounces at ucar.edu
> <mailto:go-essp-tech-bounces at ucar.edu>
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Laura Carriere
> *Sent:* 14 February 2012 19:43
> *To:* Jennifer Adams
> *Cc:* go-essp-tech at ucar.edu <mailto:go-essp-tech at ucar.edu>
> *Subject:* Re: [Go-essp-tech] Incorrect file names?
>
>
> Ah, now I see your reply. Since you have a solution to your
> immediate problem, I will not rush the republish but will have
> it done as soon as it's convenient. Thanks.
>
> Laura.
>
> On 2/14/2012 2:32 PM, Jennifer Adams wrote:
>
> Oh dear. Rather than renaming the files, I used a set of
> symlinks to solve my immediate problem, but I still find this
> a bit troubling. I will have to check all GEOS-5 data files I
> grab from now on. I asked Larry Marx to check if any of COLA's
> CMIP5 data at NASA have 8-digit date stamps, and he found
> everything to be correct with only YYYYMM date strings.
>
> --Jennifer
>
>
>
>
>
> On Feb 14, 2012, at 1:29 PM, Laura Carriere wrote:
>
>
>
>
>
> Quick answer on my way to a meeting - CMOR2 was used for this
> and at least one other dataset that we have (from COLA) that
> also has the yyyymmdd format. I'll ask a few other questions
> after my meeting but that's the short answer.
>
> Laura.
>
> On 2/14/2012 1:21 PM, Karl Taylor wrote:
>
> Dear Novice (with clearly more knowledge than most so-called
> experts),
>
> I'm copying a contact for the GEOS-5 model who may be able to
> provide some information on this. I can't explain why the
> monthly file names are inconsistent with what CMOR2 puts out.
> Maybe CMOR2 wasn't used. The DRS document doesn't absolutely
> forbid including more precision than necessary in specifying
> the time-periods, so I don't think we can force them to rename
> their files. That being said, my hope was everyone would use
> CMOR, so the file names would all follow the same template.
>
> Karl
>
> On 2/13/12 9:30 AM, Jennifer Adams wrote:
>
> Dear Experts,
>
>
>
> Here is a dataset:
>
> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NASA-GMAO.GEOS-5.decadal1960.mon.atmos.Amon.r1i1p1.html
>
>
>
> And here is the file name template for all the variables in
> this dataset:
>
> <varname>_Amon_GEOS-5_decadal1960_r1i1p1_19610116-19701216.nc
>
>
>
> My script to generate a GrADS descriptor for this file barked
> because the MONTHLY data file has time stamps in the YYYYMMDD
> format.
>
> If I have read the DRS document correctly, this a not a
> correct file name.
>
> Shouldn't I be able to assume that monthly files will have
> only YYYYMM date strings?
>
>
>
> --Jennifer
>
>
>
>
>
> --
>
> Jennifer M. Adams
>
> IGES/COLA
>
> 4041 Powder Mill Road, Suite 302
>
> Calverton, MD 20705
>
> jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
> Laura Carriere, SAIC laura.carriere at nasa.gov <mailto:laura.carriere at nasa.gov>
>
> NCCS, Code 606.2 301 614-5064
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
> --
>
> Jennifer M. Adams
>
> IGES/COLA
>
> 4041 Powder Mill Road, Suite 302
>
> Calverton, MD 20705
>
> jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
> Laura Carriere, SAIC laura.carriere at nasa.gov <mailto:laura.carriere at nasa.gov>
>
> NCCS, Code 606.2 301 614-5064
>
>
>
>
>
> _______________________________________________
>
> GO-ESSP-TECH mailing list
>
> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
>
>
> --
>
> Estanislao Gonzalez
>
>
>
> Max-Planck-Institut für Meteorologie (MPI-M)
>
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
>
>
> Phone: +49 (40) 46 00 94-126
>
> E-Mail: gonzalez at dkrz.de <mailto:gonzalez at dkrz.de>
>
>
>
> --
> Scanned by iCritical.
>
>
>
>
>
>
> --
> Estanislao Gonzalez
>
> Max-Planck-Institut für Meteorologie (MPI-M)
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
> Phone: +49 (40) 46 00 94-126
> E-Mail: gonzalez at dkrz.de <mailto:gonzalez at dkrz.de>
>
> --
> Scanned by iCritical.
>
>
--
Gavin M. Bell
Lawrence Livermore National Labs
--
"Never mistake a clear view for a short distance."
-Paul Saffo
(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
A796 CE39 9C31 68A4 52A7 1F6B 66B7 B250 21D5 6D3E
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120215/4da06784/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list