[Go-essp-tech] tracking_id and check sums was... RE: Status of Gateway 2.0 (another use case)

Fri Dec 16 09:58:38 MST 2011

Hi,

Indeed, that's what I think :-)
Changing the file with anything other than CMOR won't regenerate the 
tracking_id, and this happens always when changing the header only 
(wrong variable name, wrong grid type, wrong email address) This 
different files from different versions to share the same number.

BUT!
tracking_id + Checksum is unique and should point to the file in 
question (the Checksum alone has no semantic, so it could be disastrous 
in the future if we rely on it and use it in a similar project... we 
won't be able to tell those files apart even if the checksum is 
unique... but that's just my view)

Well, I wish you all happy holidays!
(I'm leaving tomorrow, that's why the "early" greetings)
Estani

Am 16.12.2011 10:55, schrieb stephen.pascoe at stfc.ac.uk:
>
> I agree and this is a debate that has been waiting in the wings for 
> some time.  I believe Estani doubts the tracking_id is much use.  I'm 
> on the fence -- it is a record of what the data was when it passed 
> through CMOR and it is quicker to check than the md5sum.  It does not 
> guarantee what data you have though.
>
> One example of where it could be useful is if people want to aggregate 
> their NetCDF.  Their files would change but the tracking_ids still 
> tell you where the data came from.
>
> Stephen.
>
> ---
>
> Stephen Pascoe  +44 (0)1235 445980
>
> Centre of Environmental Data Archival
>
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>
> *From:*Kettleborough, Jamie [mailto:jamie.kettleborough at metoffice.gov.uk]
> *Sent:* 16 December 2011 09:50
> *To:* Pascoe, Stephen (STFC,RAL,RALSP); jma at cola.iges.org; 
> go-essp-tech at ucar.edu
> *Cc:* Kettleborough, Jamie
> *Subject:* tracking_id and check sums was... RE: [Go-essp-tech] Status 
> of Gateway 2.0 (another use case)
>
> Hello Stephen Bryan,
>
> I wouldn't rely on tracking_id - there is too high a likelihood that 
> it is not unique. We have seen cases where different files have the 
> same tracking_id. ('though in the cases we have seen the data has been 
> the same, there are just minor updates to the meta-data - hmmm... 
> maybe that statement was a red rag to a bull).   I think the checksum 
> is the most reliable indicator of the uniqueness of a file.   Though 
> clearly its not enough on its own as it doesn't tell you what has been 
> changed or why, or how changes in one file are related to changes in 
> other files.
>
> We've also seen examples where data providers have tried to be helpful 
> - which is great - and put version number as an attribute in the 
> netcdf file... but then that has not been updated when the files have 
> be published at a new version...
>
> Karl - where are we with the agreement that all data nodes should 
> provide checksums with the data?  I think its agreed in principle, but 
> I'm not sure whether and when the implications of that agreement will 
> be followed up.
>
> Jamie
>
>     ------------------------------------------------------------------------
>
>     *From:*go-essp-tech-bounces at ucar.edu
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of
>     *stephen.pascoe at stfc.ac.uk
>     *Sent:* 16 December 2011 09:25
>     *To:* jma at cola.iges.org; go-essp-tech at ucar.edu
>     *Subject:* Re: [Go-essp-tech] Status of Gateway 2.0 (another use case)
>
>     Hi Jennifer,
>
>     I just wanted to add a few more technical specifics to this
>     sub-thread about versions.  Bryan's point that it has all been a
>     compromise is the take-home message.
>
>     > If the version is so important and needs to be preserved, then
>     it should have been included in the data file name. It's obviously
>     too late to make
>
>     > that change now.
>
>     Indeed, and it was already too late when we got agreement on the
>     format for version identifiers.  By that point CMOR, the tool that
>     generates the filenames, was already finalised and being run at
>     some modelling centres.  Also a version has to be assigned much
>     later in the process than when CMOR is run.  Bryan is right that
>     the tracking_id or md5 checksum should provide the link between
>     file and version.  Unfortunately we don't have tools for that yet.
>
>     Although the filenames don't contain versions the wget scripts do
>     *provided datanodes have their data in DRS directory format*.  ESG
>     insiders know this has been a long-term bugbear of mine. 
>     Presently IPSL, BADC and DKRZ have this and maybe some others too
>     but not all datanodes have implemented this.  Maybe the wget
>     scripts need to include versions in a more explicit way than just
>     the DRS path which would allow datanodes that can't implement DRS
>     to include versions.  It would be good if wget scripts replicated
>     the DRS directory structure at the client.  That's something I
>     wish we'd implemented by now but since not every datanode has DRS
>     structure it's impossible to implement federation-wide.
>
>     Thanks for the great feedback.
>
>     Stephen.
>
>     ---
>
>     Stephen Pascoe  +44 (0)1235 445980
>
>     Centre of Environmental Data Archival
>
>     STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>     0QX, UK
>
>     *From:*go-essp-tech-bounces at ucar.edu
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Jennifer Adams
>     *Sent:* 15 December 2011 19:58
>     *To:* go-essp-tech at ucar.edu
>     *Subject:* Re: [Go-essp-tech] Status of Gateway 2.0 (another use case)
>
>     On Dec 15, 2011, at 2:14 PM, Bryan Lawrence wrote:
>
>     Hi Jennifer
>
>     With due respect, it's completely unrealistic to expect modelling
>     groups not to want to have multiple versions of some datasets ...
>     that's just not how the world (and in particular, modelling
>     workflow) works. It has never  been thus. There simply isn't time
>     to look at everthing before it is released ... if you haave a
>     problem with that, blame the government folk who set the IPCC
>     timetables :-)  (Maybe your comment was somewhat tongue in cheek,
>     but I feel obliged to make this statement anyway :-).
>
>     Fair enough. I was being cheeky, that is why I put the :-). The
>     users suffer the IPCC time constraints too, we have to deliver
>     analyses of data that take an impossibly long time to grab.
>
>
>     Also, with due respect, please don't "replace files with newer
>     versions" ... we absolutely need folks to understand the idea of
>     processing with one particular version of the data, and
>     understanding the provenance of that, so that they understand if
>     the data has changed, they may need to re-run the processing.
>
>     If the version is so important and needs to be preserved, then it
>     should have been included in the data file name. It's obviously
>     too late to make that change now. As I mentioned before, the
>     version number is a valuable piece of metadata that is lost in the
>     wget download process. The problem of how to keep track of version
>     numbers and update my copy when necessary remains.
>
>     I'll take this opportunity to point out that the realm and
>     frequency are also missing from the file name. I can't remember
>     where I read this, but MIP_table value is not always adequate for
>     uniquely determining what the realm and frequency are.
>
>
>     I'm sure this doesn't apply to you, but for too long our community
>     has had a pretty cavalier attitude to data provenance! CMIP3 and
>     AR4 was a "dogs breakfast" in this regard ...
>
>     Looks like CMIP5 hasn't improved the situation.
>
>
>     (And I too am very grateful that you are laying out your
>     requirements in some detail :-)
>
>     I'm glad to hear that.
>
>     --Jennifer
>
>
>     Cheers
>     Bryan
>
>
>         On Dec 15, 2011, at 11:22 AM, Estanislao Gonzalez wrote:
>
>             Hi Jennifer,
>
>             I'll check this more carefully and see what can be done
>             with what we have (or minimal changes), thought the
>             multiple versions is something CMIP3 hasn't worried about,
>             files just got changed or deleted, cmip5 add a two figure
>             factor to that since there are many more institutions and
>             data... but it might be possible.
>
>         At the moment, I have no good ideas for how to solve the
>         problem of replacing files in my local CMIP5 collection with
>         newer versions if they are available. My strategy at this
>         point is to get the version that is available now and not look
>         for it again. If any data providers are listening, here is my
>         plea:
>
>         ==> Please don't submit new versions of your CMIP5 data. Get
>         it right the first time! <==
>
>         :-)
>
>             In any case I wanted just to thank you very much for the
>             detailed description, it is very useful.
>
>         I'm glad you (and Steve Hankin) find my long emails helpful.
>
>         --Jennifer
>
>             Regards,
>
>             Estani
>
>             Am 15.12.2011 14:52, schrieb Jennifer Adams:
>
>                 Hi, Estanislao --
>
>                 Please see my comments inline.
>
>                 On Dec 15, 2011, at 5:47 AM, Estanislao Gonzalez wrote:
>
>                     Hi Jennifer,
>
>                     I'm still not sure how is Lucas change in the API
>                     going to help you Jennifer. But perhaps it would
>                     help me to fully understand your requirement as
>                     well as your use of wget when using the FTP  protocol.
>
>                     I presume what you want is to crawl the archive
>                     and get file from a specific directory structure?
>
>                     Maybe it would be better if you just describe
>                     briefly the procedure you've been using for
>                     getting the CMIP3 data so we can see what could be
>                     done for CMIP5.
>
>                     How did you find out which data was interesting?
>
>                 COLA scientists ask for a specific
>                 scenario/realm/frequency/variable they need for their
>                 research. Our CMIP3 collection is a shared resource of
>                 about 4Tb of data. For CMIP5, we are working with an
>                 estimate of 4-5 times that data volume to meet our
>                 needs. It's hard to say at this point whether that
>                 will be enough.
>
>                     How did you find out which files were required to
>                     be downloaded?
>
>                 For CMIP3, we often referred to
>                 http://www-pcmdi.llnl.gov/ipcc/data_status_tables.htm
>                 to see what was available.
>
>                 The new version of this chart for CMIP5,
>                 http://cmip-pcmdi.llnl.gov/cmip5/esg_tables/transpose_esg_static_table.html,
>                 is also useful. An improvement I'd like to see on this
>                 page: the numbers inside the blue boxes that show how
>                 many runs there are for a particular experiment/model
>                 should be a link to a list of those runs that has all
>                 the necessary components from the Data Reference
>                 Synatax so that I can go directly to the URL for that
>                 data set. For example,
>
>                 the BCC-CSM1.1 model shows 45 runs for the decadal1960
>                 experiment. I would like to click on that 45 and get a
>                 list of the 45 URLs for those runs, like this:
>
>                 http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r1i1p1.html
>
>                 http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r2i1p1.html
>
>                 ...
>
>                     How did you tell wget to download those files?
>
>                 For example: wget -nH --retr-symlinks -r -A nc
>                 ftp://username@ftp-esg.ucllnl.org/picntrl/atm/mo/tas
>                 -o log.tas
>
>                 This would populate a local directory
>                 ./picntrl/atm/mo/tas with all the models and ensemble
>                 members in the proper subdirectory. If I wanted to
>                 update with newer versions or models that had been
>                 added, I simply ran the same 1-line wget command
>                 again. This is what I refer to as 'elegant.'
>
>                     We might have already some way of achieving what
>                     you want, if we knew exactly what that is.
>
>                 Wouldn't that be wonderful? I am hopeful that the P2P
>                 will simplify the elaborate and flawed workflow I have
>                 cobbled together to navigate the current system.
>
>                 I have a list of desired
>                 experiment/realm/frequency/MIP_table/variables for
>                 which I need to grab all available models/ensembles.
>                 Is that not enough to describe my needs?
>
>                     I guess my proposal of issuing:
>
>                     bash <(wget
>                     http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=month&variable=clt
>                     <http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=month&variable=clt>
>                     -qO - | grep -v HadCM3)
>
>                 Yes, this would likely achieve the same result as the
>                 '&model=!name' that Luca implemented. However, I
>                 believe the documentation says that there is a limit
>                 of 1000 to the number of wgets that p2pnode will put
>                 into a single search request, so I don't want to
>                 populate my precious 1000 results with wgets that I'm
>                 going to grep out afterwards.
>
>                 --Jennifer
>
>                     was not acceptable to you. But I still don't know
>                     exactly why.
>
>                     It would really help to know what you meant by
>                     "elegant use of wget".
>
>                     Thanks,
>
>                     Estani
>
>                     Am 14.12.2011 18:44, schrieb Cinquini, Luca (3880):
>
>                         So Jennifer, would having the capability of
>                         doing negative searches (model=!CCSM), and
>                         generate the corresponding wget scripts, help
>                         you ?
>
>                         thanks, Luca
>
>                         On Dec 14, 2011, at 10:38 AM, Jennifer Adams
>                         wrote:
>
>                             Well, after working from the client side
>                             to get CMIP3 and CMIP5 data, I can say
>                             that wget is a fine tool to rely on at the
>                             core of the workflow. Unfortunately, the
>                             step up in complexity from CMIP3 to CMIP5
>                             and the switch from FTP to HTTP trashed
>                             the elegant use of wget. No amount of
>                             customized wrapper software, browser
>                             interfaces, or pre-packaged tools like DML
>                             fixes that problem.
>
>                             At the moment, the burden on the user is
>                             embarrassingly high. It's so easy to
>                             suggest that the user should "filter to
>                             remove what is not required" from a
>                             downloaded script, but the actual pratice
>                             of doing that in a timely and automated
>                             and distributed way is NOT simple! And if
>                             the solution to my problem of filling in
>                             the gaps in my incomplete collection is to
>                             go back to clicking in my browser and do
>                             the whole thing over again but make my
>                             filters smarter by looking for what's
>                             already been acquired or what has a new
>                             version number ... this is unacceptable.
>                             The filtering must be a server-side
>                             responsibility and the interface must be
>                             accessible by automated scripts. Make it so!
>
>                             By the way, the version number is a piece
>                             of metadata that is not in the downloaded
>                             files or the gateway's search criteria. It
>                             appears in the wget script as part of the
>                             path in the file's http location, but the
>                             path is not preserved after the wget is
>                             complete, so it is effectively lost after
>                             the download is
>                                                       done. I guess
>                             the file's date stamp would be the only
>                             way to know if the version number of the
>                             data file in question has been changed,
>                             but I'm not going to write that check into
>                             my filtering scripts.
>
>                             --Jennifer
>
>                             --
>
>                             Jennifer M. Adams
>
>                             IGES/COLA
>
>                             4041 Powder Mill Road, Suite 302
>
>                             Calverton, MD 20705
>
>                             jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>                             _______________________________________________
>
>                             GO-ESSP-TECH mailing list
>
>                             GO-ESSP-TECH at ucar.edu
>                             <mailto:GO-ESSP-TECH at ucar.edu>
>
>                             http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>                         _______________________________________________
>
>                         GO-ESSP-TECH mailing list
>
>                         GO-ESSP-TECH at ucar.edu
>                         <mailto:GO-ESSP-TECH at ucar.edu>
>
>                         http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>                 --
>
>                 Jennifer M. Adams
>
>                 IGES/COLA
>
>                 4041 Powder Mill Road, Suite 302
>
>                 Calverton, MD 20705
>
>                 jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>                 _______________________________________________
>
>                 GO-ESSP-TECH mailing list
>
>                 GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>
>                 http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>         --
>
>         Jennifer M. Adams
>
>         IGES/COLA
>
>         4041 Powder Mill Road, Suite 302
>
>         Calverton, MD 20705
>
>         jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>
>     --
>     Bryan Lawrence
>     University of Reading:  Professor of Weather and Climate Computing.
>     National Centre for Atmospheric Science: Director of Models and Data.
>     STFC: Director of the Centre for Environmental Data Archival.
>     Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>
>     --
>
>     Jennifer M. Adams
>
>     IGES/COLA
>
>     4041 Powder Mill Road, Suite 302
>
>     Calverton, MD 20705
>
>     jma at cola.iges.org <mailto:jma at cola.iges.org>
>
>     -- 
>     Scanned by iCritical.
>
>
> -- 
> Scanned by iCritical.
>
>
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111216/0f229f21/attachment-0001.html