[Go-essp-tech] tracking_id and check sums was... RE: Status of Gateway 2.0 (another use case)

Fri Dec 16 12:19:11 MST 2011

Hi,

A careful preparer of CMIP output will generate a new tracking_id and 
record it at the same time as he changes anything in the header (or 
elsewhere in the file).  So, I agree with Bryan, the tracking_id is very 
useful except when the data generator is lazy and fails to create a new 
tracking_id.  I'll bet that for well over 99% of the files written, the 
tracking_id will likely be reliable (at least much better than 80 - 
20).  Of course check-sum tells you absolutely whether 2 files are 
identical, but it has limitations too, as pointed out in earlier emails.

Also, as noted earlier by others, most files that differ but have the 
same tracking_id have contain identical data, but have slightly modified 
"headers".  In most cases this means no matter which of the files with 
identical tracking_id's a user uses, one could reproduce their results, 
which is mostly what we care about.

That being said, perhaps, I need to send the modeling groups a reminder 
that they should generate a new tracking_id when they modify their files 
in any way.

Oh, and I will try to send out an email this weekend to the modeling 
groups urging them to add checksums to their tds catalogs as soon as 
possible (i.e., I agree this is essential and high priority).  Besides 
enabling users and "replicators" to determine whether they've got the 
latest version of a file and that it has been transferred correctly, are 
there other important reasons for including the checksum I should 
articulate in my email?

Best regards,
Karl

On 12/16/11 8:58 AM, Estanislao Gonzalez wrote:
> Hi,
>
> Indeed, that's what I think :-)
> Changing the file with anything other than CMOR won't regenerate the 
> tracking_id, and this happens always when changing the header only 
> (wrong variable name, wrong grid type, wrong email address) This 
> different files from different versions to share the same number.
>
> BUT!
> tracking_id + Checksum is unique and should point to the file in 
> question (the Checksum alone has no semantic, so it could be 
> disastrous in the future if we rely on it and use it in a similar 
> project... we won't be able to tell those files apart even if the 
> checksum is unique... but that's just my view)
>
> Well, I wish you all happy holidays!
> (I'm leaving tomorrow, that's why the "early" greetings)
> Estani
>
> Am 16.12.2011 10:55, schrieb stephen.pascoe at stfc.ac.uk:
>>
>> I agree and this is a debate that has been waiting in the wings for 
>> some time.  I believe Estani doubts the tracking_id is much use.  I'm 
>> on the fence -- it is a record of what the data was when it passed 
>> through CMOR and it is quicker to check than the md5sum.  It does not 
>> guarantee what data you have though.
>>
>> One example of where it could be useful is if people want to 
>> aggregate their NetCDF.  Their files would change but the 
>> tracking_ids still tell you where the data came from.
>>
>> Stephen.
>>
>> ---
>>
>> Stephen Pascoe  +44 (0)1235 445980
>>
>> Centre of Environmental Data Archival
>>
>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>>
>> *From:*Kettleborough, Jamie 
>> [mailto:jamie.kettleborough at metoffice.gov.uk]
>> *Sent:* 16 December 2011 09:50
>> *To:* Pascoe, Stephen (STFC,RAL,RALSP); jma at cola.iges.org; 
>> go-essp-tech at ucar.edu
>> *Cc:* Kettleborough, Jamie
>> *Subject:* tracking_id and check sums was... RE: [Go-essp-tech] 
>> Status of Gateway 2.0 (another use case)
>>
>> Hello Stephen Bryan,
>>
>> I wouldn't rely on tracking_id - there is too high a likelihood that 
>> it is not unique. We have seen cases where different files have the 
>> same tracking_id. ('though in the cases we have seen the data has 
>> been the same, there are just minor updates to the meta-data - 
>> hmmm... maybe that statement was a red rag to a bull).   I think the 
>> checksum is the most reliable indicator of the uniqueness of a 
>> file.   Though clearly its not enough on its own as it doesn't tell 
>> you what has been changed or why, or how changes in one file are 
>> related to changes in other files.
>>
>> We've also seen examples where data providers have tried to be 
>> helpful - which is great - and put version number as an attribute in 
>> the netcdf file... but then that has not been updated when the files 
>> have be published at a new version...
>>
>> Karl - where are we with the agreement that all data nodes should 
>> provide checksums with the data?  I think its agreed in principle, 
>> but I'm not sure whether and when the implications of that agreement 
>> will be followed up.
>>
>> Jamie
>>
>>     ------------------------------------------------------------------------
>>
>>     *From:*go-essp-tech-bounces at ucar.edu
>>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of
>>     *stephen.pascoe at stfc.ac.uk
>>     *Sent:* 16 December 2011 09:25
>>     *To:* jma at cola.iges.org; go-essp-tech at ucar.edu
>>     *Subject:* Re: [Go-essp-tech] Status of Gateway 2.0 (another use
>>     case)
>>
>>     Hi Jennifer,
>>
>>     I just wanted to add a few more technical specifics to this
>>     sub-thread about versions.  Bryan's point that it has all been a
>>     compromise is the take-home message.
>>
>>     > If the version is so important and needs to be preserved, then
>>     it should have been included in the data file name. It's
>>     obviously too late to make
>>
>>     > that change now.
>>
>>     Indeed, and it was already too late when we got agreement on the
>>     format for version identifiers.  By that point CMOR, the tool
>>     that generates the filenames, was already finalised and being run
>>     at some modelling centres.  Also a version has to be assigned
>>     much later in the process than when CMOR is run.  Bryan is right
>>     that the tracking_id or md5 checksum should provide the link
>>     between file and version.  Unfortunately we don't have tools for
>>     that yet.
>>
>>     Although the filenames don't contain versions the wget scripts do
>>     *provided datanodes have their data in DRS directory format*. 
>>     ESG insiders know this has been a long-term bugbear of mine. 
>>     Presently IPSL, BADC and DKRZ have this and maybe some others too
>>     but not all datanodes have implemented this.  Maybe the wget
>>     scripts need to include versions in a more explicit way than just
>>     the DRS path which would allow datanodes that can't implement DRS
>>     to include versions.  It would be good if wget scripts replicated
>>     the DRS directory structure at the client.  That's something I
>>     wish we'd implemented by now but since not every datanode has DRS
>>     structure it's impossible to implement federation-wide.
>>
>>     Thanks for the great feedback.
>>
>>     Stephen.
>>
>>     ---
>>
>>     Stephen Pascoe  +44 (0)1235 445980
>>
>>     Centre of Environmental Data Archival
>>
>>     STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>>     0QX, UK
>>
>>     *From:*go-essp-tech-bounces at ucar.edu
>>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Jennifer Adams
>>     *Sent:* 15 December 2011 19:58
>>     *To:* go-essp-tech at ucar.edu
>>     *Subject:* Re: [Go-essp-tech] Status of Gateway 2.0 (another use
>>     case)
>>
>>     On Dec 15, 2011, at 2:14 PM, Bryan Lawrence wrote:
>>
>>     Hi Jennifer
>>
>>     With due respect, it's completely unrealistic to expect modelling
>>     groups not to want to have multiple versions of some datasets ...
>>     that's just not how the world (and in particular, modelling
>>     workflow) works. It has never  been thus. There simply isn't time
>>     to look at everthing before it is released ... if you haave a
>>     problem with that, blame the government folk who set the IPCC
>>     timetables :-)  (Maybe your comment was somewhat tongue in cheek,
>>     but I feel obliged to make this statement anyway :-).
>>
>>     Fair enough. I was being cheeky, that is why I put the :-). The
>>     users suffer the IPCC time constraints too, we have to deliver
>>     analyses of data that take an impossibly long time to grab.
>>
>>
>>     Also, with due respect, please don't "replace files with newer
>>     versions" ... we absolutely need folks to understand the idea of
>>     processing with one particular version of the data, and
>>     understanding the provenance of that, so that they understand if
>>     the data has changed, they may need to re-run the processing.
>>
>>     If the version is so important and needs to be preserved, then it
>>     should have been included in the data file name. It's obviously
>>     too late to make that change now. As I mentioned before, the
>>     version number is a valuable piece of metadata that is lost in
>>     the wget download process. The problem of how to keep track of
>>     version numbers and update my copy when necessary remains.
>>
>>     I'll take this opportunity to point out that the realm and
>>     frequency are also missing from the file name. I can't remember
>>     where I read this, but MIP_table value is not always adequate for
>>     uniquely determining what the realm and frequency are.
>>
>>
>>     I'm sure this doesn't apply to you, but for too long our
>>     community has had a pretty cavalier attitude to data provenance!
>>     CMIP3 and AR4 was a "dogs breakfast" in this regard …
>>
>>     Looks like CMIP5 hasn't improved the situation.
>>
>>
>>     (And I too am very grateful that you are laying out your
>>     requirements in some detail :-)
>>
>>     I'm glad to hear that.
>>
>>     --Jennifer
>>
>>
>>     Cheers
>>     Bryan
>>
>>
>>         On Dec 15, 2011, at 11:22 AM, Estanislao Gonzalez wrote:
>>
>>             Hi Jennifer,
>>
>>             I'll check this more carefully and see what can be done
>>             with what we have (or minimal changes), thought the
>>             multiple versions is something CMIP3 hasn't worried
>>             about, files just got changed or deleted, cmip5 add a two
>>             figure factor to that since there are many more
>>             institutions and data... but it might be possible.
>>
>>         At the moment, I have no good ideas for how to solve the
>>         problem of replacing files in my local CMIP5 collection with
>>         newer versions if they are available. My strategy at this
>>         point is to get the version that is available now and not
>>         look for it again. If any data providers are listening, here
>>         is my plea:
>>
>>         ==> Please don't submit new versions of your CMIP5 data. Get
>>         it right the first time! <==
>>
>>         :-)
>>
>>             In any case I wanted just to thank you very much for the
>>             detailed description, it is very useful.
>>
>>         I'm glad you (and Steve Hankin) find my long emails helpful.
>>
>>         --Jennifer
>>
>>             Regards,
>>
>>             Estani
>>
>>             Am 15.12.2011 14:52, schrieb Jennifer Adams:
>>
>>                 Hi, Estanislao --
>>
>>                 Please see my comments inline.
>>
>>                 On Dec 15, 2011, at 5:47 AM, Estanislao Gonzalez wrote:
>>
>>                     Hi Jennifer,
>>
>>                     I'm still not sure how is Lucas change in the API
>>                     going to help you Jennifer. But perhaps it would
>>                     help me to fully understand your requirement as
>>                     well as your use of wget when using the FTP
>>                      protocol.
>>
>>                     I presume what you want is to crawl the archive
>>                     and get file from a specific directory structure?
>>
>>                     Maybe it would be better if you just describe
>>                     briefly the procedure you've been using for
>>                     getting the CMIP3 data so we can see what could
>>                     be done for CMIP5.
>>
>>                     How did you find out which data was interesting?
>>
>>                 COLA scientists ask for a specific
>>                 scenario/realm/frequency/variable they need for their
>>                 research. Our CMIP3 collection is a shared resource
>>                 of about 4Tb of data. For CMIP5, we are working with
>>                 an estimate of 4-5 times that data volume to meet our
>>                 needs. It's hard to say at this point whether that
>>                 will be enough.
>>
>>                     How did you find out which files were required to
>>                     be downloaded?
>>
>>                 For CMIP3, we often referred to
>>                 http://www-pcmdi.llnl.gov/ipcc/data_status_tables.htm
>>                 to see what was available.
>>
>>                 The new version of this chart for CMIP5,
>>                 http://cmip-pcmdi.llnl.gov/cmip5/esg_tables/transpose_esg_static_table.html,
>>                 is also useful. An improvement I'd like to see on
>>                 this page: the numbers inside the blue boxes that
>>                 show how many runs there are for a particular
>>                 experiment/model should be a link to a list of those
>>                 runs that has all the necessary components from the
>>                 Data Reference Synatax so that I can go directly to
>>                 the URL for that data set. For example,
>>
>>                 the BCC-CSM1.1 model shows 45 runs for the
>>                 decadal1960 experiment. I would like to click on that
>>                 45 and get a list of the 45 URLs for those runs, like
>>                 this:
>>
>>                 http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r1i1p1.html
>>
>>                 http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r2i1p1.html
>>
>>                 ...
>>
>>                     How did you tell wget to download those files?
>>
>>                 For example: wget -nH --retr-symlinks -r -A nc
>>                 ftp://username@ftp-esg.ucllnl.org/picntrl/atm/mo/tas
>>                 -o log.tas
>>
>>                 This would populate a local directory
>>                 ./picntrl/atm/mo/tas with all the models and ensemble
>>                 members in the proper subdirectory. If I wanted to
>>                 update with newer versions or models that had been
>>                 added, I simply ran the same 1-line wget command
>>                 again. This is what I refer to as 'elegant.'
>>
>>                     We might have already some way of achieving what
>>                     you want, if we knew exactly what that is.
>>
>>                 Wouldn't that be wonderful? I am hopeful that the P2P
>>                 will simplify the elaborate and flawed workflow I
>>                 have cobbled together to navigate the current system.
>>
>>                 I have a list of desired
>>                 experiment/realm/frequency/MIP_table/variables for
>>                 which I need to grab all available models/ensembles.
>>                 Is that not enough to describe my needs?
>>
>>                     I guess my proposal of issuing:
>>
>>                     bash <(wget
>>                     http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=month&variable=clt
>>                     <http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=month&variable=clt>
>>                     -qO - | grep -v HadCM3)
>>
>>                 Yes, this would likely achieve the same result as the
>>                 '&model=!name' that Luca implemented. However, I
>>                 believe the documentation says that there is a limit
>>                 of 1000 to the number of wgets that p2pnode will put
>>                 into a single search request, so I don't want to
>>                 populate my precious 1000 results with wgets that I'm
>>                 going to grep out afterwards.
>>
>>                 --Jennifer
>>
>>                     was not acceptable to you. But I still don't know
>>                     exactly why.
>>
>>                     It would really help to know what you meant by
>>                     "elegant use of wget".
>>
>>                     Thanks,
>>
>>                     Estani
>>
>>                     Am 14.12.2011 18:44, schrieb Cinquini, Luca (3880):
>>
>>                         So Jennifer, would having the capability of
>>                         doing negative searches (model=!CCSM), and
>>                         generate the corresponding wget scripts, help
>>                         you ?
>>
>>                         thanks, Luca
>>
>>                         On Dec 14, 2011, at 10:38 AM, Jennifer Adams
>>                         wrote:
>>
>>                             Well, after working from the client side
>>                             to get CMIP3 and CMIP5 data, I can say
>>                             that wget is a fine tool to rely on at
>>                             the core of the workflow. Unfortunately,
>>                             the step up in complexity from CMIP3 to
>>                             CMIP5 and the switch from FTP to HTTP
>>                             trashed the elegant use of wget. No
>>                             amount of customized wrapper software,
>>                             browser interfaces, or pre-packaged tools
>>                             like DML fixes that problem.
>>
>>                             At the moment, the burden on the user is
>>                             embarrassingly high. It's so easy to
>>                             suggest that the user should "filter to
>>                             remove what is not required" from a
>>                             downloaded script, but the actual pratice
>>                             of doing that in a timely and automated
>>                             and distributed way is NOT simple! And if
>>                             the solution to my problem of filling in
>>                             the gaps in my incomplete collection is
>>                             to go back to clicking in my browser and
>>                             do the whole thing over again but make my
>>                             filters smarter by looking for what's
>>                             already been acquired or what has a new
>>                             version number … this is unacceptable.
>>                             The filtering must be a server-side
>>                             responsibility and the interface must be
>>                             accessible by automated scripts. Make it so!
>>
>>                             By the way, the version number is a piece
>>                             of metadata that is not in the downloaded
>>                             files or the gateway's search criteria.
>>                             It appears in the wget script as part of
>>                             the path in the file's http location, but
>>                             the path is not preserved after the wget
>>                             is complete, so it is effectively lost
>>                             after the download is
>>                                                       done. I guess
>>                             the file's date stamp would be the only
>>                             way to know if the version number of the
>>                             data file in question has been changed,
>>                             but I'm not going to write that check
>>                             into my filtering scripts.
>>
>>                             --Jennifer
>>
>>                             --
>>
>>                             Jennifer M. Adams
>>
>>                             IGES/COLA
>>
>>                             4041 Powder Mill Road, Suite 302
>>
>>                             Calverton, MD 20705
>>
>>                             jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>>                             _______________________________________________
>>
>>                             GO-ESSP-TECH mailing list
>>
>>                             GO-ESSP-TECH at ucar.edu
>>                             <mailto:GO-ESSP-TECH at ucar.edu>
>>
>>                             http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>>                         _______________________________________________
>>
>>                         GO-ESSP-TECH mailing list
>>
>>                         GO-ESSP-TECH at ucar.edu
>>                         <mailto:GO-ESSP-TECH at ucar.edu>
>>
>>                         http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>>                 --
>>
>>                 Jennifer M. Adams
>>
>>                 IGES/COLA
>>
>>                 4041 Powder Mill Road, Suite 302
>>
>>                 Calverton, MD 20705
>>
>>                 jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>>                 _______________________________________________
>>
>>                 GO-ESSP-TECH mailing list
>>
>>                 GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>
>>                 http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>>         --
>>
>>         Jennifer M. Adams
>>
>>         IGES/COLA
>>
>>         4041 Powder Mill Road, Suite 302
>>
>>         Calverton, MD 20705
>>
>>         jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>>
>>     --
>>     Bryan Lawrence
>>     University of Reading:  Professor of Weather and Climate Computing.
>>     National Centre for Atmospheric Science: Director of Models and
>>     Data.
>>     STFC: Director of the Centre for Environmental Data Archival.
>>     Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>>
>>     --
>>
>>     Jennifer M. Adams
>>
>>     IGES/COLA
>>
>>     4041 Powder Mill Road, Suite 302
>>
>>     Calverton, MD 20705
>>
>>     jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>>     -- 
>>     Scanned by iCritical.
>>
>>
>> -- 
>> Scanned by iCritical.
>>
>>
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
> -- 
> Estanislao Gonzalez
>
> Max-Planck-Institut für Meteorologie (MPI-M)
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
> Phone:   +49 (40) 46 00 94-126
> E-Mail:gonzalez at dkrz.de  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111216/843dd0fc/attachment-0001.html