[Go-essp-tech] tracking_id and check sums was... RE: Status of Gateway 2.0 (another use case)
Karl Taylor
taylor13 at llnl.gov
Fri Dec 16 12:19:11 MST 2011
Hi,
A careful preparer of CMIP output will generate a new tracking_id and
record it at the same time as he changes anything in the header (or
elsewhere in the file). So, I agree with Bryan, the tracking_id is very
useful except when the data generator is lazy and fails to create a new
tracking_id. I'll bet that for well over 99% of the files written, the
tracking_id will likely be reliable (at least much better than 80 -
20). Of course check-sum tells you absolutely whether 2 files are
identical, but it has limitations too, as pointed out in earlier emails.
Also, as noted earlier by others, most files that differ but have the
same tracking_id have contain identical data, but have slightly modified
"headers". In most cases this means no matter which of the files with
identical tracking_id's a user uses, one could reproduce their results,
which is mostly what we care about.
That being said, perhaps, I need to send the modeling groups a reminder
that they should generate a new tracking_id when they modify their files
in any way.
Oh, and I will try to send out an email this weekend to the modeling
groups urging them to add checksums to their tds catalogs as soon as
possible (i.e., I agree this is essential and high priority). Besides
enabling users and "replicators" to determine whether they've got the
latest version of a file and that it has been transferred correctly, are
there other important reasons for including the checksum I should
articulate in my email?
Best regards,
Karl
On 12/16/11 8:58 AM, Estanislao Gonzalez wrote:
> Hi,
>
> Indeed, that's what I think :-)
> Changing the file with anything other than CMOR won't regenerate the
> tracking_id, and this happens always when changing the header only
> (wrong variable name, wrong grid type, wrong email address) This
> different files from different versions to share the same number.
>
> BUT!
> tracking_id + Checksum is unique and should point to the file in
> question (the Checksum alone has no semantic, so it could be
> disastrous in the future if we rely on it and use it in a similar
> project... we won't be able to tell those files apart even if the
> checksum is unique... but that's just my view)
>
> Well, I wish you all happy holidays!
> (I'm leaving tomorrow, that's why the "early" greetings)
> Estani
>
> Am 16.12.2011 10:55, schrieb stephen.pascoe at stfc.ac.uk:
>>
>> I agree and this is a debate that has been waiting in the wings for
>> some time. I believe Estani doubts the tracking_id is much use. I'm
>> on the fence -- it is a record of what the data was when it passed
>> through CMOR and it is quicker to check than the md5sum. It does not
>> guarantee what data you have though.
>>
>> One example of where it could be useful is if people want to
>> aggregate their NetCDF. Their files would change but the
>> tracking_ids still tell you where the data came from.
>>
>> Stephen.
>>
>> ---
>>
>> Stephen Pascoe +44 (0)1235 445980
>>
>> Centre of Environmental Data Archival
>>
>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>>
>> *From:*Kettleborough, Jamie
>> [mailto:jamie.kettleborough at metoffice.gov.uk]
>> *Sent:* 16 December 2011 09:50
>> *To:* Pascoe, Stephen (STFC,RAL,RALSP); jma at cola.iges.org;
>> go-essp-tech at ucar.edu
>> *Cc:* Kettleborough, Jamie
>> *Subject:* tracking_id and check sums was... RE: [Go-essp-tech]
>> Status of Gateway 2.0 (another use case)
>>
>> Hello Stephen Bryan,
>>
>> I wouldn't rely on tracking_id - there is too high a likelihood that
>> it is not unique. We have seen cases where different files have the
>> same tracking_id. ('though in the cases we have seen the data has
>> been the same, there are just minor updates to the meta-data -
>> hmmm... maybe that statement was a red rag to a bull). I think the
>> checksum is the most reliable indicator of the uniqueness of a
>> file. Though clearly its not enough on its own as it doesn't tell
>> you what has been changed or why, or how changes in one file are
>> related to changes in other files.
>>
>> We've also seen examples where data providers have tried to be
>> helpful - which is great - and put version number as an attribute in
>> the netcdf file... but then that has not been updated when the files
>> have be published at a new version...
>>
>> Karl - where are we with the agreement that all data nodes should
>> provide checksums with the data? I think its agreed in principle,
>> but I'm not sure whether and when the implications of that agreement
>> will be followed up.
>>
>> Jamie
>>
>> ------------------------------------------------------------------------
>>
>> *From:*go-essp-tech-bounces at ucar.edu
>> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of
>> *stephen.pascoe at stfc.ac.uk
>> *Sent:* 16 December 2011 09:25
>> *To:* jma at cola.iges.org; go-essp-tech at ucar.edu
>> *Subject:* Re: [Go-essp-tech] Status of Gateway 2.0 (another use
>> case)
>>
>> Hi Jennifer,
>>
>> I just wanted to add a few more technical specifics to this
>> sub-thread about versions. Bryan's point that it has all been a
>> compromise is the take-home message.
>>
>> > If the version is so important and needs to be preserved, then
>> it should have been included in the data file name. It's
>> obviously too late to make
>>
>> > that change now.
>>
>> Indeed, and it was already too late when we got agreement on the
>> format for version identifiers. By that point CMOR, the tool
>> that generates the filenames, was already finalised and being run
>> at some modelling centres. Also a version has to be assigned
>> much later in the process than when CMOR is run. Bryan is right
>> that the tracking_id or md5 checksum should provide the link
>> between file and version. Unfortunately we don't have tools for
>> that yet.
>>
>> Although the filenames don't contain versions the wget scripts do
>> *provided datanodes have their data in DRS directory format*.
>> ESG insiders know this has been a long-term bugbear of mine.
>> Presently IPSL, BADC and DKRZ have this and maybe some others too
>> but not all datanodes have implemented this. Maybe the wget
>> scripts need to include versions in a more explicit way than just
>> the DRS path which would allow datanodes that can't implement DRS
>> to include versions. It would be good if wget scripts replicated
>> the DRS directory structure at the client. That's something I
>> wish we'd implemented by now but since not every datanode has DRS
>> structure it's impossible to implement federation-wide.
>>
>> Thanks for the great feedback.
>>
>> Stephen.
>>
>> ---
>>
>> Stephen Pascoe +44 (0)1235 445980
>>
>> Centre of Environmental Data Archival
>>
>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>> 0QX, UK
>>
>> *From:*go-essp-tech-bounces at ucar.edu
>> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Jennifer Adams
>> *Sent:* 15 December 2011 19:58
>> *To:* go-essp-tech at ucar.edu
>> *Subject:* Re: [Go-essp-tech] Status of Gateway 2.0 (another use
>> case)
>>
>> On Dec 15, 2011, at 2:14 PM, Bryan Lawrence wrote:
>>
>> Hi Jennifer
>>
>> With due respect, it's completely unrealistic to expect modelling
>> groups not to want to have multiple versions of some datasets ...
>> that's just not how the world (and in particular, modelling
>> workflow) works. It has never been thus. There simply isn't time
>> to look at everthing before it is released ... if you haave a
>> problem with that, blame the government folk who set the IPCC
>> timetables :-) (Maybe your comment was somewhat tongue in cheek,
>> but I feel obliged to make this statement anyway :-).
>>
>> Fair enough. I was being cheeky, that is why I put the :-). The
>> users suffer the IPCC time constraints too, we have to deliver
>> analyses of data that take an impossibly long time to grab.
>>
>>
>> Also, with due respect, please don't "replace files with newer
>> versions" ... we absolutely need folks to understand the idea of
>> processing with one particular version of the data, and
>> understanding the provenance of that, so that they understand if
>> the data has changed, they may need to re-run the processing.
>>
>> If the version is so important and needs to be preserved, then it
>> should have been included in the data file name. It's obviously
>> too late to make that change now. As I mentioned before, the
>> version number is a valuable piece of metadata that is lost in
>> the wget download process. The problem of how to keep track of
>> version numbers and update my copy when necessary remains.
>>
>> I'll take this opportunity to point out that the realm and
>> frequency are also missing from the file name. I can't remember
>> where I read this, but MIP_table value is not always adequate for
>> uniquely determining what the realm and frequency are.
>>
>>
>> I'm sure this doesn't apply to you, but for too long our
>> community has had a pretty cavalier attitude to data provenance!
>> CMIP3 and AR4 was a "dogs breakfast" in this regard …
>>
>> Looks like CMIP5 hasn't improved the situation.
>>
>>
>> (And I too am very grateful that you are laying out your
>> requirements in some detail :-)
>>
>> I'm glad to hear that.
>>
>> --Jennifer
>>
>>
>> Cheers
>> Bryan
>>
>>
>> On Dec 15, 2011, at 11:22 AM, Estanislao Gonzalez wrote:
>>
>> Hi Jennifer,
>>
>> I'll check this more carefully and see what can be done
>> with what we have (or minimal changes), thought the
>> multiple versions is something CMIP3 hasn't worried
>> about, files just got changed or deleted, cmip5 add a two
>> figure factor to that since there are many more
>> institutions and data... but it might be possible.
>>
>> At the moment, I have no good ideas for how to solve the
>> problem of replacing files in my local CMIP5 collection with
>> newer versions if they are available. My strategy at this
>> point is to get the version that is available now and not
>> look for it again. If any data providers are listening, here
>> is my plea:
>>
>> ==> Please don't submit new versions of your CMIP5 data. Get
>> it right the first time! <==
>>
>> :-)
>>
>> In any case I wanted just to thank you very much for the
>> detailed description, it is very useful.
>>
>> I'm glad you (and Steve Hankin) find my long emails helpful.
>>
>> --Jennifer
>>
>> Regards,
>>
>> Estani
>>
>> Am 15.12.2011 14:52, schrieb Jennifer Adams:
>>
>> Hi, Estanislao --
>>
>> Please see my comments inline.
>>
>> On Dec 15, 2011, at 5:47 AM, Estanislao Gonzalez wrote:
>>
>> Hi Jennifer,
>>
>> I'm still not sure how is Lucas change in the API
>> going to help you Jennifer. But perhaps it would
>> help me to fully understand your requirement as
>> well as your use of wget when using the FTP
>> protocol.
>>
>> I presume what you want is to crawl the archive
>> and get file from a specific directory structure?
>>
>> Maybe it would be better if you just describe
>> briefly the procedure you've been using for
>> getting the CMIP3 data so we can see what could
>> be done for CMIP5.
>>
>> How did you find out which data was interesting?
>>
>> COLA scientists ask for a specific
>> scenario/realm/frequency/variable they need for their
>> research. Our CMIP3 collection is a shared resource
>> of about 4Tb of data. For CMIP5, we are working with
>> an estimate of 4-5 times that data volume to meet our
>> needs. It's hard to say at this point whether that
>> will be enough.
>>
>> How did you find out which files were required to
>> be downloaded?
>>
>> For CMIP3, we often referred to
>> http://www-pcmdi.llnl.gov/ipcc/data_status_tables.htm
>> to see what was available.
>>
>> The new version of this chart for CMIP5,
>> http://cmip-pcmdi.llnl.gov/cmip5/esg_tables/transpose_esg_static_table.html,
>> is also useful. An improvement I'd like to see on
>> this page: the numbers inside the blue boxes that
>> show how many runs there are for a particular
>> experiment/model should be a link to a list of those
>> runs that has all the necessary components from the
>> Data Reference Synatax so that I can go directly to
>> the URL for that data set. For example,
>>
>> the BCC-CSM1.1 model shows 45 runs for the
>> decadal1960 experiment. I would like to click on that
>> 45 and get a list of the 45 URLs for those runs, like
>> this:
>>
>> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r1i1p1.html
>>
>> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r2i1p1.html
>>
>> ...
>>
>> How did you tell wget to download those files?
>>
>> For example: wget -nH --retr-symlinks -r -A nc
>> ftp://username@ftp-esg.ucllnl.org/picntrl/atm/mo/tas
>> -o log.tas
>>
>> This would populate a local directory
>> ./picntrl/atm/mo/tas with all the models and ensemble
>> members in the proper subdirectory. If I wanted to
>> update with newer versions or models that had been
>> added, I simply ran the same 1-line wget command
>> again. This is what I refer to as 'elegant.'
>>
>> We might have already some way of achieving what
>> you want, if we knew exactly what that is.
>>
>> Wouldn't that be wonderful? I am hopeful that the P2P
>> will simplify the elaborate and flawed workflow I
>> have cobbled together to navigate the current system.
>>
>> I have a list of desired
>> experiment/realm/frequency/MIP_table/variables for
>> which I need to grab all available models/ensembles.
>> Is that not enough to describe my needs?
>>
>> I guess my proposal of issuing:
>>
>> bash <(wget
>> http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=month&variable=clt
>> <http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=month&variable=clt>
>> -qO - | grep -v HadCM3)
>>
>> Yes, this would likely achieve the same result as the
>> '&model=!name' that Luca implemented. However, I
>> believe the documentation says that there is a limit
>> of 1000 to the number of wgets that p2pnode will put
>> into a single search request, so I don't want to
>> populate my precious 1000 results with wgets that I'm
>> going to grep out afterwards.
>>
>> --Jennifer
>>
>> was not acceptable to you. But I still don't know
>> exactly why.
>>
>> It would really help to know what you meant by
>> "elegant use of wget".
>>
>> Thanks,
>>
>> Estani
>>
>> Am 14.12.2011 18:44, schrieb Cinquini, Luca (3880):
>>
>> So Jennifer, would having the capability of
>> doing negative searches (model=!CCSM), and
>> generate the corresponding wget scripts, help
>> you ?
>>
>> thanks, Luca
>>
>> On Dec 14, 2011, at 10:38 AM, Jennifer Adams
>> wrote:
>>
>> Well, after working from the client side
>> to get CMIP3 and CMIP5 data, I can say
>> that wget is a fine tool to rely on at
>> the core of the workflow. Unfortunately,
>> the step up in complexity from CMIP3 to
>> CMIP5 and the switch from FTP to HTTP
>> trashed the elegant use of wget. No
>> amount of customized wrapper software,
>> browser interfaces, or pre-packaged tools
>> like DML fixes that problem.
>>
>> At the moment, the burden on the user is
>> embarrassingly high. It's so easy to
>> suggest that the user should "filter to
>> remove what is not required" from a
>> downloaded script, but the actual pratice
>> of doing that in a timely and automated
>> and distributed way is NOT simple! And if
>> the solution to my problem of filling in
>> the gaps in my incomplete collection is
>> to go back to clicking in my browser and
>> do the whole thing over again but make my
>> filters smarter by looking for what's
>> already been acquired or what has a new
>> version number … this is unacceptable.
>> The filtering must be a server-side
>> responsibility and the interface must be
>> accessible by automated scripts. Make it so!
>>
>> By the way, the version number is a piece
>> of metadata that is not in the downloaded
>> files or the gateway's search criteria.
>> It appears in the wget script as part of
>> the path in the file's http location, but
>> the path is not preserved after the wget
>> is complete, so it is effectively lost
>> after the download is
>> done. I guess
>> the file's date stamp would be the only
>> way to know if the version number of the
>> data file in question has been changed,
>> but I'm not going to write that check
>> into my filtering scripts.
>>
>> --Jennifer
>>
>> --
>>
>> Jennifer M. Adams
>>
>> IGES/COLA
>>
>> 4041 Powder Mill Road, Suite 302
>>
>> Calverton, MD 20705
>>
>> jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>> _______________________________________________
>>
>> GO-ESSP-TECH mailing list
>>
>> GO-ESSP-TECH at ucar.edu
>> <mailto:GO-ESSP-TECH at ucar.edu>
>>
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>> _______________________________________________
>>
>> GO-ESSP-TECH mailing list
>>
>> GO-ESSP-TECH at ucar.edu
>> <mailto:GO-ESSP-TECH at ucar.edu>
>>
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>> --
>>
>> Jennifer M. Adams
>>
>> IGES/COLA
>>
>> 4041 Powder Mill Road, Suite 302
>>
>> Calverton, MD 20705
>>
>> jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>> _______________________________________________
>>
>> GO-ESSP-TECH mailing list
>>
>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>> --
>>
>> Jennifer M. Adams
>>
>> IGES/COLA
>>
>> 4041 Powder Mill Road, Suite 302
>>
>> Calverton, MD 20705
>>
>> jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>>
>> --
>> Bryan Lawrence
>> University of Reading: Professor of Weather and Climate Computing.
>> National Centre for Atmospheric Science: Director of Models and
>> Data.
>> STFC: Director of the Centre for Environmental Data Archival.
>> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>>
>> --
>>
>> Jennifer M. Adams
>>
>> IGES/COLA
>>
>> 4041 Powder Mill Road, Suite 302
>>
>> Calverton, MD 20705
>>
>> jma at cola.iges.org <mailto:jma at cola.iges.org>
>>
>> --
>> Scanned by iCritical.
>>
>>
>> --
>> Scanned by iCritical.
>>
>>
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
> --
> Estanislao Gonzalez
>
> Max-Planck-Institut für Meteorologie (MPI-M)
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
> Phone: +49 (40) 46 00 94-126
> E-Mail:gonzalez at dkrz.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111216/843dd0fc/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list