[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Kettleborough, Jamie jamie.kettleborough at metoffice.gov.uk
Fri Mar 16 11:20:02 MDT 2012


Hello Bryan,

Yep this is what Karl said:

'We are attempting to replicate at PCMDI and other major data archives a considerable fraction of the model output you have made available to CMIP5 scientists.  To assure only genuine replicas are served, we want to verify that the checksums are the same at all sites.  I am therefore writing to request with some urgency that you publish checksums.  For datasets that you have already published, you can add the checksum following the instructions found at:
 ....
'

I've just had a look at our cut-down copy of the thredds catalogues.  I *think* that about 93% (of about 2.8 million files) have checksums.  This is from most of the nodes I know about holding cmip5 or tamip data with the exception of http://esg.nccs.nasa.gov, and http://bcccsm.cma.gov.cn.  Can anyone verify this figure? There are several reasons that mean I could have it wrong.  Presumably a similar analysis can be done on the gateways?

Something I don't have a feel for is the proportion of files that have accurate checksums - where the catalogue checksums really reflect what is on disk.  I think this can be an issue as noted by others.  Have logs been kept of this problem during replication?  (We haven't - and I'm regretting it now).

Jamie

> -----Original Message-----
> From: go-essp-tech-bounces at ucar.edu 
> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Bryan Lawrence
> Sent: 16 March 2012 13:26
> To: go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] What is the risk that science is 
> done using 'deprecated' data?
> 
> 
> Can someone look back and see what Karl said to the modelling groups.
> I think we already have a mandate to require them.
> 
> B
> > Estani,
> > 
> > I think the issue is wider than just publishing of 
> checksums.  We could do a lot more to help users verify they 
> have the right data, e.g. automatically comparing checksums 
> to the published ones after download.
> > 
> > A new thread on Tuesday's agenda is coming ...
> > 
> > S.
> > 
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > Centre of Environmental Data Archival
> > STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot 
> OX11 0QX, 
> > UK
> > 
> > From: go-essp-tech-bounces at ucar.edu 
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Estanislao 
> > Gonzalez
> > Sent: 16 March 2012 13:15
> > To: go-essp-tech at ucar.edu
> > Subject: Re: [Go-essp-tech] What is the risk that science 
> is done using 'deprecated' data?
> > 
> > Hi Stephen,
> > 
> > IMHO there is not much to discuss at all.
> > We could spare some Minutes to hear if someone has 
> arguments against publishing the checksums and/or decide what 
> to do with those cites breaking this rule, even partially.
> > 
> > That shouldn't take more than 5'...
> > In preparation to that, I'd say that people not wishing to 
> comply with this request (providing valid and current 
> checksums *as well* as publishing new data *always* under a 
> new version) should start a new thread to discuss it (this 
> one is too long and have already changed subjects a couple of times).
> > 
> > My 2c,
> > Estani
> > 
> > Am 16.03.2012 14:01, schrieb 
> > stephen.pascoe at stfc.ac.uk:<mailto:stephen.pascoe at stfc.ac.uk:>
> > Jamie,
> > 
> > There will be a telco at 16:00GMT on Tuesday.  We have 
> several candidate topics for discussion at the moment (See 
> http://esgf.org/wiki/Esgf/Cmip5Meetings) but checksums is not 
> one of them.  Let me coordinate a realistic agenda and I'll 
> try and ensure there is some time to discuss this.
> > 
> > Cheers,
> > Stephen.
> > 
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > Centre of Environmental Data Archival
> > STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot 
> OX11 0QX, 
> > UK
> > 
> > From: 
> > go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> 
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Kettleborough, 
> > Jamie
> > Sent: 16 March 2012 10:27
> > To: Gavin M. Bell; Barron Jr, Tom O.
> > Cc: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> > Subject: Re: [Go-essp-tech] What is the risk that science 
> is done using 'deprecated' data?
> > 
> > Hello,
> > 
> > when is the next telco, and is this issue on the agenda?
> > 
> > Thanks,
> > 
> > Jamie
> > 
> > ________________________________
> > From: 
> > go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> 
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Gavin M. Bell
> > Sent: 12 March 2012 22:21
> > To: Barron Jr, Tom O.
> > Cc: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> > Subject: Re: [Go-essp-tech] What is the risk that science 
> is done using 'deprecated' data?
> > Hi Tom,
> > 
> > I don't envy your (ORNL's .et al) position, but this is 
> what must be done.  This is why Balaji was so adamant about 
> making checksums *required* from the very beginning of this 
> endeavor.  He was right.  Though, to be honest it was always 
> something that was known... this is not a surprise to anyone. 
>  I think that having it be "optional" in the publisher was 
> the sticky point.  Putting it in the publisher is, IMHO, or 
> should be the checksum of last resort.  Folks should have 
> schemes to calculate these things out of band and integrating 
> them back into the publisher... a feature that made it's way 
> into the publisher albeit a bit after the bell.  It is no 
> one's fault just a comedy of errors but... now we are all 
> enlightened and know *why* we need checksums (hashes) in a 
> distributed system that requires integrity assertions made 
> about said data.
> > 
> > Oh well :-(...
> > 
> > At least we are relatively early in the game... if that is any 
> > consolation :-\
> > 
> > checksums or bust.
> > 
> > P.S.
> > Regarding the catalogs, the topic Stephen has been 
> shepherding, there are cool things we can do with having the 
> constituent files' checksums.  Mmmwwaaahhh aahhh aahhhh.... 
> (evil laugh).
> > 
> > 
> > On 3/12/12 9:46 AM, Barron Jr, Tom O. wrote:
> > 
> > Thanks for the reply, Gavin. I understand what you say.
> > 
> > 
> > 
> > I just wanted to highlight that a significant amount of 
> data has been published without checksums at ORNL on the ESG2 
> gateway. Extracting it all from the HPSS archive for 
> checksumming in preparation for republishing on the ESGF 
> portal will take significant time. I'm not saying we 
> shouldn't do it. Just that we shouldn't expect to get it done quickly.
> > 
> > 
> > 
> > Tom
> > 
> > 
> > 
> > On 2012.0309, at 17:24, Gavin M. Bell wrote:
> > 
> > 
> > 
> > Hi Tom,
> > 
> > 
> > 
> > In the simplest form of the assertions we have made about 
> checksums... If you can't get the checksums then it shouldn't 
> / can't be published, period.  So access must be gotten and 
> checksums computed.  Otherwise you simply can't *trust* the 
> data is "who it says it is".
> > 
> > 
> > 
> > On 3/9/12 11:10 AM, Barron Jr, Tom O. wrote:
> > 
> > How will a requirement for checksums affect the ability to 
> publish offline datasets that are not immediately accessible 
> for computing a checksum?
> > 
> > 
> > 
> > On 2012.0309, at 03:47, Gavin M. Bell wrote:
> > 
> > 
> > 
> > 
> > 
> > With checksums, we can put in client-side sanity checking 
> tools to give users peace of mind.  The other side benefit 
> would be alerting offending sites that something is wrong.  I 
> agree with you, Bryan, checksums are a must.  We can enforce 
> it mechanically in the publisher.  This is worth bringing up 
> at the next call - without spending too much time on it.
> > 
> > 
> > 
> > On 3/9/12 12:20 AM, Bryan Lawrence wrote:
> > 
> > 
> > 
> > Karl has written to modellng centres requiring them to do 
> this, and I think we should start enforcing it.
> > 
> > Bryan
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Hello,
> > 
> > 
> > 
> > If we enforced checksums to be done as a part of publication, then 
> > this
> > 
> > would address this issue, right?
> > 
> > 
> > 
> > 
> > 
> > On 3/8/12 8:39 AM,
> > 
> > 
> > 
> > stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk>
> > 
> > 
> > 
> >  wrote:
> > 
> > 
> > 
> > 
> > 
> > Tobias, sorry I miss-typed your name :-)
> > 
> > S.
> > 
> > 
> > 
> > On 8 Mar 2012, at 16:00,
> > 
> > 
> > 
> > <stephen.pascoe at stfc.ac.uk><mailto:stephen.pascoe at stfc.ac.uk>
> > 
> > 
> > 
> > 
> > 
> >  wrote:
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Hi Thomas,
> > 
> > 
> > 
> > As you say, it's too late to do much re-engineering of the 
> system now -- we've attempted to put in place various 
> identifier systems and none of them are working particularly 
> well -- however I think there is another perspective to your proposal:
> > 
> > 
> > 
> > 1. ESG/CMIP5 is deployed globally across multiple 
> administrative domains and each domain has the ability to cut 
> corners to get things done, e.g. replacing files silently 
> without changing identifiers.
> > 
> > 
> > 
> > 2. ESG/CMIP5 system is so complex that who'd blame a 
> sys-admin for doing #1 to get the data to scientists when 
> they need it.  Any system that makes it impossible, or even 
> only difficult, to change the underlying data is going to be 
> more complex and difficult to administer than a system that 
> doesn't, unless that system was very rigorously designed, 
> implemented and tested.
> > 
> > 
> > 
> > Because of #1 I'm convinced that a fit-for-purpose 
> identifier system wouldn't use randomly generated UUIDs but 
> would take the GIT approach of hashing invariants of the 
> dataset so that any changes behind the scenes can be detected.
> > 
> > 
> > 
> > Because of #2 I'm convinced that now is not the time to 
> start building more software to do this.  We have to 
> stabilise the system and learn the lessons of CMIP5 first.
> > 
> > 
> > 
> > Cheers,
> > 
> > Stephen.
> > 
> > 
> > 
> > 
> > 
> > On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Jamie/All,
> > 
> > 
> > 
> > these are important questions I have been wondering about 
> as well; we 
> > just had a small internal meeting yesterday with Estani and 
> Martina, 
> > so I'll try to sum some points up here. I am not too 
> familiar with the 
> > ESG publishing process, so I can only guess that Stephen's #1 has 
> > something to do with the bending of policies that are for pragmatic 
> > reasons not enforced in the CMIP5 process. (My intuition is that 
> > *ideally* it should be impossible to make data available 
> without going 
> > through the whole publication process. Please correct me if I am 
> > misunderstanding this.)
> > 
> > 
> > 
> > Most of what I have been thinking about however concerns 
> point #2. I'd claim that the risk here should not be 
> underestimated; data consumers being unable to find the data 
> they need is bad ("the advanced search issue"), but users 
> relying on deprecated data - most likely without being aware 
> of it - is certainly dangerous for scientific credibility.
> > 
> > My suggestion to address this problem is to use globally 
> persistent identifiers (PIDs) that are automatically assigned 
> to data objects (and metadata etc.) on ESG-publication; data 
> should ideally not be known by its file name or 
> system-internal ID, but via a global identifier that never 
> changes after it has been published. Of course, this sounds 
> like the DOIs, but these are extremely coarse grained and 
> very static. The idea is to attach identifiers to the 
> low-level entities and provide solutions to build up a 
> hierarchical ID system (virtual collections) to account for 
> the various layers used in our data. Such persistent 
> identifiers should then be placed prominently in any user 
> interface dealing with managed data. The important thing is: 
> If data is updated, we don't update the data behind 
> identifier x, but assign a new identifier y and create a 
> typed link between these two (which may be the most 
> challenging part) and perhaps put a small annotatio!
> > 
> >  n on x t
> > 
> > hat this data is depreca
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ted. A clever user interface should then redirect a user 
> consistently to the latest version of a dataset if a user 
> accesses the old identifier.
> > 
> > This does not make it impossible to use deprecated data, 
> but at least it raises the consumer's awareness of the issue 
> and lowers the barrier to re-retrieve valid data.
> > 
> > 
> > 
> > As for the point in time; I'd be certain that it is too 
> late now, but 
> > it is always a good idea to have plans for future improvement.. :)
> > 
> > 
> > 
> > Best, Tobias
> > 
> > 
> > 
> > Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
> > 
> > 
> > 
> > 
> > 
> > Thanks for the replies on this - any other replies are 
> still very welcome.
> > 
> > 
> > 
> > Stephen - being selfish - we aren't too worried about 2 as 
> its less of an issue for us (we do a daily trawl of thredds 
> catalogues for new datasets), but I agree it is a problem 
> more generally.  I don't have a feel for which of the 
> problems 1-3 would minimise the risk most if you solved it.  
> I think making sure new data has a new version is a foundation though.
> > 
> > 
> > 
> > Part of me wonders though whether its already too late to 
> really do anything with versioning in its current form.  
> *But* I may be overestimating the size of the problem of new 
> datasets appearing without versions being updated.
> > 
> > 
> > 
> > Jamie
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -----Original Message-----
> > 
> > From:
> > 
> > 
> > 
> > go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
> > 
> > 
> > 
> > 
> > 
> > [
> > 
> > 
> > 
> > mailto:go-essp-tech-bounces at ucar.edu
> > 
> > 
> > 
> > ] On Behalf Of Sébastien Denvil
> > 
> > Sent: 08 March 2012 10:41
> > 
> > To:
> > 
> > 
> > 
> > go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> > 
> > 
> > 
> > 
> > 
> > Subject: Re: [Go-essp-tech] What is the risk that science is
> > 
> > done using 'deprecated' data?
> > 
> > 
> > 
> > Hi Stephen, let me add a third point:
> > 
> > 
> > 
> > 3. Users are aware of a new versions but can't download files
> > 
> > so as to have a coherent set of files.
> > 
> > 
> > 
> > With respect to that point the p2p transition (especially the
> > 
> > attribut caching on the node) will be a major step forward.
> > 
> > GFDL just upgrad and we have an amazing success rate of 98%.
> > 
> > 
> > 
> > And I agree with Ashish.
> > 
> > 
> > 
> > Regards.
> > 
> > Sébastien
> > 
> > 
> > 
> > Le 08/03/2012 11:34,
> > 
> > 
> > 
> > stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk>
> > 
> > 
> > 
> >  a écrit :
> > 
> > 
> > 
> > 
> > 
> > Hi Jamie,
> > 
> > 
> > 
> > I can imagine there is a risk of papers being written on
> > 
> > 
> > 
> > 
> > 
> > deprecated data in two scenarios:
> > 
> > 
> > 
> > 
> > 
> >  1. Data is being updated at datanodes without creating a
> > 
> > 
> > 
> > 
> > 
> > new version
> > 
> > 
> > 
> > 
> > 
> >  2. Users are unaware of new versions available and
> > 
> > 
> > 
> > 
> > 
> > therefore using
> > 
> > 
> > 
> > 
> > 
> > deprecated data
> > 
> > 
> > 
> > Are you concerned about both of these scenarios?  Your
> > 
> > 
> > 
> > 
> > 
> > email seems to mainly address #1.
> > 
> > 
> > 
> > 
> > 
> > Thanks,
> > 
> > Stephen.
> > 
> > 
> > 
> > On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Hello,
> > 
> > 
> > 
> > Does anyone have a feel for the current level of risk that
> > 
> > 
> > 
> > 
> > 
> > analysists
> > 
> > 
> > 
> > 
> > 
> > are doing work (with the intention to publish) on data
> > 
> > 
> > 
> > 
> > 
> > that has been
> > 
> > 
> > 
> > 
> > 
> > found to be wrong by the data providers and so deprecated (in some
> > 
> > sense)?
> > 
> > 
> > 
> > My feeling is that versioning isn't working (that may be
> > 
> > 
> > 
> > 
> > 
> > putting it a
> > 
> > 
> > 
> > 
> > 
> > bit strongly.  It is too easy for data providers - in their
> > 
> > understandable drive to get their data out - to have
> > 
> > 
> > 
> > 
> > 
> > updated files on
> > 
> > 
> > 
> > 
> > 
> > disk without publishing a new version.   How big a deal does anyone
> > 
> > think this is?
> > 
> > 
> > 
> > If the risk that papers are being written based on
> > 
> > 
> > 
> > 
> > 
> > deprecated data is
> > 
> > 
> > 
> > 
> > 
> > sufficiently large then is there an agreed strategy for
> > 
> > 
> > 
> > 
> > 
> > coping with
> > 
> > 
> > 
> > 
> > 
> > this?  Does it have implications for the requirements of the data
> > 
> > publishing/delivery system?
> > 
> > 
> > 
> > Thanks,
> > 
> > 
> > 
> > Jamie
> > 
> > _______________________________________________
> > 
> > GO-ESSP-TECH mailing list
> > 
> > 
> > 
> > 
> > 
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > 
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > --
> > 
> > Sébastien Denvil
> > 
> > IPSL, Pôle de modélisation du climat
> > 
> > UPMC, Case 101, 4 place Jussieu,
> > 
> > 75252 Paris Cedex 5
> > 
> > 
> > 
> > Tour 45-55 2ème étage Bureau 209
> > 
> > Tel: 33 1 44 27 21 10
> > 
> > Fax: 33 1 44 27 39 02
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > 
> > GO-ESSP-TECH mailing list
> > 
> > 
> > 
> > 
> > 
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > 
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Department of Data Management
> > 
> > Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
> > 
> > Bundesstr. 45a
> > 
> > 20146 Hamburg
> > 
> > Germany
> > 
> > 
> > 
> > Tel.: +49 40 460094 104
> > 
> > E-Mail:
> > 
> > 
> > 
> > weigel at dkrz.de<mailto:weigel at dkrz.de>
> > 
> > 
> > 
> > 
> > 
> > Website:
> > 
> > 
> > 
> > www.dkrz.de<http://www.dkrz.de>
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Managing Director: Prof. Dr. Thomas Ludwig
> > 
> > 
> > 
> > Sitz der Gesellschaft: Hamburg
> > 
> > Amtsgericht Hamburg HRB 39784
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > 
> > GO-ESSP-TECH mailing list
> > 
> > 
> > 
> > 
> > 
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > 
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > --
> > 
> > Bryan Lawrence
> > 
> > University of Reading:  Professor of Weather and Climate Computing.
> > 
> > National Centre for Atmospheric Science: Director of Models 
> and Data.
> > 
> > STFC: Director of the Centre for Environmental Data Archival.
> > 
> > Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
> > 
> > 
> > 
> > 
> > 
> > --
> > 
> > Gavin M. Bell
> > 
> > --
> > 
> > 
> > 
> >  "Never mistake a clear view for a short distance."
> > 
> >                      -Paul Saffo
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > 
> > GO-ESSP-TECH mailing list
> > 
> > 
> > 
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > 
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > _______________________________________________
> > 
> > GO-ESSP-TECH mailing list
> > 
> > 
> > 
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > 
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > --
> > 
> > Gavin M. Bell
> > 
> > --
> > 
> > 
> > 
> >  "Never mistake a clear view for a short distance."
> > 
> >                      -Paul Saffo
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > --
> > 
> > Gavin M. Bell
> > 
> > Lawrence Livermore National Labs
> > 
> > --
> > 
> > 
> > 
> >  "Never mistake a clear view for a short distance."
> > 
> >                      -Paul Saffo
> > 
> > 
> > 
> > (GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
> > 
> > 
> > 
> >  A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
> > 
> > 
> > --
> > Scanned by iCritical.
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > 
> > GO-ESSP-TECH mailing list
> > 
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > 
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > 
> > 
> > 
> > --
> > 
> > Estanislao Gonzalez
> > 
> > 
> > 
> > Max-Planck-Institut für Meteorologie (MPI-M)
> > 
> > Deutsches Klimarechenzentrum (DKRZ) - German Climate 
> Computing Centre
> > 
> > Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> > 
> > 
> > 
> > Phone:   +49 (40) 46 00 94-126
> > 
> > E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
> > 
> > 
> 
> --
> Bryan Lawrence
> University of Reading:  Professor of Weather and Climate Computing.
> National Centre for Atmospheric Science: Director of Models and Data. 
> STFC: Director of the Centre for Environmental Data Archival.
> Ph: +44 118 3786507 or 1235 445012; 
> Web:home.badc.rl.ac.uk/lawrence 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 


More information about the GO-ESSP-TECH mailing list