[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Fri Mar 9 01:54:39 MST 2012

there are othre setting up digital identificaiton services, we should not. e.g. EPIC ... could we use them?
Cheers
Bryan

> Martin
> 
> This problem space seems suitable for EXARCH.  I.E. setting up a digital identication service.  This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.
> 
> Mark
> 
> 
> On 8 Mar 2012, at 17:18, <martin.juckes at stfc.ac.uk> <martin.juckes at stfc.ac.uk> wrote:
> 
> > I agree, particularly on the last point.
> > 
> > There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
> > 
> > Cheers,
> > Martin
> > 
> >>> -----Original Message-----
> >>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
> >>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
> >>> Sent: 08 March 2012 16:01
> >>> To: weigel at dkrz.de
> >>> Cc: go-essp-tech at ucar.edu
> >>> Subject: Re: [Go-essp-tech] What is the risk that science is done
> >>> using 'deprecated' data?
> >>> 
> >>> Hi Thomas,
> >>> 
> >>> As you say, it's too late to do much re-engineering of the system now
> >>> -- we've attempted to put in place various identifier systems and none
> >>> of them are working particularly well -- however I think there is
> >>> another perspective to your proposal:
> >>> 
> >>> 1. ESG/CMIP5 is deployed globally across multiple administrative
> >>> domains and each domain has the ability to cut corners to get things
> >>> done, e.g. replacing files silently without changing identifiers.
> >>> 
> >>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
> >>> doing #1 to get the data to scientists when they need it.  Any system
> >>> that makes it impossible, or even only difficult, to change the
> >>> underlying data is going to be more complex and difficult to
> >>> administer than a system that doesn't, unless that system was very
> >>> rigorously designed, implemented and tested.
> >>> 
> >>> Because of #1 I'm convinced that a fit-for-purpose identifier system
> >>> wouldn't use randomly generated UUIDs but would take the GIT approach
> >>> of hashing invariants of the dataset so that any changes behind the
> >>> scenes can be detected.
> >>> 
> >>> Because of #2 I'm convinced that now is not the time to start building
> >>> more software to do this.  We have to stabilise the system and learn
> >>> the lessons of CMIP5 first.
> >>> 
> >>> Cheers,
> >>> Stephen.
> >>> 
> >>> 
> >>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
> >>> 
> >>>> Jamie/All,
> >>>> 
> >>>> these are important questions I have been wondering about as well;
> >>> we just had a small internal meeting yesterday with Estani and
> >>> Martina, so I'll try to sum some points up here. I am not too familiar
> >>> with the ESG publishing process, so I can only guess that Stephen's #1
> >>> has something to do with the bending of policies that are for
> >>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
> >>> that *ideally* it should be impossible to make data available without
> >>> going through the whole publication process. Please correct me if I am
> >>> misunderstanding this.)
> >>>> 
> >>>> Most of what I have been thinking about however concerns point #2.
> >>> I'd claim that the risk here should not be underestimated; data
> >>> consumers being unable to find the data they need is bad ("the
> >>> advanced search issue"), but users relying on deprecated data - most
> >>> likely without being aware of it - is certainly dangerous for
> >>> scientific credibility.
> >>>> My suggestion to address this problem is to use globally persistent
> >>> identifiers (PIDs) that are automatically assigned to data objects
> >>> (and metadata etc.) on ESG-publication; data should ideally not be
> >>> known by its file name or system-internal ID, but via a global
> >>> identifier that never changes after it has been published. Of course,
> >>> this sounds like the DOIs, but these are extremely coarse grained and
> >>> very static. The idea is to attach identifiers to the low-level
> >>> entities and provide solutions to build up a hierarchical ID system
> >>> (virtual collections) to account for the various layers used in our
> >>> data. Such persistent identifiers should then be placed prominently in
> >>> any user interface dealing with managed data. The important thing is:
> >>> If data is updated, we don't update the data behind identifier x, but
> >>> assign a new identifier y and create a typed link between these two
> >>> (which may be the most challenging part) and perhaps put a small
> >>> annotation on x that this data is deprecated. A clever user interface
> >>> should then redirect a user consistently to the latest version of a
> >>> dataset if a user accesses the old identifier.
> >>>> This does not make it impossible to use deprecated data, but at
> >>> least it raises the consumer's awareness of the issue and lowers the
> >>> barrier to re-retrieve valid data.
> >>>> 
> >>>> As for the point in time; I'd be certain that it is too late now,
> >>> but it is always a good idea to have plans for future improvement.. :)
> >>>> 
> >>>> Best, Tobias
> >>>> 
> >>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
> >>>>> Thanks for the replies on this - any other replies are still very
> >>> welcome.
> >>>>> 
> >>>>> Stephen - being selfish - we aren't too worried about 2 as its less
> >>> of an issue for us (we do a daily trawl of thredds catalogues for new
> >>> datasets), but I agree it is a problem more generally.  I don't have a
> >>> feel for which of the problems 1-3 would minimise the risk most if you
> >>> solved it.  I think making sure new data has a new version is a
> >>> foundation though.
> >>>>> 
> >>>>> Part of me wonders though whether its already too late to really do
> >>> anything with versioning in its current form.  *But* I may be
> >>> overestimating the size of the problem of new datasets appearing
> >>> without versions being updated.
> >>>>> 
> >>>>> Jamie
> >>>>> 
> >>>>> 
> >>>>>> -----Original Message-----
> >>>>>> From: go-essp-tech-bounces at ucar.edu
> >>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
> >>> Denvil
> >>>>>> Sent: 08 March 2012 10:41
> >>>>>> To: go-essp-tech at ucar.edu
> >>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
> >>>>>> done using 'deprecated' data?
> >>>>>> 
> >>>>>> Hi Stephen, let me add a third point:
> >>>>>> 
> >>>>>> 3. Users are aware of a new versions but can't download files
> >>>>>> so as to have a coherent set of files.
> >>>>>> 
> >>>>>> With respect to that point the p2p transition (especially the
> >>>>>> attribut caching on the node) will be a major step forward.
> >>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
> >>>>>> 
> >>>>>> And I agree with Ashish.
> >>>>>> 
> >>>>>> Regards.
> >>>>>> Sébastien
> >>>>>> 
> >>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
> >>>>>>> Hi Jamie,
> >>>>>>> 
> >>>>>>> I can imagine there is a risk of papers being written on
> >>>>>> deprecated data in two scenarios:
> >>>>>>>  1. Data is being updated at datanodes without creating a
> >>>>>> new version
> >>>>>>>  2. Users are unaware of new versions available and
> >>>>>> therefore using
> >>>>>>> deprecated data
> >>>>>>> 
> >>>>>>> Are you concerned about both of these scenarios?  Your
> >>>>>> email seems to mainly address #1.
> >>>>>>> Thanks,
> >>>>>>> Stephen.
> >>>>>>> 
> >>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
> >>>>>>> 
> >>>>>>>> Hello,
> >>>>>>>> 
> >>>>>>>> Does anyone have a feel for the current level of risk that
> >>>>>> analysists
> >>>>>>>> are doing work (with the intention to publish) on data
> >>>>>> that has been
> >>>>>>>> found to be wrong by the data providers and so deprecated (in
> >>> some
> >>>>>>>> sense)?
> >>>>>>>> 
> >>>>>>>> My feeling is that versioning isn't working (that may be
> >>>>>> putting it a
> >>>>>>>> bit strongly.  It is too easy for data providers - in their
> >>>>>>>> understandable drive to get their data out - to have
> >>>>>> updated files on
> >>>>>>>> disk without publishing a new version.   How big a deal does
> >>> anyone
> >>>>>>>> think this is?
> >>>>>>>> 
> >>>>>>>> If the risk that papers are being written based on
> >>>>>> deprecated data is
> >>>>>>>> sufficiently large then is there an agreed strategy for
> >>>>>> coping with
> >>>>>>>> this?  Does it have implications for the requirements of the
> >>> data
> >>>>>>>> publishing/delivery system?
> >>>>>>>> 
> >>>>>>>> Thanks,
> >>>>>>>> 
> >>>>>>>> Jamie
> >>>>>>>> _______________________________________________
> >>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>> 
> >>>>>> --
> >>>>>> Sébastien Denvil
> >>>>>> IPSL, Pôle de modélisation du climat
> >>>>>> UPMC, Case 101, 4 place Jussieu,
> >>>>>> 75252 Paris Cedex 5
> >>>>>> 
> >>>>>> Tour 45-55 2ème étage Bureau 209
> >>>>>> Tel: 33 1 44 27 21 10
> >>>>>> Fax: 33 1 44 27 39 02
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>> _______________________________________________
> >>>>> GO-ESSP-TECH mailing list
> >>>>> GO-ESSP-TECH at ucar.edu
> >>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> Tobias Weigel
> >>>> 
> >>>> Department of Data Management
> >>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
> >>>> Bundesstr. 45a
> >>>> 20146 Hamburg
> >>>> Germany
> >>>> 
> >>>> Tel.: +49 40 460094 104
> >>>> E-Mail: weigel at dkrz.de
> >>>> Website: www.dkrz.de
> >>>> 
> >>>> Managing Director: Prof. Dr. Thomas Ludwig
> >>>> 
> >>>> Sitz der Gesellschaft: Hamburg
> >>>> Amtsgericht Hamburg HRB 39784
> >>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> GO-ESSP-TECH mailing list
> >>>> GO-ESSP-TECH at ucar.edu
> >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>> 
> >>> --
> >>> Scanned by iCritical.
> >>> _______________________________________________
> >>> GO-ESSP-TECH mailing list
> >>> GO-ESSP-TECH at ucar.edu
> >>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 
> ---------------------------------------------------
> Mark Morgan
> Software Architect / Engineer
> Institut Pierre Simon Laplace (IPSL),
> Université Pierre Marie Curie,
> 4 Place Jussieu,
> Tour 45-55, Salle #207,
> Paris 75005
> France.
> Tel : +33 (0) 1 44 27 49 10
> Email: momipsl at ipsl.jussieu.fr
> ---------------------------------------------------
> 
> 
> 
> 

--
Bryan Lawrence
University of Reading:  Professor of Weather and Climate Computing.
National Centre for Atmospheric Science: Director of Models and Data. 
STFC: Director of the Centre for Environmental Data Archival.
Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence