[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Thu Mar 8 09:18:15 MST 2012


I agree, particularly on the last point.

There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,

Cheers,
Martin

> >-----Original Message-----
> >From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
> >bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
> >Sent: 08 March 2012 16:01
> >To: weigel at dkrz.de
> >Cc: go-essp-tech at ucar.edu
> >Subject: Re: [Go-essp-tech] What is the risk that science is done
> >using 'deprecated' data?
> >
> >Hi Thomas,
> >
> >As you say, it's too late to do much re-engineering of the system now
> >-- we've attempted to put in place various identifier systems and none
> >of them are working particularly well -- however I think there is
> >another perspective to your proposal:
> >
> > 1. ESG/CMIP5 is deployed globally across multiple administrative
> >domains and each domain has the ability to cut corners to get things
> >done, e.g. replacing files silently without changing identifiers.
> >
> > 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
> >doing #1 to get the data to scientists when they need it.  Any system
> >that makes it impossible, or even only difficult, to change the
> >underlying data is going to be more complex and difficult to
> >administer than a system that doesn't, unless that system was very
> >rigorously designed, implemented and tested.
> >
> >Because of #1 I'm convinced that a fit-for-purpose identifier system
> >wouldn't use randomly generated UUIDs but would take the GIT approach
> >of hashing invariants of the dataset so that any changes behind the
> >scenes can be detected.
> >
> >Because of #2 I'm convinced that now is not the time to start building
> >more software to do this.  We have to stabilise the system and learn
> >the lessons of CMIP5 first.
> >
> >Cheers,
> >Stephen.
> >
> >
> >On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
> >
> >> Jamie/All,
> >>
> >> these are important questions I have been wondering about as well;
> >we just had a small internal meeting yesterday with Estani and
> >Martina, so I'll try to sum some points up here. I am not too familiar
> >with the ESG publishing process, so I can only guess that Stephen's #1
> >has something to do with the bending of policies that are for
> >pragmatic reasons not enforced in the CMIP5 process. (My intuition is
> >that *ideally* it should be impossible to make data available without
> >going through the whole publication process. Please correct me if I am
> >misunderstanding this.)
> >>
> >> Most of what I have been thinking about however concerns point #2.
> >I'd claim that the risk here should not be underestimated; data
> >consumers being unable to find the data they need is bad ("the
> >advanced search issue"), but users relying on deprecated data - most
> >likely without being aware of it - is certainly dangerous for
> >scientific credibility.
> >> My suggestion to address this problem is to use globally persistent
> >identifiers (PIDs) that are automatically assigned to data objects
> >(and metadata etc.) on ESG-publication; data should ideally not be
> >known by its file name or system-internal ID, but via a global
> >identifier that never changes after it has been published. Of course,
> >this sounds like the DOIs, but these are extremely coarse grained and
> >very static. The idea is to attach identifiers to the low-level
> >entities and provide solutions to build up a hierarchical ID system
> >(virtual collections) to account for the various layers used in our
> >data. Such persistent identifiers should then be placed prominently in
> >any user interface dealing with managed data. The important thing is:
> >If data is updated, we don't update the data behind identifier x, but
> >assign a new identifier y and create a typed link between these two
> >(which may be the most challenging part) and perhaps put a small
> >annotation on x that this data is deprecated. A clever user interface
> >should then redirect a user consistently to the latest version of a
> >dataset if a user accesses the old identifier.
> >> This does not make it impossible to use deprecated data, but at
> >least it raises the consumer's awareness of the issue and lowers the
> >barrier to re-retrieve valid data.
> >>
> >> As for the point in time; I'd be certain that it is too late now,
> >but it is always a good idea to have plans for future improvement.. :)
> >>
> >> Best, Tobias
> >>
> >> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
> >>> Thanks for the replies on this - any other replies are still very
> >welcome.
> >>>
> >>> Stephen - being selfish - we aren't too worried about 2 as its less
> >of an issue for us (we do a daily trawl of thredds catalogues for new
> >datasets), but I agree it is a problem more generally.  I don't have a
> >feel for which of the problems 1-3 would minimise the risk most if you
> >solved it.  I think making sure new data has a new version is a
> >foundation though.
> >>>
> >>> Part of me wonders though whether its already too late to really do
> >anything with versioning in its current form.  *But* I may be
> >overestimating the size of the problem of new datasets appearing
> >without versions being updated.
> >>>
> >>> Jamie
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: go-essp-tech-bounces at ucar.edu
> >>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
> >Denvil
> >>>> Sent: 08 March 2012 10:41
> >>>> To: go-essp-tech at ucar.edu
> >>>> Subject: Re: [Go-essp-tech] What is the risk that science is
> >>>> done using 'deprecated' data?
> >>>>
> >>>> Hi Stephen, let me add a third point:
> >>>>
> >>>> 3. Users are aware of a new versions but can't download files
> >>>> so as to have a coherent set of files.
> >>>>
> >>>> With respect to that point the p2p transition (especially the
> >>>> attribut caching on the node) will be a major step forward.
> >>>> GFDL just upgrad and we have an amazing success rate of 98%.
> >>>>
> >>>> And I agree with Ashish.
> >>>>
> >>>> Regards.
> >>>> Sébastien
> >>>>
> >>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
> >>>>> Hi Jamie,
> >>>>>
> >>>>> I can imagine there is a risk of papers being written on
> >>>> deprecated data in two scenarios:
> >>>>>   1. Data is being updated at datanodes without creating a
> >>>> new version
> >>>>>   2. Users are unaware of new versions available and
> >>>> therefore using
> >>>>> deprecated data
> >>>>>
> >>>>> Are you concerned about both of these scenarios?  Your
> >>>> email seems to mainly address #1.
> >>>>> Thanks,
> >>>>> Stephen.
> >>>>>
> >>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> Does anyone have a feel for the current level of risk that
> >>>> analysists
> >>>>>> are doing work (with the intention to publish) on data
> >>>> that has been
> >>>>>> found to be wrong by the data providers and so deprecated (in
> >some
> >>>>>> sense)?
> >>>>>>
> >>>>>> My feeling is that versioning isn't working (that may be
> >>>> putting it a
> >>>>>> bit strongly.  It is too easy for data providers - in their
> >>>>>> understandable drive to get their data out - to have
> >>>> updated files on
> >>>>>> disk without publishing a new version.   How big a deal does
> >anyone
> >>>>>> think this is?
> >>>>>>
> >>>>>> If the risk that papers are being written based on
> >>>> deprecated data is
> >>>>>> sufficiently large then is there an agreed strategy for
> >>>> coping with
> >>>>>> this?  Does it have implications for the requirements of the
> >data
> >>>>>> publishing/delivery system?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Jamie
> >>>>>> _______________________________________________
> >>>>>> GO-ESSP-TECH mailing list
> >>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>
> >>>> --
> >>>> Sébastien Denvil
> >>>> IPSL, Pôle de modélisation du climat
> >>>> UPMC, Case 101, 4 place Jussieu,
> >>>> 75252 Paris Cedex 5
> >>>>
> >>>> Tour 45-55 2ème étage Bureau 209
> >>>> Tel: 33 1 44 27 21 10
> >>>> Fax: 33 1 44 27 39 02
> >>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> GO-ESSP-TECH mailing list
> >>> GO-ESSP-TECH at ucar.edu
> >>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>
> >>
> >>
> >> --
> >> Tobias Weigel
> >>
> >> Department of Data Management
> >> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
> >> Bundesstr. 45a
> >> 20146 Hamburg
> >> Germany
> >>
> >> Tel.: +49 40 460094 104
> >> E-Mail: weigel at dkrz.de
> >> Website: www.dkrz.de
> >>
> >> Managing Director: Prof. Dr. Thomas Ludwig
> >>
> >> Sitz der Gesellschaft: Hamburg
> >> Amtsgericht Hamburg HRB 39784
> >>
> >>
> >> _______________________________________________
> >> GO-ESSP-TECH mailing list
> >> GO-ESSP-TECH at ucar.edu
> >> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >
> >--
> >Scanned by iCritical.
> >_______________________________________________
> >GO-ESSP-TECH mailing list
> >GO-ESSP-TECH at ucar.edu
> >http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list