[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Fri Mar 9 01:41:36 MST 2012

Martin

This problem space seems suitable for EXARCH.  I.E. setting up a digital identication service.  This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.

Mark

On 8 Mar 2012, at 17:18, <martin.juckes at stfc.ac.uk> <martin.juckes at stfc.ac.uk> wrote:

> I agree, particularly on the last point.
> 
> There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
> 
> Cheers,
> Martin
> 
>>> -----Original Message-----
>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
>>> Sent: 08 March 2012 16:01
>>> To: weigel at dkrz.de
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] What is the risk that science is done
>>> using 'deprecated' data?
>>> 
>>> Hi Thomas,
>>> 
>>> As you say, it's too late to do much re-engineering of the system now
>>> -- we've attempted to put in place various identifier systems and none
>>> of them are working particularly well -- however I think there is
>>> another perspective to your proposal:
>>> 
>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>> domains and each domain has the ability to cut corners to get things
>>> done, e.g. replacing files silently without changing identifiers.
>>> 
>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>> doing #1 to get the data to scientists when they need it.  Any system
>>> that makes it impossible, or even only difficult, to change the
>>> underlying data is going to be more complex and difficult to
>>> administer than a system that doesn't, unless that system was very
>>> rigorously designed, implemented and tested.
>>> 
>>> Because of #1 I'm convinced that a fit-for-purpose identifier system
>>> wouldn't use randomly generated UUIDs but would take the GIT approach
>>> of hashing invariants of the dataset so that any changes behind the
>>> scenes can be detected.
>>> 
>>> Because of #2 I'm convinced that now is not the time to start building
>>> more software to do this.  We have to stabilise the system and learn
>>> the lessons of CMIP5 first.
>>> 
>>> Cheers,
>>> Stephen.
>>> 
>>> 
>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>> 
>>>> Jamie/All,
>>>> 
>>>> these are important questions I have been wondering about as well;
>>> we just had a small internal meeting yesterday with Estani and
>>> Martina, so I'll try to sum some points up here. I am not too familiar
>>> with the ESG publishing process, so I can only guess that Stephen's #1
>>> has something to do with the bending of policies that are for
>>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
>>> that *ideally* it should be impossible to make data available without
>>> going through the whole publication process. Please correct me if I am
>>> misunderstanding this.)
>>>> 
>>>> Most of what I have been thinking about however concerns point #2.
>>> I'd claim that the risk here should not be underestimated; data
>>> consumers being unable to find the data they need is bad ("the
>>> advanced search issue"), but users relying on deprecated data - most
>>> likely without being aware of it - is certainly dangerous for
>>> scientific credibility.
>>>> My suggestion to address this problem is to use globally persistent
>>> identifiers (PIDs) that are automatically assigned to data objects
>>> (and metadata etc.) on ESG-publication; data should ideally not be
>>> known by its file name or system-internal ID, but via a global
>>> identifier that never changes after it has been published. Of course,
>>> this sounds like the DOIs, but these are extremely coarse grained and
>>> very static. The idea is to attach identifiers to the low-level
>>> entities and provide solutions to build up a hierarchical ID system
>>> (virtual collections) to account for the various layers used in our
>>> data. Such persistent identifiers should then be placed prominently in
>>> any user interface dealing with managed data. The important thing is:
>>> If data is updated, we don't update the data behind identifier x, but
>>> assign a new identifier y and create a typed link between these two
>>> (which may be the most challenging part) and perhaps put a small
>>> annotation on x that this data is deprecated. A clever user interface
>>> should then redirect a user consistently to the latest version of a
>>> dataset if a user accesses the old identifier.
>>>> This does not make it impossible to use deprecated data, but at
>>> least it raises the consumer's awareness of the issue and lowers the
>>> barrier to re-retrieve valid data.
>>>> 
>>>> As for the point in time; I'd be certain that it is too late now,
>>> but it is always a good idea to have plans for future improvement.. :)
>>>> 
>>>> Best, Tobias
>>>> 
>>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>> Thanks for the replies on this - any other replies are still very
>>> welcome.
>>>>> 
>>>>> Stephen - being selfish - we aren't too worried about 2 as its less
>>> of an issue for us (we do a daily trawl of thredds catalogues for new
>>> datasets), but I agree it is a problem more generally.  I don't have a
>>> feel for which of the problems 1-3 would minimise the risk most if you
>>> solved it.  I think making sure new data has a new version is a
>>> foundation though.
>>>>> 
>>>>> Part of me wonders though whether its already too late to really do
>>> anything with versioning in its current form.  *But* I may be
>>> overestimating the size of the problem of new datasets appearing
>>> without versions being updated.
>>>>> 
>>>>> Jamie
>>>>> 
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
>>> Denvil
>>>>>> Sent: 08 March 2012 10:41
>>>>>> To: go-essp-tech at ucar.edu
>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>>> done using 'deprecated' data?
>>>>>> 
>>>>>> Hi Stephen, let me add a third point:
>>>>>> 
>>>>>> 3. Users are aware of a new versions but can't download files
>>>>>> so as to have a coherent set of files.
>>>>>> 
>>>>>> With respect to that point the p2p transition (especially the
>>>>>> attribut caching on the node) will be a major step forward.
>>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>>>> 
>>>>>> And I agree with Ashish.
>>>>>> 
>>>>>> Regards.
>>>>>> Sébastien
>>>>>> 
>>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
>>>>>>> Hi Jamie,
>>>>>>> 
>>>>>>> I can imagine there is a risk of papers being written on
>>>>>> deprecated data in two scenarios:
>>>>>>>  1. Data is being updated at datanodes without creating a
>>>>>> new version
>>>>>>>  2. Users are unaware of new versions available and
>>>>>> therefore using
>>>>>>> deprecated data
>>>>>>> 
>>>>>>> Are you concerned about both of these scenarios?  Your
>>>>>> email seems to mainly address #1.
>>>>>>> Thanks,
>>>>>>> Stephen.
>>>>>>> 
>>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> Does anyone have a feel for the current level of risk that
>>>>>> analysists
>>>>>>>> are doing work (with the intention to publish) on data
>>>>>> that has been
>>>>>>>> found to be wrong by the data providers and so deprecated (in
>>> some
>>>>>>>> sense)?
>>>>>>>> 
>>>>>>>> My feeling is that versioning isn't working (that may be
>>>>>> putting it a
>>>>>>>> bit strongly.  It is too easy for data providers - in their
>>>>>>>> understandable drive to get their data out - to have
>>>>>> updated files on
>>>>>>>> disk without publishing a new version.   How big a deal does
>>> anyone
>>>>>>>> think this is?
>>>>>>>> 
>>>>>>>> If the risk that papers are being written based on
>>>>>> deprecated data is
>>>>>>>> sufficiently large then is there an agreed strategy for
>>>>>> coping with
>>>>>>>> this?  Does it have implications for the requirements of the
>>> data
>>>>>>>> publishing/delivery system?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jamie
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>> 
>>>>>> --
>>>>>> Sébastien Denvil
>>>>>> IPSL, Pôle de modélisation du climat
>>>>>> UPMC, Case 101, 4 place Jussieu,
>>>>>> 75252 Paris Cedex 5
>>>>>> 
>>>>>> Tour 45-55 2ème étage Bureau 209
>>>>>> Tel: 33 1 44 27 21 10
>>>>>> Fax: 33 1 44 27 39 02
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Tobias Weigel
>>>> 
>>>> Department of Data Management
>>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>> Bundesstr. 45a
>>>> 20146 Hamburg
>>>> Germany
>>>> 
>>>> Tel.: +49 40 460094 104
>>>> E-Mail: weigel at dkrz.de
>>>> Website: www.dkrz.de
>>>> 
>>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>> 
>>>> Sitz der Gesellschaft: Hamburg
>>>> Amtsgericht Hamburg HRB 39784
>>>> 
>>>> 
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> 
>>> --
>>> Scanned by iCritical.
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> -- 
> Scanned by iCritical.
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 

---------------------------------------------------
Mark Morgan
Software Architect / Engineer
Institut Pierre Simon Laplace (IPSL),
Université Pierre Marie Curie,
4 Place Jussieu,
Tour 45-55, Salle #207,
Paris 75005
France.
Tel : +33 (0) 1 44 27 49 10
Email: momipsl at ipsl.jussieu.fr
---------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/25f3ace9/attachment-0001.html