[Go-essp-tech] What is the risk that science is done using 'deprecated' data?
Mark Morgan
momipsl at ipsl.jussieu.fr
Fri Mar 9 01:41:36 MST 2012
Martin
This problem space seems suitable for EXARCH. I.E. setting up a digital identication service. This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.
Mark
On 8 Mar 2012, at 17:18, <martin.juckes at stfc.ac.uk> <martin.juckes at stfc.ac.uk> wrote:
> I agree, particularly on the last point.
>
> There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
>
> Cheers,
> Martin
>
>>> -----Original Message-----
>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
>>> Sent: 08 March 2012 16:01
>>> To: weigel at dkrz.de
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] What is the risk that science is done
>>> using 'deprecated' data?
>>>
>>> Hi Thomas,
>>>
>>> As you say, it's too late to do much re-engineering of the system now
>>> -- we've attempted to put in place various identifier systems and none
>>> of them are working particularly well -- however I think there is
>>> another perspective to your proposal:
>>>
>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>> domains and each domain has the ability to cut corners to get things
>>> done, e.g. replacing files silently without changing identifiers.
>>>
>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>> doing #1 to get the data to scientists when they need it. Any system
>>> that makes it impossible, or even only difficult, to change the
>>> underlying data is going to be more complex and difficult to
>>> administer than a system that doesn't, unless that system was very
>>> rigorously designed, implemented and tested.
>>>
>>> Because of #1 I'm convinced that a fit-for-purpose identifier system
>>> wouldn't use randomly generated UUIDs but would take the GIT approach
>>> of hashing invariants of the dataset so that any changes behind the
>>> scenes can be detected.
>>>
>>> Because of #2 I'm convinced that now is not the time to start building
>>> more software to do this. We have to stabilise the system and learn
>>> the lessons of CMIP5 first.
>>>
>>> Cheers,
>>> Stephen.
>>>
>>>
>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>
>>>> Jamie/All,
>>>>
>>>> these are important questions I have been wondering about as well;
>>> we just had a small internal meeting yesterday with Estani and
>>> Martina, so I'll try to sum some points up here. I am not too familiar
>>> with the ESG publishing process, so I can only guess that Stephen's #1
>>> has something to do with the bending of policies that are for
>>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
>>> that *ideally* it should be impossible to make data available without
>>> going through the whole publication process. Please correct me if I am
>>> misunderstanding this.)
>>>>
>>>> Most of what I have been thinking about however concerns point #2.
>>> I'd claim that the risk here should not be underestimated; data
>>> consumers being unable to find the data they need is bad ("the
>>> advanced search issue"), but users relying on deprecated data - most
>>> likely without being aware of it - is certainly dangerous for
>>> scientific credibility.
>>>> My suggestion to address this problem is to use globally persistent
>>> identifiers (PIDs) that are automatically assigned to data objects
>>> (and metadata etc.) on ESG-publication; data should ideally not be
>>> known by its file name or system-internal ID, but via a global
>>> identifier that never changes after it has been published. Of course,
>>> this sounds like the DOIs, but these are extremely coarse grained and
>>> very static. The idea is to attach identifiers to the low-level
>>> entities and provide solutions to build up a hierarchical ID system
>>> (virtual collections) to account for the various layers used in our
>>> data. Such persistent identifiers should then be placed prominently in
>>> any user interface dealing with managed data. The important thing is:
>>> If data is updated, we don't update the data behind identifier x, but
>>> assign a new identifier y and create a typed link between these two
>>> (which may be the most challenging part) and perhaps put a small
>>> annotation on x that this data is deprecated. A clever user interface
>>> should then redirect a user consistently to the latest version of a
>>> dataset if a user accesses the old identifier.
>>>> This does not make it impossible to use deprecated data, but at
>>> least it raises the consumer's awareness of the issue and lowers the
>>> barrier to re-retrieve valid data.
>>>>
>>>> As for the point in time; I'd be certain that it is too late now,
>>> but it is always a good idea to have plans for future improvement.. :)
>>>>
>>>> Best, Tobias
>>>>
>>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>> Thanks for the replies on this - any other replies are still very
>>> welcome.
>>>>>
>>>>> Stephen - being selfish - we aren't too worried about 2 as its less
>>> of an issue for us (we do a daily trawl of thredds catalogues for new
>>> datasets), but I agree it is a problem more generally. I don't have a
>>> feel for which of the problems 1-3 would minimise the risk most if you
>>> solved it. I think making sure new data has a new version is a
>>> foundation though.
>>>>>
>>>>> Part of me wonders though whether its already too late to really do
>>> anything with versioning in its current form. *But* I may be
>>> overestimating the size of the problem of new datasets appearing
>>> without versions being updated.
>>>>>
>>>>> Jamie
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
>>> Denvil
>>>>>> Sent: 08 March 2012 10:41
>>>>>> To: go-essp-tech at ucar.edu
>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>>> done using 'deprecated' data?
>>>>>>
>>>>>> Hi Stephen, let me add a third point:
>>>>>>
>>>>>> 3. Users are aware of a new versions but can't download files
>>>>>> so as to have a coherent set of files.
>>>>>>
>>>>>> With respect to that point the p2p transition (especially the
>>>>>> attribut caching on the node) will be a major step forward.
>>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>>>>
>>>>>> And I agree with Ashish.
>>>>>>
>>>>>> Regards.
>>>>>> Sébastien
>>>>>>
>>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
>>>>>>> Hi Jamie,
>>>>>>>
>>>>>>> I can imagine there is a risk of papers being written on
>>>>>> deprecated data in two scenarios:
>>>>>>> 1. Data is being updated at datanodes without creating a
>>>>>> new version
>>>>>>> 2. Users are unaware of new versions available and
>>>>>> therefore using
>>>>>>> deprecated data
>>>>>>>
>>>>>>> Are you concerned about both of these scenarios? Your
>>>>>> email seems to mainly address #1.
>>>>>>> Thanks,
>>>>>>> Stephen.
>>>>>>>
>>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Does anyone have a feel for the current level of risk that
>>>>>> analysists
>>>>>>>> are doing work (with the intention to publish) on data
>>>>>> that has been
>>>>>>>> found to be wrong by the data providers and so deprecated (in
>>> some
>>>>>>>> sense)?
>>>>>>>>
>>>>>>>> My feeling is that versioning isn't working (that may be
>>>>>> putting it a
>>>>>>>> bit strongly. It is too easy for data providers - in their
>>>>>>>> understandable drive to get their data out - to have
>>>>>> updated files on
>>>>>>>> disk without publishing a new version. How big a deal does
>>> anyone
>>>>>>>> think this is?
>>>>>>>>
>>>>>>>> If the risk that papers are being written based on
>>>>>> deprecated data is
>>>>>>>> sufficiently large then is there an agreed strategy for
>>>>>> coping with
>>>>>>>> this? Does it have implications for the requirements of the
>>> data
>>>>>>>> publishing/delivery system?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jamie
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>
>>>>>> --
>>>>>> Sébastien Denvil
>>>>>> IPSL, Pôle de modélisation du climat
>>>>>> UPMC, Case 101, 4 place Jussieu,
>>>>>> 75252 Paris Cedex 5
>>>>>>
>>>>>> Tour 45-55 2ème étage Bureau 209
>>>>>> Tel: 33 1 44 27 21 10
>>>>>> Fax: 33 1 44 27 39 02
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>
>>>>
>>>>
>>>> --
>>>> Tobias Weigel
>>>>
>>>> Department of Data Management
>>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>> Bundesstr. 45a
>>>> 20146 Hamburg
>>>> Germany
>>>>
>>>> Tel.: +49 40 460094 104
>>>> E-Mail: weigel at dkrz.de
>>>> Website: www.dkrz.de
>>>>
>>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>>
>>>> Sitz der Gesellschaft: Hamburg
>>>> Amtsgericht Hamburg HRB 39784
>>>>
>>>>
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>> --
>>> Scanned by iCritical.
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> --
> Scanned by iCritical.
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
---------------------------------------------------
Mark Morgan
Software Architect / Engineer
Institut Pierre Simon Laplace (IPSL),
Université Pierre Marie Curie,
4 Place Jussieu,
Tour 45-55, Salle #207,
Paris 75005
France.
Tel : +33 (0) 1 44 27 49 10
Email: momipsl at ipsl.jussieu.fr
---------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/25f3ace9/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list