[Go-essp-tech] What is the risk that science is done using 'deprecated' data?
stephen.pascoe at stfc.ac.uk
stephen.pascoe at stfc.ac.uk
Fri Mar 9 02:55:53 MST 2012
Hi Tobias,
I'm not familiar with EPIC. Does anyone have a reference? I assume it's based on registration. In my view hashing provides a more powerful mechanism for global identifiers if you can create a canonical representation of the entity. Then it's impossible for the entity to change without the identifier changing instead of relying on some trusted authority to enforce the correspondence.
Last time I talked to DOI people it still wasn't completely sorted out whether a dataset can change without changing the DOI. Sure it shouldn't but there were examples of landing pages evolving over time. I can foresee a similar problem with any registration system.
Cheers,
Stephen.
On 9 Mar 2012, at 09:11, Tobias Weigel wrote:
> I am prety much thinking about the EPIC infrastructure, and as long as that is not fully ready yet, at least the basic Handle System. ExArch is a very good candidate to explore some ideas and limitations, as well as the German c3grid project (trace provenance information across workflows).
>
> On 09.03.2012 10:08:04, Bryan Lawrence wrote:
>> there are othre setting up digital identificaiton services, we should not. e.g. EPIC ... could we use them?
>> Cheers
>> Bryan
>>
>>> Martin
>>>
>>> This problem space seems suitable for EXARCH. I.E. setting up a digital identication service. This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.
>>>
>>> Mark
>>>
>>>
>>> On 8 Mar 2012, at 17:18,<martin.juckes at stfc.ac.uk> <martin.juckes at stfc.ac.uk> wrote:
>>>
>>>> I agree, particularly on the last point.
>>>>
>>>> There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
>>>>
>>>> Cheers,
>>>> Martin
>>>>
>>>>>> -----Original Message-----
>>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>>>>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
>>>>>> Sent: 08 March 2012 16:01
>>>>>> To: weigel at dkrz.de
>>>>>> Cc: go-essp-tech at ucar.edu
>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is done
>>>>>> using 'deprecated' data?
>>>>>>
>>>>>> Hi Thomas,
>>>>>>
>>>>>> As you say, it's too late to do much re-engineering of the system now
>>>>>> -- we've attempted to put in place various identifier systems and none
>>>>>> of them are working particularly well -- however I think there is
>>>>>> another perspective to your proposal:
>>>>>>
>>>>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>>>>> domains and each domain has the ability to cut corners to get things
>>>>>> done, e.g. replacing files silently without changing identifiers.
>>>>>>
>>>>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>>>>> doing #1 to get the data to scientists when they need it. Any system
>>>>>> that makes it impossible, or even only difficult, to change the
>>>>>> underlying data is going to be more complex and difficult to
>>>>>> administer than a system that doesn't, unless that system was very
>>>>>> rigorously designed, implemented and tested.
>>>>>>
>>>>>> Because of #1 I'm convinced that a fit-for-purpose identifier system
>>>>>> wouldn't use randomly generated UUIDs but would take the GIT approach
>>>>>> of hashing invariants of the dataset so that any changes behind the
>>>>>> scenes can be detected.
>>>>>>
>>>>>> Because of #2 I'm convinced that now is not the time to start building
>>>>>> more software to do this. We have to stabilise the system and learn
>>>>>> the lessons of CMIP5 first.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephen.
>>>>>>
>>>>>>
>>>>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>>>>
>>>>>>> Jamie/All,
>>>>>>>
>>>>>>> these are important questions I have been wondering about as well;
>>>>>> we just had a small internal meeting yesterday with Estani and
>>>>>> Martina, so I'll try to sum some points up here. I am not too familiar
>>>>>> with the ESG publishing process, so I can only guess that Stephen's #1
>>>>>> has something to do with the bending of policies that are for
>>>>>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
>>>>>> that *ideally* it should be impossible to make data available without
>>>>>> going through the whole publication process. Please correct me if I am
>>>>>> misunderstanding this.)
>>>>>>> Most of what I have been thinking about however concerns point #2.
>>>>>> I'd claim that the risk here should not be underestimated; data
>>>>>> consumers being unable to find the data they need is bad ("the
>>>>>> advanced search issue"), but users relying on deprecated data - most
>>>>>> likely without being aware of it - is certainly dangerous for
>>>>>> scientific credibility.
>>>>>>> My suggestion to address this problem is to use globally persistent
>>>>>> identifiers (PIDs) that are automatically assigned to data objects
>>>>>> (and metadata etc.) on ESG-publication; data should ideally not be
>>>>>> known by its file name or system-internal ID, but via a global
>>>>>> identifier that never changes after it has been published. Of course,
>>>>>> this sounds like the DOIs, but these are extremely coarse grained and
>>>>>> very static. The idea is to attach identifiers to the low-level
>>>>>> entities and provide solutions to build up a hierarchical ID system
>>>>>> (virtual collections) to account for the various layers used in our
>>>>>> data. Such persistent identifiers should then be placed prominently in
>>>>>> any user interface dealing with managed data. The important thing is:
>>>>>> If data is updated, we don't update the data behind identifier x, but
>>>>>> assign a new identifier y and create a typed link between these two
>>>>>> (which may be the most challenging part) and perhaps put a small
>>>>>> annotation on x that this data is deprecated. A clever user interface
>>>>>> should then redirect a user consistently to the latest version of a
>>>>>> dataset if a user accesses the old identifier.
>>>>>>> This does not make it impossible to use deprecated data, but at
>>>>>> least it raises the consumer's awareness of the issue and lowers the
>>>>>> barrier to re-retrieve valid data.
>>>>>>> As for the point in time; I'd be certain that it is too late now,
>>>>>> but it is always a good idea to have plans for future improvement.. :)
>>>>>>> Best, Tobias
>>>>>>>
>>>>>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>>>>> Thanks for the replies on this - any other replies are still very
>>>>>> welcome.
>>>>>>>> Stephen - being selfish - we aren't too worried about 2 as its less
>>>>>> of an issue for us (we do a daily trawl of thredds catalogues for new
>>>>>> datasets), but I agree it is a problem more generally. I don't have a
>>>>>> feel for which of the problems 1-3 would minimise the risk most if you
>>>>>> solved it. I think making sure new data has a new version is a
>>>>>> foundation though.
>>>>>>>> Part of me wonders though whether its already too late to really do
>>>>>> anything with versioning in its current form. *But* I may be
>>>>>> overestimating the size of the problem of new datasets appearing
>>>>>> without versions being updated.
>>>>>>>> Jamie
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
>>>>>> Denvil
>>>>>>>>> Sent: 08 March 2012 10:41
>>>>>>>>> To: go-essp-tech at ucar.edu
>>>>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>>>>>> done using 'deprecated' data?
>>>>>>>>>
>>>>>>>>> Hi Stephen, let me add a third point:
>>>>>>>>>
>>>>>>>>> 3. Users are aware of a new versions but can't download files
>>>>>>>>> so as to have a coherent set of files.
>>>>>>>>>
>>>>>>>>> With respect to that point the p2p transition (especially the
>>>>>>>>> attribut caching on the node) will be a major step forward.
>>>>>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>>>>>>>
>>>>>>>>> And I agree with Ashish.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>> Sébastien
>>>>>>>>>
>>>>>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
>>>>>>>>>> Hi Jamie,
>>>>>>>>>>
>>>>>>>>>> I can imagine there is a risk of papers being written on
>>>>>>>>> deprecated data in two scenarios:
>>>>>>>>>> 1. Data is being updated at datanodes without creating a
>>>>>>>>> new version
>>>>>>>>>> 2. Users are unaware of new versions available and
>>>>>>>>> therefore using
>>>>>>>>>> deprecated data
>>>>>>>>>>
>>>>>>>>>> Are you concerned about both of these scenarios? Your
>>>>>>>>> email seems to mainly address #1.
>>>>>>>>>> Thanks,
>>>>>>>>>> Stephen.
>>>>>>>>>>
>>>>>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Does anyone have a feel for the current level of risk that
>>>>>>>>> analysists
>>>>>>>>>>> are doing work (with the intention to publish) on data
>>>>>>>>> that has been
>>>>>>>>>>> found to be wrong by the data providers and so deprecated (in
>>>>>> some
>>>>>>>>>>> sense)?
>>>>>>>>>>>
>>>>>>>>>>> My feeling is that versioning isn't working (that may be
>>>>>>>>> putting it a
>>>>>>>>>>> bit strongly. It is too easy for data providers - in their
>>>>>>>>>>> understandable drive to get their data out - to have
>>>>>>>>> updated files on
>>>>>>>>>>> disk without publishing a new version. How big a deal does
>>>>>> anyone
>>>>>>>>>>> think this is?
>>>>>>>>>>>
>>>>>>>>>>> If the risk that papers are being written based on
>>>>>>>>> deprecated data is
>>>>>>>>>>> sufficiently large then is there an agreed strategy for
>>>>>>>>> coping with
>>>>>>>>>>> this? Does it have implications for the requirements of the
>>>>>> data
>>>>>>>>>>> publishing/delivery system?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Jamie
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>> --
>>>>>>>>> Sébastien Denvil
>>>>>>>>> IPSL, Pôle de modélisation du climat
>>>>>>>>> UPMC, Case 101, 4 place Jussieu,
>>>>>>>>> 75252 Paris Cedex 5
>>>>>>>>>
>>>>>>>>> Tour 45-55 2ème étage Bureau 209
>>>>>>>>> Tel: 33 1 44 27 21 10
>>>>>>>>> Fax: 33 1 44 27 39 02
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Tobias Weigel
>>>>>>>
>>>>>>> Department of Data Management
>>>>>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>>>>> Bundesstr. 45a
>>>>>>> 20146 Hamburg
>>>>>>> Germany
>>>>>>>
>>>>>>> Tel.: +49 40 460094 104
>>>>>>> E-Mail: weigel at dkrz.de
>>>>>>> Website: www.dkrz.de
>>>>>>>
>>>>>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>>>>>
>>>>>>> Sitz der Gesellschaft: Hamburg
>>>>>>> Amtsgericht Hamburg HRB 39784
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>> --
>>>>>> Scanned by iCritical.
>>>>>> _______________________________________________
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> ---------------------------------------------------
>>> Mark Morgan
>>> Software Architect / Engineer
>>> Institut Pierre Simon Laplace (IPSL),
>>> Université Pierre Marie Curie,
>>> 4 Place Jussieu,
>>> Tour 45-55, Salle #207,
>>> Paris 75005
>>> France.
>>> Tel : +33 (0) 1 44 27 49 10
>>> Email: momipsl at ipsl.jussieu.fr
>>> ---------------------------------------------------
>>>
>>>
>>>
>>>
>> --
>> Bryan Lawrence
>> University of Reading: Professor of Weather and Climate Computing.
>> National Centre for Atmospheric Science: Director of Models and Data.
>> STFC: Director of the Centre for Environmental Data Archival.
>> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>
>
> --
> Tobias Weigel
>
> Department of Data Management
> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
> Bundesstr. 45a
> 20146 Hamburg
> Germany
>
> Tel.: +49 40 460094 104
> E-Mail: weigel at dkrz.de
> Website: www.dkrz.de
>
> Managing Director: Prof. Dr. Thomas Ludwig
>
> Sitz der Gesellschaft: Hamburg
> Amtsgericht Hamburg HRB 39784
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
--
Scanned by iCritical.
More information about the GO-ESSP-TECH
mailing list