[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Fri Mar 9 02:55:53 MST 2012

Hi Tobias,

I'm not familiar with EPIC.  Does anyone have a reference?  I assume it's based on registration.  In my view hashing provides a more powerful mechanism for global identifiers if you can create a canonical representation of the entity.  Then it's impossible for the entity to change without the identifier changing instead of relying on some trusted authority to enforce the correspondence.  

Last time I talked to DOI people it still wasn't completely sorted out whether a dataset can change without changing the DOI.  Sure it shouldn't but there were examples of landing pages evolving over time.  I can foresee a similar problem with any registration system.

Cheers,
Stephen.

On 9 Mar 2012, at 09:11, Tobias Weigel wrote:

> I am prety much thinking about the EPIC infrastructure, and as long as that is not fully ready yet, at least the basic Handle System. ExArch is a very good candidate to explore some ideas and limitations, as well as the German c3grid project (trace provenance information across workflows).
> 
> On  09.03.2012 10:08:04, Bryan Lawrence wrote:
>> there are othre setting up digital identificaiton services, we should not. e.g. EPIC ... could we use them?
>> Cheers
>> Bryan
>> 
>>> Martin
>>> 
>>> This problem space seems suitable for EXARCH.  I.E. setting up a digital identication service.  This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.
>>> 
>>> Mark
>>> 
>>> 
>>> On 8 Mar 2012, at 17:18,<martin.juckes at stfc.ac.uk>  <martin.juckes at stfc.ac.uk>  wrote:
>>> 
>>>> I agree, particularly on the last point.
>>>> 
>>>> There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
>>>> 
>>>> Cheers,
>>>> Martin
>>>> 
>>>>>> -----Original Message-----
>>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>>>>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
>>>>>> Sent: 08 March 2012 16:01
>>>>>> To: weigel at dkrz.de
>>>>>> Cc: go-essp-tech at ucar.edu
>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is done
>>>>>> using 'deprecated' data?
>>>>>> 
>>>>>> Hi Thomas,
>>>>>> 
>>>>>> As you say, it's too late to do much re-engineering of the system now
>>>>>> -- we've attempted to put in place various identifier systems and none
>>>>>> of them are working particularly well -- however I think there is
>>>>>> another perspective to your proposal:
>>>>>> 
>>>>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>>>>> domains and each domain has the ability to cut corners to get things
>>>>>> done, e.g. replacing files silently without changing identifiers.
>>>>>> 
>>>>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>>>>> doing #1 to get the data to scientists when they need it.  Any system
>>>>>> that makes it impossible, or even only difficult, to change the
>>>>>> underlying data is going to be more complex and difficult to
>>>>>> administer than a system that doesn't, unless that system was very
>>>>>> rigorously designed, implemented and tested.
>>>>>> 
>>>>>> Because of #1 I'm convinced that a fit-for-purpose identifier system
>>>>>> wouldn't use randomly generated UUIDs but would take the GIT approach
>>>>>> of hashing invariants of the dataset so that any changes behind the
>>>>>> scenes can be detected.
>>>>>> 
>>>>>> Because of #2 I'm convinced that now is not the time to start building
>>>>>> more software to do this.  We have to stabilise the system and learn
>>>>>> the lessons of CMIP5 first.
>>>>>> 
>>>>>> Cheers,
>>>>>> Stephen.
>>>>>> 
>>>>>> 
>>>>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>>>> 
>>>>>>> Jamie/All,
>>>>>>> 
>>>>>>> these are important questions I have been wondering about as well;
>>>>>> we just had a small internal meeting yesterday with Estani and
>>>>>> Martina, so I'll try to sum some points up here. I am not too familiar
>>>>>> with the ESG publishing process, so I can only guess that Stephen's #1
>>>>>> has something to do with the bending of policies that are for
>>>>>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
>>>>>> that *ideally* it should be impossible to make data available without
>>>>>> going through the whole publication process. Please correct me if I am
>>>>>> misunderstanding this.)
>>>>>>> Most of what I have been thinking about however concerns point #2.
>>>>>> I'd claim that the risk here should not be underestimated; data
>>>>>> consumers being unable to find the data they need is bad ("the
>>>>>> advanced search issue"), but users relying on deprecated data - most
>>>>>> likely without being aware of it - is certainly dangerous for
>>>>>> scientific credibility.
>>>>>>> My suggestion to address this problem is to use globally persistent
>>>>>> identifiers (PIDs) that are automatically assigned to data objects
>>>>>> (and metadata etc.) on ESG-publication; data should ideally not be
>>>>>> known by its file name or system-internal ID, but via a global
>>>>>> identifier that never changes after it has been published. Of course,
>>>>>> this sounds like the DOIs, but these are extremely coarse grained and
>>>>>> very static. The idea is to attach identifiers to the low-level
>>>>>> entities and provide solutions to build up a hierarchical ID system
>>>>>> (virtual collections) to account for the various layers used in our
>>>>>> data. Such persistent identifiers should then be placed prominently in
>>>>>> any user interface dealing with managed data. The important thing is:
>>>>>> If data is updated, we don't update the data behind identifier x, but
>>>>>> assign a new identifier y and create a typed link between these two
>>>>>> (which may be the most challenging part) and perhaps put a small
>>>>>> annotation on x that this data is deprecated. A clever user interface
>>>>>> should then redirect a user consistently to the latest version of a
>>>>>> dataset if a user accesses the old identifier.
>>>>>>> This does not make it impossible to use deprecated data, but at
>>>>>> least it raises the consumer's awareness of the issue and lowers the
>>>>>> barrier to re-retrieve valid data.
>>>>>>> As for the point in time; I'd be certain that it is too late now,
>>>>>> but it is always a good idea to have plans for future improvement.. :)
>>>>>>> Best, Tobias
>>>>>>> 
>>>>>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>>>>> Thanks for the replies on this - any other replies are still very
>>>>>> welcome.
>>>>>>>> Stephen - being selfish - we aren't too worried about 2 as its less
>>>>>> of an issue for us (we do a daily trawl of thredds catalogues for new
>>>>>> datasets), but I agree it is a problem more generally.  I don't have a
>>>>>> feel for which of the problems 1-3 would minimise the risk most if you
>>>>>> solved it.  I think making sure new data has a new version is a
>>>>>> foundation though.
>>>>>>>> Part of me wonders though whether its already too late to really do
>>>>>> anything with versioning in its current form.  *But* I may be
>>>>>> overestimating the size of the problem of new datasets appearing
>>>>>> without versions being updated.
>>>>>>>> Jamie
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
>>>>>> Denvil
>>>>>>>>> Sent: 08 March 2012 10:41
>>>>>>>>> To: go-essp-tech at ucar.edu
>>>>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>>>>>> done using 'deprecated' data?
>>>>>>>>> 
>>>>>>>>> Hi Stephen, let me add a third point:
>>>>>>>>> 
>>>>>>>>> 3. Users are aware of a new versions but can't download files
>>>>>>>>> so as to have a coherent set of files.
>>>>>>>>> 
>>>>>>>>> With respect to that point the p2p transition (especially the
>>>>>>>>> attribut caching on the node) will be a major step forward.
>>>>>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>>>>>>> 
>>>>>>>>> And I agree with Ashish.
>>>>>>>>> 
>>>>>>>>> Regards.
>>>>>>>>> Sébastien
>>>>>>>>> 
>>>>>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
>>>>>>>>>> Hi Jamie,
>>>>>>>>>> 
>>>>>>>>>> I can imagine there is a risk of papers being written on
>>>>>>>>> deprecated data in two scenarios:
>>>>>>>>>>  1. Data is being updated at datanodes without creating a
>>>>>>>>> new version
>>>>>>>>>>  2. Users are unaware of new versions available and
>>>>>>>>> therefore using
>>>>>>>>>> deprecated data
>>>>>>>>>> 
>>>>>>>>>> Are you concerned about both of these scenarios?  Your
>>>>>>>>> email seems to mainly address #1.
>>>>>>>>>> Thanks,
>>>>>>>>>> Stephen.
>>>>>>>>>> 
>>>>>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone have a feel for the current level of risk that
>>>>>>>>> analysists
>>>>>>>>>>> are doing work (with the intention to publish) on data
>>>>>>>>> that has been
>>>>>>>>>>> found to be wrong by the data providers and so deprecated (in
>>>>>> some
>>>>>>>>>>> sense)?
>>>>>>>>>>> 
>>>>>>>>>>> My feeling is that versioning isn't working (that may be
>>>>>>>>> putting it a
>>>>>>>>>>> bit strongly.  It is too easy for data providers - in their
>>>>>>>>>>> understandable drive to get their data out - to have
>>>>>>>>> updated files on
>>>>>>>>>>> disk without publishing a new version.   How big a deal does
>>>>>> anyone
>>>>>>>>>>> think this is?
>>>>>>>>>>> 
>>>>>>>>>>> If the risk that papers are being written based on
>>>>>>>>> deprecated data is
>>>>>>>>>>> sufficiently large then is there an agreed strategy for
>>>>>>>>> coping with
>>>>>>>>>>> this?  Does it have implications for the requirements of the
>>>>>> data
>>>>>>>>>>> publishing/delivery system?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jamie
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>> --
>>>>>>>>> Sébastien Denvil
>>>>>>>>> IPSL, Pôle de modélisation du climat
>>>>>>>>> UPMC, Case 101, 4 place Jussieu,
>>>>>>>>> 75252 Paris Cedex 5
>>>>>>>>> 
>>>>>>>>> Tour 45-55 2ème étage Bureau 209
>>>>>>>>> Tel: 33 1 44 27 21 10
>>>>>>>>> Fax: 33 1 44 27 39 02
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Tobias Weigel
>>>>>>> 
>>>>>>> Department of Data Management
>>>>>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>>>>> Bundesstr. 45a
>>>>>>> 20146 Hamburg
>>>>>>> Germany
>>>>>>> 
>>>>>>> Tel.: +49 40 460094 104
>>>>>>> E-Mail: weigel at dkrz.de
>>>>>>> Website: www.dkrz.de
>>>>>>> 
>>>>>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>>>>> 
>>>>>>> Sitz der Gesellschaft: Hamburg
>>>>>>> Amtsgericht Hamburg HRB 39784
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>> --
>>>>>> Scanned by iCritical.
>>>>>> _______________________________________________
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> ---------------------------------------------------
>>> Mark Morgan
>>> Software Architect / Engineer
>>> Institut Pierre Simon Laplace (IPSL),
>>> Université Pierre Marie Curie,
>>> 4 Place Jussieu,
>>> Tour 45-55, Salle #207,
>>> Paris 75005
>>> France.
>>> Tel : +33 (0) 1 44 27 49 10
>>> Email: momipsl at ipsl.jussieu.fr
>>> ---------------------------------------------------
>>> 
>>> 
>>> 
>>> 
>> --
>> Bryan Lawrence
>> University of Reading:  Professor of Weather and Climate Computing.
>> National Centre for Atmospheric Science: Director of Models and Data.
>> STFC: Director of the Centre for Environmental Data Archival.
>> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>> 
> 
> 
> -- 
> Tobias Weigel
> 
> Department of Data Management
> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
> Bundesstr. 45a
> 20146 Hamburg
> Germany
> 
> Tel.: +49 40 460094 104
> E-Mail: weigel at dkrz.de
> Website: www.dkrz.de
> 
> Managing Director: Prof. Dr. Thomas Ludwig
> 
> Sitz der Gesellschaft: Hamburg
> Amtsgericht Hamburg HRB 39784
> 
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Scanned by iCritical.