[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Tobias Weigel weigel at dkrz.de
Thu Mar 8 08:32:12 MST 2012


Jamie/All,

these are important questions I have been wondering about as well; we 
just had a small internal meeting yesterday with Estani and Martina, so 
I'll try to sum some points up here. I am not too familiar with the ESG 
publishing process, so I can only guess that Stephen's #1 has something 
to do with the bending of policies that are for pragmatic reasons not 
enforced in the CMIP5 process. (My intuition is that *ideally* it should 
be impossible to make data available without going through the whole 
publication process. Please correct me if I am misunderstanding this.)

Most of what I have been thinking about however concerns point #2. I'd 
claim that the risk here should not be underestimated; data consumers 
being unable to find the data they need is bad ("the advanced search 
issue"), but users relying on deprecated data - most likely without 
being aware of it - is certainly dangerous for scientific credibility.
My suggestion to address this problem is to use globally persistent 
identifiers (PIDs) that are automatically assigned to data objects (and 
metadata etc.) on ESG-publication; data should ideally not be known by 
its file name or system-internal ID, but via a global identifier that 
never changes after it has been published. Of course, this sounds like 
the DOIs, but these are extremely coarse grained and very static. The 
idea is to attach identifiers to the low-level entities and provide 
solutions to build up a hierarchical ID system (virtual collections) to 
account for the various layers used in our data. Such persistent 
identifiers should then be placed prominently in any user interface 
dealing with managed data. The important thing is: If data is updated, 
we don't update the data behind identifier x, but assign a new 
identifier y and create a typed link between these two (which may be the 
most challenging part) and perhaps put a small annotation on x that this 
data is deprecated. A clever user interface should then redirect a user 
consistently to the latest version of a dataset if a user accesses the 
old identifier.
This does not make it impossible to use deprecated data, but at least it 
raises the consumer's awareness of the issue and lowers the barrier to 
re-retrieve valid data.

As for the point in time; I'd be certain that it is too late now, but it 
is always a good idea to have plans for future improvement.. :)

Best, Tobias

Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
> Thanks for the replies on this - any other replies are still very welcome.
>
> Stephen - being selfish - we aren't too worried about 2 as its less of an issue for us (we do a daily trawl of thredds catalogues for new datasets), but I agree it is a problem more generally.  I don't have a feel for which of the problems 1-3 would minimise the risk most if you solved it.  I think making sure new data has a new version is a foundation though.
>
> Part of me wonders though whether its already too late to really do anything with versioning in its current form.  *But* I may be overestimating the size of the problem of new datasets appearing without versions being updated.
>
> Jamie
>
>
>> -----Original Message-----
>> From: go-essp-tech-bounces at ucar.edu
>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien Denvil
>> Sent: 08 March 2012 10:41
>> To: go-essp-tech at ucar.edu
>> Subject: Re: [Go-essp-tech] What is the risk that science is
>> done using 'deprecated' data?
>>
>> Hi Stephen, let me add a third point:
>>
>> 3. Users are aware of a new versions but can't download files
>> so as to have a coherent set of files.
>>
>> With respect to that point the p2p transition (especially the
>> attribut caching on the node) will be a major step forward.
>> GFDL just upgrad and we have an amazing success rate of 98%.
>>
>> And I agree with Ashish.
>>
>> Regards.
>> Sébastien
>>
>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
>>> Hi Jamie,
>>>
>>> I can imagine there is a risk of papers being written on
>> deprecated data in two scenarios:
>>>    1. Data is being updated at datanodes without creating a
>> new version
>>>    2. Users are unaware of new versions available and
>> therefore using
>>> deprecated data
>>>
>>> Are you concerned about both of these scenarios?  Your
>> email seems to mainly address #1.
>>> Thanks,
>>> Stephen.
>>>
>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>
>>>> Hello,
>>>>
>>>> Does anyone have a feel for the current level of risk that
>> analysists
>>>> are doing work (with the intention to publish) on data
>> that has been
>>>> found to be wrong by the data providers and so deprecated (in some
>>>> sense)?
>>>>
>>>> My feeling is that versioning isn't working (that may be
>> putting it a
>>>> bit strongly.  It is too easy for data providers - in their
>>>> understandable drive to get their data out - to have
>> updated files on
>>>> disk without publishing a new version.   How big a deal does anyone
>>>> think this is?
>>>>
>>>> If the risk that papers are being written based on
>> deprecated data is
>>>> sufficiently large then is there an agreed strategy for
>> coping with
>>>> this?  Does it have implications for the requirements of the data
>>>> publishing/delivery system?
>>>>
>>>> Thanks,
>>>>
>>>> Jamie
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>> --
>> Sébastien Denvil
>> IPSL, Pôle de modélisation du climat
>> UPMC, Case 101, 4 place Jussieu,
>> 75252 Paris Cedex 5
>>
>> Tour 45-55 2ème étage Bureau 209
>> Tel: 33 1 44 27 21 10
>> Fax: 33 1 44 27 39 02
>>
>>
>>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>


-- 
Tobias Weigel

Department of Data Management
Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
Bundesstr. 45a
20146 Hamburg
Germany

Tel.: +49 40 460094 104
E-Mail: weigel at dkrz.de
Website: www.dkrz.de

Managing Director: Prof. Dr. Thomas Ludwig

Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4568 bytes
Desc: S/MIME Kryptografische Unterschrift
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120308/60040b74/attachment-0001.bin 


More information about the GO-ESSP-TECH mailing list