[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Tobias Weigel weigel at dkrz.de
Fri Mar 9 02:11:14 MST 2012


I am prety much thinking about the EPIC infrastructure, and as long as 
that is not fully ready yet, at least the basic Handle System. ExArch is 
a very good candidate to explore some ideas and limitations, as well as 
the German c3grid project (trace provenance information across workflows).

On  09.03.2012 10:08:04, Bryan Lawrence wrote:
> there are othre setting up digital identificaiton services, we should not. e.g. EPIC ... could we use them?
> Cheers
> Bryan
>
>> Martin
>>
>> This problem space seems suitable for EXARCH.  I.E. setting up a digital identication service.  This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.
>>
>> Mark
>>
>>
>> On 8 Mar 2012, at 17:18,<martin.juckes at stfc.ac.uk>  <martin.juckes at stfc.ac.uk>  wrote:
>>
>>> I agree, particularly on the last point.
>>>
>>> There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
>>>
>>> Cheers,
>>> Martin
>>>
>>>>> -----Original Message-----
>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>>>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
>>>>> Sent: 08 March 2012 16:01
>>>>> To: weigel at dkrz.de
>>>>> Cc: go-essp-tech at ucar.edu
>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is done
>>>>> using 'deprecated' data?
>>>>>
>>>>> Hi Thomas,
>>>>>
>>>>> As you say, it's too late to do much re-engineering of the system now
>>>>> -- we've attempted to put in place various identifier systems and none
>>>>> of them are working particularly well -- however I think there is
>>>>> another perspective to your proposal:
>>>>>
>>>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>>>> domains and each domain has the ability to cut corners to get things
>>>>> done, e.g. replacing files silently without changing identifiers.
>>>>>
>>>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>>>> doing #1 to get the data to scientists when they need it.  Any system
>>>>> that makes it impossible, or even only difficult, to change the
>>>>> underlying data is going to be more complex and difficult to
>>>>> administer than a system that doesn't, unless that system was very
>>>>> rigorously designed, implemented and tested.
>>>>>
>>>>> Because of #1 I'm convinced that a fit-for-purpose identifier system
>>>>> wouldn't use randomly generated UUIDs but would take the GIT approach
>>>>> of hashing invariants of the dataset so that any changes behind the
>>>>> scenes can be detected.
>>>>>
>>>>> Because of #2 I'm convinced that now is not the time to start building
>>>>> more software to do this.  We have to stabilise the system and learn
>>>>> the lessons of CMIP5 first.
>>>>>
>>>>> Cheers,
>>>>> Stephen.
>>>>>
>>>>>
>>>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>>>
>>>>>> Jamie/All,
>>>>>>
>>>>>> these are important questions I have been wondering about as well;
>>>>> we just had a small internal meeting yesterday with Estani and
>>>>> Martina, so I'll try to sum some points up here. I am not too familiar
>>>>> with the ESG publishing process, so I can only guess that Stephen's #1
>>>>> has something to do with the bending of policies that are for
>>>>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
>>>>> that *ideally* it should be impossible to make data available without
>>>>> going through the whole publication process. Please correct me if I am
>>>>> misunderstanding this.)
>>>>>> Most of what I have been thinking about however concerns point #2.
>>>>> I'd claim that the risk here should not be underestimated; data
>>>>> consumers being unable to find the data they need is bad ("the
>>>>> advanced search issue"), but users relying on deprecated data - most
>>>>> likely without being aware of it - is certainly dangerous for
>>>>> scientific credibility.
>>>>>> My suggestion to address this problem is to use globally persistent
>>>>> identifiers (PIDs) that are automatically assigned to data objects
>>>>> (and metadata etc.) on ESG-publication; data should ideally not be
>>>>> known by its file name or system-internal ID, but via a global
>>>>> identifier that never changes after it has been published. Of course,
>>>>> this sounds like the DOIs, but these are extremely coarse grained and
>>>>> very static. The idea is to attach identifiers to the low-level
>>>>> entities and provide solutions to build up a hierarchical ID system
>>>>> (virtual collections) to account for the various layers used in our
>>>>> data. Such persistent identifiers should then be placed prominently in
>>>>> any user interface dealing with managed data. The important thing is:
>>>>> If data is updated, we don't update the data behind identifier x, but
>>>>> assign a new identifier y and create a typed link between these two
>>>>> (which may be the most challenging part) and perhaps put a small
>>>>> annotation on x that this data is deprecated. A clever user interface
>>>>> should then redirect a user consistently to the latest version of a
>>>>> dataset if a user accesses the old identifier.
>>>>>> This does not make it impossible to use deprecated data, but at
>>>>> least it raises the consumer's awareness of the issue and lowers the
>>>>> barrier to re-retrieve valid data.
>>>>>> As for the point in time; I'd be certain that it is too late now,
>>>>> but it is always a good idea to have plans for future improvement.. :)
>>>>>> Best, Tobias
>>>>>>
>>>>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>>>> Thanks for the replies on this - any other replies are still very
>>>>> welcome.
>>>>>>> Stephen - being selfish - we aren't too worried about 2 as its less
>>>>> of an issue for us (we do a daily trawl of thredds catalogues for new
>>>>> datasets), but I agree it is a problem more generally.  I don't have a
>>>>> feel for which of the problems 1-3 would minimise the risk most if you
>>>>> solved it.  I think making sure new data has a new version is a
>>>>> foundation though.
>>>>>>> Part of me wonders though whether its already too late to really do
>>>>> anything with versioning in its current form.  *But* I may be
>>>>> overestimating the size of the problem of new datasets appearing
>>>>> without versions being updated.
>>>>>>> Jamie
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
>>>>> Denvil
>>>>>>>> Sent: 08 March 2012 10:41
>>>>>>>> To: go-essp-tech at ucar.edu
>>>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>>>>> done using 'deprecated' data?
>>>>>>>>
>>>>>>>> Hi Stephen, let me add a third point:
>>>>>>>>
>>>>>>>> 3. Users are aware of a new versions but can't download files
>>>>>>>> so as to have a coherent set of files.
>>>>>>>>
>>>>>>>> With respect to that point the p2p transition (especially the
>>>>>>>> attribut caching on the node) will be a major step forward.
>>>>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>>>>>>
>>>>>>>> And I agree with Ashish.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>> Sébastien
>>>>>>>>
>>>>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
>>>>>>>>> Hi Jamie,
>>>>>>>>>
>>>>>>>>> I can imagine there is a risk of papers being written on
>>>>>>>> deprecated data in two scenarios:
>>>>>>>>>   1. Data is being updated at datanodes without creating a
>>>>>>>> new version
>>>>>>>>>   2. Users are unaware of new versions available and
>>>>>>>> therefore using
>>>>>>>>> deprecated data
>>>>>>>>>
>>>>>>>>> Are you concerned about both of these scenarios?  Your
>>>>>>>> email seems to mainly address #1.
>>>>>>>>> Thanks,
>>>>>>>>> Stephen.
>>>>>>>>>
>>>>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Does anyone have a feel for the current level of risk that
>>>>>>>> analysists
>>>>>>>>>> are doing work (with the intention to publish) on data
>>>>>>>> that has been
>>>>>>>>>> found to be wrong by the data providers and so deprecated (in
>>>>> some
>>>>>>>>>> sense)?
>>>>>>>>>>
>>>>>>>>>> My feeling is that versioning isn't working (that may be
>>>>>>>> putting it a
>>>>>>>>>> bit strongly.  It is too easy for data providers - in their
>>>>>>>>>> understandable drive to get their data out - to have
>>>>>>>> updated files on
>>>>>>>>>> disk without publishing a new version.   How big a deal does
>>>>> anyone
>>>>>>>>>> think this is?
>>>>>>>>>>
>>>>>>>>>> If the risk that papers are being written based on
>>>>>>>> deprecated data is
>>>>>>>>>> sufficiently large then is there an agreed strategy for
>>>>>>>> coping with
>>>>>>>>>> this?  Does it have implications for the requirements of the
>>>>> data
>>>>>>>>>> publishing/delivery system?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jamie
>>>>>>>>>> _______________________________________________
>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>> --
>>>>>>>> Sébastien Denvil
>>>>>>>> IPSL, Pôle de modélisation du climat
>>>>>>>> UPMC, Case 101, 4 place Jussieu,
>>>>>>>> 75252 Paris Cedex 5
>>>>>>>>
>>>>>>>> Tour 45-55 2ème étage Bureau 209
>>>>>>>> Tel: 33 1 44 27 21 10
>>>>>>>> Fax: 33 1 44 27 39 02
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Tobias Weigel
>>>>>>
>>>>>> Department of Data Management
>>>>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>>>> Bundesstr. 45a
>>>>>> 20146 Hamburg
>>>>>> Germany
>>>>>>
>>>>>> Tel.: +49 40 460094 104
>>>>>> E-Mail: weigel at dkrz.de
>>>>>> Website: www.dkrz.de
>>>>>>
>>>>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>>>>
>>>>>> Sitz der Gesellschaft: Hamburg
>>>>>> Amtsgericht Hamburg HRB 39784
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>> --
>>>>> Scanned by iCritical.
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>> ---------------------------------------------------
>> Mark Morgan
>> Software Architect / Engineer
>> Institut Pierre Simon Laplace (IPSL),
>> Université Pierre Marie Curie,
>> 4 Place Jussieu,
>> Tour 45-55, Salle #207,
>> Paris 75005
>> France.
>> Tel : +33 (0) 1 44 27 49 10
>> Email: momipsl at ipsl.jussieu.fr
>> ---------------------------------------------------
>>
>>
>>
>>
> --
> Bryan Lawrence
> University of Reading:  Professor of Weather and Climate Computing.
> National Centre for Atmospheric Science: Director of Models and Data.
> STFC: Director of the Centre for Environmental Data Archival.
> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>


-- 
Tobias Weigel

Department of Data Management
Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
Bundesstr. 45a
20146 Hamburg
Germany

Tel.: +49 40 460094 104
E-Mail: weigel at dkrz.de
Website: www.dkrz.de

Managing Director: Prof. Dr. Thomas Ludwig

Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4568 bytes
Desc: S/MIME Kryptografische Unterschrift
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/665d2ca9/attachment-0001.bin 


More information about the GO-ESSP-TECH mailing list