[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Fri Mar 9 03:05:16 MST 2012

Stephen,

I wholly agree with you.  Let's get as close to the data as possible. 
The hash not just is a unique value / identifier but is also has a deep
semantic meaning directly relating to the data.  The solution here is
what Stephen (and I) have been kicking around.  I suggest a clean
envelope type of structure capturing this notion of mutable and
immutable data.  It is a simpler solution and no service is required to
generate it or verify it... you get all of those things for free,
especially in the context of ESGF where these values are a part of the
index itself.  I am with Stephen on this Occam's razor cuts the best.

On 3/9/12 1:55 AM, stephen.pascoe at stfc.ac.uk wrote:
> Hi Tobias,
>
> I'm not familiar with EPIC.  Does anyone have a reference?  I assume it's based on registration.  In my view hashing provides a more powerful mechanism for global identifiers if you can create a canonical representation of the entity.  Then it's impossible for the entity to change without the identifier changing instead of relying on some trusted authority to enforce the correspondence.
>
> Last time I talked to DOI people it still wasn't completely sorted out whether a dataset can change without changing the DOI.  Sure it shouldn't but there were examples of landing pages evolving over time.  I can foresee a similar problem with any registration system.
>
> Cheers,
> Stephen.
>
> On 9 Mar 2012, at 09:11, Tobias Weigel wrote:
>
>> I am prety much thinking about the EPIC infrastructure, and as long as that is not fully ready yet, at least the basic Handle System. ExArch is a very good candidate to explore some ideas and limitations, as well as the German c3grid project (trace provenance information across workflows).
>>
>> On  09.03.2012 10:08:04, Bryan Lawrence wrote:
>>> there are othre setting up digital identificaiton services, we should not. e.g. EPIC ... could we use them?
>>> Cheers
>>> Bryan
>>>
>>>> Martin
>>>>
>>>> This problem space seems suitable for EXARCH.  I.E. setting up a digital identication service.  This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.
>>>>
>>>> Mark
>>>>
>>>>
>>>> On 8 Mar 2012, at 17:18,<martin.juckes at stfc.ac.uk>  <martin.juckes at stfc.ac.uk>  wrote:
>>>>
>>>>> I agree, particularly on the last point.
>>>>>
>>>>> There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
>>>>>
>>>>> Cheers,
>>>>> Martin
>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>>>>>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
>>>>>>> Sent: 08 March 2012 16:01
>>>>>>> To: weigel at dkrz.de
>>>>>>> Cc: go-essp-tech at ucar.edu
>>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is done
>>>>>>> using 'deprecated' data?
>>>>>>>
>>>>>>> Hi Thomas,
>>>>>>>
>>>>>>> As you say, it's too late to do much re-engineering of the system now
>>>>>>> -- we've attempted to put in place various identifier systems and none
>>>>>>> of them are working particularly well -- however I think there is
>>>>>>> another perspective to your proposal:
>>>>>>>
>>>>>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>>>>>> domains and each domain has the ability to cut corners to get things
>>>>>>> done, e.g. replacing files silently without changing identifiers.
>>>>>>>
>>>>>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>>>>>> doing #1 to get the data to scientists when they need it.  Any system
>>>>>>> that makes it impossible, or even only difficult, to change the
>>>>>>> underlying data is going to be more complex and difficult to
>>>>>>> administer than a system that doesn't, unless that system was very
>>>>>>> rigorously designed, implemented and tested.
>>>>>>>
>>>>>>> Because of #1 I'm convinced that a fit-for-purpose identifier system
>>>>>>> wouldn't use randomly generated UUIDs but would take the GIT approach
>>>>>>> of hashing invariants of the dataset so that any changes behind the
>>>>>>> scenes can be detected.
>>>>>>>
>>>>>>> Because of #2 I'm convinced that now is not the time to start building
>>>>>>> more software to do this.  We have to stabilise the system and learn
>>>>>>> the lessons of CMIP5 first.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephen.
>>>>>>>
>>>>>>>
>>>>>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>>>>>
>>>>>>>> Jamie/All,
>>>>>>>>
>>>>>>>> these are important questions I have been wondering about as well;
>>>>>>> we just had a small internal meeting yesterday with Estani and
>>>>>>> Martina, so I'll try to sum some points up here. I am not too familiar
>>>>>>> with the ESG publishing process, so I can only guess that Stephen's #1
>>>>>>> has something to do with the bending of policies that are for
>>>>>>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
>>>>>>> that *ideally* it should be impossible to make data available without
>>>>>>> going through the whole publication process. Please correct me if I am
>>>>>>> misunderstanding this.)
>>>>>>>> Most of what I have been thinking about however concerns point #2.
>>>>>>> I'd claim that the risk here should not be underestimated; data
>>>>>>> consumers being unable to find the data they need is bad ("the
>>>>>>> advanced search issue"), but users relying on deprecated data - most
>>>>>>> likely without being aware of it - is certainly dangerous for
>>>>>>> scientific credibility.
>>>>>>>> My suggestion to address this problem is to use globally persistent
>>>>>>> identifiers (PIDs) that are automatically assigned to data objects
>>>>>>> (and metadata etc.) on ESG-publication; data should ideally not be
>>>>>>> known by its file name or system-internal ID, but via a global
>>>>>>> identifier that never changes after it has been published. Of course,
>>>>>>> this sounds like the DOIs, but these are extremely coarse grained and
>>>>>>> very static. The idea is to attach identifiers to the low-level
>>>>>>> entities and provide solutions to build up a hierarchical ID system
>>>>>>> (virtual collections) to account for the various layers used in our
>>>>>>> data. Such persistent identifiers should then be placed prominently in
>>>>>>> any user interface dealing with managed data. The important thing is:
>>>>>>> If data is updated, we don't update the data behind identifier x, but
>>>>>>> assign a new identifier y and create a typed link between these two
>>>>>>> (which may be the most challenging part) and perhaps put a small
>>>>>>> annotation on x that this data is deprecated. A clever user interface
>>>>>>> should then redirect a user consistently to the latest version of a
>>>>>>> dataset if a user accesses the old identifier.
>>>>>>>> This does not make it impossible to use deprecated data, but at
>>>>>>> least it raises the consumer's awareness of the issue and lowers the
>>>>>>> barrier to re-retrieve valid data.
>>>>>>>> As for the point in time; I'd be certain that it is too late now,
>>>>>>> but it is always a good idea to have plans for future improvement.. :)
>>>>>>>> Best, Tobias
>>>>>>>>
>>>>>>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>>>>>> Thanks for the replies on this - any other replies are still very
>>>>>>> welcome.
>>>>>>>>> Stephen - being selfish - we aren't too worried about 2 as its less
>>>>>>> of an issue for us (we do a daily trawl of thredds catalogues for new
>>>>>>> datasets), but I agree it is a problem more generally.  I don't have a
>>>>>>> feel for which of the problems 1-3 would minimise the risk most if you
>>>>>>> solved it.  I think making sure new data has a new version is a
>>>>>>> foundation though.
>>>>>>>>> Part of me wonders though whether its already too late to really do
>>>>>>> anything with versioning in its current form.  *But* I may be
>>>>>>> overestimating the size of the problem of new datasets appearing
>>>>>>> without versions being updated.
>>>>>>>>> Jamie
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
>>>>>>> Denvil
>>>>>>>>>> Sent: 08 March 2012 10:41
>>>>>>>>>> To: go-essp-tech at ucar.edu
>>>>>>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>>>>>>> done using 'deprecated' data?
>>>>>>>>>>
>>>>>>>>>> Hi Stephen, let me add a third point:
>>>>>>>>>>
>>>>>>>>>> 3. Users are aware of a new versions but can't download files
>>>>>>>>>> so as to have a coherent set of files.
>>>>>>>>>>
>>>>>>>>>> With respect to that point the p2p transition (especially the
>>>>>>>>>> attribut caching on the node) will be a major step forward.
>>>>>>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>>>>>>>>
>>>>>>>>>> And I agree with Ashish.
>>>>>>>>>>
>>>>>>>>>> Regards.
>>>>>>>>>> Sébastien
>>>>>>>>>>
>>>>>>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
>>>>>>>>>>> Hi Jamie,
>>>>>>>>>>>
>>>>>>>>>>> I can imagine there is a risk of papers being written on
>>>>>>>>>> deprecated data in two scenarios:
>>>>>>>>>>>  1. Data is being updated at datanodes without creating a
>>>>>>>>>> new version
>>>>>>>>>>>  2. Users are unaware of new versions available and
>>>>>>>>>> therefore using
>>>>>>>>>>> deprecated data
>>>>>>>>>>>
>>>>>>>>>>> Are you concerned about both of these scenarios?  Your
>>>>>>>>>> email seems to mainly address #1.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Stephen.
>>>>>>>>>>>
>>>>>>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone have a feel for the current level of risk that
>>>>>>>>>> analysists
>>>>>>>>>>>> are doing work (with the intention to publish) on data
>>>>>>>>>> that has been
>>>>>>>>>>>> found to be wrong by the data providers and so deprecated (in
>>>>>>> some
>>>>>>>>>>>> sense)?
>>>>>>>>>>>>
>>>>>>>>>>>> My feeling is that versioning isn't working (that may be
>>>>>>>>>> putting it a
>>>>>>>>>>>> bit strongly.  It is too easy for data providers - in their
>>>>>>>>>>>> understandable drive to get their data out - to have
>>>>>>>>>> updated files on
>>>>>>>>>>>> disk without publishing a new version.   How big a deal does
>>>>>>> anyone
>>>>>>>>>>>> think this is?
>>>>>>>>>>>>
>>>>>>>>>>>> If the risk that papers are being written based on
>>>>>>>>>> deprecated data is
>>>>>>>>>>>> sufficiently large then is there an agreed strategy for
>>>>>>>>>> coping with
>>>>>>>>>>>> this?  Does it have implications for the requirements of the
>>>>>>> data
>>>>>>>>>>>> publishing/delivery system?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Jamie
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>> --
>>>>>>>>>> Sébastien Denvil
>>>>>>>>>> IPSL, Pôle de modélisation du climat
>>>>>>>>>> UPMC, Case 101, 4 place Jussieu,
>>>>>>>>>> 75252 Paris Cedex 5
>>>>>>>>>>
>>>>>>>>>> Tour 45-55 2ème étage Bureau 209
>>>>>>>>>> Tel: 33 1 44 27 21 10
>>>>>>>>>> Fax: 33 1 44 27 39 02
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Tobias Weigel
>>>>>>>>
>>>>>>>> Department of Data Management
>>>>>>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>>>>>> Bundesstr. 45a
>>>>>>>> 20146 Hamburg
>>>>>>>> Germany
>>>>>>>>
>>>>>>>> Tel.: +49 40 460094 104
>>>>>>>> E-Mail: weigel at dkrz.de
>>>>>>>> Website: www.dkrz.de
>>>>>>>>
>>>>>>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>>>>>>
>>>>>>>> Sitz der Gesellschaft: Hamburg
>>>>>>>> Amtsgericht Hamburg HRB 39784
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>> --
>>>>>>> Scanned by iCritical.
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>> ---------------------------------------------------
>>>> Mark Morgan
>>>> Software Architect / Engineer
>>>> Institut Pierre Simon Laplace (IPSL),
>>>> Université Pierre Marie Curie,
>>>> 4 Place Jussieu,
>>>> Tour 45-55, Salle #207,
>>>> Paris 75005
>>>> France.
>>>> Tel : +33 (0) 1 44 27 49 10
>>>> Email: momipsl at ipsl.jussieu.fr
>>>> ---------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>> --
>>> Bryan Lawrence
>>> University of Reading:  Professor of Weather and Climate Computing.
>>> National Centre for Atmospheric Science: Director of Models and Data.
>>> STFC: Director of the Centre for Environmental Data Archival.
>>> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>
>> --
>> Tobias Weigel
>>
>> Department of Data Management
>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>> Bundesstr. 45a
>> 20146 Hamburg
>> Germany
>>
>> Tel.: +49 40 460094 104
>> E-Mail: weigel at dkrz.de
>> Website: www.dkrz.de
>>
>> Managing Director: Prof. Dr. Thomas Ludwig
>>
>> Sitz der Gesellschaft: Hamburg
>> Amtsgericht Hamburg HRB 39784
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> --
> Scanned by iCritical.
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Gavin M. Bell
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/084665bb/attachment-0001.html