[Go-essp-tech] What is the risk that science is done using 'deprecated' data?
Bryan Lawrence
bryan.lawrence at ncas.ac.uk
Fri Mar 9 01:54:39 MST 2012
there are othre setting up digital identificaiton services, we should not. e.g. EPIC ... could we use them?
Cheers
Bryan
> Martin
>
> This problem space seems suitable for EXARCH. I.E. setting up a digital identication service. This service would be very long term infrastructure and thus would need to scale to several billions of identifiers plus associated metadata references.
>
> Mark
>
>
> On 8 Mar 2012, at 17:18, <martin.juckes at stfc.ac.uk> <martin.juckes at stfc.ac.uk> wrote:
>
> > I agree, particularly on the last point.
> >
> > There are a lot of things which could be improved. From a software developers point of view, getting the data providers and data users to agree a set of requirements before starting development would be a good idea -- but we obviously missed the chance to do that, if it ever existed, by several years,
> >
> > Cheers,
> > Martin
> >
> >>> -----Original Message-----
> >>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
> >>> bounces at ucar.edu] On Behalf Of stephen.pascoe at stfc.ac.uk
> >>> Sent: 08 March 2012 16:01
> >>> To: weigel at dkrz.de
> >>> Cc: go-essp-tech at ucar.edu
> >>> Subject: Re: [Go-essp-tech] What is the risk that science is done
> >>> using 'deprecated' data?
> >>>
> >>> Hi Thomas,
> >>>
> >>> As you say, it's too late to do much re-engineering of the system now
> >>> -- we've attempted to put in place various identifier systems and none
> >>> of them are working particularly well -- however I think there is
> >>> another perspective to your proposal:
> >>>
> >>> 1. ESG/CMIP5 is deployed globally across multiple administrative
> >>> domains and each domain has the ability to cut corners to get things
> >>> done, e.g. replacing files silently without changing identifiers.
> >>>
> >>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
> >>> doing #1 to get the data to scientists when they need it. Any system
> >>> that makes it impossible, or even only difficult, to change the
> >>> underlying data is going to be more complex and difficult to
> >>> administer than a system that doesn't, unless that system was very
> >>> rigorously designed, implemented and tested.
> >>>
> >>> Because of #1 I'm convinced that a fit-for-purpose identifier system
> >>> wouldn't use randomly generated UUIDs but would take the GIT approach
> >>> of hashing invariants of the dataset so that any changes behind the
> >>> scenes can be detected.
> >>>
> >>> Because of #2 I'm convinced that now is not the time to start building
> >>> more software to do this. We have to stabilise the system and learn
> >>> the lessons of CMIP5 first.
> >>>
> >>> Cheers,
> >>> Stephen.
> >>>
> >>>
> >>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
> >>>
> >>>> Jamie/All,
> >>>>
> >>>> these are important questions I have been wondering about as well;
> >>> we just had a small internal meeting yesterday with Estani and
> >>> Martina, so I'll try to sum some points up here. I am not too familiar
> >>> with the ESG publishing process, so I can only guess that Stephen's #1
> >>> has something to do with the bending of policies that are for
> >>> pragmatic reasons not enforced in the CMIP5 process. (My intuition is
> >>> that *ideally* it should be impossible to make data available without
> >>> going through the whole publication process. Please correct me if I am
> >>> misunderstanding this.)
> >>>>
> >>>> Most of what I have been thinking about however concerns point #2.
> >>> I'd claim that the risk here should not be underestimated; data
> >>> consumers being unable to find the data they need is bad ("the
> >>> advanced search issue"), but users relying on deprecated data - most
> >>> likely without being aware of it - is certainly dangerous for
> >>> scientific credibility.
> >>>> My suggestion to address this problem is to use globally persistent
> >>> identifiers (PIDs) that are automatically assigned to data objects
> >>> (and metadata etc.) on ESG-publication; data should ideally not be
> >>> known by its file name or system-internal ID, but via a global
> >>> identifier that never changes after it has been published. Of course,
> >>> this sounds like the DOIs, but these are extremely coarse grained and
> >>> very static. The idea is to attach identifiers to the low-level
> >>> entities and provide solutions to build up a hierarchical ID system
> >>> (virtual collections) to account for the various layers used in our
> >>> data. Such persistent identifiers should then be placed prominently in
> >>> any user interface dealing with managed data. The important thing is:
> >>> If data is updated, we don't update the data behind identifier x, but
> >>> assign a new identifier y and create a typed link between these two
> >>> (which may be the most challenging part) and perhaps put a small
> >>> annotation on x that this data is deprecated. A clever user interface
> >>> should then redirect a user consistently to the latest version of a
> >>> dataset if a user accesses the old identifier.
> >>>> This does not make it impossible to use deprecated data, but at
> >>> least it raises the consumer's awareness of the issue and lowers the
> >>> barrier to re-retrieve valid data.
> >>>>
> >>>> As for the point in time; I'd be certain that it is too late now,
> >>> but it is always a good idea to have plans for future improvement.. :)
> >>>>
> >>>> Best, Tobias
> >>>>
> >>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
> >>>>> Thanks for the replies on this - any other replies are still very
> >>> welcome.
> >>>>>
> >>>>> Stephen - being selfish - we aren't too worried about 2 as its less
> >>> of an issue for us (we do a daily trawl of thredds catalogues for new
> >>> datasets), but I agree it is a problem more generally. I don't have a
> >>> feel for which of the problems 1-3 would minimise the risk most if you
> >>> solved it. I think making sure new data has a new version is a
> >>> foundation though.
> >>>>>
> >>>>> Part of me wonders though whether its already too late to really do
> >>> anything with versioning in its current form. *But* I may be
> >>> overestimating the size of the problem of new datasets appearing
> >>> without versions being updated.
> >>>>>
> >>>>> Jamie
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: go-essp-tech-bounces at ucar.edu
> >>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
> >>> Denvil
> >>>>>> Sent: 08 March 2012 10:41
> >>>>>> To: go-essp-tech at ucar.edu
> >>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
> >>>>>> done using 'deprecated' data?
> >>>>>>
> >>>>>> Hi Stephen, let me add a third point:
> >>>>>>
> >>>>>> 3. Users are aware of a new versions but can't download files
> >>>>>> so as to have a coherent set of files.
> >>>>>>
> >>>>>> With respect to that point the p2p transition (especially the
> >>>>>> attribut caching on the node) will be a major step forward.
> >>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
> >>>>>>
> >>>>>> And I agree with Ashish.
> >>>>>>
> >>>>>> Regards.
> >>>>>> Sébastien
> >>>>>>
> >>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk a écrit :
> >>>>>>> Hi Jamie,
> >>>>>>>
> >>>>>>> I can imagine there is a risk of papers being written on
> >>>>>> deprecated data in two scenarios:
> >>>>>>> 1. Data is being updated at datanodes without creating a
> >>>>>> new version
> >>>>>>> 2. Users are unaware of new versions available and
> >>>>>> therefore using
> >>>>>>> deprecated data
> >>>>>>>
> >>>>>>> Are you concerned about both of these scenarios? Your
> >>>>>> email seems to mainly address #1.
> >>>>>>> Thanks,
> >>>>>>> Stephen.
> >>>>>>>
> >>>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
> >>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> Does anyone have a feel for the current level of risk that
> >>>>>> analysists
> >>>>>>>> are doing work (with the intention to publish) on data
> >>>>>> that has been
> >>>>>>>> found to be wrong by the data providers and so deprecated (in
> >>> some
> >>>>>>>> sense)?
> >>>>>>>>
> >>>>>>>> My feeling is that versioning isn't working (that may be
> >>>>>> putting it a
> >>>>>>>> bit strongly. It is too easy for data providers - in their
> >>>>>>>> understandable drive to get their data out - to have
> >>>>>> updated files on
> >>>>>>>> disk without publishing a new version. How big a deal does
> >>> anyone
> >>>>>>>> think this is?
> >>>>>>>>
> >>>>>>>> If the risk that papers are being written based on
> >>>>>> deprecated data is
> >>>>>>>> sufficiently large then is there an agreed strategy for
> >>>>>> coping with
> >>>>>>>> this? Does it have implications for the requirements of the
> >>> data
> >>>>>>>> publishing/delivery system?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jamie
> >>>>>>>> _______________________________________________
> >>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>
> >>>>>> --
> >>>>>> Sébastien Denvil
> >>>>>> IPSL, Pôle de modélisation du climat
> >>>>>> UPMC, Case 101, 4 place Jussieu,
> >>>>>> 75252 Paris Cedex 5
> >>>>>>
> >>>>>> Tour 45-55 2ème étage Bureau 209
> >>>>>> Tel: 33 1 44 27 21 10
> >>>>>> Fax: 33 1 44 27 39 02
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> GO-ESSP-TECH mailing list
> >>>>> GO-ESSP-TECH at ucar.edu
> >>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Tobias Weigel
> >>>>
> >>>> Department of Data Management
> >>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
> >>>> Bundesstr. 45a
> >>>> 20146 Hamburg
> >>>> Germany
> >>>>
> >>>> Tel.: +49 40 460094 104
> >>>> E-Mail: weigel at dkrz.de
> >>>> Website: www.dkrz.de
> >>>>
> >>>> Managing Director: Prof. Dr. Thomas Ludwig
> >>>>
> >>>> Sitz der Gesellschaft: Hamburg
> >>>> Amtsgericht Hamburg HRB 39784
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> GO-ESSP-TECH mailing list
> >>>> GO-ESSP-TECH at ucar.edu
> >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>
> >>> --
> >>> Scanned by iCritical.
> >>> _______________________________________________
> >>> GO-ESSP-TECH mailing list
> >>> GO-ESSP-TECH at ucar.edu
> >>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
> ---------------------------------------------------
> Mark Morgan
> Software Architect / Engineer
> Institut Pierre Simon Laplace (IPSL),
> Université Pierre Marie Curie,
> 4 Place Jussieu,
> Tour 45-55, Salle #207,
> Paris 75005
> France.
> Tel : +33 (0) 1 44 27 49 10
> Email: momipsl at ipsl.jussieu.fr
> ---------------------------------------------------
>
>
>
>
--
Bryan Lawrence
University of Reading: Professor of Weather and Climate Computing.
National Centre for Atmospheric Science: Director of Models and Data.
STFC: Director of the Centre for Environmental Data Archival.
Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
More information about the GO-ESSP-TECH
mailing list