[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Gavin M. Bell gavin at llnl.gov
Mon Mar 12 10:18:33 MDT 2012


:-)

I smell an O'Reilly book coming together here... you've got very cool
buzz terms... very marketable.
A lot more useless things have been wrapped in nice jargon and sold 
(i.e. REST).
Hmmm....

:-)

On 3/12/12 3:58 AM, Mark Morgan wrote:
> @balaji  I.E resolution of a dataset's "Atman" requires developing a
> suitable alembic (http://en.wikipedia.org/wiki/Alembic)
>
>
> On 12 Mar 2012, at 11:55, Mark Morgan wrote:
>
>> Hi
>>
>> In relation to CMIP6+ there really needs to be stronger focus upon
>> _process_, i.e. what is the development process for resolving these
>> kind of problems.  
>>
>> For this particular problem I am thinking particularly of test driven
>> development.  I.E. after a _formal definition_ of the problem space,
>> develop a _test framework_ for testing possible solutions prior to
>> trying to implement a solution.  This will ensure that you have
>> understood the problem space correctly whilst guaranteeing the
>> robustness of potential solution(s).
>>
>> Mark  
>>
>>
>> On 12 Mar 2012, at 11:41, Tobias Weigel wrote:
>>
>>> I'd be very much interested in such a discussion in ExArch, not just
>>> because it provides a sane hashing methodology, but also because
>>> this 'dataset essence' has a large overlap with information I would
>>> feel is useful to attach directly to persistent identifiers. Might
>>> even be exactly that, but might be a bit larger.
>>>
>>> Best, Tobias
>>>
>>> On  12.03.2012 11:17:31, V Balaji wrote:
>>>> I like the idea -- in the CMIP6 timeframe, as Estani reminds us:-) --
>>>> of compiling a list of invariants and things about a dataset that can
>>>> change without the underlying data changing. We have discussed in the
>>>> past with Unidata an nc_chksum capability that can hash or sum
>>>> specific data records for comparison, so that we can omit superficial
>>>> changes from a sum. Remik Ziemlinski of GFDL implemented nccmp
>>>> (http://nccmp.sourceforge.net <http://nccmp.sourceforge.net/>) that
>>>> allows some of this capability,
>>>> but it properly belongs in the netCDF base libraries.
>>>>
>>>> Happy to discuss this within ExArch as you suggest. It's taking us
>>>> deep into metaphysical territory: a hash representation of the
>>>> Platonic essence, the Atman, the soul of a dataset.
>>>>
>>>> On Fri, Mar 9, 2012 at 2:26 AM,<stephen.pascoe at stfc.ac.uk
>>>> <mailto:stephen.pascoe at stfc.ac.uk>>  wrote:
>>>>> Hi Gavin,
>>>>>
>>>>> That would definitely help but I don't think it's sufficient.  How
>>>>> many of us would notice if a centre republished the same dataset
>>>>> (same dataset_id and facet metadata) with different checksums?
>>>>>  Estani would I expect :-) but the system itself wouldn't.
>>>>>
>>>>> I would like to see a hash of invariants of each dataset used as
>>>>> identifiers.  For that we'd need to strip-out all the information
>>>>> from a THREDDS catalog which might legitimately change without
>>>>> changing the data: URL paths, service endpoints, last-modified,
>>>>> etc., but keeping filenames, checksums and some properties.
>>>>>  Canonicalise a serialisiation then generate a hash.
>>>>>
>>>>> We'd also need to really keep track of these hashes.  We have
>>>>> checksums and tracking_ids right now and are under-utilising them.
>>>>>
>>>>> Cheers,
>>>>> Stephen.
>>>>>
>>>>> On 9 Mar 2012, at 05:05, Gavin M. Bell wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> If we enforced checksums to be done as a part of publication, then
>>>>> this would address this issue, right?
>>>>>
>>>>>
>>>>> On 3/8/12 8:39 AM, stephen.pascoe at stfc.ac.uk
>>>>> <mailto:stephen.pascoe at stfc.ac.uk><mailto:stephen.pascoe at stfc.ac.uk>
>>>>>  wrote:
>>>>>
>>>>> Tobias, sorry I miss-typed your name :-)
>>>>> S.
>>>>>
>>>>> On 8 Mar 2012, at 16:00,<stephen.pascoe at stfc.ac.uk
>>>>> <mailto:stephen.pascoe at stfc.ac.uk>><mailto:stephen.pascoe at stfc.ac.uk>
>>>>>  wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hi Thomas,
>>>>>
>>>>> As you say, it's too late to do much re-engineering of the system
>>>>> now -- we've attempted to put in place various identifier systems
>>>>> and none of them are working particularly well -- however I think
>>>>> there is another perspective to your proposal:
>>>>>
>>>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>>>> domains and each domain has the ability to cut corners to get
>>>>> things done, e.g. replacing files silently without changing
>>>>> identifiers.
>>>>>
>>>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>>>> doing #1 to get the data to scientists when they need it.  Any
>>>>> system that makes it impossible, or even only difficult, to change
>>>>> the underlying data is going to be more complex and difficult to
>>>>> administer than a system that doesn't, unless that system was very
>>>>> rigorously designed, implemented and tested.
>>>>>
>>>>> Because of #1 I'm convinced that a fit-for-purpose identifier
>>>>> system wouldn't use randomly generated UUIDs but would take the
>>>>> GIT approach of hashing invariants of the dataset so that any
>>>>> changes behind the scenes can be detected.
>>>>>
>>>>> Because of #2 I'm convinced that now is not the time to start
>>>>> building more software to do this.  We have to stabilise the
>>>>> system and learn the lessons of CMIP5 first.
>>>>>
>>>>> Cheers,
>>>>> Stephen.
>>>>>
>>>>>
>>>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>>>
>>>>>
>>>>>
>>>>> Jamie/All,
>>>>>
>>>>> these are important questions I have been wondering about as well;
>>>>> we just had a small internal meeting yesterday with Estani and
>>>>> Martina, so I'll try to sum some points up here. I am not too
>>>>> familiar with the ESG publishing process, so I can only guess that
>>>>> Stephen's #1 has something to do with the bending of policies that
>>>>> are for pragmatic reasons not enforced in the CMIP5 process. (My
>>>>> intuition is that *ideally* it should be impossible to make data
>>>>> available without going through the whole publication process.
>>>>> Please correct me if I am misunderstanding this.)
>>>>>
>>>>> Most of what I have been thinking about however concerns point #2.
>>>>> I'd claim that the risk here should not be underestimated; data
>>>>> consumers being unable to find the data they need is bad ("the
>>>>> advanced search issue"), but users relying on deprecated data -
>>>>> most likely without being aware of it - is certainly dangerous for
>>>>> scientific credibility.
>>>>> My suggestion to address this problem is to use globally
>>>>> persistent identifiers (PIDs) that are automatically assigned to
>>>>> data objects (and metadata etc.) on ESG-publication; data should
>>>>> ideally not be known by its file name or system-internal ID, but
>>>>> via a global identifier that never changes after it has been
>>>>> published. Of course, this sounds like the DOIs, but these are
>>>>> extremely coarse grained and very static. The idea is to attach
>>>>> identifiers to the low-level entities and provide solutions to
>>>>> build up a hierarchical ID system (virtual collections) to account
>>>>> for the various layers used in our data. Such persistent
>>>>> identifiers should then be placed prominently in any user
>>>>> interface dealing with managed data. The important thing is: If
>>>>> data is updated, we don't update the data behind identifier x, but
>>>>> assign a new identifier y and create a typed link between these
>>>>> two (which may be the most challenging part) and perhaps put a
>>>>> small annotation on x that this data is depreca
>>>>> ted. A clever user interface should then redirect a user
>>>>> consistently to the latest version of a dataset if a user accesses
>>>>> the old identifier.
>>>>> This does not make it impossible to use deprecated data, but at
>>>>> least it raises the consumer's awareness of the issue and lowers
>>>>> the barrier to re-retrieve valid data.
>>>>>
>>>>> As for the point in time; I'd be certain that it is too late now,
>>>>> but it is always a good idea to have plans for future improvement.. :)
>>>>>
>>>>> Best, Tobias
>>>>>
>>>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>>
>>>>>
>>>>> Thanks for the replies on this - any other replies are still very
>>>>> welcome.
>>>>>
>>>>> Stephen - being selfish - we aren't too worried about 2 as its
>>>>> less of an issue for us (we do a daily trawl of thredds catalogues
>>>>> for new datasets), but I agree it is a problem more generally.  I
>>>>> don't have a feel for which of the problems 1-3 would minimise the
>>>>> risk most if you solved it.  I think making sure new data has a
>>>>> new version is a foundation though.
>>>>>
>>>>> Part of me wonders though whether its already too late to really
>>>>> do anything with versioning in its current form.  *But* I may be
>>>>> overestimating the size of the problem of new datasets appearing
>>>>> without versions being updated.
>>>>>
>>>>> Jamie
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>> <mailto:go-essp-tech-bounces at ucar.edu><mailto:go-essp-tech-bounces at ucar.edu>
>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien Denvil
>>>>> Sent: 08 March 2012 10:41
>>>>> To: go-essp-tech at ucar.edu
>>>>> <mailto:go-essp-tech at ucar.edu><mailto:go-essp-tech at ucar.edu>
>>>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>> done using 'deprecated' data?
>>>>>
>>>>> Hi Stephen, let me add a third point:
>>>>>
>>>>> 3. Users are aware of a new versions but can't download files
>>>>> so as to have a coherent set of files.
>>>>>
>>>>> With respect to that point the p2p transition (especially the
>>>>> attribut caching on the node) will be a major step forward.
>>>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>>>
>>>>> And I agree with Ashish.
>>>>>
>>>>> Regards.
>>>>> Sébastien
>>>>>
>>>>> Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk
>>>>> <mailto:stephen.pascoe at stfc.ac.uk><mailto:stephen.pascoe at stfc.ac.uk>
>>>>>  a écrit :
>>>>>
>>>>>
>>>>> Hi Jamie,
>>>>>
>>>>> I can imagine there is a risk of papers being written on
>>>>>
>>>>>
>>>>> deprecated data in two scenarios:
>>>>>
>>>>>
>>>>>  1. Data is being updated at datanodes without creating a
>>>>>
>>>>>
>>>>> new version
>>>>>
>>>>>
>>>>>  2. Users are unaware of new versions available and
>>>>>
>>>>>
>>>>> therefore using
>>>>>
>>>>>
>>>>> deprecated data
>>>>>
>>>>> Are you concerned about both of these scenarios?  Your
>>>>>
>>>>>
>>>>> email seems to mainly address #1.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Stephen.
>>>>>
>>>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> Does anyone have a feel for the current level of risk that
>>>>>
>>>>>
>>>>> analysists
>>>>>
>>>>>
>>>>> are doing work (with the intention to publish) on data
>>>>>
>>>>>
>>>>> that has been
>>>>>
>>>>>
>>>>> found to be wrong by the data providers and so deprecated (in some
>>>>> sense)?
>>>>>
>>>>> My feeling is that versioning isn't working (that may be
>>>>>
>>>>>
>>>>> putting it a
>>>>>
>>>>>
>>>>> bit strongly.  It is too easy for data providers - in their
>>>>> understandable drive to get their data out - to have
>>>>>
>>>>>
>>>>> updated files on
>>>>>
>>>>>
>>>>> disk without publishing a new version.   How big a deal does anyone
>>>>> think this is?
>>>>>
>>>>> If the risk that papers are being written based on
>>>>>
>>>>>
>>>>> deprecated data is
>>>>>
>>>>>
>>>>> sufficiently large then is there an agreed strategy for
>>>>>
>>>>>
>>>>> coping with
>>>>>
>>>>>
>>>>> this?  Does it have implications for the requirements of the data
>>>>> publishing/delivery system?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jamie
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>
>>>>>
>>>>> --
>>>>> Sébastien Denvil
>>>>> IPSL, Pôle de modélisation du climat
>>>>> UPMC, Case 101, 4 place Jussieu,
>>>>> 75252 Paris Cedex 5
>>>>>
>>>>> Tour 45-55 2ème étage Bureau 209
>>>>> Tel: 33 1 44 27 21 10
>>>>> Fax: 33 1 44 27 39 02
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Tobias Weigel
>>>>>
>>>>> Department of Data Management
>>>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>>> Bundesstr. 45a
>>>>> 20146 Hamburg
>>>>> Germany
>>>>>
>>>>> Tel.: +49 40 460094 104
>>>>> E-Mail: weigel at dkrz.de <mailto:weigel at dkrz.de><mailto:weigel at dkrz.de>
>>>>> Website: www.dkrz.de <http://www.dkrz.de/><http://www.dkrz.de/>
>>>>>
>>>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>>>
>>>>> Sitz der Gesellschaft: Hamburg
>>>>> Amtsgericht Hamburg HRB 39784
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>
>>>>>
>>>>> --
>>>>> Scanned by iCritical.
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Gavin M. Bell
>>>>> --
>>>>>
>>>>>  "Never mistake a clear view for a short distance."
>>>>>               -Paul Saffo
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Scanned by iCritical.
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Tobias Weigel
>>>
>>> Department of Data Management
>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>> Bundesstr. 45a
>>> 20146 Hamburg
>>> Germany
>>>
>>> Tel.: +49 40 460094 104
>>> E-Mail: weigel at dkrz.de <mailto:weigel at dkrz.de>
>>> Website: www.dkrz.de <http://www.dkrz.de/>
>>>
>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>
>>> Sitz der Gesellschaft: Hamburg
>>> Amtsgericht Hamburg HRB 39784
>>>
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>> ---------------------------------------------------
>> Mark Morgan
>> Software Architect / Engineer
>> Institut Pierre Simon Laplace (IPSL),
>> Université Pierre Marie Curie,
>> 4 Place Jussieu,
>> Tour 45-55, Salle #207,
>> Paris 75005
>> France.
>> Tel : +33 (0) 1 44 27 49 10
>> Email: momipsl at ipsl.jussieu.fr <mailto:momipsl at ipsl.jussieu.fr>
>> ---------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
> ---------------------------------------------------
> Mark Morgan
> Software Architect / Engineer
> Institut Pierre Simon Laplace (IPSL),
> Université Pierre Marie Curie,
> 4 Place Jussieu,
> Tour 45-55, Salle #207,
> Paris 75005
> France.
> Tel : +33 (0) 1 44 27 49 10
> Email: momipsl at ipsl.jussieu.fr <mailto:momipsl at ipsl.jussieu.fr>
> ---------------------------------------------------
>
>
>

-- 
Gavin M. Bell
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120312/355ba0ad/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list