[Go-essp-tech] What is the risk that science is done using'deprecated' data?

Mon Mar 12 15:34:02 MDT 2012

 Hi,

IMHO, I think we need to simple discuss a quick cost/benefit analysis of
this endeavor.  Why do we need this grammar? What it enables? Why is
that important? and then How would we go about doing it... and what
resources are involved.  Basically, is the juice worth the squeeze.  The
basic thinking I have on this is that it would be a good thing to know
the shape of the catalog in so far as the shape of it will enable new
capabilities.  If the benefit of knowing the shape of this box "only"
makes life easier for those doing machinations at that level... then...
well, that is what we should find out.  Is it on fire? Or is this a
Cassandra moment - as I have been having on this topic for over a year
and a half.  Thus far it has not moved into the realm of "on fire" which
leads me to perhaps believe that this kind of thing is best left to a
select and motivated few that should go off and fix it and then report
back so we all can leverage it.  I don't think there needs to be a lot
of sausage makers here.  This is why I proposed that Stephen start a
working group if one is not already there. 

Oops.. that reminds me, making a working group page fell off my radar
after the F2F... I will do that.  For now I recommend appending this
task to that of the Information Architecture (Data Model) Interface
Group that Stephen is already the lead on. :-) How serendipitous. :-)

http://esgf.org/wiki/ESGFInterfaceGroups
http://esgf.org/wiki/ESGFInterfaceGroups/InformationArchitecture

On 3/12/12 5:41 AM, Kettleborough, Jamie wrote:
> Hello,
>  
> I'm not quite sure how to respond to all the replies to this - I'm not
> sure I understand all the terms used for one thing - but thanks to
> everyone for engaging in this discussion.
>  
> I think there is agreement that anything too new or innovative on this
> should not be done until CMIP6+.  *BUT* I think we still have to ask
> whether we are happy with the current level of risk at CMIP5, and if
> not what can we do about it?  If we leave it all to CMIP6+ then fine,
> but I *think* that is equivalent to an implied 'yes we can live with
> this level of risk at CMIP5'.
>  
> Gavin - I think you offered a slot on the Telco to talk about this.  I
> think that is a good idea - with the focus being on what we can do for
> CMIP5.  This may be more a project management type issue rather than
> some deep technical discussion, but *my* feeling is its sufficiently
> important that it is worth talking about it.  (But I don't have all
> the information, so can be ignored and I won't be upset).   Anyone agree?
>  
> Thanks again,
>  
> Jamie
>
>     ------------------------------------------------------------------------
>     *From:* go-essp-tech-bounces at ucar.edu
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Mark Morgan
>     *Sent:* 12 March 2012 10:55
>     *To:* Tobias Weigel
>     *Cc:* V Balaji; go-essp-tech at ucar.edu
>     *Subject:* Re: [Go-essp-tech] What is the risk that science is
>     done using'deprecated' data?
>
>     Hi
>
>     In relation to CMIP6+ there really needs to be stronger focus upon
>     _process_, i.e. what is the development process for resolving
>     these kind of problems.  
>
>     For this particular problem I am thinking particularly of test
>     driven development.  I.E. after a _formal definition_ of the
>     problem space, develop a _test framework_ for testing possible
>     solutions prior to trying to implement a solution.  This will
>     ensure that you have understood the problem space correctly whilst
>     guaranteeing the robustness of potential solution(s).
>
>     Mark  
>
>
>     On 12 Mar 2012, at 11:41, Tobias Weigel wrote:
>
>>     I'd be very much interested in such a discussion in ExArch, not
>>     just because it provides a sane hashing methodology, but also
>>     because this 'dataset essence' has a large overlap with
>>     information I would feel is useful to attach directly to
>>     persistent identifiers. Might even be exactly that, but might be
>>     a bit larger.
>>
>>     Best, Tobias
>>
>>     On  12.03.2012 11:17:31, V Balaji wrote:
>>>     I like the idea -- in the CMIP6 timeframe, as Estani reminds
>>>     us:-) --
>>>     of compiling a list of invariants and things about a dataset
>>>     that can
>>>     change without the underlying data changing. We have discussed
>>>     in the
>>>     past with Unidata an nc_chksum capability that can hash or sum
>>>     specific data records for comparison, so that we can omit
>>>     superficial
>>>     changes from a sum. Remik Ziemlinski of GFDL implemented nccmp
>>>     (http://nccmp.sourceforge.net) that allows some of this capability,
>>>     but it properly belongs in the netCDF base libraries.
>>>
>>>     Happy to discuss this within ExArch as you suggest. It's taking us
>>>     deep into metaphysical territory: a hash representation of the
>>>     Platonic essence, the Atman, the soul of a dataset.
>>>
>>>     On Fri, Mar 9, 2012 at 2:26 AM,<stephen.pascoe at stfc.ac.uk
>>>     <mailto:stephen.pascoe at stfc.ac.uk>>  wrote:
>>>>     Hi Gavin,
>>>>
>>>>     That would definitely help but I don't think it's sufficient.
>>>>      How many of us would notice if a centre republished the same
>>>>     dataset (same dataset_id and facet metadata) with different
>>>>     checksums?  Estani would I expect :-) but the system itself
>>>>     wouldn't.
>>>>
>>>>     I would like to see a hash of invariants of each dataset used
>>>>     as identifiers.  For that we'd need to strip-out all the
>>>>     information from a THREDDS catalog which might legitimately
>>>>     change without changing the data: URL paths, service endpoints,
>>>>     last-modified, etc., but keeping filenames, checksums and some
>>>>     properties.  Canonicalise a serialisiation then generate a hash.
>>>>
>>>>     We'd also need to really keep track of these hashes.  We have
>>>>     checksums and tracking_ids right now and are under-utilising them.
>>>>
>>>>     Cheers,
>>>>     Stephen.
>>>>
>>>>     On 9 Mar 2012, at 05:05, Gavin M. Bell wrote:
>>>>
>>>>     Hello,
>>>>
>>>>     If we enforced checksums to be done as a part of publication,
>>>>     then this would address this issue, right?
>>>>
>>>>
>>>>     On 3/8/12 8:39 AM, stephen.pascoe at stfc.ac.uk
>>>>     <mailto:stephen.pascoe at stfc.ac.uk><mailto:stephen.pascoe at stfc.ac.uk>
>>>>      wrote:
>>>>
>>>>     Tobias, sorry I miss-typed your name :-)
>>>>     S.
>>>>
>>>>     On 8 Mar 2012, at 16:00,<stephen.pascoe at stfc.ac.uk
>>>>     <mailto:stephen.pascoe at stfc.ac.uk>><mailto:stephen.pascoe at stfc.ac.uk>
>>>>      wrote:
>>>>
>>>>
>>>>
>>>>     Hi Thomas,
>>>>
>>>>     As you say, it's too late to do much re-engineering of the
>>>>     system now -- we've attempted to put in place various
>>>>     identifier systems and none of them are working particularly
>>>>     well -- however I think there is another perspective to your
>>>>     proposal:
>>>>
>>>>     1. ESG/CMIP5 is deployed globally across multiple
>>>>     administrative domains and each domain has the ability to cut
>>>>     corners to get things done, e.g. replacing files silently
>>>>     without changing identifiers.
>>>>
>>>>     2. ESG/CMIP5 system is so complex that who'd blame a sys-admin
>>>>     for doing #1 to get the data to scientists when they need it.
>>>>      Any system that makes it impossible, or even only difficult,
>>>>     to change the underlying data is going to be more complex and
>>>>     difficult to administer than a system that doesn't, unless that
>>>>     system was very rigorously designed, implemented and tested.
>>>>
>>>>     Because of #1 I'm convinced that a fit-for-purpose identifier
>>>>     system wouldn't use randomly generated UUIDs but would take the
>>>>     GIT approach of hashing invariants of the dataset so that any
>>>>     changes behind the scenes can be detected.
>>>>
>>>>     Because of #2 I'm convinced that now is not the time to start
>>>>     building more software to do this.  We have to stabilise the
>>>>     system and learn the lessons of CMIP5 first.
>>>>
>>>>     Cheers,
>>>>     Stephen.
>>>>
>>>>
>>>>     On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>>
>>>>
>>>>
>>>>     Jamie/All,
>>>>
>>>>     these are important questions I have been wondering about as
>>>>     well; we just had a small internal meeting yesterday with
>>>>     Estani and Martina, so I'll try to sum some points up here. I
>>>>     am not too familiar with the ESG publishing process, so I can
>>>>     only guess that Stephen's #1 has something to do with the
>>>>     bending of policies that are for pragmatic reasons not enforced
>>>>     in the CMIP5 process. (My intuition is that *ideally* it should
>>>>     be impossible to make data available without going through the
>>>>     whole publication process. Please correct me if I am
>>>>     misunderstanding this.)
>>>>
>>>>     Most of what I have been thinking about however concerns point
>>>>     #2. I'd claim that the risk here should not be underestimated;
>>>>     data consumers being unable to find the data they need is bad
>>>>     ("the advanced search issue"), but users relying on deprecated
>>>>     data - most likely without being aware of it - is certainly
>>>>     dangerous for scientific credibility.
>>>>     My suggestion to address this problem is to use globally
>>>>     persistent identifiers (PIDs) that are automatically assigned
>>>>     to data objects (and metadata etc.) on ESG-publication; data
>>>>     should ideally not be known by its file name or system-internal
>>>>     ID, but via a global identifier that never changes after it has
>>>>     been published. Of course, this sounds like the DOIs, but these
>>>>     are extremely coarse grained and very static. The idea is to
>>>>     attach identifiers to the low-level entities and provide
>>>>     solutions to build up a hierarchical ID system (virtual
>>>>     collections) to account for the various layers used in our
>>>>     data. Such persistent identifiers should then be placed
>>>>     prominently in any user interface dealing with managed data.
>>>>     The important thing is: If data is updated, we don't update the
>>>>     data behind identifier x, but assign a new identifier y and
>>>>     create a typed link between these two (which may be the most
>>>>     challenging part) and perhaps put a small annotation on x that
>>>>     this data is depreca
>>>>     ted. A clever user interface should then redirect a user
>>>>     consistently to the latest version of a dataset if a user
>>>>     accesses the old identifier.
>>>>     This does not make it impossible to use deprecated data, but at
>>>>     least it raises the consumer's awareness of the issue and
>>>>     lowers the barrier to re-retrieve valid data.
>>>>
>>>>     As for the point in time; I'd be certain that it is too late
>>>>     now, but it is always a good idea to have plans for future
>>>>     improvement.. :)
>>>>
>>>>     Best, Tobias
>>>>
>>>>     Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>>
>>>>
>>>>     Thanks for the replies on this - any other replies are still
>>>>     very welcome.
>>>>
>>>>     Stephen - being selfish - we aren't too worried about 2 as its
>>>>     less of an issue for us (we do a daily trawl of thredds
>>>>     catalogues for new datasets), but I agree it is a problem more
>>>>     generally.  I don't have a feel for which of the problems 1-3
>>>>     would minimise the risk most if you solved it.  I think making
>>>>     sure new data has a new version is a foundation though.
>>>>
>>>>     Part of me wonders though whether its already too late to
>>>>     really do anything with versioning in its current form.  *But*
>>>>     I may be overestimating the size of the problem of new datasets
>>>>     appearing without versions being updated.
>>>>
>>>>     Jamie
>>>>
>>>>
>>>>
>>>>
>>>>     -----Original Message-----
>>>>     From: go-essp-tech-bounces at ucar.edu
>>>>     <mailto:go-essp-tech-bounces at ucar.edu><mailto:go-essp-tech-bounces at ucar.edu>
>>>>     [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien
>>>>     Denvil
>>>>     Sent: 08 March 2012 10:41
>>>>     To: go-essp-tech at ucar.edu
>>>>     <mailto:go-essp-tech at ucar.edu><mailto:go-essp-tech at ucar.edu>
>>>>     Subject: Re: [Go-essp-tech] What is the risk that science is
>>>>     done using 'deprecated' data?
>>>>
>>>>     Hi Stephen, let me add a third point:
>>>>
>>>>     3. Users are aware of a new versions but can't download files
>>>>     so as to have a coherent set of files.
>>>>
>>>>     With respect to that point the p2p transition (especially the
>>>>     attribut caching on the node) will be a major step forward.
>>>>     GFDL just upgrad and we have an amazing success rate of 98%.
>>>>
>>>>     And I agree with Ashish.
>>>>
>>>>     Regards.
>>>>     Sébastien
>>>>
>>>>     Le 08/03/2012 11:34, stephen.pascoe at stfc.ac.uk
>>>>     <mailto:stephen.pascoe at stfc.ac.uk><mailto:stephen.pascoe at stfc.ac.uk>
>>>>      a écrit :
>>>>
>>>>
>>>>     Hi Jamie,
>>>>
>>>>     I can imagine there is a risk of papers being written on
>>>>
>>>>
>>>>     deprecated data in two scenarios:
>>>>
>>>>
>>>>      1. Data is being updated at datanodes without creating a
>>>>
>>>>
>>>>     new version
>>>>
>>>>
>>>>      2. Users are unaware of new versions available and
>>>>
>>>>
>>>>     therefore using
>>>>
>>>>
>>>>     deprecated data
>>>>
>>>>     Are you concerned about both of these scenarios?  Your
>>>>
>>>>
>>>>     email seems to mainly address #1.
>>>>
>>>>
>>>>     Thanks,
>>>>     Stephen.
>>>>
>>>>     On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>>
>>>>
>>>>
>>>>     Hello,
>>>>
>>>>     Does anyone have a feel for the current level of risk that
>>>>
>>>>
>>>>     analysists
>>>>
>>>>
>>>>     are doing work (with the intention to publish) on data
>>>>
>>>>
>>>>     that has been
>>>>
>>>>
>>>>     found to be wrong by the data providers and so deprecated (in some
>>>>     sense)?
>>>>
>>>>     My feeling is that versioning isn't working (that may be
>>>>
>>>>
>>>>     putting it a
>>>>
>>>>
>>>>     bit strongly.  It is too easy for data providers - in their
>>>>     understandable drive to get their data out - to have
>>>>
>>>>
>>>>     updated files on
>>>>
>>>>
>>>>     disk without publishing a new version.   How big a deal does anyone
>>>>     think this is?
>>>>
>>>>     If the risk that papers are being written based on
>>>>
>>>>
>>>>     deprecated data is
>>>>
>>>>
>>>>     sufficiently large then is there an agreed strategy for
>>>>
>>>>
>>>>     coping with
>>>>
>>>>
>>>>     this?  Does it have implications for the requirements of the data
>>>>     publishing/delivery system?
>>>>
>>>>     Thanks,
>>>>
>>>>     Jamie
>>>>     _______________________________________________
>>>>     GO-ESSP-TECH mailing list
>>>>     GO-ESSP-TECH at ucar.edu
>>>>     <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>
>>>>
>>>>     --
>>>>     Sébastien Denvil
>>>>     IPSL, Pôle de modélisation du climat
>>>>     UPMC, Case 101, 4 place Jussieu,
>>>>     75252 Paris Cedex 5
>>>>
>>>>     Tour 45-55 2ème étage Bureau 209
>>>>     Tel: 33 1 44 27 21 10
>>>>     Fax: 33 1 44 27 39 02
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     GO-ESSP-TECH mailing list
>>>>     GO-ESSP-TECH at ucar.edu
>>>>     <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>
>>>>
>>>>
>>>>
>>>>     --
>>>>     Tobias Weigel
>>>>
>>>>     Department of Data Management
>>>>     Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>>>     Bundesstr. 45a
>>>>     20146 Hamburg
>>>>     Germany
>>>>
>>>>     Tel.: +49 40 460094 104
>>>>     E-Mail: weigel at dkrz.de
>>>>     <mailto:weigel at dkrz.de><mailto:weigel at dkrz.de>
>>>>     Website: www.dkrz.de <http://www.dkrz.de><http://www.dkrz.de/>
>>>>
>>>>     Managing Director: Prof. Dr. Thomas Ludwig
>>>>
>>>>     Sitz der Gesellschaft: Hamburg
>>>>     Amtsgericht Hamburg HRB 39784
>>>>
>>>>
>>>>     _______________________________________________
>>>>     GO-ESSP-TECH mailing list
>>>>     GO-ESSP-TECH at ucar.edu
>>>>     <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>
>>>>
>>>>     --
>>>>     Scanned by iCritical.
>>>>     _______________________________________________
>>>>     GO-ESSP-TECH mailing list
>>>>     GO-ESSP-TECH at ucar.edu
>>>>     <mailto:GO-ESSP-TECH at ucar.edu><mailto:GO-ESSP-TECH at ucar.edu>
>>>>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>
>>>>
>>>>
>>>>     --
>>>>     Gavin M. Bell
>>>>     --
>>>>
>>>>      "Never mistake a clear view for a short distance."
>>>>                   -Paul Saffo
>>>>
>>>>
>>>>
>>>>     --
>>>>     Scanned by iCritical.
>>>>     _______________________________________________
>>>>     GO-ESSP-TECH mailing list
>>>>     GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>>>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>
>>
>>     -- 
>>     Tobias Weigel
>>
>>     Department of Data Management
>>     Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>     Bundesstr. 45a
>>     20146 Hamburg
>>     Germany
>>
>>     Tel.: +49 40 460094 104
>>     E-Mail: weigel at dkrz.de <mailto:weigel at dkrz.de>
>>     Website: www.dkrz.de <http://www.dkrz.de>
>>
>>     Managing Director: Prof. Dr. Thomas Ludwig
>>
>>     Sitz der Gesellschaft: Hamburg
>>     Amtsgericht Hamburg HRB 39784
>>
>>
>>     _______________________________________________
>>     GO-ESSP-TECH mailing list
>>     GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>     ---------------------------------------------------
>     Mark Morgan
>     Software Architect / Engineer
>     Institut Pierre Simon Laplace (IPSL),
>     Université Pierre Marie Curie,
>     4 Place Jussieu,
>     Tour 45-55, Salle #207,
>     Paris 75005
>     France.
>     Tel : +33 (0) 1 44 27 49 10
>     Email: momipsl at ipsl.jussieu.fr <mailto:momipsl at ipsl.jussieu.fr>
>     ---------------------------------------------------
>
>
>

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120312/7f6e17cd/attachment-0001.html