[Go-essp-tech] What is the risk that science is done using 'deprecated' data?

Tobias Weigel weigel at dkrz.de
Fri Mar 9 02:55:39 MST 2012


Oh, and I am also talking about CMIP6+ here - no use in targeting CMIP5 
except for hypothetical 'what-ifs' on lessens to learn.

On  09.03.2012 10:53:51, Tobias Weigel wrote:
> On  09.03.2012 10:31:28, Estanislao Gonzalez wrote:
>> Making a hash that uniquely identifies all the information, like the one
>> you've proposed Stephen, is certainly appealing. Though we will have a
>> lot of hashes, most of them pointing to the same data from the user
>> perspective. For instance, the user download a variable and from that
>> point onwards, other variables got add and changed, catalogs get
>> republished unintentionally or moved to other machines, errors at all
>> stages get corrected, new access to the data get inserted to the
>> catalogs, etc (They do happen, I've done them all). I'd expect about 20
>> different hashes from it, none of them would be interesting to the user.
>> IMHO we need to find proper versioning units; the publication unit
>> (realm dataset) as we now use might not be the best option.
>> AFAICT everything moves in variable units (atomic datasets). My
>> publication tasks are very different, but it never matches the
>> publication unit we use.... hardly ever.
>
> From the user's perspective, meaningful IDs (like the ones currently 
> visible in the gateway, "cmip5.output1.MPI-M...") are preferable to 
> hashes. However, from what you are writing here I'd think that such 
> IDs can only be applied to very high-level entities and are useless 
> for the actual data management. This could be addressed through 
> collections/aggregations perhaps. In general, I'd be comfortable with 
> hashes as Stephen originally proposed. I've never seen anyone 
> complaining about git in this respect.
>
>> For example this is how publication looks from my perspective:
>> Normally I get information about a complete ensemble that was created
>> anew. No information on what was changed or not. Just the data (and the
>> computed checksums). I have to find out which datasets are there and how
>> they relate to the ones I've already published (i.e. I have to
>> distinguish between new, changed and deleted).
>> Then the other common tasks are, e.g., when a variable was wrongly
>> computed. So I get something like, "umo,vmo from 1pctCO2 are wrong and
>> will be recalculated". This requires me to extract the variables from
>> the datasets (finding out which), publish a new version without them and
>> when they are corrected, generated yet another new version to include
>> the corrected variables.
>> This generates 2 Versions without any meaning for those users interested
>> in other variables.
>
> So, to just make a quick shot in terms of PIDs/EPIC:
> You'd rather assign identifiers to all the low-level entities and 
> build up a hierarchy through aggregations. If the variables are 
> corrected, you'd publish a new identifier for an extended version of 
> the old collection (some cloning involved here, but still only on the 
> identifier side), and where possible still reference the old variables 
> as their data has not changed. If consequently done, this decouples 
> the identification/publication of identifiers issue from the data 
> layer, and that's one of the strong advantages I can see in a global 
> PID infrastructure.
>
> Best, Tobias
>
>> All the data I get is not related at all with the "realm" datasets (or
>> publication units). And this makes the data management more difficult.
>>
>> I just think we might want to review what we need and distinguish the
>> bes units from three different perspectives: the producer (cmor), the
>> data manager (esg) and the user.
>> Once we know that for sure (and I doubt it will be the same unit for
>> all), then we can think about unique ids and a hashing procedure, which
>> I strongly support.
>>
>> My 2c,
>> Estani
>>
>> Am 09.03.2012 08:26, schrieb stephen.pascoe at stfc.ac.uk:
>>> Hi Gavin,
>>>
>>> That would definitely help but I don't think it's sufficient.  How 
>>> many of us would notice if a centre republished the same dataset 
>>> (same dataset_id and facet metadata) with different checksums?  
>>> Estani would I expect :-) but the system itself wouldn't.
>>>
>>> I would like to see a hash of invariants of each dataset used as 
>>> identifiers.  For that we'd need to strip-out all the information 
>>> from a THREDDS catalog which might legitimately change without 
>>> changing the data: URL paths, service endpoints, last-modified, 
>>> etc., but keeping filenames, checksums and some properties.  
>>> Canonicalise a serialisiation then generate a hash.
>>>
>>> We'd also need to really keep track of these hashes.  We have 
>>> checksums and tracking_ids right now and are under-utilising them.
>>>
>>> Cheers,
>>> Stephen.
>>>
>>> On 9 Mar 2012, at 05:05, Gavin M. Bell wrote:
>>>
>>> Hello,
>>>
>>> If we enforced checksums to be done as a part of publication, then 
>>> this would address this issue, right?
>>>
>>>
>>> On 3/8/12 8:39 AM, 
>>> stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk>   wrote:
>>>
>>> Tobias, sorry I miss-typed your name :-)
>>> S.
>>>
>>> On 8 Mar 2012, at 
>>> 16:00,<stephen.pascoe at stfc.ac.uk><mailto:stephen.pascoe at stfc.ac.uk>
>>>    wrote:
>>>
>>>
>>>
>>> Hi Thomas,
>>>
>>> As you say, it's too late to do much re-engineering of the system 
>>> now -- we've attempted to put in place various identifier systems 
>>> and none of them are working particularly well -- however I think 
>>> there is another perspective to your proposal:
>>>
>>> 1. ESG/CMIP5 is deployed globally across multiple administrative 
>>> domains and each domain has the ability to cut corners to get things 
>>> done, e.g. replacing files silently without changing identifiers.
>>>
>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for 
>>> doing #1 to get the data to scientists when they need it.  Any 
>>> system that makes it impossible, or even only difficult, to change 
>>> the underlying data is going to be more complex and difficult to 
>>> administer than a system that doesn't, unless that system was very 
>>> rigorously designed, implemented and tested.
>>>
>>> Because of #1 I'm convinced that a fit-for-purpose identifier system 
>>> wouldn't use randomly generated UUIDs but would take the GIT 
>>> approach of hashing invariants of the dataset so that any changes 
>>> behind the scenes can be detected.
>>>
>>> Because of #2 I'm convinced that now is not the time to start 
>>> building more software to do this.  We have to stabilise the system 
>>> and learn the lessons of CMIP5 first.
>>>
>>> Cheers,
>>> Stephen.
>>>
>>>
>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>
>>>
>>>
>>> Jamie/All,
>>>
>>> these are important questions I have been wondering about as well; 
>>> we just had a small internal meeting yesterday with Estani and 
>>> Martina, so I'll try to sum some points up here. I am not too 
>>> familiar with the ESG publishing process, so I can only guess that 
>>> Stephen's #1 has something to do with the bending of policies that 
>>> are for pragmatic reasons not enforced in the CMIP5 process. (My 
>>> intuition is that *ideally* it should be impossible to make data 
>>> available without going through the whole publication process. 
>>> Please correct me if I am misunderstanding this.)
>>>
>>> Most of what I have been thinking about however concerns point #2. 
>>> I'd claim that the risk here should not be underestimated; data 
>>> consumers being unable to find the data they need is bad ("the 
>>> advanced search issue"), but users relying on deprecated data - most 
>>> likely without being aware of it - is certainly dangerous for 
>>> scientific credibility.
>>> My suggestion to address this problem is to use globally persistent 
>>> identifiers (PIDs) that are automatically assigned to data objects 
>>> (and metadata etc.) on ESG-publication; data should ideally not be 
>>> known by its file name or system-internal ID, but via a global 
>>> identifier that never changes after it has been published. Of 
>>> course, this sounds like the DOIs, but these are extremely coarse 
>>> grained and very static. The idea is to attach identifiers to the 
>>> low-level entities and provide solutions to build up a hierarchical 
>>> ID system (virtual collections) to account for the various layers 
>>> used in our data. Such persistent identifiers should then be placed 
>>> prominently in any user interface dealing with managed data. The 
>>> important thing is: If data is updated, we don't update the data 
>>> behind identifier x, but assign a new identifier y and create a 
>>> typed link between these two (which may be the most challenging 
>>> part) and perhaps put a small annotation on x that this data is depreca
>>> ted. A clever user interface should then redirect a user 
>>> consistently to the latest version of a dataset if a user accesses 
>>> the old identifier.
>>> This does not make it impossible to use deprecated data, but at 
>>> least it raises the consumer's awareness of the issue and lowers the 
>>> barrier to re-retrieve valid data.
>>>
>>> As for the point in time; I'd be certain that it is too late now, 
>>> but it is always a good idea to have plans for future improvement.. :)
>>>
>>> Best, Tobias
>>>
>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>
>>>
>>> Thanks for the replies on this - any other replies are still very 
>>> welcome.
>>>
>>> Stephen - being selfish - we aren't too worried about 2 as its less 
>>> of an issue for us (we do a daily trawl of thredds catalogues for 
>>> new datasets), but I agree it is a problem more generally.  I don't 
>>> have a feel for which of the problems 1-3 would minimise the risk 
>>> most if you solved it.  I think making sure new data has a new 
>>> version is a foundation though.
>>>
>>> Part of me wonders though whether its already too late to really do 
>>> anything with versioning in its current form.  *But* I may be 
>>> overestimating the size of the problem of new datasets appearing 
>>> without versions being updated.
>>>
>>> Jamie
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: 
>>> go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien Denvil
>>> Sent: 08 March 2012 10:41
>>> To: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>> done using 'deprecated' data?
>>>
>>> Hi Stephen, let me add a third point:
>>>
>>> 3. Users are aware of a new versions but can't download files
>>> so as to have a coherent set of files.
>>>
>>> With respect to that point the p2p transition (especially the
>>> attribut caching on the node) will be a major step forward.
>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>
>>> And I agree with Ashish.
>>>
>>> Regards.
>>> Sébastien
>>>
>>> Le 08/03/2012 11:34, 
>>> stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk>   a écrit :
>>>
>>>
>>> Hi Jamie,
>>>
>>> I can imagine there is a risk of papers being written on
>>>
>>>
>>> deprecated data in two scenarios:
>>>
>>>
>>>    1. Data is being updated at datanodes without creating a
>>>
>>>
>>> new version
>>>
>>>
>>>    2. Users are unaware of new versions available and
>>>
>>>
>>> therefore using
>>>
>>>
>>> deprecated data
>>>
>>> Are you concerned about both of these scenarios?  Your
>>>
>>>
>>> email seems to mainly address #1.
>>>
>>>
>>> Thanks,
>>> Stephen.
>>>
>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>
>>>
>>>
>>> Hello,
>>>
>>> Does anyone have a feel for the current level of risk that
>>>
>>>
>>> analysists
>>>
>>>
>>> are doing work (with the intention to publish) on data
>>>
>>>
>>> that has been
>>>
>>>
>>> found to be wrong by the data providers and so deprecated (in some
>>> sense)?
>>>
>>> My feeling is that versioning isn't working (that may be
>>>
>>>
>>> putting it a
>>>
>>>
>>> bit strongly.  It is too easy for data providers - in their
>>> understandable drive to get their data out - to have
>>>
>>>
>>> updated files on
>>>
>>>
>>> disk without publishing a new version.   How big a deal does anyone
>>> think this is?
>>>
>>> If the risk that papers are being written based on
>>>
>>>
>>> deprecated data is
>>>
>>>
>>> sufficiently large then is there an agreed strategy for
>>>
>>>
>>> coping with
>>>
>>>
>>> this?  Does it have implications for the requirements of the data
>>> publishing/delivery system?
>>>
>>> Thanks,
>>>
>>> Jamie
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>> -- 
>>> Sébastien Denvil
>>> IPSL, Pôle de modélisation du climat
>>> UPMC, Case 101, 4 place Jussieu,
>>> 75252 Paris Cedex 5
>>>
>>> Tour 45-55 2ème étage Bureau 209
>>> Tel: 33 1 44 27 21 10
>>> Fax: 33 1 44 27 39 02
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>>
>>>
>>> -- 
>>> Tobias Weigel
>>>
>>> Department of Data Management
>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>> Bundesstr. 45a
>>> 20146 Hamburg
>>> Germany
>>>
>>> Tel.: +49 40 460094 104
>>> E-Mail: weigel at dkrz.de<mailto:weigel at dkrz.de>
>>> Website: www.dkrz.de<http://www.dkrz.de/>
>>>
>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>
>>> Sitz der Gesellschaft: Hamburg
>>> Amtsgericht Hamburg HRB 39784
>>>
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>> -- 
>>> Scanned by iCritical.
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>>
>>> -- 
>>> Gavin M. Bell
>>> -- 
>>>
>>>    "Never mistake a clear view for a short distance."
>>>                  -Paul Saffo
>>>
>>>
>>>
>>
>
>
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech


-- 
Tobias Weigel

Department of Data Management
Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
Bundesstr. 45a
20146 Hamburg
Germany

Tel.: +49 40 460094 104
E-Mail: weigel at dkrz.de
Website: www.dkrz.de

Managing Director: Prof. Dr. Thomas Ludwig

Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/b4ae1f3a/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4568 bytes
Desc: S/MIME Kryptografische Unterschrift
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/b4ae1f3a/attachment-0001.bin 


More information about the GO-ESSP-TECH mailing list