[Go-essp-tech] What is the risk that science is done using 'deprecated' data?
Tobias Weigel
weigel at dkrz.de
Fri Mar 9 02:55:39 MST 2012
Oh, and I am also talking about CMIP6+ here - no use in targeting CMIP5
except for hypothetical 'what-ifs' on lessens to learn.
On 09.03.2012 10:53:51, Tobias Weigel wrote:
> On 09.03.2012 10:31:28, Estanislao Gonzalez wrote:
>> Making a hash that uniquely identifies all the information, like the one
>> you've proposed Stephen, is certainly appealing. Though we will have a
>> lot of hashes, most of them pointing to the same data from the user
>> perspective. For instance, the user download a variable and from that
>> point onwards, other variables got add and changed, catalogs get
>> republished unintentionally or moved to other machines, errors at all
>> stages get corrected, new access to the data get inserted to the
>> catalogs, etc (They do happen, I've done them all). I'd expect about 20
>> different hashes from it, none of them would be interesting to the user.
>> IMHO we need to find proper versioning units; the publication unit
>> (realm dataset) as we now use might not be the best option.
>> AFAICT everything moves in variable units (atomic datasets). My
>> publication tasks are very different, but it never matches the
>> publication unit we use.... hardly ever.
>
> From the user's perspective, meaningful IDs (like the ones currently
> visible in the gateway, "cmip5.output1.MPI-M...") are preferable to
> hashes. However, from what you are writing here I'd think that such
> IDs can only be applied to very high-level entities and are useless
> for the actual data management. This could be addressed through
> collections/aggregations perhaps. In general, I'd be comfortable with
> hashes as Stephen originally proposed. I've never seen anyone
> complaining about git in this respect.
>
>> For example this is how publication looks from my perspective:
>> Normally I get information about a complete ensemble that was created
>> anew. No information on what was changed or not. Just the data (and the
>> computed checksums). I have to find out which datasets are there and how
>> they relate to the ones I've already published (i.e. I have to
>> distinguish between new, changed and deleted).
>> Then the other common tasks are, e.g., when a variable was wrongly
>> computed. So I get something like, "umo,vmo from 1pctCO2 are wrong and
>> will be recalculated". This requires me to extract the variables from
>> the datasets (finding out which), publish a new version without them and
>> when they are corrected, generated yet another new version to include
>> the corrected variables.
>> This generates 2 Versions without any meaning for those users interested
>> in other variables.
>
> So, to just make a quick shot in terms of PIDs/EPIC:
> You'd rather assign identifiers to all the low-level entities and
> build up a hierarchy through aggregations. If the variables are
> corrected, you'd publish a new identifier for an extended version of
> the old collection (some cloning involved here, but still only on the
> identifier side), and where possible still reference the old variables
> as their data has not changed. If consequently done, this decouples
> the identification/publication of identifiers issue from the data
> layer, and that's one of the strong advantages I can see in a global
> PID infrastructure.
>
> Best, Tobias
>
>> All the data I get is not related at all with the "realm" datasets (or
>> publication units). And this makes the data management more difficult.
>>
>> I just think we might want to review what we need and distinguish the
>> bes units from three different perspectives: the producer (cmor), the
>> data manager (esg) and the user.
>> Once we know that for sure (and I doubt it will be the same unit for
>> all), then we can think about unique ids and a hashing procedure, which
>> I strongly support.
>>
>> My 2c,
>> Estani
>>
>> Am 09.03.2012 08:26, schrieb stephen.pascoe at stfc.ac.uk:
>>> Hi Gavin,
>>>
>>> That would definitely help but I don't think it's sufficient. How
>>> many of us would notice if a centre republished the same dataset
>>> (same dataset_id and facet metadata) with different checksums?
>>> Estani would I expect :-) but the system itself wouldn't.
>>>
>>> I would like to see a hash of invariants of each dataset used as
>>> identifiers. For that we'd need to strip-out all the information
>>> from a THREDDS catalog which might legitimately change without
>>> changing the data: URL paths, service endpoints, last-modified,
>>> etc., but keeping filenames, checksums and some properties.
>>> Canonicalise a serialisiation then generate a hash.
>>>
>>> We'd also need to really keep track of these hashes. We have
>>> checksums and tracking_ids right now and are under-utilising them.
>>>
>>> Cheers,
>>> Stephen.
>>>
>>> On 9 Mar 2012, at 05:05, Gavin M. Bell wrote:
>>>
>>> Hello,
>>>
>>> If we enforced checksums to be done as a part of publication, then
>>> this would address this issue, right?
>>>
>>>
>>> On 3/8/12 8:39 AM,
>>> stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk> wrote:
>>>
>>> Tobias, sorry I miss-typed your name :-)
>>> S.
>>>
>>> On 8 Mar 2012, at
>>> 16:00,<stephen.pascoe at stfc.ac.uk><mailto:stephen.pascoe at stfc.ac.uk>
>>> wrote:
>>>
>>>
>>>
>>> Hi Thomas,
>>>
>>> As you say, it's too late to do much re-engineering of the system
>>> now -- we've attempted to put in place various identifier systems
>>> and none of them are working particularly well -- however I think
>>> there is another perspective to your proposal:
>>>
>>> 1. ESG/CMIP5 is deployed globally across multiple administrative
>>> domains and each domain has the ability to cut corners to get things
>>> done, e.g. replacing files silently without changing identifiers.
>>>
>>> 2. ESG/CMIP5 system is so complex that who'd blame a sys-admin for
>>> doing #1 to get the data to scientists when they need it. Any
>>> system that makes it impossible, or even only difficult, to change
>>> the underlying data is going to be more complex and difficult to
>>> administer than a system that doesn't, unless that system was very
>>> rigorously designed, implemented and tested.
>>>
>>> Because of #1 I'm convinced that a fit-for-purpose identifier system
>>> wouldn't use randomly generated UUIDs but would take the GIT
>>> approach of hashing invariants of the dataset so that any changes
>>> behind the scenes can be detected.
>>>
>>> Because of #2 I'm convinced that now is not the time to start
>>> building more software to do this. We have to stabilise the system
>>> and learn the lessons of CMIP5 first.
>>>
>>> Cheers,
>>> Stephen.
>>>
>>>
>>> On 8 Mar 2012, at 15:32, Tobias Weigel wrote:
>>>
>>>
>>>
>>> Jamie/All,
>>>
>>> these are important questions I have been wondering about as well;
>>> we just had a small internal meeting yesterday with Estani and
>>> Martina, so I'll try to sum some points up here. I am not too
>>> familiar with the ESG publishing process, so I can only guess that
>>> Stephen's #1 has something to do with the bending of policies that
>>> are for pragmatic reasons not enforced in the CMIP5 process. (My
>>> intuition is that *ideally* it should be impossible to make data
>>> available without going through the whole publication process.
>>> Please correct me if I am misunderstanding this.)
>>>
>>> Most of what I have been thinking about however concerns point #2.
>>> I'd claim that the risk here should not be underestimated; data
>>> consumers being unable to find the data they need is bad ("the
>>> advanced search issue"), but users relying on deprecated data - most
>>> likely without being aware of it - is certainly dangerous for
>>> scientific credibility.
>>> My suggestion to address this problem is to use globally persistent
>>> identifiers (PIDs) that are automatically assigned to data objects
>>> (and metadata etc.) on ESG-publication; data should ideally not be
>>> known by its file name or system-internal ID, but via a global
>>> identifier that never changes after it has been published. Of
>>> course, this sounds like the DOIs, but these are extremely coarse
>>> grained and very static. The idea is to attach identifiers to the
>>> low-level entities and provide solutions to build up a hierarchical
>>> ID system (virtual collections) to account for the various layers
>>> used in our data. Such persistent identifiers should then be placed
>>> prominently in any user interface dealing with managed data. The
>>> important thing is: If data is updated, we don't update the data
>>> behind identifier x, but assign a new identifier y and create a
>>> typed link between these two (which may be the most challenging
>>> part) and perhaps put a small annotation on x that this data is depreca
>>> ted. A clever user interface should then redirect a user
>>> consistently to the latest version of a dataset if a user accesses
>>> the old identifier.
>>> This does not make it impossible to use deprecated data, but at
>>> least it raises the consumer's awareness of the issue and lowers the
>>> barrier to re-retrieve valid data.
>>>
>>> As for the point in time; I'd be certain that it is too late now,
>>> but it is always a good idea to have plans for future improvement.. :)
>>>
>>> Best, Tobias
>>>
>>> Am 08.03.2012 13:06, schrieb Kettleborough, Jamie:
>>>
>>>
>>> Thanks for the replies on this - any other replies are still very
>>> welcome.
>>>
>>> Stephen - being selfish - we aren't too worried about 2 as its less
>>> of an issue for us (we do a daily trawl of thredds catalogues for
>>> new datasets), but I agree it is a problem more generally. I don't
>>> have a feel for which of the problems 1-3 would minimise the risk
>>> most if you solved it. I think making sure new data has a new
>>> version is a foundation though.
>>>
>>> Part of me wonders though whether its already too late to really do
>>> anything with versioning in its current form. *But* I may be
>>> overestimating the size of the problem of new datasets appearing
>>> without versions being updated.
>>>
>>> Jamie
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From:
>>> go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien Denvil
>>> Sent: 08 March 2012 10:41
>>> To: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
>>> Subject: Re: [Go-essp-tech] What is the risk that science is
>>> done using 'deprecated' data?
>>>
>>> Hi Stephen, let me add a third point:
>>>
>>> 3. Users are aware of a new versions but can't download files
>>> so as to have a coherent set of files.
>>>
>>> With respect to that point the p2p transition (especially the
>>> attribut caching on the node) will be a major step forward.
>>> GFDL just upgrad and we have an amazing success rate of 98%.
>>>
>>> And I agree with Ashish.
>>>
>>> Regards.
>>> Sébastien
>>>
>>> Le 08/03/2012 11:34,
>>> stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk> a écrit :
>>>
>>>
>>> Hi Jamie,
>>>
>>> I can imagine there is a risk of papers being written on
>>>
>>>
>>> deprecated data in two scenarios:
>>>
>>>
>>> 1. Data is being updated at datanodes without creating a
>>>
>>>
>>> new version
>>>
>>>
>>> 2. Users are unaware of new versions available and
>>>
>>>
>>> therefore using
>>>
>>>
>>> deprecated data
>>>
>>> Are you concerned about both of these scenarios? Your
>>>
>>>
>>> email seems to mainly address #1.
>>>
>>>
>>> Thanks,
>>> Stephen.
>>>
>>> On 8 Mar 2012, at 10:21, Kettleborough, Jamie wrote:
>>>
>>>
>>>
>>> Hello,
>>>
>>> Does anyone have a feel for the current level of risk that
>>>
>>>
>>> analysists
>>>
>>>
>>> are doing work (with the intention to publish) on data
>>>
>>>
>>> that has been
>>>
>>>
>>> found to be wrong by the data providers and so deprecated (in some
>>> sense)?
>>>
>>> My feeling is that versioning isn't working (that may be
>>>
>>>
>>> putting it a
>>>
>>>
>>> bit strongly. It is too easy for data providers - in their
>>> understandable drive to get their data out - to have
>>>
>>>
>>> updated files on
>>>
>>>
>>> disk without publishing a new version. How big a deal does anyone
>>> think this is?
>>>
>>> If the risk that papers are being written based on
>>>
>>>
>>> deprecated data is
>>>
>>>
>>> sufficiently large then is there an agreed strategy for
>>>
>>>
>>> coping with
>>>
>>>
>>> this? Does it have implications for the requirements of the data
>>> publishing/delivery system?
>>>
>>> Thanks,
>>>
>>> Jamie
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>> --
>>> Sébastien Denvil
>>> IPSL, Pôle de modélisation du climat
>>> UPMC, Case 101, 4 place Jussieu,
>>> 75252 Paris Cedex 5
>>>
>>> Tour 45-55 2ème étage Bureau 209
>>> Tel: 33 1 44 27 21 10
>>> Fax: 33 1 44 27 39 02
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>>
>>>
>>> --
>>> Tobias Weigel
>>>
>>> Department of Data Management
>>> Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
>>> Bundesstr. 45a
>>> 20146 Hamburg
>>> Germany
>>>
>>> Tel.: +49 40 460094 104
>>> E-Mail: weigel at dkrz.de<mailto:weigel at dkrz.de>
>>> Website: www.dkrz.de<http://www.dkrz.de/>
>>>
>>> Managing Director: Prof. Dr. Thomas Ludwig
>>>
>>> Sitz der Gesellschaft: Hamburg
>>> Amtsgericht Hamburg HRB 39784
>>>
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>> --
>>> Scanned by iCritical.
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>>>
>>> --
>>> Gavin M. Bell
>>> --
>>>
>>> "Never mistake a clear view for a short distance."
>>> -Paul Saffo
>>>
>>>
>>>
>>
>
>
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
--
Tobias Weigel
Department of Data Management
Deutsches Klimarechenzentrum GmbH (German Climate Computing Center)
Bundesstr. 45a
20146 Hamburg
Germany
Tel.: +49 40 460094 104
E-Mail: weigel at dkrz.de
Website: www.dkrz.de
Managing Director: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/b4ae1f3a/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4568 bytes
Desc: S/MIME Kryptografische Unterschrift
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120309/b4ae1f3a/attachment-0001.bin
More information about the GO-ESSP-TECH
mailing list