[Go-essp-tech] Checksums on data nodes

Estanislao Gonzalez gonzalez at dkrz.de
Fri Jul 8 04:35:32 MDT 2011


Hi,

indeed this might not be entirely the case here but anyway:
While moving the cmip3 Archive from PCMDI (~35T, +74000 Files) I got 
about ten corrupted files (same size). I got them via gridFTP but using 
the normal mode, without any parallelization since that wasn't working 
in our case. The procedure broke a couple of times and I always rolled 
~1k back before continuing in case the last blocks got scrambled up. 
I've successfully tested this a lot of times and I'm sure it hasn't 
caused the file corruption (in which case the size would have been 
different anyway). So the error rate I've seen was >%0.02.

The point here is that there is no real way we can guarantee with 100% 
certainty that all files are the same unless we calculate a checksum. 
And a bit difference (literally a _bit_ :-) will be almost impossible to 
spot and will become visible probably only when performing max and min 
computations.

Bottom line, it did happen so it will happen again. As archive we must 
be sure we have the right files, for example, before moving them to 
tape... a corrupt file now means a redownload; a corrupt file in 3~5 
years, means the file got lost.

My 2c,
Estani

Am 08.07.2011 10:02, schrieb martin.juckes at stfc.ac.uk:
> Hi Jamie,
>
> I have been checking file size before checking the checksum -- but I'm afraid I don't have statistics on failure rates.
>
> I believe that Alan Iwi has had experience (on another project) of corrupted files showing up with the correct size. This may have been associated with a parallel FTP client and so not directly relevant to wget transfers, so I think it is best to be cautious.
>
> I have data from CNRM which I transferred before they started publishing checksums -- I need to go through that now and will let you know the results,
>
> Cheers,
> Martin
>
>>> -----Original Message-----
>>> From: Kettleborough, Jamie
>>> [mailto:jamie.kettleborough at metoffice.gov.uk]
>>> Sent: 07 July 2011 14:54
>>> To: Juckes, Martin (STFC,RAL,RALSP); gavin at llnl.gov
>>> Cc: go-essp-tech at ucar.edu; Kettleborough, Jamie
>>> Subject: RE: Checksums on data nodes
>>>
>>> Hello Martin,
>>>
>>> As you have been pulling data back from different nodes how often has
>>> the checksum picked up a corrupt transfer?  How often could this
>>> corruption have been spotted by just checking the file size?
>>>
>>> Thanks,
>>>
>>> Jamie
>>>
>>>> -----Original Message-----
>>>> From: martin.juckes at stfc.ac.uk [mailto:martin.juckes at stfc.ac.uk]
>>>> Sent: 06 July 2011 22:43
>>>> To: Kettleborough, Jamie; gavin at llnl.gov
>>>> Cc: go-essp-tech at ucar.edu
>>>> Subject: Checksums and PKI access control on data nodes
>>>>
>>>> Hi Jamie,
>>>>
>>>> just picking up something on one of your data node
>>>> authorization threads.
>>>>
>>>> I think programmatic access to data requires PKI security --
>>>> I don't see any prospect of adequate data access with the
>>>> http token approach.
>>>>
>>>> I think that checksums are also necessary to guarantee data
>>>> integrity -- these are given in the THREDDS catalogues of
>>>> BADC, IPSL, and CNRM -- and CCCMA is in the process of adding them.
>>>>
>>>> I aim to continue contacting data nodes over the coming weeks
>>>> and hope that there will be steady progress in levelling the
>>>> quality of service upwards,
>>>>
>>>> cheers,
>>>> Martin
>>>>
>>>> ________________________________________
>>>> From: go-essp-tech-bounces at ucar.edu
>>>> [go-essp-tech-bounces at ucar.edu] on behalf of Kettleborough,
>>>> Jamie [jamie.kettleborough at metoffice.gov.uk]
>>>> Sent: 05 July 2011 14:48
>>>> To: Gavin M. Bell
>>>> Cc: go-essp-tech at ucar.edu
>>>> Subject: Re: [Go-essp-tech] Data node authorization
>>>>
>>>> Hello Gavin,
>>>>
>>>> thanks for this.  This looks useful.  Any ideas when any
>>>> live/production data nodes will have this version of the
>>>> service on them? - I couldn't find any (but that's part of
>>>> the problem of course). When available how up to date will
>>>> the registry be e.g. are their constraints on it like it will
>>>> only know about data nodes running the same releases?
>>>>
>>>> I know you were just answering my tangent.  But I think the
>>>> original question is still only half answered.  As I
>>>> understand it there are two ways this might go:
>>>>
>>>> 1. all data nodes upgrade change to the PKI infrastructure
>>>>
>>>> 2. the ESGF continues to support (for some time) both PKI and
>>>> the HTTP query string token (I don't know the right name for
>>>> this, sorry).
>>>>
>>>> (there is a 3rd option of everyone move to just the HTTP
>>>> query string token - but I don't think that is really under
>>>> discussion).
>>>>
>>>> My guess is that 2. is the most likely outcome and data users
>>>> will have to cope with both.  So...
>>>>
>>>> 1. How do you programmatically get data using the HTTP query
>>>> string token (I think Martin is following this up with Bob -
>>>> can we have a summary posted to the list?)
>>>>
>>>> 2. How does a user know which method to use for which nodes.
>>>> (This may be in the data-node registry, when available, but
>>>> it wasn't' obvious to me from the sample Luca sent round? -
>>>> again I may be missing something though).
>>>>
>>>> Apologies if I'm coming across as over demanding here - I
>>>> realise I'm coming to this discussion relatively late in the
>>>> day.  Just I'm aware that we have scientists who want to get
>>>> data so they can start the analysis and writing of multi
>>>> model papers in time for the 1st draft of the AR5. At the
>>>> moment I'm really uncertain on how they can get the data
>>>> minimising the effort that have to put into finding and fetching it.
>>>>
>>>> Thanks,
>>>>
>>>> Jamie
>>>>
>>>>
>>>> ________________________________
>>>>
>>>>          From: Gavin M. Bell [mailto:gavin at llnl.gov]
>>>>          Sent: 01 July 2011 20:35
>>>>          To: Kettleborough, Jamie
>>>>          Cc: Cinquini, Luca (3880); go-essp-tech at ucar.edu
>>>>          Subject: Re: [Go-essp-tech] Data node authorization
>>>>
>>>>
>>>>          Hello Jamie,
>>>>
>>>>          Allow me to solely indulge your tangent for a moment... :-)
>>>>
>>>>          The issue of knowing who is where etc. is solved by
>>>> using a sufficiently recent version of the  ESGF "data" Node
>>>> (v0.5.1+).
>>>>          The node-manager's registry component will
>>>> automatically generate a continuously updating descriptive
>>>> (xml) document of nodes currently present in the federation
>>>> at a given time.  This would have ameliorated your task
>>> considerably.
>>>>          If you look at the sites you have collected; go to
>>>> the esgf-node-manager page and look at the bottom left corner
>>>> for the version.
>>>>          They are all earlier than v0.5.1 and hence do not
>>>> have the automatic federation feature in place.
>>>>
>>>>          Ex:
>>>>          http://esgnode1.nci.org.au/esgf-node-manager/  (v0.5.0)
>>>>          http://vesg.ipsl.fr/esgf-node-manager/  (v0.4.0)
>>>>          http://esg.cnrm-game-meteo.fr/esgf-node-manager/  (v0.4.0)
>>>>          http://dap.cccma.uvic.ca/esgf-node-manager/  (v0.5.0)
>>>>          http://cmip-dn.badc.rl.ac.uk/esgf-node-manager/  (v0.4.0)
>>>>
>>>>          (NASA-GISS are not running a node manager at all)
>>>>
>>>>          If you look at more recent node installations
>>>> (version 0.5.1+) you will see that there is a
>>>> registration.xml document that is served under
>>>> esgf-node-manager.  It is an active document that is
>>>> automatically updated by the node manager's registry service
>>>> to always reflect the current state of the federation.
>>>>          This is a feature of the new ESGF Node.  Gateways are
>>>> not running node managers so they are not present in the
>>>> registration.xml document.  However, you can find out about
>>>> gateways indirectly by looking at the ESGF Node's
>>>> registration entry and looking at the attribute "adminPeer"
>>>> this indicates that node's target IDP service, which in older
>>>> ESG parlance indicates a "gateway".  The new ESGF Nodes are
>>>> built based on a modular component architecture such that
>>>> sets of components embody functionality, these are what we
>>>> call ESGF Node "types".  There are 4 node types. The node
>>>> type that is currently being installed is the well known
>>>> "data" type a.k.a the "data node", the other types are not
>>>> mutually exclusive and extend the ESGF Nodes functionality to
>>>> include familiar features such as:
>>>>          - User credential management and single sign on support
>>>>          - Attribute management
>>>>          - Enhanced Federation-wide searching (with new search
>>>> front-end)
>>>>
>>>>          As well as recent features since v0.5.1 and pending
>>>> features coming on line such as:
>>>>          - Automatic fail-over and fault tolerance
>>>>          - New administrative front ends
>>>>          - Computation / Visualization tools
>>>>          - and more...
>>>>
>>>>          I would suggest upgrading :-).
>>>>
>>>>          The installation/upgrading process has been
>>>> streamlined to make things more straight forward - and the
>>>> team and I are always glad to help if needed.  There are
>>>> further enhancements in the queue that will further
>>>> streamline the process to make installation/upgrading as
>>>> turn-key as possible.  There are also enhancements to the
>>>> federation protocol and new features as well, that will soon
>>>> be available in an upcoming v0.5.3 release that is currently in
>>> test.
>>>>          FYI:
>>>>          The current installer installs the ESGF Node at v0.5.1.
>>>>          In staging is v0.5.2
>>>>          In test is v0.5.3.
>>>>
>>>>          Note: The list above are versions of the node manager
>>>> component.
>>>> As it is a component of the ESGF Node, the node itself has a
>>>> version currently ESG Node v1.0.4+ (Stuyvesant release).
>>>>
>>>>          The new ESGF Node augments the data node and is a
>>>> complete solution in and of itself while being compatible
>>>> with the current Gateway.  It should be considered a useful
>>>> tool to help the climate community and adding to the ESG
>>>> ecosystem of utilities :-).
>>>>
>>>>          Whew... (that was a long email)
>>>>          I hope this was somewhat useful information in the
>>>> context of your tangent. :-)
>>>>
>>>>
>>>>          On 7/1/11 6:49 AM, Kettleborough, Jamie wrote:
>>>>
>>>>                  I created this table by: looking at each
>>>> gateway, figuring out which
>>>>                  modelling institutes contributed to the CMIP5
>>>> project, selecting a
>>>>                  sample data-set, creating a wget script, and
>>>> then inspecting the url in
>>>>                  the script.  (I couldn't get to any NCC data
>>>> as I didn't have access).
>>>>                  I only sampled one dataset.
>>>>
>>>>                  This feels a bit long winded - what is the
>>>> expected way to do this?
>>>>                  Although today I was just gathering
>>>> information on what data nodes are
>>>>                  out there I can imagine this as a part of a
>>>> real life use case (a very
>>>>                  common use case).  If I want to gather a
>>>> diagnostic, such as monthly
>>>>                  mean surface temperature from as many models
>>>> as I can, I think I'd have
>>>>                  to do this sort of trawling.  OK I maybe only
>>>> have to do the initial
>>>>                  mapping of institute to data node once, but I
>>>> think there is still a
>>>>                  trawl needed between gateways to get the
>>>> data.  I may be missing
>>>>                  something - and I took some unnecessary
>>>> steps. Please let me know if
>>>>                  this is the case.  Estani, Martin, Sebastien
>>>> - sounds like you have
>>>>                  already started to do this sort of thing?
>>>>
>>>>                  I also note that not all gateways know about
>>>> all institutes - I think
>>>>                  this is a known problem.  For instance PCMDI
>>>> doesn't know about IPSL,
>>>>                  and only NCI seems to know about CSIRO. Any
>>>> ideas when this might be
>>>>                  resolved?
>>>>
>>>>
>>>>
>>>>
>>>>          --
>>>>          Gavin M. Bell
>>>>          Lawrence Livermore National Labs
>>>>          --
>>>>
>>>>           "Never mistake a clear view for a short distance."
>>>>                         -Paul Saffo
>>>>
>>>>          (GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>
>>>>           A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>>
>>>>
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>> --
>>>> Scanned by iCritical.
>>>>


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de



More information about the GO-ESSP-TECH mailing list