[Go-essp-tech] Checksums on data nodes

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Fri Jul 8 10:35:06 MDT 2011


Hi,

I also have a number of files which are larger than the advertised size in the catalogue -- probably as a result of using wget -c to restart broken transfers. I guess I'll end up copying these again.

Cheers,
Martin

> >-----Original Message-----
> >From: Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
> >Sent: 08 July 2011 11:36
> >To: Juckes, Martin (STFC,RAL,RALSP)
> >Cc: jamie.kettleborough at metoffice.gov.uk; gavin at llnl.gov; go-essp-
> >tech at ucar.edu
> >Subject: Re: [Go-essp-tech] Checksums on data nodes
> >
> >Hi,
> >
> >indeed this might not be entirely the case here but anyway:
> >While moving the cmip3 Archive from PCMDI (~35T, +74000 Files) I got
> >about ten corrupted files (same size). I got them via gridFTP but
> >using
> >the normal mode, without any parallelization since that wasn't working
> >in our case. The procedure broke a couple of times and I always rolled
> >~1k back before continuing in case the last blocks got scrambled up.
> >I've successfully tested this a lot of times and I'm sure it hasn't
> >caused the file corruption (in which case the size would have been
> >different anyway). So the error rate I've seen was >%0.02.
> >
> >The point here is that there is no real way we can guarantee with 100%
> >certainty that all files are the same unless we calculate a checksum.
> >And a bit difference (literally a _bit_ :-) will be almost impossible
> >to
> >spot and will become visible probably only when performing max and min
> >computations.
> >
> >Bottom line, it did happen so it will happen again. As archive we must
> >be sure we have the right files, for example, before moving them to
> >tape... a corrupt file now means a redownload; a corrupt file in 3~5
> >years, means the file got lost.
> >
> >My 2c,
> >Estani
> >
> >Am 08.07.2011 10:02, schrieb martin.juckes at stfc.ac.uk:
> >> Hi Jamie,
> >>
> >> I have been checking file size before checking the checksum -- but
> >I'm afraid I don't have statistics on failure rates.
> >>
> >> I believe that Alan Iwi has had experience (on another project) of
> >corrupted files showing up with the correct size. This may have been
> >associated with a parallel FTP client and so not directly relevant to
> >wget transfers, so I think it is best to be cautious.
> >>
> >> I have data from CNRM which I transferred before they started
> >publishing checksums -- I need to go through that now and will let you
> >know the results,
> >>
> >> Cheers,
> >> Martin
> >>
> >>>> -----Original Message-----
> >>>> From: Kettleborough, Jamie
> >>>> [mailto:jamie.kettleborough at metoffice.gov.uk]
> >>>> Sent: 07 July 2011 14:54
> >>>> To: Juckes, Martin (STFC,RAL,RALSP); gavin at llnl.gov
> >>>> Cc: go-essp-tech at ucar.edu; Kettleborough, Jamie
> >>>> Subject: RE: Checksums on data nodes
> >>>>
> >>>> Hello Martin,
> >>>>
> >>>> As you have been pulling data back from different nodes how often
> >has
> >>>> the checksum picked up a corrupt transfer?  How often could this
> >>>> corruption have been spotted by just checking the file size?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jamie
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: martin.juckes at stfc.ac.uk [mailto:martin.juckes at stfc.ac.uk]
> >>>>> Sent: 06 July 2011 22:43
> >>>>> To: Kettleborough, Jamie; gavin at llnl.gov
> >>>>> Cc: go-essp-tech at ucar.edu
> >>>>> Subject: Checksums and PKI access control on data nodes
> >>>>>
> >>>>> Hi Jamie,
> >>>>>
> >>>>> just picking up something on one of your data node
> >>>>> authorization threads.
> >>>>>
> >>>>> I think programmatic access to data requires PKI security --
> >>>>> I don't see any prospect of adequate data access with the
> >>>>> http token approach.
> >>>>>
> >>>>> I think that checksums are also necessary to guarantee data
> >>>>> integrity -- these are given in the THREDDS catalogues of
> >>>>> BADC, IPSL, and CNRM -- and CCCMA is in the process of adding
> >them.
> >>>>>
> >>>>> I aim to continue contacting data nodes over the coming weeks
> >>>>> and hope that there will be steady progress in levelling the
> >>>>> quality of service upwards,
> >>>>>
> >>>>> cheers,
> >>>>> Martin
> >>>>>
> >>>>> ________________________________________
> >>>>> From: go-essp-tech-bounces at ucar.edu
> >>>>> [go-essp-tech-bounces at ucar.edu] on behalf of Kettleborough,
> >>>>> Jamie [jamie.kettleborough at metoffice.gov.uk]
> >>>>> Sent: 05 July 2011 14:48
> >>>>> To: Gavin M. Bell
> >>>>> Cc: go-essp-tech at ucar.edu
> >>>>> Subject: Re: [Go-essp-tech] Data node authorization
> >>>>>
> >>>>> Hello Gavin,
> >>>>>
> >>>>> thanks for this.  This looks useful.  Any ideas when any
> >>>>> live/production data nodes will have this version of the
> >>>>> service on them? - I couldn't find any (but that's part of
> >>>>> the problem of course). When available how up to date will
> >>>>> the registry be e.g. are their constraints on it like it will
> >>>>> only know about data nodes running the same releases?
> >>>>>
> >>>>> I know you were just answering my tangent.  But I think the
> >>>>> original question is still only half answered.  As I
> >>>>> understand it there are two ways this might go:
> >>>>>
> >>>>> 1. all data nodes upgrade change to the PKI infrastructure
> >>>>>
> >>>>> 2. the ESGF continues to support (for some time) both PKI and
> >>>>> the HTTP query string token (I don't know the right name for
> >>>>> this, sorry).
> >>>>>
> >>>>> (there is a 3rd option of everyone move to just the HTTP
> >>>>> query string token - but I don't think that is really under
> >>>>> discussion).
> >>>>>
> >>>>> My guess is that 2. is the most likely outcome and data users
> >>>>> will have to cope with both.  So...
> >>>>>
> >>>>> 1. How do you programmatically get data using the HTTP query
> >>>>> string token (I think Martin is following this up with Bob -
> >>>>> can we have a summary posted to the list?)
> >>>>>
> >>>>> 2. How does a user know which method to use for which nodes.
> >>>>> (This may be in the data-node registry, when available, but
> >>>>> it wasn't' obvious to me from the sample Luca sent round? -
> >>>>> again I may be missing something though).
> >>>>>
> >>>>> Apologies if I'm coming across as over demanding here - I
> >>>>> realise I'm coming to this discussion relatively late in the
> >>>>> day.  Just I'm aware that we have scientists who want to get
> >>>>> data so they can start the analysis and writing of multi
> >>>>> model papers in time for the 1st draft of the AR5. At the
> >>>>> moment I'm really uncertain on how they can get the data
> >>>>> minimising the effort that have to put into finding and fetching
> >it.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jamie
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>>          From: Gavin M. Bell [mailto:gavin at llnl.gov]
> >>>>>          Sent: 01 July 2011 20:35
> >>>>>          To: Kettleborough, Jamie
> >>>>>          Cc: Cinquini, Luca (3880); go-essp-tech at ucar.edu
> >>>>>          Subject: Re: [Go-essp-tech] Data node authorization
> >>>>>
> >>>>>
> >>>>>          Hello Jamie,
> >>>>>
> >>>>>          Allow me to solely indulge your tangent for a moment...
> >:-)
> >>>>>
> >>>>>          The issue of knowing who is where etc. is solved by
> >>>>> using a sufficiently recent version of the  ESGF "data" Node
> >>>>> (v0.5.1+).
> >>>>>          The node-manager's registry component will
> >>>>> automatically generate a continuously updating descriptive
> >>>>> (xml) document of nodes currently present in the federation
> >>>>> at a given time.  This would have ameliorated your task
> >>>> considerably.
> >>>>>          If you look at the sites you have collected; go to
> >>>>> the esgf-node-manager page and look at the bottom left corner
> >>>>> for the version.
> >>>>>          They are all earlier than v0.5.1 and hence do not
> >>>>> have the automatic federation feature in place.
> >>>>>
> >>>>>          Ex:
> >>>>>          http://esgnode1.nci.org.au/esgf-node-manager/  (v0.5.0)
> >>>>>          http://vesg.ipsl.fr/esgf-node-manager/  (v0.4.0)
> >>>>>          http://esg.cnrm-game-meteo.fr/esgf-node-manager/
> >(v0.4.0)
> >>>>>          http://dap.cccma.uvic.ca/esgf-node-manager/  (v0.5.0)
> >>>>>          http://cmip-dn.badc.rl.ac.uk/esgf-node-manager/
> >(v0.4.0)
> >>>>>
> >>>>>          (NASA-GISS are not running a node manager at all)
> >>>>>
> >>>>>          If you look at more recent node installations
> >>>>> (version 0.5.1+) you will see that there is a
> >>>>> registration.xml document that is served under
> >>>>> esgf-node-manager.  It is an active document that is
> >>>>> automatically updated by the node manager's registry service
> >>>>> to always reflect the current state of the federation.
> >>>>>          This is a feature of the new ESGF Node.  Gateways are
> >>>>> not running node managers so they are not present in the
> >>>>> registration.xml document.  However, you can find out about
> >>>>> gateways indirectly by looking at the ESGF Node's
> >>>>> registration entry and looking at the attribute "adminPeer"
> >>>>> this indicates that node's target IDP service, which in older
> >>>>> ESG parlance indicates a "gateway".  The new ESGF Nodes are
> >>>>> built based on a modular component architecture such that
> >>>>> sets of components embody functionality, these are what we
> >>>>> call ESGF Node "types".  There are 4 node types. The node
> >>>>> type that is currently being installed is the well known
> >>>>> "data" type a.k.a the "data node", the other types are not
> >>>>> mutually exclusive and extend the ESGF Nodes functionality to
> >>>>> include familiar features such as:
> >>>>>          - User credential management and single sign on support
> >>>>>          - Attribute management
> >>>>>          - Enhanced Federation-wide searching (with new search
> >>>>> front-end)
> >>>>>
> >>>>>          As well as recent features since v0.5.1 and pending
> >>>>> features coming on line such as:
> >>>>>          - Automatic fail-over and fault tolerance
> >>>>>          - New administrative front ends
> >>>>>          - Computation / Visualization tools
> >>>>>          - and more...
> >>>>>
> >>>>>          I would suggest upgrading :-).
> >>>>>
> >>>>>          The installation/upgrading process has been
> >>>>> streamlined to make things more straight forward - and the
> >>>>> team and I are always glad to help if needed.  There are
> >>>>> further enhancements in the queue that will further
> >>>>> streamline the process to make installation/upgrading as
> >>>>> turn-key as possible.  There are also enhancements to the
> >>>>> federation protocol and new features as well, that will soon
> >>>>> be available in an upcoming v0.5.3 release that is currently in
> >>>> test.
> >>>>>          FYI:
> >>>>>          The current installer installs the ESGF Node at v0.5.1.
> >>>>>          In staging is v0.5.2
> >>>>>          In test is v0.5.3.
> >>>>>
> >>>>>          Note: The list above are versions of the node manager
> >>>>> component.
> >>>>> As it is a component of the ESGF Node, the node itself has a
> >>>>> version currently ESG Node v1.0.4+ (Stuyvesant release).
> >>>>>
> >>>>>          The new ESGF Node augments the data node and is a
> >>>>> complete solution in and of itself while being compatible
> >>>>> with the current Gateway.  It should be considered a useful
> >>>>> tool to help the climate community and adding to the ESG
> >>>>> ecosystem of utilities :-).
> >>>>>
> >>>>>          Whew... (that was a long email)
> >>>>>          I hope this was somewhat useful information in the
> >>>>> context of your tangent. :-)
> >>>>>
> >>>>>
> >>>>>          On 7/1/11 6:49 AM, Kettleborough, Jamie wrote:
> >>>>>
> >>>>>                  I created this table by: looking at each
> >>>>> gateway, figuring out which
> >>>>>                  modelling institutes contributed to the CMIP5
> >>>>> project, selecting a
> >>>>>                  sample data-set, creating a wget script, and
> >>>>> then inspecting the url in
> >>>>>                  the script.  (I couldn't get to any NCC data
> >>>>> as I didn't have access).
> >>>>>                  I only sampled one dataset.
> >>>>>
> >>>>>                  This feels a bit long winded - what is the
> >>>>> expected way to do this?
> >>>>>                  Although today I was just gathering
> >>>>> information on what data nodes are
> >>>>>                  out there I can imagine this as a part of a
> >>>>> real life use case (a very
> >>>>>                  common use case).  If I want to gather a
> >>>>> diagnostic, such as monthly
> >>>>>                  mean surface temperature from as many models
> >>>>> as I can, I think I'd have
> >>>>>                  to do this sort of trawling.  OK I maybe only
> >>>>> have to do the initial
> >>>>>                  mapping of institute to data node once, but I
> >>>>> think there is still a
> >>>>>                  trawl needed between gateways to get the
> >>>>> data.  I may be missing
> >>>>>                  something - and I took some unnecessary
> >>>>> steps. Please let me know if
> >>>>>                  this is the case.  Estani, Martin, Sebastien
> >>>>> - sounds like you have
> >>>>>                  already started to do this sort of thing?
> >>>>>
> >>>>>                  I also note that not all gateways know about
> >>>>> all institutes - I think
> >>>>>                  this is a known problem.  For instance PCMDI
> >>>>> doesn't know about IPSL,
> >>>>>                  and only NCI seems to know about CSIRO. Any
> >>>>> ideas when this might be
> >>>>>                  resolved?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>          --
> >>>>>          Gavin M. Bell
> >>>>>          Lawrence Livermore National Labs
> >>>>>          --
> >>>>>
> >>>>>           "Never mistake a clear view for a short distance."
> >>>>>                         -Paul Saffo
> >>>>>
> >>>>>          (GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
> >>>>>
> >>>>>           A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> GO-ESSP-TECH mailing list
> >>>>> GO-ESSP-TECH at ucar.edu
> >>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>> --
> >>>>> Scanned by iCritical.
> >>>>>
> >
> >
> >--
> >Estanislao Gonzalez
> >
> >Max-Planck-Institut für Meteorologie (MPI-M)
> >Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> >Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >
> >Phone:   +49 (40) 46 00 94-126
> >E-Mail:  gonzalez at dkrz.de

-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list