[Go-essp-tech] Checksums on data nodes
martin.juckes at stfc.ac.uk
martin.juckes at stfc.ac.uk
Fri Jul 8 10:35:06 MDT 2011
Hi,
I also have a number of files which are larger than the advertised size in the catalogue -- probably as a result of using wget -c to restart broken transfers. I guess I'll end up copying these again.
Cheers,
Martin
> >-----Original Message-----
> >From: Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
> >Sent: 08 July 2011 11:36
> >To: Juckes, Martin (STFC,RAL,RALSP)
> >Cc: jamie.kettleborough at metoffice.gov.uk; gavin at llnl.gov; go-essp-
> >tech at ucar.edu
> >Subject: Re: [Go-essp-tech] Checksums on data nodes
> >
> >Hi,
> >
> >indeed this might not be entirely the case here but anyway:
> >While moving the cmip3 Archive from PCMDI (~35T, +74000 Files) I got
> >about ten corrupted files (same size). I got them via gridFTP but
> >using
> >the normal mode, without any parallelization since that wasn't working
> >in our case. The procedure broke a couple of times and I always rolled
> >~1k back before continuing in case the last blocks got scrambled up.
> >I've successfully tested this a lot of times and I'm sure it hasn't
> >caused the file corruption (in which case the size would have been
> >different anyway). So the error rate I've seen was >%0.02.
> >
> >The point here is that there is no real way we can guarantee with 100%
> >certainty that all files are the same unless we calculate a checksum.
> >And a bit difference (literally a _bit_ :-) will be almost impossible
> >to
> >spot and will become visible probably only when performing max and min
> >computations.
> >
> >Bottom line, it did happen so it will happen again. As archive we must
> >be sure we have the right files, for example, before moving them to
> >tape... a corrupt file now means a redownload; a corrupt file in 3~5
> >years, means the file got lost.
> >
> >My 2c,
> >Estani
> >
> >Am 08.07.2011 10:02, schrieb martin.juckes at stfc.ac.uk:
> >> Hi Jamie,
> >>
> >> I have been checking file size before checking the checksum -- but
> >I'm afraid I don't have statistics on failure rates.
> >>
> >> I believe that Alan Iwi has had experience (on another project) of
> >corrupted files showing up with the correct size. This may have been
> >associated with a parallel FTP client and so not directly relevant to
> >wget transfers, so I think it is best to be cautious.
> >>
> >> I have data from CNRM which I transferred before they started
> >publishing checksums -- I need to go through that now and will let you
> >know the results,
> >>
> >> Cheers,
> >> Martin
> >>
> >>>> -----Original Message-----
> >>>> From: Kettleborough, Jamie
> >>>> [mailto:jamie.kettleborough at metoffice.gov.uk]
> >>>> Sent: 07 July 2011 14:54
> >>>> To: Juckes, Martin (STFC,RAL,RALSP); gavin at llnl.gov
> >>>> Cc: go-essp-tech at ucar.edu; Kettleborough, Jamie
> >>>> Subject: RE: Checksums on data nodes
> >>>>
> >>>> Hello Martin,
> >>>>
> >>>> As you have been pulling data back from different nodes how often
> >has
> >>>> the checksum picked up a corrupt transfer? How often could this
> >>>> corruption have been spotted by just checking the file size?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jamie
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: martin.juckes at stfc.ac.uk [mailto:martin.juckes at stfc.ac.uk]
> >>>>> Sent: 06 July 2011 22:43
> >>>>> To: Kettleborough, Jamie; gavin at llnl.gov
> >>>>> Cc: go-essp-tech at ucar.edu
> >>>>> Subject: Checksums and PKI access control on data nodes
> >>>>>
> >>>>> Hi Jamie,
> >>>>>
> >>>>> just picking up something on one of your data node
> >>>>> authorization threads.
> >>>>>
> >>>>> I think programmatic access to data requires PKI security --
> >>>>> I don't see any prospect of adequate data access with the
> >>>>> http token approach.
> >>>>>
> >>>>> I think that checksums are also necessary to guarantee data
> >>>>> integrity -- these are given in the THREDDS catalogues of
> >>>>> BADC, IPSL, and CNRM -- and CCCMA is in the process of adding
> >them.
> >>>>>
> >>>>> I aim to continue contacting data nodes over the coming weeks
> >>>>> and hope that there will be steady progress in levelling the
> >>>>> quality of service upwards,
> >>>>>
> >>>>> cheers,
> >>>>> Martin
> >>>>>
> >>>>> ________________________________________
> >>>>> From: go-essp-tech-bounces at ucar.edu
> >>>>> [go-essp-tech-bounces at ucar.edu] on behalf of Kettleborough,
> >>>>> Jamie [jamie.kettleborough at metoffice.gov.uk]
> >>>>> Sent: 05 July 2011 14:48
> >>>>> To: Gavin M. Bell
> >>>>> Cc: go-essp-tech at ucar.edu
> >>>>> Subject: Re: [Go-essp-tech] Data node authorization
> >>>>>
> >>>>> Hello Gavin,
> >>>>>
> >>>>> thanks for this. This looks useful. Any ideas when any
> >>>>> live/production data nodes will have this version of the
> >>>>> service on them? - I couldn't find any (but that's part of
> >>>>> the problem of course). When available how up to date will
> >>>>> the registry be e.g. are their constraints on it like it will
> >>>>> only know about data nodes running the same releases?
> >>>>>
> >>>>> I know you were just answering my tangent. But I think the
> >>>>> original question is still only half answered. As I
> >>>>> understand it there are two ways this might go:
> >>>>>
> >>>>> 1. all data nodes upgrade change to the PKI infrastructure
> >>>>>
> >>>>> 2. the ESGF continues to support (for some time) both PKI and
> >>>>> the HTTP query string token (I don't know the right name for
> >>>>> this, sorry).
> >>>>>
> >>>>> (there is a 3rd option of everyone move to just the HTTP
> >>>>> query string token - but I don't think that is really under
> >>>>> discussion).
> >>>>>
> >>>>> My guess is that 2. is the most likely outcome and data users
> >>>>> will have to cope with both. So...
> >>>>>
> >>>>> 1. How do you programmatically get data using the HTTP query
> >>>>> string token (I think Martin is following this up with Bob -
> >>>>> can we have a summary posted to the list?)
> >>>>>
> >>>>> 2. How does a user know which method to use for which nodes.
> >>>>> (This may be in the data-node registry, when available, but
> >>>>> it wasn't' obvious to me from the sample Luca sent round? -
> >>>>> again I may be missing something though).
> >>>>>
> >>>>> Apologies if I'm coming across as over demanding here - I
> >>>>> realise I'm coming to this discussion relatively late in the
> >>>>> day. Just I'm aware that we have scientists who want to get
> >>>>> data so they can start the analysis and writing of multi
> >>>>> model papers in time for the 1st draft of the AR5. At the
> >>>>> moment I'm really uncertain on how they can get the data
> >>>>> minimising the effort that have to put into finding and fetching
> >it.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jamie
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> From: Gavin M. Bell [mailto:gavin at llnl.gov]
> >>>>> Sent: 01 July 2011 20:35
> >>>>> To: Kettleborough, Jamie
> >>>>> Cc: Cinquini, Luca (3880); go-essp-tech at ucar.edu
> >>>>> Subject: Re: [Go-essp-tech] Data node authorization
> >>>>>
> >>>>>
> >>>>> Hello Jamie,
> >>>>>
> >>>>> Allow me to solely indulge your tangent for a moment...
> >:-)
> >>>>>
> >>>>> The issue of knowing who is where etc. is solved by
> >>>>> using a sufficiently recent version of the ESGF "data" Node
> >>>>> (v0.5.1+).
> >>>>> The node-manager's registry component will
> >>>>> automatically generate a continuously updating descriptive
> >>>>> (xml) document of nodes currently present in the federation
> >>>>> at a given time. This would have ameliorated your task
> >>>> considerably.
> >>>>> If you look at the sites you have collected; go to
> >>>>> the esgf-node-manager page and look at the bottom left corner
> >>>>> for the version.
> >>>>> They are all earlier than v0.5.1 and hence do not
> >>>>> have the automatic federation feature in place.
> >>>>>
> >>>>> Ex:
> >>>>> http://esgnode1.nci.org.au/esgf-node-manager/ (v0.5.0)
> >>>>> http://vesg.ipsl.fr/esgf-node-manager/ (v0.4.0)
> >>>>> http://esg.cnrm-game-meteo.fr/esgf-node-manager/
> >(v0.4.0)
> >>>>> http://dap.cccma.uvic.ca/esgf-node-manager/ (v0.5.0)
> >>>>> http://cmip-dn.badc.rl.ac.uk/esgf-node-manager/
> >(v0.4.0)
> >>>>>
> >>>>> (NASA-GISS are not running a node manager at all)
> >>>>>
> >>>>> If you look at more recent node installations
> >>>>> (version 0.5.1+) you will see that there is a
> >>>>> registration.xml document that is served under
> >>>>> esgf-node-manager. It is an active document that is
> >>>>> automatically updated by the node manager's registry service
> >>>>> to always reflect the current state of the federation.
> >>>>> This is a feature of the new ESGF Node. Gateways are
> >>>>> not running node managers so they are not present in the
> >>>>> registration.xml document. However, you can find out about
> >>>>> gateways indirectly by looking at the ESGF Node's
> >>>>> registration entry and looking at the attribute "adminPeer"
> >>>>> this indicates that node's target IDP service, which in older
> >>>>> ESG parlance indicates a "gateway". The new ESGF Nodes are
> >>>>> built based on a modular component architecture such that
> >>>>> sets of components embody functionality, these are what we
> >>>>> call ESGF Node "types". There are 4 node types. The node
> >>>>> type that is currently being installed is the well known
> >>>>> "data" type a.k.a the "data node", the other types are not
> >>>>> mutually exclusive and extend the ESGF Nodes functionality to
> >>>>> include familiar features such as:
> >>>>> - User credential management and single sign on support
> >>>>> - Attribute management
> >>>>> - Enhanced Federation-wide searching (with new search
> >>>>> front-end)
> >>>>>
> >>>>> As well as recent features since v0.5.1 and pending
> >>>>> features coming on line such as:
> >>>>> - Automatic fail-over and fault tolerance
> >>>>> - New administrative front ends
> >>>>> - Computation / Visualization tools
> >>>>> - and more...
> >>>>>
> >>>>> I would suggest upgrading :-).
> >>>>>
> >>>>> The installation/upgrading process has been
> >>>>> streamlined to make things more straight forward - and the
> >>>>> team and I are always glad to help if needed. There are
> >>>>> further enhancements in the queue that will further
> >>>>> streamline the process to make installation/upgrading as
> >>>>> turn-key as possible. There are also enhancements to the
> >>>>> federation protocol and new features as well, that will soon
> >>>>> be available in an upcoming v0.5.3 release that is currently in
> >>>> test.
> >>>>> FYI:
> >>>>> The current installer installs the ESGF Node at v0.5.1.
> >>>>> In staging is v0.5.2
> >>>>> In test is v0.5.3.
> >>>>>
> >>>>> Note: The list above are versions of the node manager
> >>>>> component.
> >>>>> As it is a component of the ESGF Node, the node itself has a
> >>>>> version currently ESG Node v1.0.4+ (Stuyvesant release).
> >>>>>
> >>>>> The new ESGF Node augments the data node and is a
> >>>>> complete solution in and of itself while being compatible
> >>>>> with the current Gateway. It should be considered a useful
> >>>>> tool to help the climate community and adding to the ESG
> >>>>> ecosystem of utilities :-).
> >>>>>
> >>>>> Whew... (that was a long email)
> >>>>> I hope this was somewhat useful information in the
> >>>>> context of your tangent. :-)
> >>>>>
> >>>>>
> >>>>> On 7/1/11 6:49 AM, Kettleborough, Jamie wrote:
> >>>>>
> >>>>> I created this table by: looking at each
> >>>>> gateway, figuring out which
> >>>>> modelling institutes contributed to the CMIP5
> >>>>> project, selecting a
> >>>>> sample data-set, creating a wget script, and
> >>>>> then inspecting the url in
> >>>>> the script. (I couldn't get to any NCC data
> >>>>> as I didn't have access).
> >>>>> I only sampled one dataset.
> >>>>>
> >>>>> This feels a bit long winded - what is the
> >>>>> expected way to do this?
> >>>>> Although today I was just gathering
> >>>>> information on what data nodes are
> >>>>> out there I can imagine this as a part of a
> >>>>> real life use case (a very
> >>>>> common use case). If I want to gather a
> >>>>> diagnostic, such as monthly
> >>>>> mean surface temperature from as many models
> >>>>> as I can, I think I'd have
> >>>>> to do this sort of trawling. OK I maybe only
> >>>>> have to do the initial
> >>>>> mapping of institute to data node once, but I
> >>>>> think there is still a
> >>>>> trawl needed between gateways to get the
> >>>>> data. I may be missing
> >>>>> something - and I took some unnecessary
> >>>>> steps. Please let me know if
> >>>>> this is the case. Estani, Martin, Sebastien
> >>>>> - sounds like you have
> >>>>> already started to do this sort of thing?
> >>>>>
> >>>>> I also note that not all gateways know about
> >>>>> all institutes - I think
> >>>>> this is a known problem. For instance PCMDI
> >>>>> doesn't know about IPSL,
> >>>>> and only NCI seems to know about CSIRO. Any
> >>>>> ideas when this might be
> >>>>> resolved?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Gavin M. Bell
> >>>>> Lawrence Livermore National Labs
> >>>>> --
> >>>>>
> >>>>> "Never mistake a clear view for a short distance."
> >>>>> -Paul Saffo
> >>>>>
> >>>>> (GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
> >>>>>
> >>>>> A796 CE39 9C31 68A4 52A7 1F6B 66B7 B250 21D5 6D3E
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> GO-ESSP-TECH mailing list
> >>>>> GO-ESSP-TECH at ucar.edu
> >>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>> --
> >>>>> Scanned by iCritical.
> >>>>>
> >
> >
> >--
> >Estanislao Gonzalez
> >
> >Max-Planck-Institut für Meteorologie (MPI-M)
> >Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> >Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >
> >Phone: +49 (40) 46 00 94-126
> >E-Mail: gonzalez at dkrz.de
--
Scanned by iCritical.
More information about the GO-ESSP-TECH
mailing list