[Go-essp-tech] Verifying files

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Wed Mar 21 11:08:45 MDT 2012


OK, I used it because I had some existing code that worked -- it may be time to move on (and not too hard, as its is only a handful if lines using the library),

Cheers,
Martin

> >-----Original Message-----
> >From: Pascoe, Stephen (STFC,RAL,RALSP)
> >Sent: 21 March 2012 16:55
> >To: Juckes, Martin (STFC,RAL,RALSP); Estanislao Gonzalez
> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de;
> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov;
> >Drach1 at llnl.gov; go-essp-tech at ucar.edu
> >Subject: RE: Verifying files
> >
> >OK, I must have picked-up a deprecated package somewhere.  I'll take a
> >look later.
> >
> >(PyXml is very old though -- I'd recommend using lxml and standard
> >python libraries exclusively)
> >
> >S.
> >
> >---
> >Stephen Pascoe  +44 (0)1235 445980
> >Centre of Environmental Data Archival
> >STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX,
> >UK
> >
> >
> >-----Original Message-----
> >From: Juckes, Martin (STFC,RAL,RALSP)
> >Sent: 21 March 2012 16:52
> >To: Pascoe, Stephen (STFC,RAL,RALSP); Estanislao Gonzalez
> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de;
> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov;
> >Drach1 at llnl.gov; go-essp-tech at ucar.edu
> >Subject: RE: Verifying files
> >
> >Hi,
> >
> >Yes, I have PyXml 0.8.4 to be exact, working with python 2.6 -- this
> >is on cmip-ingest1.badc.rl.ac.uk.
> >
> >I don't think the load on servers will be significant -- catalogues
> >don't have to be accessed very often.
> >
> >Keeping a list of data nodes should not be a problem in the very short
> >term -- though I agree that a cleaner solution implemented through the
> >P2P index node is highly desirable.
> >
> >Cheers,
> >Martin
> >
> >> >-----Original Message-----
> >> >From: Pascoe, Stephen (STFC,RAL,RALSP)
> >> >Sent: 21 March 2012 16:43
> >> >To: Juckes, Martin (STFC,RAL,RALSP); Estanislao Gonzalez
> >> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk;
> >lautenschlager at dkrz.de;
> >> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr;
> >gavin at llnl.gov;
> >> >Drach1 at llnl.gov; go-essp-tech at ucar.edu
> >> >Subject: RE: Verifying files
> >> >
> >> >Hi Martin,
> >> >
> >> >I'm having trouble with this code.  The xml.xpath module isn't in
> >lxml
> >> >or the standard library so I tried installing pyxml, which I assume
> >is
> >> >what you've installed.  Unfortunately pyxml appears incompatible
> >with
> >> >Python2.5+ since it uses the reserved token "as" as a variable.
> >> >
> >> >See http://stackoverflow.com/questions/4953600/pyxml-on-ubuntu for
> >an
> >> >explanation.
> >> >
> >> >Therefore this will need refactoring to use lxml's xpath support.
> >> >
> >> >As an immediate fix this approach would help.  The script is going
> >to
> >> >break when ever the list of datanodes change (one of the key
> >> >criticisms Reto made of IPSL's tool) and widespread use would put
> >more
> >> >load on the datanodes.  That's why I think we need a separate
> >catalog
> >> >doc.  We could then put all catalogs under a single lightweight
> >HTTP
> >> >server and make a script to call that.
> >> >
> >> >Stephen.
> >> >
> >> >---
> >> >Stephen Pascoe  +44 (0)1235 445980
> >> >Centre of Environmental Data Archival
> >> >STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
> >0QX,
> >> >UK
> >> >
> >> >
> >> >-----Original Message-----
> >> >From: Juckes, Martin (STFC,RAL,RALSP)
> >> >Sent: 21 March 2012 15:05
> >> >To: Estanislao Gonzalez
> >> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk;
> >lautenschlager at dkrz.de;
> >> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr;
> >gavin at llnl.gov;
> >> >Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP); go-essp-
> >> >tech at ucar.edu
> >> >Subject: Verifying files
> >> >
> >> >Hi Estani, and others,
> >> >
> >> >attached is a simple script to find the catalogued checksum, size
> >and
> >> >tid of any given CMIP5 data file (though it probably won't work for
> >> >GFDL gridspec files).
> >> >
> >> >System requirements are the python libxml2 and xml libraries.
> >> >
> >> >usage:
> >> >python check_file.py <file name>
> >> >
> >> >It gets over the problems Estani raises below by searching the top
> >> >level catalogue for a dataset that matches rather than trying to
> >> >construct it. In some cases iy may find 2 (output1 and output2) and
> >> >then need to search both for the file specification.
> >> >
> >> >At present it is a proof of concept -- it only finds the checksum
> >and
> >> >does not go on to actually check it against the file.
> >> >
> >> >Does this look like a useful tool? if people think it is useful,
> >there
> >> >are a few points which need tidying up to get reasonable efficiency
> >> >and feed back on likely usage patterns would be useful.
> >> >
> >> >cheers,
> >> >Martin
> >> >
> >> >
> >> >________________________________________
> >> >From: Estanislao Gonzalez [gonzalez at dkrz.de]
> >> >Sent: 21 March 2012 12:26
> >> >To: Juckes, Martin (STFC,RAL,RALSP)
> >> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk;
> >lautenschlager at dkrz.de;
> >> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr;
> >gavin at llnl.gov;
> >> >Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP)
> >> >
> >> >
> >> >Subject: Re: Hawaii CMIP5 meeting report   (was Re: CMIP5
> >Management
> >> >telco)
> >> >
> >> >Hi Martin,
> >> >
> >> >thanks for the feedback.
> >> >
> >> >A short comment regarding the service you described, it's not as
> >easy,
> >> >as you have to generate the dataset name from the file name, and
> >> >AFAIK, this requires the cmor tables. And probably the cmor tables
> >at
> >> >the time the dataset was created to cover all cases (but 99% of
> >them
> >> >should be covered anyway, so I think it's fine).
> >> >Another problem is the product which can't be determined precisely
> >> >without information from the models (again 99% of all cases should
> >> >work).
> >> >There's still some issues with the versions and the data node
> >serving
> >> >that dataset, but I guess it's doable.
> >> >
> >> >I still think the best approach for this is to use the P2P search
> >> >capability to reconstruct the dataset, from the checksums and then
> >> >resolve back to the latest version as was discussed yesterday.
> >> >But I still think we can't do much more without leaving other
> >things
> >> >behind (at least I can't).
> >> >
> >> >That's why I was hoping to get feedback on a list of "features"
> >that
> >> >we can postpone so we can concentrate on others.
> >> >
> >> >Thanks,
> >> >Estani
> >> >
> >> >Am 21.03.2012 13:14, schrieb martin.juckes at stfc.ac.uk:
> >> >> Hi Estani,
> >> >>
> >> >> Our user community is dominated by academic researchers, so they
> >> >have no obligation to agree with each other. They are juggling a
> >range
> >> >of priorities, not working to a set of rigid objectives.
> >> >>
> >> >> One of the problems we need to sort out is getting data to people
> >> >quickly. This objective is steadily rising towards the top of the
> >list
> >> >as the deadline for submission of papers for consideration by the
> >IPCC
> >> >approaches.
> >> >>
> >> >> We want to provide quality controlled data, with checksums, with
> >a
> >> >robust and user friendly search interface -- but the community does
> >> >not have the luxury of being able to wait for that to appear.
> >> >>
> >> >> It occurs to me that it would be fairly easy (because security is
> >> >not involved) to write a script to validate a file (or list of
> >files)
> >> >against THREDDS catalogues; this would then allow users who have
> >taken
> >> >shortcuts and used secondary sources to verify their data.
> >> >>
> >> >> cheers,
> >> >> Martin
> >> >> ________________________________
> >> >> From: Estanislao Gonzalez [gonzalez at dkrz.de]
> >> >> Sent: 21 March 2012 10:03
> >> >> To: Karl Taylor
> >> >> Cc: Ben Evans; Williams, Dean N.; Frank Toussaint; Juckes, Martin
> >> >> (STFC,RAL,RALSP); bryan.lawrence at ncas.ac.uk; Michael
> >Lautenschlager;
> >> >> Cinquini, Luca; Stéphane Senesi; Gavin M Bell; Drach, Bob;
> >Pascoe,
> >> >> Stephen (STFC,RAL,RALSP)
> >> >> Subject: Re: Hawaii CMIP5 meeting report (was Re: CMIP5
> >Management
> >> >> telco)
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I have one question/concern regarding sites like the one
> >mentioned
> >> >in Karl's mail. These sites get _some_ data by _unknown_ methods.
> >> >What's the position of the community regarding this?
> >> >>
> >> >> We know we have a lot of problems to sort out, but those sites go
> >> >around the problems by not confronting them at all. For instance,
> >> >AFAICT:
> >> >> 1) The site stores no version information
> >> >> 2) It does not guarantee data is complete or validated from the
> >> >> original sites (sometimes they can't as the original sites does
> >not
> >> >> provide checksums, or are wrong)
> >> >> 3) It's not integrated into the search, so people can't get to it
> >by
> >> >> the same means as to other data (new interface)
> >> >> 4) There's no way the user can be notified when something is
> >changed
> >> >> (thought there might be complex architecture behind it)
> >> >> 5) I just wonder if there's a logging of data access at all, so
> >at
> >> >> least it is known who download what and when for notification and
> >> >> reporting purposes (might be)
> >> >> 6) And most importantly it completely bypasses the whole security
> >we
> >> >> have in place (for instance I see models in there with non-
> >> >commercial
> >> >> access restrictions... I wonder how those are handled?)
> >> >>
> >> >> Basically, I wonder if we are trying to achieve really what the
> >> >community wants. I think there's a conflict between what we (my
> >view
> >> >of _we_ :-) think science should look like and what the scientific
> >> >community needs.
> >> >> IMHO we have two different ends of the rope here: data quality
> >vs.
> >> >prompt access.
> >> >>
> >> >> Of course we are aiming at both ends at the same time (and this
> >*is*
> >> >> doable), but this slows development at both ends... there's no
> >such
> >> >thing as a free lunch :-) should we keep this development path or
> >> >should we define a new one?
> >> >>
> >> >> I've seen a lot of sites and procedures like this, as well as
> >people
> >> >complaining about not getting the whole CMIP3 archive of CMIP5 data
> >> >(because of the 36TB size) in a snap.
> >> >> I have (almost) no experience in this community, so I have
> >thought
> >> >it from the very beginning to be one, but it looks like the data
> >> >producers are not the same as the data consumers (still referring
> >to
> >> >WG1).
> >> >>
> >> >> Having some information on this would help us (me and others :-)
> >to
> >> >understand the community as well as improving development, by
> >guiding
> >> >it better to satisfy the community requirements in the order the
> >> >community expect those requirements to be satisfied.
> >> >>
> >> >> Just my 2c,
> >> >> Estani
> >> >>
> >> >> Am 21.03.2012 00:08, schrieb Karl Taylor:
> >> >> Hi Ben and all,
> >> >>
> >> >> A short and general report has been prepared summarizing the
> >CMIP5
> >> >meeting held recently in Hawaii ( http://www.wcrp-
> >> >climate.org/documents/ezine/WCRPnews_14032012.pdf ).  Some of my
> >> >impressions, which may not be reflected in the official summary,
> >> >include:
> >> >>
> >> >> 1.  An impressive array of scientific multi-model CMIP5 studies
> >are
> >> >underway (some in pretty advanced phases), so users seem to be
> >coping
> >> >with the frustrations of our current ESG.
> >> >>
> >> >> 2.  Two scientists (of about 160) said specifically and publicly
> >> >they could *not* do what they wanted to do because it was so
> >difficult
> >> >and slow to download data.
> >> >>
> >> >> 3.  Privately, several scientists expressed frustrations and
> >asked
> >> >when things would improve.  Nearly everyone understood and
> >appreciated
> >> >the enormity of the challenges and most seem willing to remain
> >patient
> >> >a little longer.
> >> >>
> >> >> 4.  Reto Knutti (a convening? lead author) and Thomas Stocker
> >(co-
> >> >chair of WG1) are both *counting* on improvements and *concerned*
> >that
> >> >they will be delayed.  Reto is the one who has put up a website at
> >> >https://wiki.c2sm.ethz.ch/Wiki/CMIP5 where lots of folks are
> >getting
> >> >data more easily than through the official ESG sites.  Reto sent me
> >an
> >> >email summarizing the biggest problems he has getting data to
> >populate
> >> >his site.  He has identified problems (many of which we're working
> >> >on), which probably affect all users (summarizing his email copied
> >> >below):
> >> >> a) incorrect MD5 checksums
> >> >> b) old, incorrect catalog entries at some nodes
> >> >> c) no easy way to report errors (he mentioned "errata websites,
> >> >feeds,
> >> >> email addresses invalid or no responses"; not sure whether he
> >tried
> >> >> the help desk)
> >> >> d) incorrect version numbers (or unclearly labeled)
> >> >> e) gaps in data, overlaps, "strange time coordinates"
> >> >> f) no way to find out when data is found to contain errors
> >> >>
> >> >> 5.  A number of folks volunteered to be first users on the p2p
> >> >system.
> >> >>
> >> >> Obviously in this email the focus is on what needs fixing.  An
> >> >enormous amount has already been accomplished.
> >> >>
> >> >> Please let me know if you are specifically curious about anything
> >> >else that went on at the CMIP5 meeting, and I'll try to respond.
> >> >>
> >> >> Best regards,
> >> >> Karl
> >> >>
> >> >> email from Reto Knutti:
> >> >> Dear Karl,
> >> >>
> >> >> As promised, here's a list of main issues we are encountering
> >with
> >> >CMIP5.
> >> >>
> >> >> - the manual download is slow and unreliable. A tool that is
> >> >scriptable is essential, so that it is easy to find out what is new
> >> >and changed.
> >> >> - the IPSL download  tool is reasonably good but fails often due
> >to
> >> >> old and incorrect catalog entries at the different nodes,
> >incorrect
> >> >> MD5 checksums, or missing PKI interface
> >> >> - communication of problems is difficult to impossible (different
> >> >> errata websites, feeds, email addresses invalid or no responses)
> >> >> - no clear data version control, which makes it difficult to find
> >> >out
> >> >> which files are the most recent ones
> >> >>
> >> >> I think the new interface that you are testing could address part
> >of
> >> >the above.
> >> >>
> >> >> But it would also help if PCMDI could communicate clearly to the
> >> >data centers that they need to make sure their catalogues are up to
> >> >date, checksums are correct, and versions are clearly labeled.
> >There
> >> >are further issues that the modeling centers should address (gaps
> >in
> >> >data, overlaps, strange timescales, etc.). I realize that is a lot
> >of
> >> >work for the centers, but if they don't do it they create even more
> >> >work for thousands of people who are trying to analyze the data.
> >> >>
> >> >> Finally, I would like to stress what Thomas mentioned. We need a
> >way
> >> >to find out when data is found to contain errors, but we also need
> >a
> >> >way to give feedback to the modeling groups when we discover
> >issues.
> >> >If you have a list of contact persons from the centers that you
> >could
> >> >provide to us, that could help.
> >> >>
> >> >> Given that scientific papers and the IPCC second order draft are
> >> >written over the next five months, it is important that the above
> >> >points are addressed as quickly as possible. I realize of course
> >the
> >> >constraints that you have at PCMDI, and the fact that some problems
> >> >are not under your control.
> >> >>
> >> >> In any case we appreciate all your efforts and support, and would
> >be
> >> >happy to work with you and help with testing tools etc. if we can.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Reto
> >> >>
> >> >>
> >> >> On 3/12/12 11:36 AM, Ben Evans wrote:
> >> >> Thanks Dean.  That sounds good.
> >> >>
> >> >> Perhaps this is more for Karl and others, it would be helpful to
> >see
> >> >a report of the Hawaii meeting if it were available before then.  I
> >> >will be heading into a local management meeting in two weeks and I
> >> >would like to be up to speed - especially if there are other
> >> >perceptions that came out of the meeting.
> >> >>
> >> >> Best Wishes,
> >> >> Ben
> >> >> --
> >> >> Dr Ben Evans
> >> >> Associate Director (Research Engagement and Initiatives) NCI
> >> >> http://www.nci.org.au/<http://nf.nci.org.au/>
> >> >> Leonard Huxley Building (#56)
> >> >> The Australian National University
> >> >> Canberra, ACT, 0200 Australia
> >> >> Ph  +61 2 6125 4967
> >> >> Fax: +61 2 6125 8199
> >> >> CRICOS Provider #00120C
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Estanislao Gonzalez
> >> >>
> >> >> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
> >> >> Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room
> >108
> >> >-
> >> >> Bundesstrasse 45a, D-20146 Hamburg, Germany
> >> >>
> >> >> Phone:   +49 (40) 46 00 94-126
> >> >> E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
> >> >
> >> >
> >> >--
> >> >Estanislao Gonzalez
> >> >
> >> >Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
> >> >Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room
> >108 -
> >> >Bundesstrasse 45a, D-20146 Hamburg, Germany
> >> >
> >> >Phone:   +49 (40) 46 00 94-126
> >> >E-Mail:  gonzalez at dkrz.de

-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list