[Go-essp-tech] Verifying files

stephen.pascoe at stfc.ac.uk stephen.pascoe at stfc.ac.uk
Wed Mar 21 10:54:55 MDT 2012


OK, I must have picked-up a deprecated package somewhere.  I'll take a look later.

(PyXml is very old though -- I'd recommend using lxml and standard python libraries exclusively)

S.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK


-----Original Message-----
From: Juckes, Martin (STFC,RAL,RALSP) 
Sent: 21 March 2012 16:52
To: Pascoe, Stephen (STFC,RAL,RALSP); Estanislao Gonzalez
Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov; toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de; luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov; Drach1 at llnl.gov; go-essp-tech at ucar.edu
Subject: RE: Verifying files

Hi,

Yes, I have PyXml 0.8.4 to be exact, working with python 2.6 -- this is on cmip-ingest1.badc.rl.ac.uk.

I don't think the load on servers will be significant -- catalogues don't have to be accessed very often.

Keeping a list of data nodes should not be a problem in the very short term -- though I agree that a cleaner solution implemented through the P2P index node is highly desirable. 

Cheers,
Martin

> >-----Original Message-----
> >From: Pascoe, Stephen (STFC,RAL,RALSP)
> >Sent: 21 March 2012 16:43
> >To: Juckes, Martin (STFC,RAL,RALSP); Estanislao Gonzalez
> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de;
> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov;
> >Drach1 at llnl.gov; go-essp-tech at ucar.edu
> >Subject: RE: Verifying files
> >
> >Hi Martin,
> >
> >I'm having trouble with this code.  The xml.xpath module isn't in lxml
> >or the standard library so I tried installing pyxml, which I assume is
> >what you've installed.  Unfortunately pyxml appears incompatible with
> >Python2.5+ since it uses the reserved token "as" as a variable.
> >
> >See http://stackoverflow.com/questions/4953600/pyxml-on-ubuntu for an
> >explanation.
> >
> >Therefore this will need refactoring to use lxml's xpath support.
> >
> >As an immediate fix this approach would help.  The script is going to
> >break when ever the list of datanodes change (one of the key
> >criticisms Reto made of IPSL's tool) and widespread use would put more
> >load on the datanodes.  That's why I think we need a separate catalog
> >doc.  We could then put all catalogs under a single lightweight HTTP
> >server and make a script to call that.
> >
> >Stephen.
> >
> >---
> >Stephen Pascoe  +44 (0)1235 445980
> >Centre of Environmental Data Archival
> >STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX,
> >UK
> >
> >
> >-----Original Message-----
> >From: Juckes, Martin (STFC,RAL,RALSP)
> >Sent: 21 March 2012 15:05
> >To: Estanislao Gonzalez
> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de;
> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov;
> >Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP); go-essp-
> >tech at ucar.edu
> >Subject: Verifying files
> >
> >Hi Estani, and others,
> >
> >attached is a simple script to find the catalogued checksum, size and
> >tid of any given CMIP5 data file (though it probably won't work for
> >GFDL gridspec files).
> >
> >System requirements are the python libxml2 and xml libraries.
> >
> >usage:
> >python check_file.py <file name>
> >
> >It gets over the problems Estani raises below by searching the top
> >level catalogue for a dataset that matches rather than trying to
> >construct it. In some cases iy may find 2 (output1 and output2) and
> >then need to search both for the file specification.
> >
> >At present it is a proof of concept -- it only finds the checksum and
> >does not go on to actually check it against the file.
> >
> >Does this look like a useful tool? if people think it is useful, there
> >are a few points which need tidying up to get reasonable efficiency
> >and feed back on likely usage patterns would be useful.
> >
> >cheers,
> >Martin
> >
> >
> >________________________________________
> >From: Estanislao Gonzalez [gonzalez at dkrz.de]
> >Sent: 21 March 2012 12:26
> >To: Juckes, Martin (STFC,RAL,RALSP)
> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de;
> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov;
> >Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP)
> >
> >
> >Subject: Re: Hawaii CMIP5 meeting report   (was Re: CMIP5 Management
> >telco)
> >
> >Hi Martin,
> >
> >thanks for the feedback.
> >
> >A short comment regarding the service you described, it's not as easy,
> >as you have to generate the dataset name from the file name, and
> >AFAIK, this requires the cmor tables. And probably the cmor tables at
> >the time the dataset was created to cover all cases (but 99% of them
> >should be covered anyway, so I think it's fine).
> >Another problem is the product which can't be determined precisely
> >without information from the models (again 99% of all cases should
> >work).
> >There's still some issues with the versions and the data node serving
> >that dataset, but I guess it's doable.
> >
> >I still think the best approach for this is to use the P2P search
> >capability to reconstruct the dataset, from the checksums and then
> >resolve back to the latest version as was discussed yesterday.
> >But I still think we can't do much more without leaving other things
> >behind (at least I can't).
> >
> >That's why I was hoping to get feedback on a list of "features" that
> >we can postpone so we can concentrate on others.
> >
> >Thanks,
> >Estani
> >
> >Am 21.03.2012 13:14, schrieb martin.juckes at stfc.ac.uk:
> >> Hi Estani,
> >>
> >> Our user community is dominated by academic researchers, so they
> >have no obligation to agree with each other. They are juggling a range
> >of priorities, not working to a set of rigid objectives.
> >>
> >> One of the problems we need to sort out is getting data to people
> >quickly. This objective is steadily rising towards the top of the list
> >as the deadline for submission of papers for consideration by the IPCC
> >approaches.
> >>
> >> We want to provide quality controlled data, with checksums, with a
> >robust and user friendly search interface -- but the community does
> >not have the luxury of being able to wait for that to appear.
> >>
> >> It occurs to me that it would be fairly easy (because security is
> >not involved) to write a script to validate a file (or list of files)
> >against THREDDS catalogues; this would then allow users who have taken
> >shortcuts and used secondary sources to verify their data.
> >>
> >> cheers,
> >> Martin
> >> ________________________________
> >> From: Estanislao Gonzalez [gonzalez at dkrz.de]
> >> Sent: 21 March 2012 10:03
> >> To: Karl Taylor
> >> Cc: Ben Evans; Williams, Dean N.; Frank Toussaint; Juckes, Martin
> >> (STFC,RAL,RALSP); bryan.lawrence at ncas.ac.uk; Michael Lautenschlager;
> >> Cinquini, Luca; Stéphane Senesi; Gavin M Bell; Drach, Bob; Pascoe,
> >> Stephen (STFC,RAL,RALSP)
> >> Subject: Re: Hawaii CMIP5 meeting report (was Re: CMIP5 Management
> >> telco)
> >>
> >> Hi all,
> >>
> >> I have one question/concern regarding sites like the one mentioned
> >in Karl's mail. These sites get _some_ data by _unknown_ methods.
> >What's the position of the community regarding this?
> >>
> >> We know we have a lot of problems to sort out, but those sites go
> >around the problems by not confronting them at all. For instance,
> >AFAICT:
> >> 1) The site stores no version information
> >> 2) It does not guarantee data is complete or validated from the
> >> original sites (sometimes they can't as the original sites does not
> >> provide checksums, or are wrong)
> >> 3) It's not integrated into the search, so people can't get to it by
> >> the same means as to other data (new interface)
> >> 4) There's no way the user can be notified when something is changed
> >> (thought there might be complex architecture behind it)
> >> 5) I just wonder if there's a logging of data access at all, so at
> >> least it is known who download what and when for notification and
> >> reporting purposes (might be)
> >> 6) And most importantly it completely bypasses the whole security we
> >> have in place (for instance I see models in there with non-
> >commercial
> >> access restrictions... I wonder how those are handled?)
> >>
> >> Basically, I wonder if we are trying to achieve really what the
> >community wants. I think there's a conflict between what we (my view
> >of _we_ :-) think science should look like and what the scientific
> >community needs.
> >> IMHO we have two different ends of the rope here: data quality vs.
> >prompt access.
> >>
> >> Of course we are aiming at both ends at the same time (and this *is*
> >> doable), but this slows development at both ends... there's no such
> >thing as a free lunch :-) should we keep this development path or
> >should we define a new one?
> >>
> >> I've seen a lot of sites and procedures like this, as well as people
> >complaining about not getting the whole CMIP3 archive of CMIP5 data
> >(because of the 36TB size) in a snap.
> >> I have (almost) no experience in this community, so I have thought
> >it from the very beginning to be one, but it looks like the data
> >producers are not the same as the data consumers (still referring to
> >WG1).
> >>
> >> Having some information on this would help us (me and others :-) to
> >understand the community as well as improving development, by guiding
> >it better to satisfy the community requirements in the order the
> >community expect those requirements to be satisfied.
> >>
> >> Just my 2c,
> >> Estani
> >>
> >> Am 21.03.2012 00:08, schrieb Karl Taylor:
> >> Hi Ben and all,
> >>
> >> A short and general report has been prepared summarizing the CMIP5
> >meeting held recently in Hawaii ( http://www.wcrp-
> >climate.org/documents/ezine/WCRPnews_14032012.pdf ).  Some of my
> >impressions, which may not be reflected in the official summary,
> >include:
> >>
> >> 1.  An impressive array of scientific multi-model CMIP5 studies are
> >underway (some in pretty advanced phases), so users seem to be coping
> >with the frustrations of our current ESG.
> >>
> >> 2.  Two scientists (of about 160) said specifically and publicly
> >they could *not* do what they wanted to do because it was so difficult
> >and slow to download data.
> >>
> >> 3.  Privately, several scientists expressed frustrations and asked
> >when things would improve.  Nearly everyone understood and appreciated
> >the enormity of the challenges and most seem willing to remain patient
> >a little longer.
> >>
> >> 4.  Reto Knutti (a convening? lead author) and Thomas Stocker (co-
> >chair of WG1) are both *counting* on improvements and *concerned* that
> >they will be delayed.  Reto is the one who has put up a website at
> >https://wiki.c2sm.ethz.ch/Wiki/CMIP5 where lots of folks are getting
> >data more easily than through the official ESG sites.  Reto sent me an
> >email summarizing the biggest problems he has getting data to populate
> >his site.  He has identified problems (many of which we're working
> >on), which probably affect all users (summarizing his email copied
> >below):
> >> a) incorrect MD5 checksums
> >> b) old, incorrect catalog entries at some nodes
> >> c) no easy way to report errors (he mentioned "errata websites,
> >feeds,
> >> email addresses invalid or no responses"; not sure whether he tried
> >> the help desk)
> >> d) incorrect version numbers (or unclearly labeled)
> >> e) gaps in data, overlaps, "strange time coordinates"
> >> f) no way to find out when data is found to contain errors
> >>
> >> 5.  A number of folks volunteered to be first users on the p2p
> >system.
> >>
> >> Obviously in this email the focus is on what needs fixing.  An
> >enormous amount has already been accomplished.
> >>
> >> Please let me know if you are specifically curious about anything
> >else that went on at the CMIP5 meeting, and I'll try to respond.
> >>
> >> Best regards,
> >> Karl
> >>
> >> email from Reto Knutti:
> >> Dear Karl,
> >>
> >> As promised, here's a list of main issues we are encountering with
> >CMIP5.
> >>
> >> - the manual download is slow and unreliable. A tool that is
> >scriptable is essential, so that it is easy to find out what is new
> >and changed.
> >> - the IPSL download  tool is reasonably good but fails often due to
> >> old and incorrect catalog entries at the different nodes, incorrect
> >> MD5 checksums, or missing PKI interface
> >> - communication of problems is difficult to impossible (different
> >> errata websites, feeds, email addresses invalid or no responses)
> >> - no clear data version control, which makes it difficult to find
> >out
> >> which files are the most recent ones
> >>
> >> I think the new interface that you are testing could address part of
> >the above.
> >>
> >> But it would also help if PCMDI could communicate clearly to the
> >data centers that they need to make sure their catalogues are up to
> >date, checksums are correct, and versions are clearly labeled. There
> >are further issues that the modeling centers should address (gaps in
> >data, overlaps, strange timescales, etc.). I realize that is a lot of
> >work for the centers, but if they don't do it they create even more
> >work for thousands of people who are trying to analyze the data.
> >>
> >> Finally, I would like to stress what Thomas mentioned. We need a way
> >to find out when data is found to contain errors, but we also need a
> >way to give feedback to the modeling groups when we discover issues.
> >If you have a list of contact persons from the centers that you could
> >provide to us, that could help.
> >>
> >> Given that scientific papers and the IPCC second order draft are
> >written over the next five months, it is important that the above
> >points are addressed as quickly as possible. I realize of course the
> >constraints that you have at PCMDI, and the fact that some problems
> >are not under your control.
> >>
> >> In any case we appreciate all your efforts and support, and would be
> >happy to work with you and help with testing tools etc. if we can.
> >>
> >> Thanks,
> >>
> >> Reto
> >>
> >>
> >> On 3/12/12 11:36 AM, Ben Evans wrote:
> >> Thanks Dean.  That sounds good.
> >>
> >> Perhaps this is more for Karl and others, it would be helpful to see
> >a report of the Hawaii meeting if it were available before then.  I
> >will be heading into a local management meeting in two weeks and I
> >would like to be up to speed - especially if there are other
> >perceptions that came out of the meeting.
> >>
> >> Best Wishes,
> >> Ben
> >> --
> >> Dr Ben Evans
> >> Associate Director (Research Engagement and Initiatives) NCI
> >> http://www.nci.org.au/<http://nf.nci.org.au/>
> >> Leonard Huxley Building (#56)
> >> The Australian National University
> >> Canberra, ACT, 0200 Australia
> >> Ph  +61 2 6125 4967
> >> Fax: +61 2 6125 8199
> >> CRICOS Provider #00120C
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Estanislao Gonzalez
> >>
> >> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
> >> Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room 108
> >-
> >> Bundesstrasse 45a, D-20146 Hamburg, Germany
> >>
> >> Phone:   +49 (40) 46 00 94-126
> >> E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
> >
> >
> >--
> >Estanislao Gonzalez
> >
> >Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
> >Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room 108 -
> >Bundesstrasse 45a, D-20146 Hamburg, Germany
> >
> >Phone:   +49 (40) 46 00 94-126
> >E-Mail:  gonzalez at dkrz.de

-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list