[Go-essp-tech] Verifying files

stephen.pascoe at stfc.ac.uk stephen.pascoe at stfc.ac.uk
Wed Mar 21 10:42:39 MDT 2012


Hi Martin,

I'm having trouble with this code.  The xml.xpath module isn't in lxml or the standard library so I tried installing pyxml, which I assume is what you've installed.  Unfortunately pyxml appears incompatible with Python2.5+ since it uses the reserved token "as" as a variable.

See http://stackoverflow.com/questions/4953600/pyxml-on-ubuntu for an explanation.

Therefore this will need refactoring to use lxml's xpath support.  

As an immediate fix this approach would help.  The script is going to break when ever the list of datanodes change (one of the key criticisms Reto made of IPSL's tool) and widespread use would put more load on the datanodes.  That's why I think we need a separate catalog doc.  We could then put all catalogs under a single lightweight HTTP server and make a script to call that.

Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK


-----Original Message-----
From: Juckes, Martin (STFC,RAL,RALSP) 
Sent: 21 March 2012 15:05
To: Estanislao Gonzalez
Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov; toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de; luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov; Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP); go-essp-tech at ucar.edu
Subject: Verifying files

Hi Estani, and others,

attached is a simple script to find the catalogued checksum, size and tid of any given CMIP5 data file (though it probably won't work for GFDL gridspec files).

System requirements are the python libxml2 and xml libraries.

usage:
python check_file.py <file name>

It gets over the problems Estani raises below by searching the top level catalogue for a dataset that matches rather than trying to construct it. In some cases iy may find 2 (output1 and output2) and then need to search both for the file specification.

At present it is a proof of concept -- it only finds the checksum and does not go on to actually check it against the file.

Does this look like a useful tool? if people think it is useful, there are a few points which need tidying up to get reasonable efficiency and feed back on likely usage patterns would be useful.

cheers,
Martin


________________________________________
From: Estanislao Gonzalez [gonzalez at dkrz.de]
Sent: 21 March 2012 12:26
To: Juckes, Martin (STFC,RAL,RALSP)
Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov; toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de; luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov; Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP)

Subject: Re: Hawaii CMIP5 meeting report   (was Re: CMIP5 Management telco)

Hi Martin,

thanks for the feedback.

A short comment regarding the service you described, it's not as easy, as you have to generate the dataset name from the file name, and AFAIK, this requires the cmor tables. And probably the cmor tables at the time the dataset was created to cover all cases (but 99% of them should be covered anyway, so I think it's fine).
Another problem is the product which can't be determined precisely without information from the models (again 99% of all cases should work).
There's still some issues with the versions and the data node serving that dataset, but I guess it's doable.

I still think the best approach for this is to use the P2P search capability to reconstruct the dataset, from the checksums and then resolve back to the latest version as was discussed yesterday.
But I still think we can't do much more without leaving other things behind (at least I can't).

That's why I was hoping to get feedback on a list of "features" that we can postpone so we can concentrate on others.

Thanks,
Estani

Am 21.03.2012 13:14, schrieb martin.juckes at stfc.ac.uk:
> Hi Estani,
>
> Our user community is dominated by academic researchers, so they have no obligation to agree with each other. They are juggling a range of priorities, not working to a set of rigid objectives.
>
> One of the problems we need to sort out is getting data to people quickly. This objective is steadily rising towards the top of the list as the deadline for submission of papers for consideration by the IPCC approaches.
>
> We want to provide quality controlled data, with checksums, with a robust and user friendly search interface -- but the community does not have the luxury of being able to wait for that to appear.
>
> It occurs to me that it would be fairly easy (because security is not involved) to write a script to validate a file (or list of files) against THREDDS catalogues; this would then allow users who have taken shortcuts and used secondary sources to verify their data.
>
> cheers,
> Martin
> ________________________________
> From: Estanislao Gonzalez [gonzalez at dkrz.de]
> Sent: 21 March 2012 10:03
> To: Karl Taylor
> Cc: Ben Evans; Williams, Dean N.; Frank Toussaint; Juckes, Martin 
> (STFC,RAL,RALSP); bryan.lawrence at ncas.ac.uk; Michael Lautenschlager; 
> Cinquini, Luca; Stéphane Senesi; Gavin M Bell; Drach, Bob; Pascoe, 
> Stephen (STFC,RAL,RALSP)
> Subject: Re: Hawaii CMIP5 meeting report (was Re: CMIP5 Management 
> telco)
>
> Hi all,
>
> I have one question/concern regarding sites like the one mentioned in Karl's mail. These sites get _some_ data by _unknown_ methods. What's the position of the community regarding this?
>
> We know we have a lot of problems to sort out, but those sites go around the problems by not confronting them at all. For instance, AFAICT:
> 1) The site stores no version information
> 2) It does not guarantee data is complete or validated from the 
> original sites (sometimes they can't as the original sites does not 
> provide checksums, or are wrong)
> 3) It's not integrated into the search, so people can't get to it by 
> the same means as to other data (new interface)
> 4) There's no way the user can be notified when something is changed 
> (thought there might be complex architecture behind it)
> 5) I just wonder if there's a logging of data access at all, so at 
> least it is known who download what and when for notification and 
> reporting purposes (might be)
> 6) And most importantly it completely bypasses the whole security we 
> have in place (for instance I see models in there with non-commercial 
> access restrictions... I wonder how those are handled?)
>
> Basically, I wonder if we are trying to achieve really what the community wants. I think there's a conflict between what we (my view of _we_ :-) think science should look like and what the scientific community needs.
> IMHO we have two different ends of the rope here: data quality vs. prompt access.
>
> Of course we are aiming at both ends at the same time (and this *is* 
> doable), but this slows development at both ends... there's no such thing as a free lunch :-) should we keep this development path or should we define a new one?
>
> I've seen a lot of sites and procedures like this, as well as people complaining about not getting the whole CMIP3 archive of CMIP5 data (because of the 36TB size) in a snap.
> I have (almost) no experience in this community, so I have thought it from the very beginning to be one, but it looks like the data producers are not the same as the data consumers (still referring to WG1).
>
> Having some information on this would help us (me and others :-) to understand the community as well as improving development, by guiding it better to satisfy the community requirements in the order the community expect those requirements to be satisfied.
>
> Just my 2c,
> Estani
>
> Am 21.03.2012 00:08, schrieb Karl Taylor:
> Hi Ben and all,
>
> A short and general report has been prepared summarizing the CMIP5 meeting held recently in Hawaii ( http://www.wcrp-climate.org/documents/ezine/WCRPnews_14032012.pdf ).  Some of my impressions, which may not be reflected in the official summary, include:
>
> 1.  An impressive array of scientific multi-model CMIP5 studies are underway (some in pretty advanced phases), so users seem to be coping with the frustrations of our current ESG.
>
> 2.  Two scientists (of about 160) said specifically and publicly  they could *not* do what they wanted to do because it was so difficult and slow to download data.
>
> 3.  Privately, several scientists expressed frustrations and asked when things would improve.  Nearly everyone understood and appreciated the enormity of the challenges and most seem willing to remain patient a little longer.
>
> 4.  Reto Knutti (a convening? lead author) and Thomas Stocker (co-chair of WG1) are both *counting* on improvements and *concerned* that they will be delayed.  Reto is the one who has put up a website at https://wiki.c2sm.ethz.ch/Wiki/CMIP5 where lots of folks are getting data more easily than through the official ESG sites.  Reto sent me an email summarizing the biggest problems he has getting data to populate his site.  He has identified problems (many of which we're working on), which probably affect all users (summarizing his email copied below):
> a) incorrect MD5 checksums
> b) old, incorrect catalog entries at some nodes
> c) no easy way to report errors (he mentioned "errata websites, feeds, 
> email addresses invalid or no responses"; not sure whether he tried 
> the help desk)
> d) incorrect version numbers (or unclearly labeled)
> e) gaps in data, overlaps, "strange time coordinates"
> f) no way to find out when data is found to contain errors
>
> 5.  A number of folks volunteered to be first users on the p2p system.
>
> Obviously in this email the focus is on what needs fixing.  An enormous amount has already been accomplished.
>
> Please let me know if you are specifically curious about anything else that went on at the CMIP5 meeting, and I'll try to respond.
>
> Best regards,
> Karl
>
> email from Reto Knutti:
> Dear Karl,
>
> As promised, here's a list of main issues we are encountering with CMIP5.
>
> - the manual download is slow and unreliable. A tool that is scriptable is essential, so that it is easy to find out what is new and changed.
> - the IPSL download  tool is reasonably good but fails often due to 
> old and incorrect catalog entries at the different nodes, incorrect 
> MD5 checksums, or missing PKI interface
> - communication of problems is difficult to impossible (different 
> errata websites, feeds, email addresses invalid or no responses)
> - no clear data version control, which makes it difficult to find out 
> which files are the most recent ones
>
> I think the new interface that you are testing could address part of the above.
>
> But it would also help if PCMDI could communicate clearly to the data centers that they need to make sure their catalogues are up to date, checksums are correct, and versions are clearly labeled. There are further issues that the modeling centers should address (gaps in data, overlaps, strange timescales, etc.). I realize that is a lot of work for the centers, but if they don't do it they create even more work for thousands of people who are trying to analyze the data.
>
> Finally, I would like to stress what Thomas mentioned. We need a way to find out when data is found to contain errors, but we also need a way to give feedback to the modeling groups when we discover issues. If you have a list of contact persons from the centers that you could provide to us, that could help.
>
> Given that scientific papers and the IPCC second order draft are written over the next five months, it is important that the above points are addressed as quickly as possible. I realize of course the constraints that you have at PCMDI, and the fact that some problems are not under your control.
>
> In any case we appreciate all your efforts and support, and would be happy to work with you and help with testing tools etc. if we can.
>
> Thanks,
>
> Reto
>
>
> On 3/12/12 11:36 AM, Ben Evans wrote:
> Thanks Dean.  That sounds good.
>
> Perhaps this is more for Karl and others, it would be helpful to see a report of the Hawaii meeting if it were available before then.  I will be heading into a local management meeting in two weeks and I would like to be up to speed - especially if there are other perceptions that came out of the meeting.
>
> Best Wishes,
> Ben
> --
> Dr Ben Evans
> Associate Director (Research Engagement and Initiatives) NCI 
> http://www.nci.org.au/<http://nf.nci.org.au/>
> Leonard Huxley Building (#56)
> The Australian National University
> Canberra, ACT, 0200 Australia
> Ph  +61 2 6125 4967
> Fax: +61 2 6125 8199
> CRICOS Provider #00120C
>
>
>
>
>
> --
> Estanislao Gonzalez
>
> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches 
> Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room 108 - 
> Bundesstrasse 45a, D-20146 Hamburg, Germany
>
> Phone:   +49 (40) 46 00 94-126
> E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>


--
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M) Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de

-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list