[Go-essp-tech] Verifying files

Estanislao Gonzalez gonzalez at dkrz.de
Wed Mar 21 10:47:51 MDT 2012


Hi Martin,

this is a great and useful idea. I think the main question would be for 
whom are we building tools. The main problem I had when creating the 
wget script is that users:
1) use all sorts of equipments from Solaris to Mac in all flavors and 
version.
2) have little knowledge about any particular computer language
3) have no root rights

But even without going into the details, defining the tool is the most 
valuable thing.
I still have some concerns as to how would the tool now that the snw 
variable for example is either in landIce or land. And I also doubt that 
hard-coding values will escalate (pcmdi3 changed to pcmdi7 and we might 
need to distribute catalogs to multiple nodes if the TDS can't cope with 
all of them).

So I think we should start from your proposal and describe the steps 
required for getting the information. Did I got right the core of the 
procedure?:
1) Start from file name -> extract facets (defined by the cmip5 DRS)
2) Find the data node TDS hosting the catalog from the information 
extracted from 1
3) Select all possible catalogs references from the main catalog.xml 
from data in 1
4) parse the catalogs and find the matching one according to the data in 
1 (might not match all fields but stripping the temporal part should be 
reliable for most cases). Trickier will be if the variable got deleted...
And then
5) check if file is there (filename and checksum) and return dataset
6) Check again the TDS for those catalogs and see if there are newer 
versions.

We do have the means to simplify some steps by getting directly to a 
search service (e.g. P2P). But I think it's better to first define this 
in terms of the data sources, and then we can exploit services which 
already do parts of this procedure.

Any agreement or desire to move forward in this direction?

Thanks,
Estani



Am 21.03.2012 16:05, schrieb martin.juckes at stfc.ac.uk:
> Hi Estani, and others,
>
> attached is a simple script to find the catalogued checksum, size and tid of any given CMIP5 data file (though it probably won't work for GFDL gridspec files).
>
> System requirements are the python libxml2 and xml libraries.
>
> usage:
> python check_file.py<file name>
>
> It gets over the problems Estani raises below by searching the top level catalogue for a dataset that matches rather than trying to construct it. In some cases iy may find 2 (output1 and output2) and then need to search both for the file specification.
>
> At present it is a proof of concept -- it only finds the checksum and does not go on to actually check it against the file.
>
> Does this look like a useful tool? if people think it is useful, there are a few points which need tidying up to get reasonable efficiency and feed back on likely usage patterns would be useful.
>
> cheers,
> Martin
>
>
> ________________________________________
> From: Estanislao Gonzalez [gonzalez at dkrz.de]
> Sent: 21 March 2012 12:26
> To: Juckes, Martin (STFC,RAL,RALSP)
> Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov; toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de; luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov; Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP)
> Subject: Re: Hawaii CMIP5 meeting report   (was Re: CMIP5 Management telco)
>
> Hi Martin,
>
> thanks for the feedback.
>
> A short comment regarding the service you described, it's not as easy,
> as you have to generate the dataset name from the file name, and AFAIK,
> this requires the cmor tables. And probably the cmor tables at the time
> the dataset was created to cover all cases (but 99% of them should be
> covered anyway, so I think it's fine).
> Another problem is the product which can't be determined precisely
> without information from the models (again 99% of all cases should work).
> There's still some issues with the versions and the data node serving
> that dataset, but I guess it's doable.
>
> I still think the best approach for this is to use the P2P search
> capability to reconstruct the dataset, from the checksums and then
> resolve back to the latest version as was discussed yesterday.
> But I still think we can't do much more without leaving other things
> behind (at least I can't).
>
> That's why I was hoping to get feedback on a list of "features" that we
> can postpone so we can concentrate on others.
>
> Thanks,
> Estani
>
> Am 21.03.2012 13:14, schrieb martin.juckes at stfc.ac.uk:
>> Hi Estani,
>>
>> Our user community is dominated by academic researchers, so they have no obligation to agree with each other. They are juggling a range of priorities, not working to a set of rigid objectives.
>>
>> One of the problems we need to sort out is getting data to people quickly. This objective is steadily rising towards the top of the list as the deadline for submission of papers for consideration by the IPCC approaches.
>>
>> We want to provide quality controlled data, with checksums, with a robust and user friendly search interface -- but the community does not have the luxury of being able to wait for that to appear.
>>
>> It occurs to me that it would be fairly easy (because security is not involved) to write a script to validate a file (or list of files) against THREDDS catalogues; this would then allow users who have taken shortcuts and used secondary sources to verify their data.
>>
>> cheers,
>> Martin
>> ________________________________
>> From: Estanislao Gonzalez [gonzalez at dkrz.de]
>> Sent: 21 March 2012 10:03
>> To: Karl Taylor
>> Cc: Ben Evans; Williams, Dean N.; Frank Toussaint; Juckes, Martin (STFC,RAL,RALSP); bryan.lawrence at ncas.ac.uk; Michael Lautenschlager; Cinquini, Luca; Stéphane Senesi; Gavin M Bell; Drach, Bob; Pascoe, Stephen (STFC,RAL,RALSP)
>> Subject: Re: Hawaii CMIP5 meeting report (was Re: CMIP5 Management telco)
>>
>> Hi all,
>>
>> I have one question/concern regarding sites like the one mentioned in Karl's mail. These sites get _some_ data by _unknown_ methods. What's the position of the community regarding this?
>>
>> We know we have a lot of problems to sort out, but those sites go around the problems by not confronting them at all. For instance, AFAICT:
>> 1) The site stores no version information
>> 2) It does not guarantee data is complete or validated from the original sites (sometimes they can't as the original sites does not provide checksums, or are wrong)
>> 3) It's not integrated into the search, so people can't get to it by the same means as to other data (new interface)
>> 4) There's no way the user can be notified when something is changed (thought there might be complex architecture behind it)
>> 5) I just wonder if there's a logging of data access at all, so at least it is known who download what and when for notification and reporting purposes (might be)
>> 6) And most importantly it completely bypasses the whole security we have in place (for instance I see models in there with non-commercial access restrictions... I wonder how those are handled?)
>>
>> Basically, I wonder if we are trying to achieve really what the community wants. I think there's a conflict between what we (my view of _we_ :-) think science should look like and what the scientific community needs.
>> IMHO we have two different ends of the rope here: data quality vs. prompt access.
>>
>> Of course we are aiming at both ends at the same time (and this *is* doable), but this slows development at both ends... there's no such thing as a free lunch :-)
>> should we keep this development path or should we define a new one?
>>
>> I've seen a lot of sites and procedures like this, as well as people complaining about not getting the whole CMIP3 archive of CMIP5 data (because of the 36TB size) in a snap.
>> I have (almost) no experience in this community, so I have thought it from the very beginning to be one, but it looks like the data producers are not the same as the data consumers (still referring to WG1).
>>
>> Having some information on this would help us (me and others :-) to understand the community as well as improving development, by guiding it better to satisfy the community requirements in the order the community expect those requirements to be satisfied.
>>
>> Just my 2c,
>> Estani
>>
>> Am 21.03.2012 00:08, schrieb Karl Taylor:
>> Hi Ben and all,
>>
>> A short and general report has been prepared summarizing the CMIP5 meeting held recently in Hawaii ( http://www.wcrp-climate.org/documents/ezine/WCRPnews_14032012.pdf ).  Some of my impressions, which may not be reflected in the official summary, include:
>>
>> 1.  An impressive array of scientific multi-model CMIP5 studies are underway (some in pretty advanced phases), so users seem to be coping with the frustrations of our current ESG.
>>
>> 2.  Two scientists (of about 160) said specifically and publicly  they could *not* do what they wanted to do because it was so difficult and slow to download data.
>>
>> 3.  Privately, several scientists expressed frustrations and asked when things would improve.  Nearly everyone understood and appreciated the enormity of the challenges and most seem willing to remain patient a little longer.
>>
>> 4.  Reto Knutti (a convening? lead author) and Thomas Stocker (co-chair of WG1) are both *counting* on improvements and *concerned* that they will be delayed.  Reto is the one who has put up a website at https://wiki.c2sm.ethz.ch/Wiki/CMIP5 where lots of folks are getting data more easily than through the official ESG sites.  Reto sent me an email summarizing the biggest problems he has getting data to populate his site.  He has identified problems (many of which we're working on), which probably affect all users (summarizing his email copied below):
>> a) incorrect MD5 checksums
>> b) old, incorrect catalog entries at some nodes
>> c) no easy way to report errors (he mentioned "errata websites, feeds, email addresses invalid or no responses"; not sure whether he tried the help desk)
>> d) incorrect version numbers (or unclearly labeled)
>> e) gaps in data, overlaps, "strange time coordinates"
>> f) no way to find out when data is found to contain errors
>>
>> 5.  A number of folks volunteered to be first users on the p2p system.
>>
>> Obviously in this email the focus is on what needs fixing.  An enormous amount has already been accomplished.
>>
>> Please let me know if you are specifically curious about anything else that went on at the CMIP5 meeting, and I'll try to respond.
>>
>> Best regards,
>> Karl
>>
>> email from Reto Knutti:
>> Dear Karl,
>>
>> As promised, here's a list of main issues we are encountering with CMIP5.
>>
>> - the manual download is slow and unreliable. A tool that is scriptable is essential, so that it is easy to find out what is new and changed.
>> - the IPSL download  tool is reasonably good but fails often due to old and incorrect catalog entries at the different nodes, incorrect MD5 checksums, or missing PKI interface
>> - communication of problems is difficult to impossible (different errata websites, feeds, email addresses invalid or no responses)
>> - no clear data version control, which makes it difficult to find out which files are the most recent ones
>>
>> I think the new interface that you are testing could address part of the above.
>>
>> But it would also help if PCMDI could communicate clearly to the data centers that they need to make sure their catalogues are up to date, checksums are correct, and versions are clearly labeled. There are further issues that the modeling centers should address (gaps in data, overlaps, strange timescales, etc.). I realize that is a lot of work for the centers, but if they don't do it they create even more work for thousands of people who are trying to analyze the data.
>>
>> Finally, I would like to stress what Thomas mentioned. We need a way to find out when data is found to contain errors, but we also need a way to give feedback to the modeling groups when we discover issues. If you have a list of contact persons from the centers that you could provide to us, that could help.
>>
>> Given that scientific papers and the IPCC second order draft are written over the next five months, it is important that the above points are addressed as quickly as possible. I realize of course the constraints that you have at PCMDI, and the fact that some problems are not under your control.
>>
>> In any case we appreciate all your efforts and support, and would be happy to work with you and help with testing tools etc. if we can.
>>
>> Thanks,
>>
>> Reto
>>
>>
>> On 3/12/12 11:36 AM, Ben Evans wrote:
>> Thanks Dean.  That sounds good.
>>
>> Perhaps this is more for Karl and others, it would be helpful to see a report of the Hawaii meeting if it were available before then.  I will be heading into a local management meeting in two weeks and I would like to be up to speed - especially if there are other perceptions that came out of the meeting.
>>
>> Best Wishes,
>> Ben
>> --
>> Dr Ben Evans
>> Associate Director (Research Engagement and Initiatives)
>> NCI
>> http://www.nci.org.au/<http://nf.nci.org.au/>
>> Leonard Huxley Building (#56)
>> The Australian National University
>> Canberra, ACT, 0200 Australia
>> Ph  +61 2 6125 4967
>> Fax: +61 2 6125 8199
>> CRICOS Provider #00120C
>>
>>
>>
>>
>>
>> --
>> Estanislao Gonzalez
>>
>> Max-Planck-Institut für Meteorologie (MPI-M)
>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>
>> Phone:   +49 (40) 46 00 94-126
>> E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
>
> --
> Estanislao Gonzalez
>
> Max-Planck-Institut für Meteorologie (MPI-M)
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
> Phone:   +49 (40) 46 00 94-126
> E-Mail:  gonzalez at dkrz.de
>
>


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de



More information about the GO-ESSP-TECH mailing list