[Go-essp-tech] Verifying files

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Wed Mar 21 11:07:30 MDT 2012


Hi Estani,

People will need python and PyXml installed -- which may not be easy on all systems (there may be better solutions for parsing the XML, but that can be plugged inO. Once that is in place, there is no reason why root access should be needed to run the script. If users don't have rights to install PyXML they should have someone to ask.

The script extracts MIP table, model, experiment and ensemble id from the filename. 

It uses a look-up table to identify the data node responsible for publication of data from the given model - -this is the only look up table used, and as Stephen says, it would be good to sunset this dependency.

The script then obtains a top level catalogue (currently set to re-fetch if the local copy is > 24 hours old: further command line options would be desirable and easy to fix) and searches for datasets with the same MIP table, model, experiment and ensemble id.

It may find more than one (e.g. multiple products, multiple realms), in which case it fetches them all (unless it already has a local copy). This ambiguity will be difficult to resolve, but the overhead is not critical.

The script then looks through the catalogues. Currently it only checks the latest version of each publication unit -- it could look through multiple versions if more detailed output is wanted.

As you say, some of this can be improved by using new P2P services: P2P won't resolve the ambiguity about which dataset a file comes from, but it will make checking multiple datasets much faster. 

At the same time, we need to make the user interface side of things match what users need and think about how this could work flexibly in whatever file-system arrangement they have adopted,

Cheers,
Martin





> >-----Original Message-----
> >From: Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
> >Sent: 21 March 2012 16:48
> >To: Juckes, Martin (STFC,RAL,RALSP)
> >Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de;
> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov;
> >Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP); go-essp-
> >tech at ucar.edu
> >Subject: Re: Verifying files
> >
> >Hi Martin,
> >
> >this is a great and useful idea. I think the main question would be
> >for
> >whom are we building tools. The main problem I had when creating the
> >wget script is that users:
> >1) use all sorts of equipments from Solaris to Mac in all flavors and
> >version.
> >2) have little knowledge about any particular computer language
> >3) have no root rights
> >
> >But even without going into the details, defining the tool is the most
> >valuable thing.
> >I still have some concerns as to how would the tool now that the snw
> >variable for example is either in landIce or land. And I also doubt
> >that
> >hard-coding values will escalate (pcmdi3 changed to pcmdi7 and we
> >might
> >need to distribute catalogs to multiple nodes if the TDS can't cope
> >with
> >all of them).
> >
> >So I think we should start from your proposal and describe the steps
> >required for getting the information. Did I got right the core of the
> >procedure?:
> >1) Start from file name -> extract facets (defined by the cmip5 DRS)
> >2) Find the data node TDS hosting the catalog from the information
> >extracted from 1
> >3) Select all possible catalogs references from the main catalog.xml
> >from data in 1
> >4) parse the catalogs and find the matching one according to the data
> >in
> >1 (might not match all fields but stripping the temporal part should
> >be
> >reliable for most cases). Trickier will be if the variable got
> >deleted...
> >And then
> >5) check if file is there (filename and checksum) and return dataset
> >6) Check again the TDS for those catalogs and see if there are newer
> >versions.
> >
> >We do have the means to simplify some steps by getting directly to a
> >search service (e.g. P2P). But I think it's better to first define
> >this
> >in terms of the data sources, and then we can exploit services which
> >already do parts of this procedure.
> >
> >Any agreement or desire to move forward in this direction?
> >
> >Thanks,
> >Estani
> >
> >
> >
> >Am 21.03.2012 16:05, schrieb martin.juckes at stfc.ac.uk:
> >> Hi Estani, and others,
> >>
> >> attached is a simple script to find the catalogued checksum, size
> >and tid of any given CMIP5 data file (though it probably won't work
> >for GFDL gridspec files).
> >>
> >> System requirements are the python libxml2 and xml libraries.
> >>
> >> usage:
> >> python check_file.py<file name>
> >>
> >> It gets over the problems Estani raises below by searching the top
> >level catalogue for a dataset that matches rather than trying to
> >construct it. In some cases iy may find 2 (output1 and output2) and
> >then need to search both for the file specification.
> >>
> >> At present it is a proof of concept -- it only finds the checksum
> >and does not go on to actually check it against the file.
> >>
> >> Does this look like a useful tool? if people think it is useful,
> >there are a few points which need tidying up to get reasonable
> >efficiency and feed back on likely usage patterns would be useful.
> >>
> >> cheers,
> >> Martin
> >>
> >>
> >> ________________________________________
> >> From: Estanislao Gonzalez [gonzalez at dkrz.de]
> >> Sent: 21 March 2012 12:26
> >> To: Juckes, Martin (STFC,RAL,RALSP)
> >> Cc: taylor13 at llnl.gov; Ben.Evans at anu.edu.au; williams13 at llnl.gov;
> >toussaint at dkrz.de; bryan.lawrence at ncas.ac.uk; lautenschlager at dkrz.de;
> >luca.cinquini at jpl.nasa.gov; Stephane.Senesi at meteo.fr; gavin at llnl.gov;
> >Drach1 at llnl.gov; Pascoe, Stephen (STFC,RAL,RALSP)
> >> Subject: Re: Hawaii CMIP5 meeting report   (was Re: CMIP5 Management
> >telco)
> >>
> >> Hi Martin,
> >>
> >> thanks for the feedback.
> >>
> >> A short comment regarding the service you described, it's not as
> >easy,
> >> as you have to generate the dataset name from the file name, and
> >AFAIK,
> >> this requires the cmor tables. And probably the cmor tables at the
> >time
> >> the dataset was created to cover all cases (but 99% of them should
> >be
> >> covered anyway, so I think it's fine).
> >> Another problem is the product which can't be determined precisely
> >> without information from the models (again 99% of all cases should
> >work).
> >> There's still some issues with the versions and the data node
> >serving
> >> that dataset, but I guess it's doable.
> >>
> >> I still think the best approach for this is to use the P2P search
> >> capability to reconstruct the dataset, from the checksums and then
> >> resolve back to the latest version as was discussed yesterday.
> >> But I still think we can't do much more without leaving other things
> >> behind (at least I can't).
> >>
> >> That's why I was hoping to get feedback on a list of "features" that
> >we
> >> can postpone so we can concentrate on others.
> >>
> >> Thanks,
> >> Estani
> >>
> >> Am 21.03.2012 13:14, schrieb martin.juckes at stfc.ac.uk:
> >>> Hi Estani,
> >>>
> >>> Our user community is dominated by academic researchers, so they
> >have no obligation to agree with each other. They are juggling a range
> >of priorities, not working to a set of rigid objectives.
> >>>
> >>> One of the problems we need to sort out is getting data to people
> >quickly. This objective is steadily rising towards the top of the list
> >as the deadline for submission of papers for consideration by the IPCC
> >approaches.
> >>>
> >>> We want to provide quality controlled data, with checksums, with a
> >robust and user friendly search interface -- but the community does
> >not have the luxury of being able to wait for that to appear.
> >>>
> >>> It occurs to me that it would be fairly easy (because security is
> >not involved) to write a script to validate a file (or list of files)
> >against THREDDS catalogues; this would then allow users who have taken
> >shortcuts and used secondary sources to verify their data.
> >>>
> >>> cheers,
> >>> Martin
> >>> ________________________________
> >>> From: Estanislao Gonzalez [gonzalez at dkrz.de]
> >>> Sent: 21 March 2012 10:03
> >>> To: Karl Taylor
> >>> Cc: Ben Evans; Williams, Dean N.; Frank Toussaint; Juckes, Martin
> >(STFC,RAL,RALSP); bryan.lawrence at ncas.ac.uk; Michael Lautenschlager;
> >Cinquini, Luca; Stéphane Senesi; Gavin M Bell; Drach, Bob; Pascoe,
> >Stephen (STFC,RAL,RALSP)
> >>> Subject: Re: Hawaii CMIP5 meeting report (was Re: CMIP5 Management
> >telco)
> >>>
> >>> Hi all,
> >>>
> >>> I have one question/concern regarding sites like the one mentioned
> >in Karl's mail. These sites get _some_ data by _unknown_ methods.
> >What's the position of the community regarding this?
> >>>
> >>> We know we have a lot of problems to sort out, but those sites go
> >around the problems by not confronting them at all. For instance,
> >AFAICT:
> >>> 1) The site stores no version information
> >>> 2) It does not guarantee data is complete or validated from the
> >original sites (sometimes they can't as the original sites does not
> >provide checksums, or are wrong)
> >>> 3) It's not integrated into the search, so people can't get to it
> >by the same means as to other data (new interface)
> >>> 4) There's no way the user can be notified when something is
> >changed (thought there might be complex architecture behind it)
> >>> 5) I just wonder if there's a logging of data access at all, so at
> >least it is known who download what and when for notification and
> >reporting purposes (might be)
> >>> 6) And most importantly it completely bypasses the whole security
> >we have in place (for instance I see models in there with non-
> >commercial access restrictions... I wonder how those are handled?)
> >>>
> >>> Basically, I wonder if we are trying to achieve really what the
> >community wants. I think there's a conflict between what we (my view
> >of _we_ :-) think science should look like and what the scientific
> >community needs.
> >>> IMHO we have two different ends of the rope here: data quality vs.
> >prompt access.
> >>>
> >>> Of course we are aiming at both ends at the same time (and this
> >*is* doable), but this slows development at both ends... there's no
> >such thing as a free lunch :-)
> >>> should we keep this development path or should we define a new one?
> >>>
> >>> I've seen a lot of sites and procedures like this, as well as
> >people complaining about not getting the whole CMIP3 archive of CMIP5
> >data (because of the 36TB size) in a snap.
> >>> I have (almost) no experience in this community, so I have thought
> >it from the very beginning to be one, but it looks like the data
> >producers are not the same as the data consumers (still referring to
> >WG1).
> >>>
> >>> Having some information on this would help us (me and others :-) to
> >understand the community as well as improving development, by guiding
> >it better to satisfy the community requirements in the order the
> >community expect those requirements to be satisfied.
> >>>
> >>> Just my 2c,
> >>> Estani
> >>>
> >>> Am 21.03.2012 00:08, schrieb Karl Taylor:
> >>> Hi Ben and all,
> >>>
> >>> A short and general report has been prepared summarizing the CMIP5
> >meeting held recently in Hawaii ( http://www.wcrp-
> >climate.org/documents/ezine/WCRPnews_14032012.pdf ).  Some of my
> >impressions, which may not be reflected in the official summary,
> >include:
> >>>
> >>> 1.  An impressive array of scientific multi-model CMIP5 studies are
> >underway (some in pretty advanced phases), so users seem to be coping
> >with the frustrations of our current ESG.
> >>>
> >>> 2.  Two scientists (of about 160) said specifically and publicly
> >they could *not* do what they wanted to do because it was so difficult
> >and slow to download data.
> >>>
> >>> 3.  Privately, several scientists expressed frustrations and asked
> >when things would improve.  Nearly everyone understood and appreciated
> >the enormity of the challenges and most seem willing to remain patient
> >a little longer.
> >>>
> >>> 4.  Reto Knutti (a convening? lead author) and Thomas Stocker (co-
> >chair of WG1) are both *counting* on improvements and *concerned* that
> >they will be delayed.  Reto is the one who has put up a website at
> >https://wiki.c2sm.ethz.ch/Wiki/CMIP5 where lots of folks are getting
> >data more easily than through the official ESG sites.  Reto sent me an
> >email summarizing the biggest problems he has getting data to populate
> >his site.  He has identified problems (many of which we're working
> >on), which probably affect all users (summarizing his email copied
> >below):
> >>> a) incorrect MD5 checksums
> >>> b) old, incorrect catalog entries at some nodes
> >>> c) no easy way to report errors (he mentioned "errata websites,
> >feeds, email addresses invalid or no responses"; not sure whether he
> >tried the help desk)
> >>> d) incorrect version numbers (or unclearly labeled)
> >>> e) gaps in data, overlaps, "strange time coordinates"
> >>> f) no way to find out when data is found to contain errors
> >>>
> >>> 5.  A number of folks volunteered to be first users on the p2p
> >system.
> >>>
> >>> Obviously in this email the focus is on what needs fixing.  An
> >enormous amount has already been accomplished.
> >>>
> >>> Please let me know if you are specifically curious about anything
> >else that went on at the CMIP5 meeting, and I'll try to respond.
> >>>
> >>> Best regards,
> >>> Karl
> >>>
> >>> email from Reto Knutti:
> >>> Dear Karl,
> >>>
> >>> As promised, here's a list of main issues we are encountering with
> >CMIP5.
> >>>
> >>> - the manual download is slow and unreliable. A tool that is
> >scriptable is essential, so that it is easy to find out what is new
> >and changed.
> >>> - the IPSL download  tool is reasonably good but fails often due to
> >old and incorrect catalog entries at the different nodes, incorrect
> >MD5 checksums, or missing PKI interface
> >>> - communication of problems is difficult to impossible (different
> >errata websites, feeds, email addresses invalid or no responses)
> >>> - no clear data version control, which makes it difficult to find
> >out which files are the most recent ones
> >>>
> >>> I think the new interface that you are testing could address part
> >of the above.
> >>>
> >>> But it would also help if PCMDI could communicate clearly to the
> >data centers that they need to make sure their catalogues are up to
> >date, checksums are correct, and versions are clearly labeled. There
> >are further issues that the modeling centers should address (gaps in
> >data, overlaps, strange timescales, etc.). I realize that is a lot of
> >work for the centers, but if they don't do it they create even more
> >work for thousands of people who are trying to analyze the data.
> >>>
> >>> Finally, I would like to stress what Thomas mentioned. We need a
> >way to find out when data is found to contain errors, but we also need
> >a way to give feedback to the modeling groups when we discover issues.
> >If you have a list of contact persons from the centers that you could
> >provide to us, that could help.
> >>>
> >>> Given that scientific papers and the IPCC second order draft are
> >written over the next five months, it is important that the above
> >points are addressed as quickly as possible. I realize of course the
> >constraints that you have at PCMDI, and the fact that some problems
> >are not under your control.
> >>>
> >>> In any case we appreciate all your efforts and support, and would
> >be happy to work with you and help with testing tools etc. if we can.
> >>>
> >>> Thanks,
> >>>
> >>> Reto
> >>>
> >>>
> >>> On 3/12/12 11:36 AM, Ben Evans wrote:
> >>> Thanks Dean.  That sounds good.
> >>>
> >>> Perhaps this is more for Karl and others, it would be helpful to
> >see a report of the Hawaii meeting if it were available before then.
> >I will be heading into a local management meeting in two weeks and I
> >would like to be up to speed - especially if there are other
> >perceptions that came out of the meeting.
> >>>
> >>> Best Wishes,
> >>> Ben
> >>> --
> >>> Dr Ben Evans
> >>> Associate Director (Research Engagement and Initiatives)
> >>> NCI
> >>> http://www.nci.org.au/<http://nf.nci.org.au/>
> >>> Leonard Huxley Building (#56)
> >>> The Australian National University
> >>> Canberra, ACT, 0200 Australia
> >>> Ph  +61 2 6125 4967
> >>> Fax: +61 2 6125 8199
> >>> CRICOS Provider #00120C
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Estanislao Gonzalez
> >>>
> >>> Max-Planck-Institut für Meteorologie (MPI-M)
> >>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing
> >Centre
> >>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >>>
> >>> Phone:   +49 (40) 46 00 94-126
> >>> E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
> >>
> >> --
> >> Estanislao Gonzalez
> >>
> >> Max-Planck-Institut für Meteorologie (MPI-M)
> >> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing
> >Centre
> >> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >>
> >> Phone:   +49 (40) 46 00 94-126
> >> E-Mail:  gonzalez at dkrz.de
> >>
> >>
> >
> >
> >--
> >Estanislao Gonzalez
> >
> >Max-Planck-Institut für Meteorologie (MPI-M)
> >Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> >Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >
> >Phone:   +49 (40) 46 00 94-126
> >E-Mail:  gonzalez at dkrz.de

-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list