[Go-essp-tech] cmip5

Stéphane Senesi Stephane.Senesi at meteo.fr
Mon Sep 12 07:05:21 MDT 2011


Dear colleagues

I also devised a simple tool (actually a set of bash functions) which do 
use wget and Thredds catalog structure for providing a main versatile 
function "esgfiles" which doc is reproduced below and another function 
"esgcheck" for monitoring new CMIP5 data published; it is intended to 
fit the need of scientists in querying and retrieving CMIP5 data in a 
scriptable way. It also handles the file organizations that do not match 
the recommended DRS structure. It uses non-documented aspects of Thredds 
catalog (and maybe non-reliable ones) in a very crude way. Iit 
represents some 300 lines of bash (including 100 lines of doc);

Do you think it could be useful to scientists at that stage, given the 
situation where an API is currently being developped, and taking into 
account also Sébastien's recent offer ?

Regards

Stéphane


> esgfiles dn_pattern [ action [ base_url [ adn_pattern [ wgetargs [ 
> nmax ]]]]]
>
> Performs ACTION for all dataset entries of the Thredds catalog hosted 
> at BASE_URL which do match DN_PATTERN
>
> DN_PATTERN should be a regular expression. It is matched against 
> dataset names (and not against file names), i.e. against strings like 
> : 
> cmip5.output1.CNRM-CERFACS.CNRM-CM5.piControl.day.atmos.day.r1i1p1.v20110701.html. 
>
>         Reference document for dataset names is 
> http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf
> ACTION may be :
>         list : Print short dataset names and number of files; this is 
> the default
>         listlong : Print dataset names and number of files per dataset
>         urls : Print atomic dataset files URLs
>         check : in addition, checks that the atomic dataset files are 
> reachable (by wget --spider)
>         get : in addition, downloads the files (with wget, which will 
> get argument WGETARGS as arguments)
> BASE_URL : either a part of a datanode name, ot the URL of a data 
> node, or 'all'
>         'all' means all known datanodes (see list in code)
>         defaults to $esg_base_url, which current value is 
> esg.cnrm-game-meteo.fr (tune in the code or by setting the environment 
> variable).
> ADN_PATTERN is an optional regular expression acting as an additional 
> filter. It can be used to filter according to variable names, because 
> it is matched against atomic dataset names, i.e. against strings like 
> : 
> cmip5.output1.CNRM-CERFACS.CNRM-CM5.historicalMisc.mon.landIce.LImon.r1i1p1.v20110722.sbl_LImon_CNRM-CM5_historicalMisc_r1i1p1_185001-189912.nc
>
> WGETARGS applies only in case ACTION == get, and accepts arguments to 
> wget.
>         Use it to tune dowloaded files organization, e.g with '-r' - 
> see 'man wget')
> NMAX is the maximum number of data files to process. It does not apply 
> to actions : list and listlong.
>         Default is : no limit
>
> For pre-requisites, type  : esgdoc setup
>
> Examples :
>   - esgfiles historicalMisc.fx list "" ""
>   - esgfiles "amip.*6hrLev" check "" vesg.ipsl.fr 5
>

Example of esgfiles runs today (outputs are truncated)
> > esgfiles historical\..*Amon listlong badc
> 2/cmip5.output1.MOHC.HadGEM2-ES.historical.mon.atmos.Amon.r3i1p1.v20110418.html  
> - 0276 entries
> 2/cmip5.output1.MOHC.HadGEM2-ES.historical.mon.atmos.Amon.r2i1p1.v20110418.html  
> - 0278 entries
> ....
> 2/cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.mon.atmos.Amon.r1i1p1.v20110330.html  
> - 0276 entries 

> > esgfiles "historical\..*Amon.*r3" urls badc
> cmip-dn.badc.rl.ac.uk/thredds/fileServer/esg_dataroot/cmip5/output1/MOHC/HadGEM2-ES/historical/mon/atmos/Amon/r3i1p1/v20110418/sci/sci_Amon_HadGEM2-ES_historical_r3i1p1_185912-188411.nc
> cmip-dn.badc.rl.ac.uk/thredds/fileServer/esg_dataroot/cmip5/output1/MOHC/HadGEM2-ES/historical/mon/atmos/Amon/r3i1p1/v20110418/sci/sci_Amon_HadGEM2-ES_historical_r3i1p1_188412-190911.nc 
> .........

> > esgfiles "historical\..*Amon.*r3" urls cnrm
> esg.cnrm-game-meteo.fr/thredds/fileServer/esg_dataroot5/CMIP5/output/CNRM-CERFACS/CNRM-CM5/historical/mon/atmos/rsdt/r3i1p1/rsdt_Amon_CNRM-CM5_historical_r3i1p1_185001-189912.nc
> esg.cnrm-game-meteo.fr/thredds/fileServer/esg_dataroot5/CMIP5/output/CNRM-CERFACS/CNRM-CM5/historical/mon/atmos/rsdt/r3i1p1/rsdt_Amon_CNRM-CM5_historical_r3i1p1_190001-194912.nc



Sébastien Denvil wrote, On 12/09/2011 14:23:
>  Jonathan,
>
> we also felt the need to have a program to download files from the 
> CMIP5 archive in an easy way, for a list of variables and experiments. 
> At IPSL, we have developed a tool to help to do it. Its a first 
> version that will be progressively improved (in particular the "user 
> guide"). The program will evolve together with the cmip5 archive 
> backend functionalities.
>
> The user defines one or many templates. Each of them has a list of 
> variables, frequencies and experiments. The user also define a list of 
> models. Using these templates, the program explore the ESG grid and 
> dowload all the corresponding files that are available (and only for 
> the first ensemble member in the current version). The program may be 
> run regularly to download the possible new files. Typically each 
> template is associated with an analysis (cfmip template, downscaling 
> template and so on). Create as many TemplateName.txt as you want in 
> the user_selections folder (following the user_selections/default.txt 
> (trivial) syntax) and you are done.
>
> Here is the procedure to install the CMIP5 data download program. 
> Except two dependencies (sqlite) it's a non root install:
> http://dods.ipsl.jussieu.fr/jripsl/synchro_data/README
>
> The program have the following features:
> * support for myproxy-logon and myproxyclient
> * simple data selection with model,experiment,realm and variable
> * multi threaded downloads (8 tasks by default)
> * manage datasets version following new drs
> * incremental process (download only what's new)
> * download history stored in a db
>
> It has been tested with the following models: HadGEM2-ES, HadGEM2-A, 
> CanESM2, CNRM-CM5, NorESM1-M, CanCM4, CSIRO-Mk3-6-0. IPSL-CM5A-LR will 
> be added shortly ... :-)
>
> Fill free to use it and to ask us if you have any questions, 
> difficulties or suggestions to improve the program.
>
> Enjoy your analysis.
>
> Cheers,
> Sébastien
>
> On 12/09/2011 13:12, Williams, Dean N. wrote:
>> Dear Jonathan and Stephen,
>>
>>     We are also working on other solutions to help alleviate the 
>> problems
>> mentioned below, such as replicating the most of the archive at various
>> locations around the world. As Steven mentioned, we are aware of this
>> shortcomings and others and are working "quickly" to address them.
>>
>> Thanks and best regards,
>>     Dean
>>
>> On 9/12/11 4:04 AM, "stephen.pascoe at stfc.ac.uk"
>> <stephen.pascoe at stfc.ac.uk>  wrote:
>>
>>> Dear Jonathan,
>>>
>>> Thanks for taking the time to describe your concerns about the 
>>> usability
>>> of the CMIP5 archive system.  I am CC'ing this to go-essp-tech at ucar.edu
>>> as I think your feedback is particularly welcome and insightful and
>>> deserves to be seen and discussed widely.
>>>
>>> We are aware of many of the shortcomings you identify; improvements in
>>> software and documentation are in progress that I hope will improve 
>>> your
>>> experience.  However, our progress has been slower than we'd hoped 
>>> and we
>>> are now up against significant CMIP5 usage which will inevitably impede
>>> rolling-out improvements.  We would have hoped to have the system more
>>> usable by now but we are pushing hard to improve the system as 
>>> quickly as
>>> possible.
>>>
>>> You identify several user interface and performance issues with the ESG
>>> Gateway search system.  Our colleagues at NCAR have been developing 
>>> a new
>>> version of the Gateway with an improved search backend that I believe
>>> solves many of your concerns.  I've seen a test deployment at NCAR 
>>> and it
>>> is a significant improvement.  We at BADC will be deploying it for
>>> testing in the next couple of days in the hope that it can be 
>>> rolled-out
>>> quickly for end-users.
>>>
>>> Another point in your feedback is scriptability of downloads and 
>>> checking
>>> what is available.  We had hoped that the wget script generation 
>>> feature
>>> of the gateway would produce wget scripts that could be edited to
>>> download different sorts of data by leveraging the Data Reference 
>>> Syntax
>>> [1].  Unfortunately, although some download URLs contain DRS 
>>> information
>>> that would help deducing alternative downloads, this isn't practical at
>>> present.  We are working to improve the DRS consistency of the archive
>>> that we hope will improve download scriptability.
>>>
>>> The other mechanism you could use to programmatically download data and
>>> discover new data is reading the THREDDS catalogs.  Every centre 
>>> serving
>>> CMIP5 data is running a THREDDS Data Server [2] which lists all 
>>> download
>>> URLs in a network of THREDDS XML catalogs.  This is intended as an
>>> internal interface so isn't well documented.  However, I think it is no
>>> secret that some users are doing this already.  You can find the 
>>> THREDDS
>>> source catalog of every dataset in the "History" tab of the Gateway's
>>> dataset page or they can be deduced from download URLs and a little
>>> knowledge of TDS.
>>>
>>> I should add that downloading data directly from a TDS will only 
>>> work if
>>> it is configured to use "tokenless" security.  This is the case with 
>>> only
>>> some datanodes at present but should be fixed in the near term.
>>>
>>> In the medium-term ESGF are planning documented service APIs that would
>>> allow users to query the system programmatically and there is a new P2P
>>> architecture in the works with more focus on scalability [3]
>>>
>>> Regards,
>>> Stephen Pascoe.
>>>
>>> [1] CMIP5 Data Reference Syntax:
>>> http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf
>>> [2] THREDDS Data Server: http://www.unidata.ucar.edu/projects/THREDDS/
>>> [3] ESGF P2P Architecture: http://esgf.org/wiki/ESGF_Index
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> Centre of Environmental Data Archival
>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 
>>> 0QX, UK
>>>
>>>
>>> -----Original Message-----
>>> From: Jonathan Gregory [mailto:j.m.gregory at reading.ac.uk]
>>> Sent: 12 September 2011 11:12
>>> To: esg-support at earthsystemgrid.org
>>> Subject: cmip5
>>>
>>> Dear ESG
>>>
>>> In preparation for working on the 1st draft of the AR5, I have begun to
>>> try to
>>> download CMIP5 data. I have to say I am discouraged by the experience.
>>> Using
>>> this web interface is slow and inconvenient, and I fear it will be an
>>> obstacle
>>> to the work required to be done. The biggest limitation, I would 
>>> say, is
>>> that
>>> there is *only* a web interface. For CMIP3, I used ftp to download the
>>> data,
>>> having written my own scripts. That minimised the manual effort 
>>> required,
>>> and
>>> most importantly I could use my script to fetch data I didn't already
>>> have,
>>> which it could easily identify. With a web interface, working out 
>>> what I
>>> don't
>>> already have will only be possible by manual comparison, which will 
>>> take
>>> a lot
>>> of time. Is the http protocol that the web interface uses something 
>>> that
>>> could
>>> be employed in a script? If so, could you document it? Even if the
>>> protocol is
>>> tricky, I would still much rather write a script than use a web
>>> interface, as
>>> in the end it will be more efficient.
>>>
>>> However, the web interface could be improved in various ways, I think,
>>> which
>>> would make it more efficient. As it stands, I find the following
>>> inconvenient:
>>>
>>> * The PCMDI gateway is sometimes slow. This morning (UK time) it is
>>> terribly
>>> slow - unusable, in fact.
>>>
>>> * It always searches when you change any of the criteria, so it 
>>> searches
>>> all
>>> of CMIP5 when you select the "Project", for instance. This wastes time.
>>>
>>> * You have to select "all" in order to see the whole list again and 
>>> make a
>>> new selection, again wasting time with unnecessary searching.
>>>
>>> * There is no way to select more than one thing at a time e.g. more 
>>> than
>>> one
>>> experiment or more than one quantity.
>>>
>>> * All the datasets have to be ticked individually to proceed to 
>>> download,
>>> which is tedious.
>>>
>>> * If there is more than one page, you can tick only one page at a time,
>>> so you
>>> have to start all over again to do the next page, by repeating the 
>>> whole
>>> search laboriously.
>>>
>>> * I can't (yet) get MRI or MIROC data, as it requires some further
>>> authorisation that I have applied for. In fact I applied several 
>>> days ago,
>>> and I have not yet been authorised. How can I chase this up?
>>>
>>> * The search facility at the top seems flaky. The "loading, please 
>>> wait"
>>> never
>>> goes away and it crashes with an http error sometimes.
>>>
>>> * Although I would have thought that many users said that CMIP3 would
>>> have been
>>> much more convenient if it had been possible to download annual data
>>> rather
>>> than monthly - I certainly made this comment - that facility has not 
>>> been
>>> provided in the CMIP5 interface.
>>>
>>> I am sure many people would be grateful if you could make some
>>> improvements.
>>> (And I expect I am not the first to make these suggestions!)
>>>
>>> Best wishes
>>>
>>> Jonathan Gregory
>>> -- 
>>> Scanned by iCritical.
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>    


-- 
Stéphane Sénési
Ingénieur - équipe Assemblage du Système Terre
Centre National de Recherches Météorologiques
Groupe de Météorologie à Grande Echelle et Climat

CNRM/GMGEC/ASTER
42 Av Coriolis
F-31057 Toulouse Cedex 1

+33.5.61.07.99.31 (Fax :....9610)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110912/af7580ee/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list