[Go-essp-tech] Status of Gateway 2.0 (another use case) [SEC=UNCLASSIFIED]

Thu Dec 15 14:50:11 MST 2011

Hi Jennifer,

Firstly, a big "Thank you!"; I have been enjoying all of the information flow that your recent email messages have generated, because like you, I am also an end-user of the CMIP5 data, and like you, have also been wrangling with the incessantly slow web-page interface ... click to select another dataset parameter and wait an interminably long time for the list to update so you can select the next parameter and do some more waiting, ... and so I completely sympathise with your frustrations.

However, I wanted to pick up on one point that you made about losing the "version" information.  Now certainly, when you run the "wget" script you end up with (what I call) a long descriptive file name, e.g., "pr_Amon_HadCM3_historical_r1i1p1_195912-198411.nc", but what I do is this:

The "wget" scripts contain one line of information for each of the files that is being downloaded, and those lines contain at least two things (some contain other things like MD5 checksums, etc.), but there is the long descriptive file name, but also there is the (usually) "thredds" path to the file that is to be downloaded by the "wget" command.  I used that "thredds" server path to produce a directory path on our (NCI) computer and move the downloaded file into an (almost) identical path on our server so that, for example, the file mentioned above is stored at (something like):

    /path/to/downloaded/cmip5data/cmip-dn.badc.rl.ac.uk/thredds/fileServer/esg_dataroot/cmip5/output1/MOHC/HadCM3/historical/mon/atmos/Amon/r1i1p1/v20110823/pr/pr_Amon_HadCM3_historical_r1i1p1_195912-198411.nc

We originally did this so that the ESG-NCI folk in Canberra could have some trace-ability as to what data was already downloaded, and from where that data had originated, so if/when an automated replication system eventuated they may be able to target their replication efforts.

The next thing I do is generate a symbolic link to the newly added file, and the symbolic links use a simpler hierarchical directory path that is composed of the components found in the long descriptive file names, e.g.:

    /path/to/cmip5links/ historical/Amon/pr/HadCM3/r1i1p1/pr_Amon_HadCM3_historical_r1i1p1_195912-198411.nc

    i.e., {Expt}/{Mrf}/{Var}/{Mdl}/{Rip}/{File}.nc

    where:
        {Expt} ---> CMIP5 Experiment Identifier
                    e.g., 'historical', 'rcp45', 'rcp85', ...

        {Mrf} ----> CMIP5 Modelling Realm and Data Frequency (abbreviation)
                    e.g., '6hrLev', '6hrPlev', 'Amon', 'OImon', 'day', ...

        {Var} ----> CMIP5 Short Variable Identifier
                    e.g., 'hur', 'psl', 'ta', 'ua', 'va', 'zg', ...

        {Mdl} ----> CMIP5 Model Identifier
                    e.g., 'CNRM-CM5', 'CanESM2', 'HadGEM2-ES',
                          'IPSL-CM5A-LR', 'bcc-csm1-1', 'inmcm4', ...

        {Rip} ----> CMIP5 Run/Initialisation/Physics Ensemble Identifier,
                    e.g., 'r1i1p1', 'r2i1p1', 'r3i1p1', ...

        {File} ---> CMIP5 Descriptive Fila Name
                    e.g., 'hur_Amon_inmcm4_rcp45_r1i1p1_200601-201512.nc'
                    i.e., {Var}_{Mrf}_{Mdl}_{Expt}_{Rip}[_{date-Range}].nc

So in the end we do have the "version" number and other information that was delivered in the "thredds" server path, if needed, but we also have a much simpler path to the data files, which I find can simplify the handling of those files in the scripts that I, and other people, write to process the thousands of files we need to handle for our climate-related analyses.

One other thing I have done here is to write a script that splits a multiple-file "wget-download.sh" script into separate single-file "wget" scripts, and then we launch these as batch jobs so many of these can be downloading in parallel.  Australia is a rather remote country, compared to where many others in the CMIP5 data user community dwell (i.e., in or closer to the USA, or UK, or Europe, probably with very high-speed access to the CMIP5 data nodes), so I have been driven to try all sorts of things to help speed-up our data downloads.  So far, over the last three months, I have managed to download 12 TB (terabyte) of CMIP5 data, and that is about 10% of what we want (initially).

I was _very_ pleased to read about the P2P functionality that is being developed, and like you, I will find this _extremely_ useful to be able to define a set of dataset describing parameters and construct the "wget" search request so I can simply obtain the associated "wget" script (instead of the current point, click, wait, wait, wait web-page affair that we have currently).

Anyway, I have raved on for long enough :-).

Again, many thanks for your messages describing your trials and tribulations in using the current tools.  I thought I was the only one who was going (gone) grey, and tearing my hair out!

Best regards,

Lawson Hanson
-------------
    CAWCR (Centre for Australian Weather and Climate Research),
    Climate Variability and Change group,
    Climate Change Science team,
    Bureau of Meteorology, 700 Collins Street,
    Melbourne Docklands, VIC 3008

From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Jennifer Adams
Sent: Thursday, 15 December 2011 4:39 AM
To: go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Status of Gateway 2.0 (another use case)

Well, after working from the client side to get CMIP3 and CMIP5 data, I can say that wget is a fine tool to rely on at the core of the workflow. Unfortunately, the step up in complexity from CMIP3 to CMIP5 and the switch from FTP to HTTP trashed the elegant use of wget. No amount of customized wrapper software, browser interfaces, or pre-packaged tools like DML fixes that problem.

At the moment, the burden on the user is embarrassingly high. It's so easy to suggest that the user should "filter to remove what is not required" from a downloaded script, but the actual pratice of doing that in a timely and automated and distributed way is NOT simple! And if the solution to my problem of filling in the gaps in my incomplete collection is to go back to clicking in my browser and do the whole thing over again but make my filters smarter by looking for what's already been acquired or what has a new version number ... this is unacceptable. The filtering must be a server-side responsibility and the interface must be accessible by automated scripts. Make it so!

By the way, the version number is a piece of metadata that is not in the downloaded files or the gateway's search criteria. It appears in the wget script as part of the path in the file's http location, but the path is not preserved after the wget is complete, so it is effectively lost after the download is done. I guess the file's date stamp would be the only way to know if the version number of the data file in question has been changed, but I'm not going to write that check into my filtering scripts.

--Jennifer

--
Jennifer M. Adams
IGES/COLA
4041 Powder Mill Road, Suite 302
Calverton, MD 20705
jma at cola.iges.org<mailto:jma at cola.iges.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111216/eb77251a/attachment-0001.html