[Go-essp-tech] Status of Gateway 2.0 (another use case)

Thu Dec 15 15:41:22 MST 2011

Dear Sebastien and everyone else,

I am replying to the go-essp-tech forum on this, because I think your strategy should be shared with the others who are reading this thread and getting ideas for how to manage CMIP5 data. 

I did read the README file you pointed to.  I must say that the details that you describe below make the package even more enticing, thanks for the 2nd reminder. I do not know anything about python, that is one thing that discouraged me from digging deeper. Plus, it was only yesterday. 

I've been so busy reading and writing emails to this thread, I haven't gotten very much work done. My downloads are not making progress! That's okay, it's worth it. The discussion has been incredibly helpful to me. Knowing there are others who have struggled with these issues and learning about what solutions have been designed really cheers me up. The set up you have created does solve my version problem, and I shall strive to implement it, or maybe something similar to what Lawson Hanson has just described. I'm not sure I can convince the COLA scientists to use version control on all their analyses and figures, but I will discuss it with them. 

Lawson, your technique of breaking up the wget script into pieces, one file per piece, and running them in batches, is brilliant. I wish I had thought of that. It neatly solves the problem of the authorization tokens that expire in only 2 hours. 

Thank you, developers and fellow users, for keeping this thread active and informative! It's been invaluable. 
--Jennifer

On Dec 15, 2011, at 4:26 PM, Sébastien Denvil wrote:

> Hi Jennifer,
> 
> a lot has been already said on this thread. And I think that your 2 queries has been solved by my yesterday's email:
> 
> - "At the moment, I have no good ideas for how to solve the problem of replacing files in my local CMIP5 collection with newer versions if they are available."
> 
> Please read this :
> http://dods.ipsl.jussieu.fr/jripsl/synchro_data/README
> 
> - "The problem of how to keep track of version numbers and update my copy when necessary remains."
> 
> Please read this again and ask me any questions you want:
> http://dods.ipsl.jussieu.fr/jripsl/synchro_data/README
> 
> Regarding CMIP5 analysis I apply the following strategy.
> 
> 1) Accept that the latest version is the better and that you have to use it. We are data provider running a CMIP5 datanode so there is no surprise here.
> 
> 2) Decide how many times you plan to redo your figures (adding new models and experiments) to check that your preliminary results remains consistent.
> 
> 3) Replicate your CMIP5 Archive subset using the datanode DRS structure and keep the version information:
> /prodigfs/esg/CMIP5/output1/MPI-M/MPI-ESM-LR/rcp26/day/atmos/day/r1i1p1/v20111014
> /prodigfs/esg/CMIP5/output1/MRI/MRI-CGCM3/rcp26/day/atmos/day/r1i1p1/v1
> /prodigfs/esg/CMIP5/output1/IPSL/IPSL-CM5A-LR/rcp26/day/atmos/day/r3i1p1/v20111119
> 
> 4) Create a "latest" symlink pointing the latest version
> /prodigfs/esg/CMIP5/output1/MPI-M/MPI-ESM-LR/rcp26/day/atmos/day/r1i1p1/latest --> v20111014
> /prodigfs/esg/CMIP5/output1/MRI/MRI-CGCM3/rcp26/day/atmos/day/r1i1p1/latest --> v1
> /prodigfs/esg/CMIP5/output1/IPSL/IPSL-CM5A-LR/rcp26/day/atmos/day/r3i1p1/latest --> v20111119
> 
> 5) Build your analysis using the "latest" version *ONLY*.
> 
> 6) Create your figures, keep track of the files you used, and keep track of *the date* the figures has been produced : version your figures :-)
> 
> 7) Your replica collection evolves every day together with the "latest" symlink. Log information in a DB from every single file you downloaded (download date, version, MD5, tracking_id, and so on).
> 
> 8) Update your figure with your new collection as you planned. Again keep track of the files you used and keep track of *the date* the figures has been produced.
> 
> 9) If something changed fundamentally in your figures then ask the local DB what changed between date1 and date2 (you have the list of files you used).
> 
> 10) Isolate models and/or variables responsible for those fundamentals changes ; contact CMIP5 help desk *and* directly the respective modelling centre.
> 
> I know it's kind of heavy but there is no other way to do that. I know also that this strategy is not compatible with the current gateway/wget features.
> 
> The tool mentioned earlier allows all this!
> 
> Regards.
> Sébastien
> 
> On 15/12/2011 21:12, Bryan Lawrence wrote:
>> Hi Jennifer
>> 
>> Everything is a compromise isn't it :-)
>> 
>> The reality is that folks have decided that the filename isn't the place to push all our metadata. That's why we introduced three other key pieces of metadata - the DRS - the tracking ID in the file, and the metafor metadata. The DRS was a compromise on a compromise ... basically, we know that we can't rely on anyone to preserve either the filename  (which already had it's own structure) or the directory structure (the DRS), so *relying* on either of those is broken.
>> 
>> However, a) you should be able to unambiguously get provenance from the tracking id in the file ... so in that sense, we are way further forward than CMIP3, and
>> b) in principle you can only get at these files through a version aware interface (yeah, I know that it's not perfect in this regard, but that's not for want of our trying).
>> 
>> But yes, that puts some onus on the consumer to manage versioning just as the modelling centres *should*. We have tried to help this by providing you with drslib.
>> 
>> (Can anyone at DKRZ remind me what the url is for the tracking-id service?)
>> 
>> Cheers
>> Bryan
>> 
>> 
>>> On Dec 15, 2011, at 2:14 PM, Bryan Lawrence wrote:
>>> 
>>>> Hi Jennifer
>>>> 
>>>> With due respect, it's completely unrealistic to expect modelling groups not to want to have multiple versions of some datasets ... that's just not how the world (and in particular, modelling workflow) works. It has never  been thus. There simply isn't time to look at everthing before it is released ... if you haave a problem with that, blame the government folk who set the IPCC timetables :-)  (Maybe your comment was somewhat tongue in cheek, but I feel obliged to make this statement anyway :-).
>>> Fair enough. I was being cheeky, that is why I put the :-). The users suffer the IPCC time constraints too, we have to deliver analyses of data that take an impossibly long time to grab.
>>> 
>>>> Also, with due respect, please don't "replace files with newer versions" ... we absolutely need folks to understand the idea of processing with one particular version of the data, and understanding the provenance of that, so that they understand if the data has changed, they may need to re-run the processing.
>>> If the version is so important and needs to be preserved, then it should have been included in the data file name. It's obviously too late to make that change now. As I mentioned before, the version number is a valuable piece of metadata that is lost in the wget download process. The problem of how to keep track of version numbers and update my copy when necessary remains.
>>> 
>>> I'll take this opportunity to point out that the realm and frequency are also missing from the file name. I can't remember where I read this, but MIP_table value is not always adequate for uniquely determining what the realm and frequency are.
>>> 
>>>> I'm sure this doesn't apply to you, but for too long our community has had a pretty cavalier attitude to data provenance! CMIP3 and AR4 was a "dogs breakfast" in this regard …
>>> Looks like CMIP5 hasn't improved the situation.
>>> 
>>>> (And I too am very grateful that you are laying out your requirements in some detail :-)
>>> I'm glad to hear that.
>>> --Jennifer
>>> 
>>> 
>>>> Cheers
>>>> Bryan
>>>> 
>>>> 
>>>>> On Dec 15, 2011, at 11:22 AM, Estanislao Gonzalez wrote:
>>>>> 
>>>>>> Hi Jennifer,
>>>>>> 
>>>>>> I'll check this more carefully and see what can be done with what we have (or minimal changes), thought the multiple versions is something CMIP3 hasn't worried about, files just got changed or deleted, cmip5 add a two figure factor to that since there are many more institutions and data... but it might be possible.
>>>>> At the moment, I have no good ideas for how to solve the problem of replacing files in my local CMIP5 collection with newer versions if they are available. My strategy at this point is to get the version that is available now and not look for it again. If any data providers are listening, here is my plea:
>>>>> ==>  Please don't submit new versions of your CMIP5 data. Get it right the first time!<==
>>>>> :-)
>>>>> 
>>>>>> In any case I wanted just to thank you very much for the detailed description, it is very useful.
>>>>> I'm glad you (and Steve Hankin) find my long emails helpful.
>>>>> --Jennifer
>>>>> 
>>>>>> Regards,
>>>>>> Estani
>>>>>> 
>>>>>> Am 15.12.2011 14:52, schrieb Jennifer Adams:
>>>>>>> Hi, Estanislao --
>>>>>>> Please see my comments inline.
>>>>>>> 
>>>>>>> On Dec 15, 2011, at 5:47 AM, Estanislao Gonzalez wrote:
>>>>>>> 
>>>>>>>> Hi Jennifer,
>>>>>>>> 
>>>>>>>> I'm still not sure how is Lucas change in the API going to help you Jennifer. But perhaps it would help me to fully understand your requirement as well as your use of wget when using the FTP  protocol.
>>>>>>>> 
>>>>>>>> I presume what you want is to crawl the archive and get file from a specific directory structure?
>>>>>>>> Maybe it would be better if you just describe briefly the procedure you've been using for getting the CMIP3 data so we can see what could be done for CMIP5.
>>>>>>>> 
>>>>>>>> How did you find out which data was interesting?
>>>>>>> COLA scientists ask for a specific scenario/realm/frequency/variable they need for their research. Our CMIP3 collection is a shared resource of about 4Tb of data. For CMIP5, we are working with an estimate of 4-5 times that data volume to meet our needs. It's hard to say at this point whether that will be enough.
>>>>>>> 
>>>>>>>> How did you find out which files were required to be downloaded?
>>>>>>> For CMIP3, we often referred to http://www-pcmdi.llnl.gov/ipcc/data_status_tables.htm to see what was available.
>>>>>>> 
>>>>>>> The new version of this chart for CMIP5, http://cmip-pcmdi.llnl.gov/cmip5/esg_tables/transpose_esg_static_table.html, is also useful. An improvement I'd like to see on this page: the numbers inside the blue boxes that show how many runs there are for a particular experiment/model should be a link to a list of those runs that has all the necessary components from the Data Reference Synatax so that I can go directly to the URL for that data set. For example,
>>>>>>> the BCC-CSM1.1 model shows 45 runs for the decadal1960 experiment. I would like to click on that 45 and get a list of the 45 URLs for those runs, like this:
>>>>>>> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r1i1p1.html
>>>>>>> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decadal1960.day.land.day.r2i1p1.html
>>>>>>> ...
>>>>>>> 
>>>>>>> 
>>>>>>>> How did you tell wget to download those files?
>>>>>>> For example: wget -nH --retr-symlinks -r -A nc ftp://username@ftp-esg.ucllnl.org/picntrl/atm/mo/tas -o log.tas
>>>>>>> This would populate a local directory ./picntrl/atm/mo/tas with all the models and ensemble members in the proper subdirectory. If I wanted to update with newer versions or models that had been added, I simply ran the same 1-line wget command again. This is what I refer to as 'elegant.'
>>>>>>> 
>>>>>>> 
>>>>>>>> We might have already some way of achieving what you want, if we knew exactly what that is.
>>>>>>> Wouldn't that be wonderful? I am hopeful that the P2P will simplify the elaborate and flawed workflow I have cobbled together to navigate the current system.
>>>>>>> I have a list of desired experiment/realm/frequency/MIP_table/variables for which I need to grab all available models/ensembles. Is that not enough to describe my needs?
>>>>>>> 
>>>>>>>> I guess my proposal of issuing:
>>>>>>>> bash<(wget http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=month&variable=clt -qO - | grep -v HadCM3)
>>>>>>> Yes, this would likely achieve the same result as the '&model=!name' that Luca implemented. However, I believe the documentation says that there is a limit of 1000 to the number of wgets that p2pnode will put into a single search request, so I don't want to populate my precious 1000 results with wgets that I'm going to grep out afterwards.
>>>>>>> 
>>>>>>> --Jennifer
>>>>>>> 
>>>>>>> 
>>>>>>>> was not acceptable to you. But I still don't know exactly why.
>>>>>>>> It would really help to know what you meant by "elegant use of wget".
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Estani
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Am 14.12.2011 18:44, schrieb Cinquini, Luca (3880):
>>>>>>>>> So Jennifer, would having the capability of doing negative searches (model=!CCSM), and generate the corresponding wget scripts, help you ?
>>>>>>>>> thanks, Luca
>>>>>>>>> 
>>>>>>>>> On Dec 14, 2011, at 10:38 AM, Jennifer Adams wrote:
>>>>>>>>> 
>>>>>>>>>> Well, after working from the client side to get CMIP3 and CMIP5 data, I can say that wget is a fine tool to rely on at the core of the workflow. Unfortunately, the step up in complexity from CMIP3 to CMIP5 and the switch from FTP to HTTP trashed the elegant use of wget. No amount of customized wrapper software, browser interfaces, or pre-packaged tools like DML fixes that problem.
>>>>>>>>>> 
>>>>>>>>>> At the moment, the burden on the user is embarrassingly high. It's so easy to suggest that the user should "filter to remove what is not required" from a downloaded script, but the actual pratice of doing that in a timely and automated and distributed way is NOT simple! And if the solution to my problem of filling in the gaps in my incomplete collection is to go back to clicking in my browser and do the whole thing over again but make my filters smarter by looking for what's already been acquired or what has a new version number … this is unacceptable. The filtering must be a server-side responsibility and the interface must be accessible by automated scripts. Make it so!
>>>>>>>>>> 
>>>>>>>>>> By the way, the version number is a piece of metadata that is not in the downloaded files or the gateway's search criteria. It appears in the wget script as part of the path in the file's http location, but the path is not preserved after the wget is complete, so it is effectively lost after the download is                           done. I guess the file's date stamp would be the only way to know if the version number of the data file in question has been changed, but I'm not going to write that check into my filtering scripts.
>>>>>>>>>> 
>>>>>>>>>> --Jennifer
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Jennifer M. Adams
>>>>>>>>>> IGES/COLA
>>>>>>>>>> 4041 Powder Mill Road, Suite 302
>>>>>>>>>> Calverton, MD 20705
>>>>>>>>>> jma at cola.iges.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>> 
>>>>>>> --
>>>>>>> Jennifer M. Adams
>>>>>>> IGES/COLA
>>>>>>> 4041 Powder Mill Road, Suite 302
>>>>>>> Calverton, MD 20705
>>>>>>> jma at cola.iges.org
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>> 
>>>>> --
>>>>> Jennifer M. Adams
>>>>> IGES/COLA
>>>>> 4041 Powder Mill Road, Suite 302
>>>>> Calverton, MD 20705
>>>>> jma at cola.iges.org
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> --
>>>> Bryan Lawrence
>>>> University of Reading:  Professor of Weather and Climate Computing.
>>>> National Centre for Atmospheric Science: Director of Models and Data.
>>>> STFC: Director of the Centre for Environmental Data Archival.
>>>> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>>> --
>>> Jennifer M. Adams
>>> IGES/COLA
>>> 4041 Powder Mill Road, Suite 302
>>> Calverton, MD 20705
>>> jma at cola.iges.org
>>> 
>>> 
>>> 
>>> 
>> --
>> Bryan Lawrence
>> University of Reading:  Professor of Weather and Climate Computing.
>> National Centre for Atmospheric Science: Director of Models and Data.
>> STFC: Director of the Centre for Environmental Data Archival.
>> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>> 
> 
> 
> -- 
> Sébastien Denvil
> IPSL, Pôle de modélisation du climat
> UPMC, Case 101, 4 place Jussieu,
> 75252 Paris Cedex 5
> 
> Tour 45-55 2ème étage Bureau 209
> Tel: 33 1 44 27 21 10
> Fax: 33 1 44 27 39 02
> 
> 

--
Jennifer M. Adams
IGES/COLA
4041 Powder Mill Road, Suite 302
Calverton, MD 20705
jma at cola.iges.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111215/7cf8a824/attachment-0001.html