[Go-essp-tech] cmip5

Thu Sep 15 16:53:18 MDT 2011

Hi Sebastien,

Thank you.  Your input is very useful.

More in-line below.

Best regards,

-Eric

Sébastien Denvil wrote:
>  Hi Eric,
>
> see comments and feedback below.
>
> The purpose of this email is to give a end-user perspective and to 
> recall the "multi-model inter comparison" essence of CMIP5.
>
> I just hope it can be taken into account by gateways roadmap.
We're focused on identifying the top priorities for improving the 
gateway search and download capabilities for CMIP5.  Your input is truly 
appreciated.

The gateway 2.0 addresses a number of these search shortcomings.  It is 
also a more flexible platform for addressing current and future search 
needs.  We plan to release this version as soon as possible.  Feedback 
is an important part of the process.

>
> On 13/09/2011 22:40, Eric Nienhouse wrote:
>> Dear All,
>>
>> Thanks everyone for your thorough and thoughtful discussion on this
>> issue.  This support request details a number of usability issues
>> related to cross-cutting data discovery and access.  Much can be done
>> now to improve user experience as we all collectively work towards a
>> more robust system which addresses these community needs.
>>
>> Please consider the following:
>>
>> * Install/upgrade to Gateway version 1.3.2.  (This version addresses
>> several frequent end user support issues.)
>> * Enable certificate authorization (token-less) at all Gateways and Data
>> Nodes.
>> * Try out the Gateway 2.0 Beta and provide feedback:
>> (http://search-esg.prototype.ucar.edu)
>
> CMIP5 being a multi-model approach ; what end user are looking for is 
> a way to access given quantities (air temperature and liquid 
> precipitation for example) from a given experiment for all available 
> models (reach that point by minimizing click amounts is key to 
> success). Seeing that as the primary requirement the end-user final 
> step of the scenario is :
> - end-user with a wget script that points files across datasets.
> - end-user start a globus-online download that points files across 
> datasets.
> - and so on.
Enabling a single download script (or globus-online process) based on a 
many datasets derived from user criteria is necessary in this context.  
As you and Balaji note, a per dataset download model is unfeasible for many.
>
> May be the dataset centric view doesn't ease that approach but we 
> think this should be the way to go. I can't login with my OpenID so it 
> could be that the download options has been change in the v2 but I 
> can't test it.

The download page allows you to create a single download script based on 
the selected set of datasets returned from search results.  Users can 
choose the number of results per page enabling selection of many 
datasets.  (You can also show > 100 results by adding the page size to 
the URL).  We're hoping this approach allows users to progress quickly 
and effectively to a single download script or method.  Please note, 
there is a "select all" checkbox in the results page, however, this may 
not be clear.

Sorry you can't test the download UI in the search preview.  It is not 
accepting federation OpenIDs yet.  We'll look into options to enable 
this until we have a test federation setup.
>
> See below an example.
>
>> * Install the Gateway 2.0 Beta for federation testing and review.
>>
>> It will also be of great help to our end users to work towards
>> consistent DRS based file URLs, improved user documentation and
>> providing services to support bulk download.  Much of this is already
>> under way.
>>
>> Improving the help documentation for certificate WGet access and adding
>> Globus Online as a download option is work in progress for the next
>> Gateway release 1.3.3.  This should be available very soon.
>
> What is the scenario for globus online? One "download job" per dataset 
> or one "download job" per search criteria?
The scenario for Globus Online is one "download job" per search 
criteria.  So a download job may represent many datasets.
>
> Example:
>
> - I'm interested in the short term simulation (18 experiments), all 
> ensemble member (let's say 6 members in average across models for each 
> experiment)
>
> experiments="decadal1960 decadal1965 decadal1970 decadal1975 
> decadal1980 decadal1985 decadal1990 decadal1995 decadal2000 
> decadal2001 decadal2002 decadal2003 decadal2004 decadal2005 
> decadal2006 decadal2007 decadal2008 decadal2009"
>
> - I like those variables (6 variables from the same table!)
>
> 2D_variables[atmos][mon]="tas ts pctisccp"
> 3D_variables[atmos][mon]="ta hur clcalipso parasolRefl"
>
> - I want to analyse all the model distributing what's above (let's say 
> 14 models)
>
> ===> So I have an interest for 9072 datasets.
>
> One download per search criteria should trigger one download job using 
> 10 clicks.
> One download per dataset will trigger 9072 download jobs and something 
> like o(100000) clicks. This is not possible.
Agreed - a download per dataset is not reasonable.  We're working 
towards a model where a single "download job" can be created based on 
search criteria.  We may find that the current faceted navigation is not 
the best method for enabling this.  However this is our starting point.  
It is our desire to support cross cutting data requests without 
exhausting our users with clicks and backward steps.

You've identified two key user scenarios.  We're hearing similar 
requests from end users via support emails as well.

1)  I want to download all files from experiment E1, with 
variables/fields V1,V2,V3 from all models.
2)  I want to download all files for variables (V1,V2,V3...) from 
experiments (E1,E2,E3...)

The current, pre-2.0 gateway makes these tasks difficult and time 
consuming at best.  The 2.0 version is a solid step in the right direction.

Thanks again, Sebastien, for sending this out.
>
> Best regards.
> Sébastien
>
>> Thank you for your time and help.
>>
>> Kind regards,
>>
>> -Eric
>>
>>
>> Williams, Dean N. wrote:
>>> Hi Johathan,
>>>
>>>      I know this can be pain now, but we are working to improve the
>>> situation.
>>> I was also informed that other on the ESGF team are working to help the
>>> data movement/download situation. This help comes in the form of Globus
>>> Online (GO) and Data Mover-Lite (DML). For example DML also supports 
>>> the
>>> list of features listed by IPSL:
>>>          * support for myproxy-logon.
>>>          * simple data selection with model,experiment,realm and
>>> variable etc.
>>>            in a simple tree search.
>>>          * multi threaded downloads,
>>>            NOTE: dml-webstart only supports downloading small files,
>>> but the
>>> standalone version
>>>            supports downloading bigfiles with multithreaded support.
>>>          * incremental process (ie, downloading only non-existing 
>>> files)
>>>
>>>
>>>      The features DML are working to incorporate are:
>>>          * manage datasets version following new DRS
>>>          * download history stored in a database
>>>
>>>
>>>      We will keep you and the community abreast of the new features as
>>> they
>>> become available.
>>>
>>> Best regards,
>>>      Dean
>>>
>>> On 9/12/11 10:12 AM, "Jonathan Gregory"<j.m.gregory at reading.ac.uk>
>>> wrote:
>>>
>>>
>>>> Dear Stephen, Dean, Sebastien, Stephane     cc Jamie, Martin
>>>>
>>>> Thank you very much for your emails. I am grateful to Stephen and
>>>> Dean for
>>>> responding positively and constructively to my email, despite its
>>>> being a
>>>> list of complaints. I'll certainly out try Sebastien's program, and
>>>> Stephane's too if you are willing to make it available; it's very
>>>> helpful
>>>> that you have written these.
>>>>
>>>> Best wishes
>>>>
>>>> Jonathan
>>>>
>>>
>>> Dear Jonathan,
>>>
>>> Thanks for taking the time to describe your concerns about the 
>>> usability of the CMIP5 archive system.  I am CC'ing this to 
>>> go-essp-tech at ucar.edu as I think your feedback is particularly 
>>> welcome and insightful and deserves to be seen and discussed widely.
>>>
>>> We are aware of many of the shortcomings you identify; improvements 
>>> in software and documentation are in progress that I hope will 
>>> improve your experience.  However, our progress has been slower than 
>>> we'd hoped and we are now up against significant CMIP5 usage which 
>>> will inevitably impede rolling-out improvements.  We would have 
>>> hoped to have the system more usable by now but we are pushing hard 
>>> to improve the system as quickly as possible.
>>>
>>> You identify several user interface and performance issues with the 
>>> ESG Gateway search system.  Our colleagues at NCAR have been 
>>> developing a new version of the Gateway with an improved search 
>>> backend that I believe solves many of your concerns.  I've seen a 
>>> test deployment at NCAR and it is a significant improvement.  We at 
>>> BADC will be deploying it for testing in the next couple of days in 
>>> the hope that it can be rolled-out quickly for end-users.
>>>
>>> Another point in your feedback is scriptability of downloads and 
>>> checking what is available.  We had hoped that the wget script 
>>> generation feature of the gateway would produce wget scripts that 
>>> could be edited to download different sorts of data by leveraging 
>>> the Data Reference Syntax [1].  Unfortunately, although some 
>>> download URLs contain DRS information that would help deducing 
>>> alternative downloads, this isn't practical at present.  We are 
>>> working to improve the DRS consistency of the archive that we hope 
>>> will improve download scriptability.
>>>
>>> The other mechanism you could use to programmatically download data 
>>> and discover new data is reading the THREDDS catalogs.  Every centre 
>>> serving CMIP5 data is running a THREDDS Data Server [2] which lists 
>>> all download URLs in a network of THREDDS XML catalogs.  This is 
>>> intended as an internal interface so isn't well documented.  
>>> However, I think it is no secret that some users are doing this 
>>> already.  You can find the THREDDS source catalog of every dataset 
>>> in the "History" tab of the Gateway's dataset page or they can be 
>>> deduced from download URLs and a little knowledge of TDS.
>>>
>>> I should add that downloading data directly from a TDS will only 
>>> work if it is configured to use "tokenless" security.  This is the 
>>> case with only some datanodes at present but should be fixed in the 
>>> near term.
>>>
>>> In the medium-term ESGF are planning documented service APIs that 
>>> would allow users to query the system programmatically and there is 
>>> a new P2P architecture in the works with more focus on scalability [3]
>>>
>>> Regards,
>>> Stephen Pascoe.
>>>
>>> [1] CMIP5 Data Reference Syntax: 
>>> http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf
>>> [2] THREDDS Data Server: http://www.unidata.ucar.edu/projects/THREDDS/
>>> [3] ESGF P2P Architecture: http://esgf.org/wiki/ESGF_Index
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> Centre of Environmental Data Archival
>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 
>>> 0QX, UK
>>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>