[Go-essp-tech] search problems with ESG

Don Middleton don at ucar.edu
Fri May 27 23:44:47 MDT 2011


Hi Karl - Thanks for the thorough description of problems, as always. We have several people working on the critical search & related tickets, and will make sure all of these are properly in the system as well. There's been work on this during the week, with traffic on the gateways list - BADC, DKRZ, and NCAR. Monday's a federal holiday for the US folks, so we'll reengage on Tuesday. Have a good holiday weekend, Karl - and all!

cheers - don


On May 27, 2011, at 6:13 PM, Karl Taylor wrote:

> Dear all,
> 
> As you know, we are getting an increasing number of complaints that the ESG search is returning misleading results.  I'm providing one more simple example that illustrates the problem (at least one problem that seems quite robust):
> 
> Go to the BADC, PCMDI, or DKRZ websites and select from the "Search Categories" model = inmcm4 (which is being served from the PCMDI data node).   The search returns 204 datasets, which is the correct number, but if you now look under the "Experiment" search category, you see only 4 experiments listed, whereas the 204 datasets are actually from 12 experiments; 8 experiments fail to appear.  [Note that if you search at the JPL portal, only 4 experiments are missing, and if you search at NCAR, you'll see all the experiments correctly.]
> 
> Most users coming to BADC, PCMDI, or DKRZ would think that output is only available from 4 experiments from the inmcm4 model.  So output that modeling groups have so diligently made available would be missed by the folks who want to analyze it.  
> 
> I can see no other problem at this time that should be given higher priority.  We must make visible to our users all data that is actually in the archive as soon as possible.  We should devote every available resource to fixing this problem. 
> 
> Another problem with the current search capability (of slightly lower priority) is that finding datasets the user is interested in (and only those datasets) is currently difficult because of the way the variables are identified in the search engine and because there is a rather silly mistake in what the search engine is doing.   Considering first the "Search Categories" method (as opposed to the method where you enter "free text"), the search for variables is based on the list of standard names (displayed with the underscores removed).  There are two limitations of listing only the standard_name:
> 
> 1. some variables in the database may not have a standard name attribute, so they won't be listed under the "variable" search category.  
> 
> 2. the same standard_name can apply to multiple variables.  For example area_fraction is a standard name that can apply to many different variables (e.g., "land cover fraction", "grass fraction", "crop fraction")  Thus, if a user searches on "area fraction", but is only interested in, for example, "grass fraction", they will find lots of extraneous datasets.
> 
> I think the user should be able to designate whether the list of variables under the "variable" search category are displayed as standard_names, long_names, or the names of the variables themselves (i.e., the netCDF variable name).
> 
> If this is not possible (or is too difficult), then the other method of searching (i.e., by entering a string after "Search: Datasets for: ...." at the top of the esg search page) needs to be greatly improved.  (It really should be improved in any case.)  When searching for a variable it is difficult to limit the results to the actual variable you are looking for.  Here are some problems:
> 
> 1.  If you enter a word like "sometimes" or "the" or "disciplines", the search returns lots of datasets.  This is because the search looks not only through the standard names, the realm names, the model names, etc., but also through all the text in the explanatory description associated with each standard_name it finds.  I think whoever coded this made a silly mistake by searching the text that explains the meaning of the standard_name, when they meant to look through the "long_names" instead.  [The search engine doesn't look at the long names at all!]   You'll note that if you click on any dataset returned by a search, you'll get a page that provides 4 tabs of information ("summary", "geophysical properties", "variables", and "administration").  Click on "variables" and you will find information about each variable.  Among the information is "Description", which contains the long_name, followed by "Units" and "Standard_Name".  Just below the standard_name is *another* "Description", which explains what the standard_name means.  I think it is this description that is being looked at instead of the "Description" that contains the long_name.  This should be fixed immediately!
> 
> 2.  When you enter a string such as "surface air temperature", the search returns all variables that contain any one of these 3 words.  There is no way a user can find surface air temperature and not a bunch of other variables with this search capability because it returns the *union* of individual searches on "surface", "air", and "temperature".   If the user looks for a standard name (e.g., air_temperature), an error is returned because an underscore is not allowed.    Another problem is that if you search by entering the text "temperature",  variables like "precipitation" still appear  under the Search Category "Variables" although they should have been eliminated.
> 
> 3.  If you enter the name of a variable (as it appears in the netCDF file), for example "tas" (surface air temperature), no results are returned.  The search should include the variable names in the text it scans.
> 
> I am quite embarrassed that hundreds of users are seeing this terribly disfunctional search capability.  It makes us look bad.  Whoever is responsible for this, needs to fix it immediately.   
> 
>  If there are things I can do to help, please let me know.
> 
> thanks,
> Karl
> 
> P.S.  Here's a list of additional items which I think would improve the ESG user interface:
> Highest priority:
> 
> 1.     A symbol next to each dataset in the list should indicate which datasets the user can download with current permissions.
> 
> 2.     Whenever a user is asked to click on (select) items on a page, he should always be provided with an option to “select all” or “select all        for which I currently have permission to download"  (as well retaining the option to select individual items).  This holds both for the “dataset” selection pages and the file selection pages.
> 
> 3.     The interface pages for subscribing to the “commercial” group or the “research” group for CMIP5 need to be modified to prevent users       from joining the wrong group.  We need to make the whole of the terms of use visible to the user without scrolling, if possible.  We need to add text to guide the user to the correct group.  The terms of use should be easily downloadable as a pdf or something.  
> 
> 4.     When subscribing to one of the CMIP5 groups, if the user forgets to click on “I accept”, the statement of work should not have to be re-entered by the user (the user will likely enter a much shorter description the second time through, and we don’t want that).  Also, we should explain what kind of information we are seeking in the “statement of work”.
> 5.    I have not yet any experience with the new method of authentication (i.e., tokenless).  But it sounds like when users run the wget script, they will be required to login (after presumably logging in to download the wget script itself).  Logging in twice would be a nuisance.  
> Lower priority:
> 
> 6.     Eliminate the prompting to join a group every time you attempt to download data that you are not currently authorized to download.  Replace this with something like the following statements:  “You are not authorized to download all of the selected datasets.  You may be able to gain permission to access these datasets by joining the following groups:  group1, group2, … groupN.  To subscribe to a group, click on the “My account” tab at the top of the page.   Click HERE to reach those datasets you are already authorized to download”  [A different wording will be required if the user is not authorized to download any of the datasets.]
> 
> 7.     When datasets are listed, the user should be able to set priorities on which center(s) he prefers to get them from.  Then in the case when a dataset has been replicated, if the user “selects all”, ESG will know which center to get the data from (when there is more than one center with the data).
> 
> 8.     Add “more information” or “help” buttons in several places to assist the users when they become confused.
> 
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110527/9f164c9f/attachment.html 


More information about the GO-ESSP-TECH mailing list