[Go-essp-tech] search problems with ESG

Karl Taylor taylor13 at llnl.gov
Fri May 27 18:13:25 MDT 2011


Dear all,

As you know, we are getting an increasing number of complaints that the 
ESG search is returning misleading results.  I'm providing one more 
simple example that illustrates the problem (at least one problem that 
seems quite robust):

Go to the BADC, PCMDI, or DKRZ websites and select from the "Search 
Categories" model = inmcm4 (which is being served from the PCMDI data 
node).   The search returns 204 datasets, which is the correct number, 
but if you now look under the "Experiment" search category, you see only 
4 experiments listed, whereas the 204 datasets are actually from 12 
experiments; 8 experiments fail to appear.  [Note that if you search at 
the JPL portal, only 4 experiments are missing, and if you search at 
NCAR, you'll see all the experiments correctly.]

Most users coming to BADC, PCMDI, or DKRZ would think that output is 
only available from 4 experiments from the inmcm4 model.  So output that 
modeling groups have so diligently made available would be missed by the 
folks who want to analyze it.

I can see no other problem at this time that should be given higher 
priority.  We must make visible to our users all data that is actually 
in the archive as soon as possible.  We should devote every available 
resource to fixing this problem.

Another problem with the current search capability (of slightly lower 
priority) is that finding datasets the user is interested in (and only 
those datasets) is currently difficult because of the way the variables 
are identified in the search engine and because there is a rather silly 
mistake in what the search engine is doing.   Considering first the 
"Search Categories" method (as opposed to the method where you enter 
"free text"), the search for variables is based on the list of standard 
names (displayed with the underscores removed).  There are two 
limitations of listing only the standard_name:

1. some variables in the database may not have a standard name 
attribute, so they won't be listed under the "variable" search category.

2. the same standard_name can apply to multiple variables.  For example 
area_fraction is a standard name that can apply to many different 
variables (e.g., "land cover fraction", "grass fraction", "crop 
fraction")  Thus, if a user searches on "area fraction", but is only 
interested in, for example, "grass fraction", they will find lots of 
extraneous datasets.

I think the user should be able to designate whether the list of 
variables under the "variable" search category are displayed as 
standard_names, long_names, or the names of the variables themselves 
(i.e., the netCDF variable name).

If this is not possible (or is too difficult), then the other method of 
searching (i.e., by entering a string after "Search: Datasets for: ...." 
at the top of the esg search page) needs to be greatly improved.  (It 
really should be improved in any case.)  When searching for a variable 
it is difficult to limit the results to the actual variable you are 
looking for.  Here are some problems:

1.  If you enter a word like "sometimes" or "the" or "disciplines", the 
search returns lots of datasets.  This is because the search looks not 
only through the standard names, the realm names, the model names, etc., 
but also through all the text in the explanatory description associated 
with each standard_name it finds.  I think whoever coded this made a 
silly mistake by searching the text that explains the meaning of the 
standard_name, when they meant to look through the "long_names" 
instead.  [The search engine doesn't look at the long names at all!]   
You'll note that if you click on any dataset returned by a search, 
you'll get a page that provides 4 tabs of information ("summary", 
"geophysical properties", "variables", and "administration").  Click on 
"variables" and you will find information about each variable.  Among 
the information is "Description", which contains the long_name, followed 
by "Units" and "Standard_Name".  Just below the standard_name is 
*another* "Description", which explains what the standard_name means.  I 
think it is this description that is being looked at instead of the 
"Description" that contains the long_name.  This should be fixed 
immediately!

2.  When you enter a string such as "surface air temperature", the 
search returns all variables that contain any one of these 3 words.  
There is no way a user can find surface air temperature and not a bunch 
of other variables with this search capability because it returns the 
*union* of individual searches on "surface", "air", and "temperature".   
If the user looks for a standard name (e.g., air_temperature), an error 
is returned because an underscore is not allowed.    Another problem is 
that if you search by entering the text "temperature",  variables like 
"precipitation" still appear  under the Search Category "Variables" 
although they should have been eliminated.

3.  If you enter the name of a variable (as it appears in the netCDF 
file), for example "tas" (surface air temperature), no results are 
returned.  The search should include the variable names in the text it 
scans.

I am quite embarrassed that hundreds of users are seeing this terribly 
disfunctional search capability.  It makes us look bad.  Whoever is 
responsible for this, needs to fix it immediately.

  If there are things I can do to help, please let me know.

thanks,
Karl

P.S.  Here's a list of additional items which I think would improve the 
ESG user interface:

Highest priority:

1.A symbol next to each dataset in the list should indicate which 
datasets the user can download with current permissions.

2.Whenever a user is asked to click on (select) items on a page, he 
should always be provided with an option to "select all" or "select all  
for which I currently have permission to download"  (as well retaining 
the option to select individual items).This holds both for the "dataset" 
selection pages and the file selection pages.

3.The interface pages for subscribing to the "commercial" group or the 
"research" group for CMIP5 need to be modified to prevent users from 
joining the wrong group.We need to make the whole of the terms of use 
visible to the user without scrolling, if possible.We need to add text 
to guide the user to the correct group.The terms of use should be easily 
downloadable as a pdf or something.

4.When subscribing to one of the CMIP5 groups, if the user forgets to 
click on "I accept", the statement of work should not have to be 
re-entered by the user (the user will likely enter a much shorter 
description the second time through, and we don't want that).Also, we 
should explain what kind of information we are seeking in the "statement 
of work".

5.    I have not yet any experience with the new method of 
authentication (i.e., tokenless).  But it sounds like when users run the 
wget script, they will be required to login (after presumably logging in 
to download the wget script itself).  Logging in twice would be a nuisance.

Lower priority:

6.Eliminate the prompting to join a group every time you attempt to 
download data that you are not currently authorized to download.Replace 
this with something like the following statements:"You are not 
authorized to download all of the selected datasets.You may be able to 
gain permission to access these datasets by joining the following 
groups:group1, group2, ... groupN.To subscribe to a group, click on the 
"My account" tab at the top of the page.Click HERE to reach those 
datasets you are already authorized to download"[A different wording 
will be required if the user is not authorized to download any of the 
datasets.]

7.When datasets are listed, the user should be able to set priorities on 
which center(s) he prefers to get them from.Then in the case when a 
dataset has been replicated, if the user "selects all", ESG will know 
which center to get the data from (when there is more than one center 
with the data).

8.Add "more information" or "help" buttons in several places to assist 
the users when they become confused.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110527/b1cb9a7e/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list