[Go-essp-tech] search problems with ESG
Karl Taylor
taylor13 at llnl.gov
Fri May 27 18:13:25 MDT 2011
Dear all,
As you know, we are getting an increasing number of complaints that the
ESG search is returning misleading results. I'm providing one more
simple example that illustrates the problem (at least one problem that
seems quite robust):
Go to the BADC, PCMDI, or DKRZ websites and select from the "Search
Categories" model = inmcm4 (which is being served from the PCMDI data
node). The search returns 204 datasets, which is the correct number,
but if you now look under the "Experiment" search category, you see only
4 experiments listed, whereas the 204 datasets are actually from 12
experiments; 8 experiments fail to appear. [Note that if you search at
the JPL portal, only 4 experiments are missing, and if you search at
NCAR, you'll see all the experiments correctly.]
Most users coming to BADC, PCMDI, or DKRZ would think that output is
only available from 4 experiments from the inmcm4 model. So output that
modeling groups have so diligently made available would be missed by the
folks who want to analyze it.
I can see no other problem at this time that should be given higher
priority. We must make visible to our users all data that is actually
in the archive as soon as possible. We should devote every available
resource to fixing this problem.
Another problem with the current search capability (of slightly lower
priority) is that finding datasets the user is interested in (and only
those datasets) is currently difficult because of the way the variables
are identified in the search engine and because there is a rather silly
mistake in what the search engine is doing. Considering first the
"Search Categories" method (as opposed to the method where you enter
"free text"), the search for variables is based on the list of standard
names (displayed with the underscores removed). There are two
limitations of listing only the standard_name:
1. some variables in the database may not have a standard name
attribute, so they won't be listed under the "variable" search category.
2. the same standard_name can apply to multiple variables. For example
area_fraction is a standard name that can apply to many different
variables (e.g., "land cover fraction", "grass fraction", "crop
fraction") Thus, if a user searches on "area fraction", but is only
interested in, for example, "grass fraction", they will find lots of
extraneous datasets.
I think the user should be able to designate whether the list of
variables under the "variable" search category are displayed as
standard_names, long_names, or the names of the variables themselves
(i.e., the netCDF variable name).
If this is not possible (or is too difficult), then the other method of
searching (i.e., by entering a string after "Search: Datasets for: ...."
at the top of the esg search page) needs to be greatly improved. (It
really should be improved in any case.) When searching for a variable
it is difficult to limit the results to the actual variable you are
looking for. Here are some problems:
1. If you enter a word like "sometimes" or "the" or "disciplines", the
search returns lots of datasets. This is because the search looks not
only through the standard names, the realm names, the model names, etc.,
but also through all the text in the explanatory description associated
with each standard_name it finds. I think whoever coded this made a
silly mistake by searching the text that explains the meaning of the
standard_name, when they meant to look through the "long_names"
instead. [The search engine doesn't look at the long names at all!]
You'll note that if you click on any dataset returned by a search,
you'll get a page that provides 4 tabs of information ("summary",
"geophysical properties", "variables", and "administration"). Click on
"variables" and you will find information about each variable. Among
the information is "Description", which contains the long_name, followed
by "Units" and "Standard_Name". Just below the standard_name is
*another* "Description", which explains what the standard_name means. I
think it is this description that is being looked at instead of the
"Description" that contains the long_name. This should be fixed
immediately!
2. When you enter a string such as "surface air temperature", the
search returns all variables that contain any one of these 3 words.
There is no way a user can find surface air temperature and not a bunch
of other variables with this search capability because it returns the
*union* of individual searches on "surface", "air", and "temperature".
If the user looks for a standard name (e.g., air_temperature), an error
is returned because an underscore is not allowed. Another problem is
that if you search by entering the text "temperature", variables like
"precipitation" still appear under the Search Category "Variables"
although they should have been eliminated.
3. If you enter the name of a variable (as it appears in the netCDF
file), for example "tas" (surface air temperature), no results are
returned. The search should include the variable names in the text it
scans.
I am quite embarrassed that hundreds of users are seeing this terribly
disfunctional search capability. It makes us look bad. Whoever is
responsible for this, needs to fix it immediately.
If there are things I can do to help, please let me know.
thanks,
Karl
P.S. Here's a list of additional items which I think would improve the
ESG user interface:
Highest priority:
1.A symbol next to each dataset in the list should indicate which
datasets the user can download with current permissions.
2.Whenever a user is asked to click on (select) items on a page, he
should always be provided with an option to "select all" or "select all
for which I currently have permission to download" (as well retaining
the option to select individual items).This holds both for the "dataset"
selection pages and the file selection pages.
3.The interface pages for subscribing to the "commercial" group or the
"research" group for CMIP5 need to be modified to prevent users from
joining the wrong group.We need to make the whole of the terms of use
visible to the user without scrolling, if possible.We need to add text
to guide the user to the correct group.The terms of use should be easily
downloadable as a pdf or something.
4.When subscribing to one of the CMIP5 groups, if the user forgets to
click on "I accept", the statement of work should not have to be
re-entered by the user (the user will likely enter a much shorter
description the second time through, and we don't want that).Also, we
should explain what kind of information we are seeking in the "statement
of work".
5. I have not yet any experience with the new method of
authentication (i.e., tokenless). But it sounds like when users run the
wget script, they will be required to login (after presumably logging in
to download the wget script itself). Logging in twice would be a nuisance.
Lower priority:
6.Eliminate the prompting to join a group every time you attempt to
download data that you are not currently authorized to download.Replace
this with something like the following statements:"You are not
authorized to download all of the selected datasets.You may be able to
gain permission to access these datasets by joining the following
groups:group1, group2, ... groupN.To subscribe to a group, click on the
"My account" tab at the top of the page.Click HERE to reach those
datasets you are already authorized to download"[A different wording
will be required if the user is not authorized to download any of the
datasets.]
7.When datasets are listed, the user should be able to set priorities on
which center(s) he prefers to get them from.Then in the case when a
dataset has been replicated, if the user "selects all", ESG will know
which center to get the data from (when there is more than one center
with the data).
8.Add "more information" or "help" buttons in several places to assist
the users when they become confused.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110527/b1cb9a7e/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list