[Go-essp-tech] search problems with ESG

Wed Jun 1 16:17:49 MDT 2011

Dear Karl, All,

A growing number of ESG search related issues have been identified 
recently.  We're working on addressing each of these problems, 
prioritizing those that have the greatest impact on end users.

One issue in particular is leading to the inconsistent "Search 
Categories" results Karl notes below.  This problem is directly related 
to deleting datasets by invoking the "esgunpublish" command at a Data 
Node.  Dataset unpublish operations have a significant side effect of 
removing important search category values from *many* published datasets 
at a Gateway.  In particular, unpublishing datasets may result in 
deletion of the DRS components "experiment" and "ensemble".  This 
effectively removes the related "Search Categories" values.

We're working on a fix to this problem which will be available as soon 
as possible.  In the meanwhile, two things can be done to mitigate this 
issue:

1)  Please suspend all esgunpublish operations which delete datasets.
2)  Please re-publish all affected dataset catalogs.

I realize this may involve significant effort.  This is likely the 
quickest way to minimize the impact of this problem on the operational 
system.

We deeply regret this issue and the inconvenience caused to end users as 
well as data publishers.  We're reviewing options to remedy this issue 
as quickly as possible and will pass on further details as they are known.

Thank you again, Karl, for the detailed description of search related 
problems.

Best regards,

-Eric

Don Middleton wrote:
> Hi Karl - Thanks for the thorough description of problems, as always. 
> We have several people working on the critical search & related 
> tickets, and will make sure all of these are properly in the system as 
> well. There's been work on this during the week, with traffic on the 
> gateways list - BADC, DKRZ, and NCAR. Monday's a federal holiday for 
> the US folks, so we'll reengage on Tuesday. Have a good holiday 
> weekend, Karl - and all!
>
> cheers - don
>
>
> On May 27, 2011, at 6:13 PM, Karl Taylor wrote:
>
>> Dear all,
>>
>> As you know, we are getting an increasing number of complaints that 
>> the ESG search is returning misleading results.  I'm providing one 
>> more simple example that illustrates the problem (at least one 
>> problem that seems quite robust):
>>
>> Go to the BADC, PCMDI, or DKRZ websites and select from the "Search 
>> Categories" model = inmcm4 (which is being served from the PCMDI data 
>> node).   The search returns 204 datasets, which is the correct 
>> number, but if you now look under the "Experiment" search category, 
>> you see only 4 experiments listed, whereas the 204 datasets are 
>> actually from 12 experiments; 8 experiments fail to appear.  [Note 
>> that if you search at the JPL portal, only 4 experiments are missing, 
>> and if you search at NCAR, you'll see all the experiments correctly.]
>>
>> Most users coming to BADC, PCMDI, or DKRZ would think that output is 
>> only available from 4 experiments from the inmcm4 model.  So output 
>> that modeling groups have so diligently made available would be 
>> missed by the folks who want to analyze it. 
>>
>> I can see no other problem at this time that should be given higher 
>> priority.  We must make visible to our users all data that is 
>> actually in the archive as soon as possible.  We should devote every 
>> available resource to fixing this problem.
>>
>> Another problem with the current search capability (of slightly lower 
>> priority) is that finding datasets the user is interested in (and 
>> only those datasets) is currently difficult because of the way the 
>> variables are identified in the search engine and because there is a 
>> rather silly mistake in what the search engine is doing.   
>> Considering first the "Search Categories" method (as opposed to the 
>> method where you enter "free text"), the search for variables is 
>> based on the list of standard names (displayed with the underscores 
>> removed).  There are two limitations of listing only the standard_name:
>>
>> 1. some variables in the database may not have a standard name 
>> attribute, so they won't be listed under the "variable" search 
>> category. 
>>
>> 2. the same standard_name can apply to multiple variables.  For 
>> example area_fraction is a standard name that can apply to many 
>> different variables (e.g., "land cover fraction", "grass fraction", 
>> "crop fraction")  Thus, if a user searches on "area fraction", but is 
>> only interested in, for example, "grass fraction", they will find 
>> lots of extraneous datasets.
>>
>> I think the user should be able to designate whether the list of 
>> variables under the "variable" search category are displayed as 
>> standard_names, long_names, or the names of the variables themselves 
>> (i.e., the netCDF variable name).
>>
>> If this is not possible (or is too difficult), then the other method 
>> of searching (i.e., by entering a string after "Search: Datasets for: 
>> ...." at the top of the esg search page) needs to be greatly 
>> improved.  (It really should be improved in any case.)  When 
>> searching for a variable it is difficult to limit the results to the 
>> actual variable you are looking for.  Here are some problems:
>>
>> 1.  If you enter a word like "sometimes" or "the" or "disciplines", 
>> the search returns lots of datasets.  This is because the search 
>> looks not only through the standard names, the realm names, the model 
>> names, etc., but also through all the text in the explanatory 
>> description associated with each standard_name it finds.  I think 
>> whoever coded this made a silly mistake by searching the text that 
>> explains the meaning of the standard_name, when they meant to look 
>> through the "long_names" instead.  [The search engine doesn't look at 
>> the long names at all!]   You'll note that if you click on any 
>> dataset returned by a search, you'll get a page that provides 4 tabs 
>> of information ("summary", "geophysical properties", "variables", and 
>> "administration").  Click on "variables" and you will find 
>> information about each variable.  Among the information is 
>> "Description", which contains the long_name, followed by "Units" and 
>> "Standard_Name".  Just below the standard_name is *another* 
>> "Description", which explains what the standard_name means.  I think 
>> it is this description that is being looked at instead of the 
>> "Description" that contains the long_name.  This should be fixed 
>> immediately!
>>
>> 2.  When you enter a string such as "surface air temperature", the 
>> search returns all variables that contain any one of these 3 words.  
>> There is no way a user can find surface air temperature and not a 
>> bunch of other variables with this search capability because it 
>> returns the *union* of individual searches on "surface", "air", and 
>> "temperature".   If the user looks for a standard name (e.g., 
>> air_temperature), an error is returned because an underscore is not 
>> allowed.    Another problem is that if you search by entering the 
>> text "temperature",  variables like "precipitation" still appear  
>> under the Search Category "Variables" although they should have been 
>> eliminated.
>>
>> 3.  If you enter the name of a variable (as it appears in the netCDF 
>> file), for example "tas" (surface air temperature), no results are 
>> returned.  The search should include the variable names in the text 
>> it scans.
>>
>> I am quite embarrassed that hundreds of users are seeing this 
>> terribly disfunctional search capability.  It makes us look bad.  
>> Whoever is responsible for this, needs to fix it immediately.  
>>
>>  If there are things I can do to help, please let me know.
>>
>> thanks,
>> Karl
>>
>> P.S.  Here's a list of additional items which I think would improve 
>> the ESG user interface:
>>
>> Highest priority:
>>
>> 1.     A symbol next to each dataset in the list should indicate 
>> which datasets the user can download with current permissions.
>>
>> 2.     Whenever a user is asked to click on (select) items on a page, 
>> he should always be provided with an option to “select all” or 
>> “select all  for which I currently have permission to download"  (as 
>> well retaining the option to select individual items).  This holds 
>> both for the “dataset” selection pages and the file selection pages.
>>
>> 3.     The interface pages for subscribing to the “commercial” group 
>> or the “research” group for CMIP5 need to be modified to prevent 
>> users from joining the wrong group.  We need to make the whole of the 
>> terms of use visible to the user without scrolling, if possible.  We 
>> need to add text to guide the user to the correct group.  The terms 
>> of use should be easily downloadable as a pdf or something.  
>>
>> 4.     When subscribing to one of the CMIP5 groups, if the user 
>> forgets to click on “I accept”, the statement of work should not have 
>> to be re-entered by the user (the user will likely enter a much 
>> shorter description the second time through, and we don’t want 
>> that).  Also, we should explain what kind of information we are 
>> seeking in the “statement of work”.
>>
>> 5.    I have not yet any experience with the new method of 
>> authentication (i.e., tokenless).  But it sounds like when users run 
>> the wget script, they will be required to login (after presumably 
>> logging in to download the wget script itself).  Logging in twice 
>> would be a nuisance. 
>>
>> Lower priority:
>>
>> 6.     Eliminate the prompting to join a group every time you attempt 
>> to download data that you are not currently authorized to download.  
>> Replace this with something like the following statements:  “You are 
>> not authorized to download all of the selected datasets.  You may be 
>> able to gain permission to access these datasets by joining the 
>> following groups:  group1, group2, … groupN.  To subscribe to a 
>> group, click on the “My account” tab at the top of the page.   Click 
>> HERE to reach those datasets you are already authorized to download”  
>> [A different wording will be required if the user is not authorized 
>> to download any of the datasets.]
>>
>> 7.     When datasets are listed, the user should be able to set 
>> priorities on which center(s) he prefers to get them from.  Then in 
>> the case when a dataset has been replicated, if the user “selects 
>> all”, ESG will know which center to get the data from (when there is 
>> more than one center with the data).
>>
>> 8.     Add “more information” or “help” buttons in several places to 
>> assist the users when they become confused.
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>