[Go-essp-tech] Fwd: Re: [esgf-devel] Bug reports from IS-ENES2 installation sprint

Tue Nov 12 09:12:09 MST 2013

-------- Original Message --------
Subject: 	Re: [esgf-devel] Bug reports from IS-ENES2 installation sprint
Date: 	Mon, 11 Nov 2013 18:42:22 -0800
From: 	Gavin M. Bell <gavin at llnl.gov>
To: 	Stephen Pascoe <stephen.pascoe at lirico.co.uk>
CC: 	esgf-devel at lists.llnl.gov, IS-ENES-2 Data-WPs
<is-enes2-data at lists.enes.org>

Hey All,

Thanks for running the installer through its paces... we are certainly
resource constrained to test all the configurations.
The one we run the most is the DATA+INDEX+IDP and another additionally
with COMPUTE.  So I am glad to see the other permutations getting some
exercise.

Some of these issues have been partially dealt with but are in need of a
more deep tissue massage (#19) others are soon to be moot (#18), others
are short oversights (entry 3.)  Some are new dependencies brought over
with updates in components like UV-CDAT needing gfortran, (entry 5.).

There are now only a handful of issues left open... some of which are on
the way to being closed. Others... 'the juice is not [yet] worth the
squeeze'.  Keep in mind that moving forward we are going to move to a
VM(-ish) based solution, where matters of installation at this level
will be only ever witnessed and done by a handful of folks - so I
caution us to be parsimonious with the effort we may initially want to
marshal to this end.

(rest of response is interleaved)

On 11/8/13 6:08 AM, Stephen Pascoe wrote:
> European node managers have been meeting in Paris over the last 3 days
> to test installation of ESG nodes.  Below is a summary of issues we
> have found.  Where appropriate I have raised an equivalent ticket on
> github as indicated below.
>
> Our judgement is that 1.6 requires some work before it is safe to
> upgrade production nodes.
>
> Speaking purely as BADC, we will try to help resolve some of these
> issues by issuing pull requests.
>
>
> *
> *
> *1. Myproxy failing to install correctly [issue #18]*
>
> We spent some time diagnosing an apparent failure to install myproxy.
>  The symptoms included no /etc/init.d/myproxy being installed and
> "esg-node generate-globus-key-and-csr" failing.  In the end we
> discovered that this was because we had answered "N" when asked to
> install globus for a second time.  
>
> The script first installs gridftp and tries to install myproxy.  When
> an existing globus installation is detected it asks whether you want
> to install globus again, defaulting to "N".  If you follow the default
> myproxy is not installed.
>
> We recommend at the minimum the default should be to install globus.
>  Ideally the installer would only query the user to install globus
> once and would know whether to install gridftp and/or myproxy.

[response]
I have paid little attention to the globus script, it quite frankly is a
bit of a mess and will be entirely replaced with an rpm solution.
This is work that will be available in v1.7.0.  For now, just answer
*yes* to all questions being presented when in the globus realm of the
installation - regardless of the defaults.  I worked quite a bit to make
the prompts be 'smart' so you can essentially hit [return] all the way
through, but that didn't happen in the globus script.  The idea there is
to make it through the FULL globus install by saying yes to EVERYTHING
that you are asked for and once you have done it... never do it again
:-). (sort of).  This is why subsequently the install prompts having to
do with globus steer you way from re-entering that install process with
default "N" answers.

Again, this is going away entirely so there is no point in investing any
time with addressing these issues.

>
> *2. THREDDS fails to start with data-only install [issue #19]*
> *
> *
> When not installing all components the directory /esg/content/thredds
> is not created and the script fails.  The solution is to create
> /esg/content/thredds and re-run the script.
>
> When not installing the compute component thredds will not start
> because it is looking for las_servers.xml.  The solution is to comment
> out the ipFilter declaration in thredds' web.xml.
>
> We recommend the ipFilter declaration in web.xml should only be
> included in the compute configuration.
>

[response]
Luca will relax the filter's behavior when it cannot find the
las_servers.xml files.  This was partially addressed in
adedd261ee7404b84cde8b8ec0b8a329dd3109df but indeed it was still in the
context of installing a compute node.
With relaxing the ipFilter this will go away.  *No need to do any edits
to the Thredds web.xml file*.  We don't want any one-off cases of
editing that file... it is not meant to be edited casually (as in during
the course of an installation).

> *3. Recent changes to rainbow prevent installation of git*
>
> On Thursday the installer changed the git version and this version was
> not downloadable from rainbow therefore clean installations stopped
> working.  This demonstrates why we need reproducible installations.
>  In this particular case we should just depend on the Git RPM.

[response]
This was a push that was inadvertent.  The source has been posted to
rainbow, however, the installer has been updated to use git's
distribution server to pull down the source for building.
https://github.com/ESGF/esgf-installer/issues/17

>
> *4. Failing to re-install replica shards causes script failure [issue
> #20]*
> *
> *
> If you accept the default response for "Replica shard entry for port X
> is already present.  Would you like to install it again?" the script
> fails.  The default is N you need Y.
>
> *5. Undocumented prerequisites*
> *
> *
> gcc-gfortran is now a dependency but is undocumented.

[response]
Added the following to the wiki section that describes pre-requisites:
(https://github.com/ESGF/esgf.github.io/wiki/ESGFNode%7CFAQ)
/NOTE: There are additional prerequisites from the UV-CDAT tool that is
installed as part of the DATA configuration of the stack.  Please see
them here: https://github.com/UV-CDAT/uvcdat/wiki/System-Requirements,
most notably the need for gfortran. (In newer versions of uv-cdat
gfortran is part of the installation procedure)/

>
> *6. Compute configuration fails with "Argument too long" bash error
> [issue #21]*
>
> This issue has occured on data/compute and idp/index/data/compute
> nodes from scratch installations. Rerun the installation again doest
> not solve the problem. The work-around we found is to remove
> /usr/local/ferret and rerun the installation. 
>
>     *******************************
>     Setting up LAS Product Server...
>     *******************************
>
>     Getting LAS...
>     Don't see LAS tar file las-esgf-v8.1.tar.gz Downloading LAS from
>     las-esgf-v8.1.tar.gz -to->
>     /usr/local/src/esgf/workbench/esg/ferret/8.1/las-esgf-v8.1.tar.gz
>     wget -O 'las-esgf-v8.1.tar.gz'
>     'ftp://ftp.pmel.noaa.gov/pub/las/las-esgf-v8.1.tar.gz'
>     <ftp://ftp.pmel.noaa.gov/pub/las/las-esgf-v8.1.tar.gz%27>
>     /usr/local/bin/esg-product-server: line 426: /usr/bin/wget:
>     Argument list too long
>      ERROR: Could not download LAS:las-esgf-v8.1.tar.gz
>
>
> This would appear to be a low-level bash bug (because the command
> being executed definitely only has 3 arguments).
> One possible work-around would be to try to increase ulimit -s. it is
> 10240 by default. On Linux, the maximum amount of space for command
> arguments is 1/4th of the amount of available stack space.

Zed hit it on the head in his email 11/8/13 @9:32am PDT.  There are
things that we could perhaps do to try to scrub up behind us regarding
the environment, but nothing that comes to mind.

Hence the solution currently is that you can do DATA+INDEX+IDP in one
pass... and then do COMPUTE in another.  There are other tricks that can
be done... but at the moment. 

>
> *7. Assorted minor issues*
>
> We observed that when SOLr shards time out the logs do not show which
> shard failed.  This would be very helpful for diagnosing issues.
>
> If you don't include "--verify" the esgcet/catalog.xml file is not
> created.
>
>
[response]
There are a bunch of shards/search related flags to show the state of
shards... If shards fail, which I haven't seen happen thus far, then
they would be seen as timing out locally.  the --verify flag is pretty
much there to run the test_* functions that sanity check things are
running and on the 'right' port.  To  your point it is not quite as
useful as its initial intentions... but still marginally functional to
sanity check the install.  I do not use that flag that much in the
wild.  Pretty much --install does what is needed for most occasions...
(same goes for --update, which I believe I took out).  The --install
flag is idempotent and only changes things that are out of version range
or not present, which is what you want.

P.S.
I see a bunch of work being done on the fork over at badc, thus far from
what I can see there are 12 divergent commits - we should have an
install meeting to discuss those plans.

-- 
Senior Computer Scientist / Mathematics Programmer
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20131112/ed1bf161/attachment.html