[NARCCAP-discuss] Scenario for NSF's "Earth Cube"

Fri Sep 2 14:35:01 MDT 2011

Greetings NARCCAP Users,

The NSF recent put out a Dear Colleague Letter about an effort they're
calling "Earth Cube" (http://www.nsf.gov/geo/earthcube/index.jsp) to
develop cyberinfrastructure to improve the earth sciences community's
ability to deal with the large data volumes.  NSF has requested input
on what these kinds of systems should look like, and I'm part of a
group at NCAR working on some responses to that request.

I wrote up a little scenario describing the experience of using a
system that incorporates a number of the ideas we've been discussing.
Since my examples were heavily influenced by what we've learned
working with NARCCAP users, we'd be especially interested in your 
reactions.

If this sounds interesting, please give the document below a read.  
(And if not, you can stop reading here.)

Any thoughts you have about parts of the scenario that particularly
resonate with you, or that you find appealing appealing, or things you
don't think would work, or ideas that it sparks, would be greatly
welcomed.  If you would rather not reply to the list, feel free to
send feedback privately to narccap at ucar.edu .

Thanks in advance,

--Seth

------------------------

The central element of the system I'm envisioning is methodology upload.
We already have pretty solid cyberinfrastructure in place for the
sharing of geoscience data.  For example, in NARCCAP, we're publishing
CF-compliant NetCDF data through ESG, and that seems like a generally
effective solution.  It doesn't achieve full interworkability, but it
feels like the core elements all exist and are evolving in the right
direction.  What I propose is the development of a compatible system to
handle the sharing of analysis and visualization, plus an affiliated
layer of opinion and interpretation.

This is an example use case, an illustrative story about a software
system that describes how you would use it.

Example Use Case
=============

Our user, Ursula, is working on conservation of an endangered species
of migratory songbird.  She's concerned about the effects of climate
change on the biomes that make up the bird's seasonal habitats, and
wants to know if conservation efforts should prioritize certain
regions over others.

This is cross-disciplinary work; Ursula doesn't know very much about
climate modeling, but she's found an uploaded workflow (a how-to guide
or recipe) to follow.  It was written by another conservationist whom
she doesn't know, but who has a high reputation score among colleagues
that she trusts.  The guide also has a good score from people in the
climate community, and some useful comments attached that discuss the
applicability of the various steps.

She starts with a quick search via a web-based data portal. Finding a
regional climate modeling project whose output will suit her needs,
she registers for an update alert: if new data that meets the search
criteria becomes available before the deadline for the report she's
working on, she'll receive email about it.

The first step of her analysis is to speed things up by subsetting the
data to her region of interest.  (She's going work through the
analysis using a single exemplar, and then will go back and run it on
the complete set of items she's interested in.)  Her region is defined
by a GIS shapefile.  Neither the region nor blocks of the data are
easily defined by lat-lon bounding boxes, but someone (in this case,
the data portal team) has provided a system extension that can handle
that in a sensible way.  Ursula checks the result via a quick
(ncview-style) visualization of the intermediate NetCDF file, which is
not downloaded to her system, but remains in the cloud.

Ursula then needs to distill high-resolution timeseries data to
seasonal climatology.  This step is trickier than it seems, because
the driving GCMs use different and non-standard calendars.  However,
somebody else (the data supplier) has already figured out how to do
this properly.  So Ursula just re-uses their results.  It's
transparent to her whether the system has cached the transformed data
(because it's popular), or whether it's reapplying it to her
intermediate result (because she wants something unusual).  The system
knows enough to be smart about that.

Next, she needs to bias-correct the results by comparing them with
observations.  Her preferred observational dataset isn't available on
this system, so she adds it, either by uploading the files or by
creating a link to another online source.  She doesn't have time to do
a proper write-up for it, so she restricts access to the new resource
to the members of her research group, who already understand its
limitations.

The model data and observations are on different grids, so she needs
to interpolate one or both of them to a common set of locations.  Her
recipe has a recommended method, but a discussion in the comments make
a persuasive case (using links to presentations and the relevant
analysis module) that another method is better, so she decides to try
that one.

When she does the interpolation, the module provides a diagnostic that
shows the results near the edge of her domain aren't very good,
because she didn't include enough data in the margins.  So she backs
her analysis up two steps, expands the subsetting domain, and reruns
it with the larger domain.

Her analysis proceeds through a few more similarly complex steps. At
the end, she edits and saves her entire analysis chain, with
annotations, to her personal library.  (She also cleans up a temporary
save from where she left it at the end of the day and came back to it
the next morning.)

Ursula can now apply the entire analysis to the full range of datasets
she's interested in.  Before she submits it for processing, the system
provides her with an estimate of the total computational resources
that will be required.  It's too big to consider running on her
desktop machine (especially considering the data transfer that would
be involved), but it will fit within the allocation she's received
through a partner institution; if it didn't, she would have to
consider purchasing some compute cycles from a cloud provider like
Google or Amazon.  Exactly how long her analysis will take is somewhat
uncertain, but the system will notify her when the results are ready.

The notification arrives earlier than expected -- it turns out there
was an error in half the cases, so they were aborted after the second
step.  Ursula checks the error messages (which are useful) and
realizes that she forgot to adjust the time range for processing the
future climate data.  She makes the change, restarts the analysis, and
this time it runs successfully.

When writing up her report, Ursula is able to get a bibliographic
reference for each step in her analysis, as well as for the recipe she
used, just by checking the "citations" tab.  She also creates a stub
for her report on the system, linking it to the analysis.  After she
uploads the pre-print of the report, she plans to make a simplified
version of it into a "live" document with embedded analysis to be used
for educational purposes.

Shortly before her report is due, she receives an automated email
alert about the interpolation module she used.  The creator has found
a bug: when used on precipitation data, the algorithm it uses will
sometimes generate values below zero.  Luckily the fix is simple, and
has already been implemented.  Ursula finds the analysis in her
library and with a few clicks, is able to re-run it in its entirety
using the updated interpolation algorithm.  She gets the result in
time to incorporate them into the report, and is able to meet her
deadline with corrected data.

What's noteworthy here is that although Ursula is a non-specialist in
climate change, the system has nonetheless enabled her to perform
analysis that otherwise would require considerable domain expertise
quickly and accurately.  The gains for expert users will be
commensurately dramatic, enabling them to use a toolbox of deep and
complex analyses as the basic building blocks of truly far-reaching
inquiries into the vast torrent of incoming geoscience data.