[Go-essp-tech] Publication, versioning and notification

Thu Mar 18 04:17:38 MDT 2010

Hello Karl,

I'd just like to clarify a point about your step one: some modelling
centres may want to distribute output from CMIP5 experiments which
cannot easily be made CMOR2 compliant and therefore will not be in our
CMIP5 archive - but they the may nevertheless choose to use ESG software
to publish this data (I'm not sure of the ESG terminology here, but it
would be a distinct archive). I discussed this with Bryan yesterday, and
he was adamant that "CMIP5 data" should only refer to  CMOR2 compliant
data.   My concern is that if we don't have a clear and comprehensible
way of referring to "output from CMIP5 experiments which is not CMOR2
compliant" it will be hard to stop modellers from referring to their
data as "CMIP5 data". So, I suggest calling this data "CMIP5 raw data",
and recommending that if it is published through ESG it should be with a
reduced DRS consisting of:

CMIP5_raw/<institute>/<model>/<whatever suits your local group>.

The driver for this concern is the fact that the CMIP5 data request does
not, despite the extensive consultation, include all output for forcing
of regional models that some groups want. So, this additional output
will (or is likely to be) be produced and distributed by at least DKRZ
and IPSL. There are two factors which might block the preferred route of
making it CMOR2 compliant: (1) justifying the data transformation effort
when it is not part of the WCRP request and (2) getting the variables
into tables so that creation of CMOR2 compliant data is possible. 

If we end up distributing raw model output within the IS-ENES project,
would you object to us calling it "CMIP5_raw"? 

Regards,

Martin

From: go-essp-tech-bounces at ucar.edu
[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
Sent: 15 March 2010 22:14
To: go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Publication, versioning and notification

Dear all,

I'll try here to summarize my understanding of how data will get
published and replicated as part of ESG.  If my summary is accurate,
there are a number of items we'll need to address soon, which I'll come
back to at the end of this email.

Consider the following simplified case:

1.  Model A produces precipitation and temperature data for a 100-year
simulation.  This will be considered the entire "output".
2.  CMIP5 requests that only temperature data be archived, and this
temperature data then constitutes the entire "requested" model output.
3.  The ESG federation has agreed that the last 20 years of temperature
data will be replicated at the archival gateways (PCMDI, BADC, DKRZ,
...).

Thus, the "replicated" output is a subset of the "requested" output,
which is a subset of the "output".  Note that in what follows I assume
there is no good reason to separate "requested" from "output".  The
"replicated" output, however, needs to be treated someone separately
because of the issues having to do with quality control, versioning and
replication.

As I understand it, one possible route by which data will appear on ESG
is as follows:

1.  Modeling group A publishes all its "output" on an ESG node.  This
requires writing the output files into directories, determining which
data will be part of the official replicated subset and which will not,
collecting files into ESG datasets, and assigning version numbers to
files and to datasets.  

1a) files are initially placed in directories following the DRS
specifications, without assigning a version number.  Thus, they are
placed directly in the <ensemble member> directory.  Here is the DRS
directory structure:  ...../
<activity>/<product>/<institute>/<model>/<experiment>/<frequency>/<model
ing realm>/<variable name>/<ensemble member>/
1b) ESG decides whether any of the files are supposed to replace earlier
versions and assigns a version number to each file.  It also moves any
replaced or withdrawn files down one level into directories named by the
version of the files they contain. Thus, both the latest versions of
files and directories containing earlier versions of the files will
appear under <ensemble member>.
1c) ESG decides whether any portion of the data in each file is included
in the officially called for replicated set.  [The code has not yet been
written that can make this decision.]  If it finds any data in the
"replicated" category, ESG creates a parallel directory (under the
"requested" directory as called for by the current DRS, but perhaps this
should be changed to "replicated").  ESG creates a link from the new
directory to the file itself under "output" (i.e., to the file or files
containing temperature data that falls within the specified 20-year
period.
1d) Assuming publication at the "realm" level, the ESG publisher is
executed on the "output" side of the directory tree, yielding a single
dataset containing all the temperature and precipitation data.  Note,
that publication at the "realm" level means all variables from a single
realization of a simulation will be in the same ESG dataset.  [The
different members of an ensemble will appear in separate ESG datasets.]

1e) The ESG publisher will then act on the "links" in the "replicated"
side of the directory tree, and publish a dataset  containing only
temperature data from 20 years (plus perhaps any additional years that
might be included as part of the needed files).  Thus, the replicated
data will be found either under the "replicated" dataset or the original
"output" dataset (along with additional data stored there).

2. Modeling group A sends all of its temperature data to PCMDI (because
PCMDI plans to archive as much of the requested data as it can). 

3.  PCMDI publishes the data it receives, following a similar procedure
as in 1a-e above, but of course the realm dataset will only include
temperature data.  The "replicated" dataset, on the other hand should be
identical to the one published by Modeling Group A on its own node.

4.  The data is sent to other archival gateways (e.g., BADC and DKRZ)
who might choose only to archive the "replicated" subset.  They might
place their files directly in the "replicated" side of the directory
tree (and omit the "output" side of the tree).  

Questions:

1.  How we can make sure that the version numbers assigned by the
publishers at the different nodes/gateways are the same across the
federation?

2.  Isn't there an easier way to do all of this?

3.  When a user looks for data, he is more likely to find what he needs
by searching "output" not "replicated" (since "output" is more
complete).  In fact I'm not sure the typical user will care what portion
of the data has been replicated.  What are the advantages to the user of
requiring that some defined subset of the data be replicated.  Is it in
the DOI assignment?

I've run out of time for now, but I think we still have envisioned how
this is going to work end to end.  Also, the requirements of the
"search" capability and the "notification" service still seem quite
vague to me.  It seems to me we need to get the specifications down on
paper soon.

Best regards,
Karl

Hi Karl, 
    this is a VERY good use case, and thinking about it can really help
clarify how the system will or should work, even for me. It might be
worth discussing this with the go-essp list just to make sure everybody
is on the same page. I'm cc'ing Eric too because he is working on wget
scripts right these days... 

That said, I think the use case is flawed, because, as it stands, it
involves partial replicas of datasets, a thing that we said we wouldn't
support. To be specific, the only way that the modeling center can only
ship one file to PCMDI is that the output stream is split into 2
datasets: "requested" and "full" (or whatever), contrary to assumption
4) below. 

So, if we assume that the "full" dataset is composed of two files, and
the "requested" dataset of 1 file, the following happens: 

o The modeling center publishes the full dataset onto its data node and
to the PCMDI gateway 
o The "requested" dataset is replicated to PCMDI, and published to the
PCMDI datanode and the PCMDI gateway 
o The PCMDI gateway exposes both datasets in the search interface. The 2
datasets share all the same DRS facets (model, experiment, time
frequency,...) except perhaps a facet called "product" that has the two
possible values "Full CMIP5 output" and "Core CMIP5 output". To be
distinguishable, the two datasets must come with a name/description that
specify their time extent, and/or their product type. We could also
harvest the overall time information and display it, if it can be
helpful. 
o So when users 1, 2, 3 below make a search, 2 results will be returned:
by inspecting the results descriptions, they will realize that all the
original data is available from the modeling center, and only a subset
of it from PCMDI. Depending on what they want, they will make their
dataset selection, click a button, and obtain a files listing which
contains all the files for that particular dataset. At this point they
can still presumably deselect any files they don't want (perhaps based
on the total size displayed) before asking for a wget script to be
generated. 

In summary, I think the system fully supports this use case provided the
two datasets are identified as distinct at the time of publication. 

Also, let me add a few comments. This is a simple use case because there
is only one gateway serving two datanodes. In this case, the gateway
knows exactly which files are present at each data node. If the user (1,
2 or 3) was going to select BOTH datasets in the search results and ask
for the files, a single web page would be presented that contains all
the files from the two datasets. Since some of the files share the same
name, the gateway can either present two options for download, or maybe
make an authoritative decision and present one only. 

More complicated is the case where the modeling center publishes the
"full" dataset to BADC (for example), and the "requested" dataset is
replicated to the PCMDI data node and published to the PCMDI gateway. In
this case, the PCMDI gateway knows about two datasets, but only the
files of its datanode, and similarly the BADC gateway knows about two
datasets, but only the files of the modeling center. In this scenario,
it's very important that the two datasets be accurately described so
that the user can make the proper selection, after which the listing of
files is presented. If the user were to select both datasets, he would
be presented with two sets of files, and two wget scripts to download
them. Probably the worst that can happen in this case is that if the
user doesn't pay attention to the file listing, he'll download the files
twice. 

I hope this helps in understanding - Bob, Eric please speak up if you
think I got any of this wrong. 

thanks, Luca 

On Mar 6, 2010, at 11:31 AM, Karl Taylor wrote: 

Hi Bob and Luca, 

I'm trying to get a feel for what to expect from a user's perspective
out of a federated ESG, assuming only what software will be in place at
the time of the first release.  Consider the following simple federated
archive, involving just two partners -- a modeling center hosting a data
node and PCMDI hosting a data node and a portal (i.e., a gateway). 

1.  Suppose the archive is tiny and comprises only two files: one file
with precipitation data for years 1-100 of a single simulation, and the
other years 101-200 from the same simulations. 

2.  Suppose the modeling center responsible for the simulation publishes
the data (years 1-200) on its node, and then sends a copy of only the
2nd file (years 101-200) to PCMDI, which subsequently publishes it on
the PCMDI node. 

3.  The ESG portal at PCMDI knows about both nodes. 

4.  Suppose that there is no special designation associated with any of
the data (e.g., we have not defined a "requested" or "replicated"
subset). 

I presume the gateway will see 2 different datasets.  Could you please
tell me whether the gateway will be aware of all the information found
in the catalogs at both nodes, or only a subset of the information?
(And will the gateway have to retrieve this information from each node
whenever it is needed by a user, or will the gateway already have a
copy?)   In particular will the gateway be able to access (locally?): 
a) the full list of files at each node? 
b) what time period the data covers in each node? 

Could you also tell me what information/scripts each of the following
users will receive from ESG that will allow him to get the data he
wants? 
User 1: 
This user wants to download all precipitation data available in the
archive.  How will he know he should download his data from the original
node, rather than from PCMDI? 

User 2: 
This user wants to download only years 1-100 of the data.  How will he
know he should download his data from the original node, rather than
from PCMDI? 

'User 3: 
This user wants to download only years 101-200 of the data.  How will he
know that he can get his data from either site? 

The answers to these questions may help guide us in setting priorities
beyond the first release. 
thanks, 
Karl 

On 26-Feb-10 5:47 AM, Luca Cinquini wrote: 

Hi Stephen, 

it's good to think of all possible scenarios...

It seems to me like in this case:

o) it would make more sense to change the propose notification system to
operate on datasets, not single files

o) in any case, when the two users compare the plots for variable V1,
the first thing they should do is exchange information about which file
versions they are using - and they would find they have different
versions. If instead they'd rather exchange information about dataset
versions, they can do that too, and they would still find they are using
different versions.

thanks, Luca

On Feb 26, 2010, at 4:50 AM, <stephen.pascoe at stfc.ac.uk>
<stephen.pascoe at stfc.ac.uk> wrote:

Another issue with changing the publication granularity.

Will users be notified about changes to files, atomic-datasets or
realm-datasets?  I think Gavin has said in the past that users will be
emailed when *files* change.  Consider the scenario:

 1. A realm-dataset DS1 is published at version v1.

 2. User A downloads variable V1 from DS1.

 3. User B downloads all of DS1.

 4. An error is found in variable V2 of DS1.

 5. The files for V2 are replaced and DS1 is republished as version v2.

 6. User B is notified that some files have changed in DS1.

 7. User A is *not* notified because he never downloaded the files that
changed.

 8. User A & B collaborate discussing the data from DS1 v1.  THEY HAVE
DIFFERENT FILES!

If this is how the system is supposed to work it's going to be very
confusing.

S.

---

Stephen Pascoe  +44 (0)1235 445980

British Atmospheric Data Centre

Rutherford Appleton Laboratory

-- 
Scanned by iCritical. 

_______________________________________________
GO-ESSP-TECH mailing list
GO-ESSP-TECH at ucar.edu
http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech

_______________________________________________
GO-ESSP-TECH mailing list
GO-ESSP-TECH at ucar.edu
http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Scanned by iCritical.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100318/a9fdbc3d/attachment-0001.html