[Go-essp-tech] Publication, versioning and notification
Karl Taylor
taylor13 at llnl.gov
Mon Mar 15 16:14:20 MDT 2010
Dear all,
I'll try here to summarize my understanding of how data will get
published and replicated as part of ESG. If my summary is accurate,
there are a number of items we'll need to address soon, which I'll come
back to at the end of this email.
Consider the following simplified case:
1. Model A produces precipitation and temperature data for a 100-year
simulation. This will be considered the entire "output".
2. CMIP5 requests that only temperature data be archived, and this
temperature data then constitutes the entire "requested" model output.
3. The ESG federation has agreed that the last 20 years of temperature
data will be replicated at the archival gateways (PCMDI, BADC, DKRZ, ...).
Thus, the "replicated" output is a subset of the "requested" output,
which is a subset of the "output". Note that in what follows I assume
there is no good reason to separate "requested" from "output". The
"replicated" output, however, needs to be treated someone separately
because of the issues having to do with quality control, versioning and
replication.
As I understand it, one possible route by which data will appear on ESG
is as follows:
1. Modeling group A publishes all its "output" on an ESG node. This
requires writing the output files into directories, determining which
data will be part of the official replicated subset and which will not,
collecting files into ESG datasets, and assigning version numbers to
files and to datasets.
1a) files are initially placed in directories following the DRS
specifications, without assigning a version number. Thus, they are
placed directly in the <ensemble member> directory. Here is the DRS
directory structure: ...../
</activity/>/</product/>/</institute/>/</model/>/</experiment/>/</frequency/>/</modeling
realm/>/</variable name/>/</ensemble member/>/
1b) ESG decides whether any of the files are supposed to replace earlier
versions and assigns a version number to each file. It also moves any
replaced or withdrawn files down one level into directories named by the
version of the files they contain. Thus, both the latest versions of
files and directories containing earlier versions of the files will
appear under </ensemble member/>.
1c) ESG decides whether any portion of the data in each file is included
in the officially called for replicated set. [The code has not yet been
written that can make this decision.] If it finds any data in the
"replicated" category, ESG creates a parallel directory (under the
"requested" directory as called for by the current DRS, but perhaps this
should be changed to "replicated"). ESG creates a link from the new
directory to the file itself under "output" (i.e., to the file or files
containing temperature data that falls within the specified 20-year period.
1d) Assuming publication at the "realm" level, the ESG publisher is
executed on the "output" side of the directory tree, yielding a single
dataset containing all the temperature and precipitation data. Note,
that publication at the "realm" level means all variables from a single
realization of a simulation will be in the same ESG dataset. [The
different members of an ensemble will appear in separate ESG datasets.]
1e) The ESG publisher will then act on the "links" in the "replicated"
side of the directory tree, and publish a dataset containing only
temperature data from 20 years (plus perhaps any additional years that
might be included as part of the needed files). Thus, the replicated
data will be found either under the "replicated" dataset or the original
"output" dataset (along with additional data stored there).
2. Modeling group A sends all of its temperature data to PCMDI (because
PCMDI plans to archive as much of the requested data as it can).
3. PCMDI publishes the data it receives, following a similar procedure
as in 1a-e above, but of course the realm dataset will only include
temperature data. The "replicated" dataset, on the other hand should be
identical to the one published by Modeling Group A on its own node.
4. The data is sent to other archival gateways (e.g., BADC and DKRZ)
who might choose only to archive the "replicated" subset. They might
place their files directly in the "replicated" side of the directory
tree (and omit the "output" side of the tree).
Questions:
1. How we can make sure that the version numbers assigned by the
publishers at the different nodes/gateways are the same across the
federation?
2. Isn't there an easier way to do all of this?
3. When a user looks for data, he is more likely to find what he needs
by searching "output" not "replicated" (since "output" is more
complete). In fact I'm not sure the typical user will care what portion
of the data has been replicated. What are the advantages to the user of
requiring that some defined subset of the data be replicated. Is it in
the DOI assignment?
I've run out of time for now, but I think we still have envisioned how
this is going to work end to end. Also, the requirements of the
"search" capability and the "notification" service still seem quite
vague to me. It seems to me we need to get the specifications down on
paper soon.
Best regards,
Karl
Hi Karl,
this is a VERY good use case, and thinking about it can really help
clarify how the system will or should work, even for me. It might be
worth discussing this with the go-essp list just to make sure everybody
is on the same page. I'm cc'ing Eric too because he is working on wget
scripts right these days...
That said, I think the use case is flawed, because, as it stands, it
involves partial replicas of datasets, a thing that we said we wouldn't
support. To be specific, the only way that the modeling center can only
ship one file to PCMDI is that the output stream is split into 2
datasets: "requested" and "full" (or whatever), contrary to assumption
4) below.
So, if we assume that the "full" dataset is composed of two files, and
the "requested" dataset of 1 file, the following happens:
o The modeling center publishes the full dataset onto its data node and
to the PCMDI gateway
o The "requested" dataset is replicated to PCMDI, and published to the
PCMDI datanode and the PCMDI gateway
o The PCMDI gateway exposes both datasets in the search interface. The 2
datasets share all the same DRS facets (model, experiment, time
frequency,...) except perhaps a facet called "product" that has the two
possible values "Full CMIP5 output" and "Core CMIP5 output". To be
distinguishable, the two datasets must come with a name/description that
specify their time extent, and/or their product type. We could also
harvest the overall time information and display it, if it can be helpful.
o So when users 1, 2, 3 below make a search, 2 results will be returned:
by inspecting the results descriptions, they will realize that all the
original data is available from the modeling center, and only a subset
of it from PCMDI. Depending on what they want, they will make their
dataset selection, click a button, and obtain a files listing which
contains all the files for that particular dataset. At this point they
can still presumably deselect any files they don't want (perhaps based
on the total size displayed) before asking for a wget script to be
generated.
In summary, I think the system fully supports this use case provided the
two datasets are identified as distinct at the time of publication.
Also, let me add a few comments. This is a simple use case because there
is only one gateway serving two datanodes. In this case, the gateway
knows exactly which files are present at each data node. If the user (1,
2 or 3) was going to select BOTH datasets in the search results and ask
for the files, a single web page would be presented that contains all
the files from the two datasets. Since some of the files share the same
name, the gateway can either present two options for download, or maybe
make an authoritative decision and present one only.
More complicated is the case where the modeling center publishes the
"full" dataset to BADC (for example), and the "requested" dataset is
replicated to the PCMDI data node and published to the PCMDI gateway. In
this case, the PCMDI gateway knows about two datasets, but only the
files of its datanode, and similarly the BADC gateway knows about two
datasets, but only the files of the modeling center. In this scenario,
it's very important that the two datasets be accurately described so
that the user can make the proper selection, after which the listing of
files is presented. If the user were to select both datasets, he would
be presented with two sets of files, and two wget scripts to download
them. Probably the worst that can happen in this case is that if the
user doesn't pay attention to the file listing, he'll download the files
twice.
I hope this helps in understanding - Bob, Eric please speak up if you
think I got any of this wrong.
thanks, Luca
On Mar 6, 2010, at 11:31 AM, Karl Taylor wrote:
Hi Bob and Luca,
I'm trying to get a feel for what to expect from a user's perspective
out of a federated ESG, assuming only what software will be in place at
the time of the first release. Consider the following simple federated
archive, involving just two partners -- a modeling center hosting a data
node and PCMDI hosting a data node and a portal (i.e., a gateway).
1. Suppose the archive is tiny and comprises only two files: one file
with precipitation data for years 1-100 of a single simulation, and the
other years 101-200 from the same simulations.
2. Suppose the modeling center responsible for the simulation publishes
the data (years 1-200) on its node, and then sends a copy of only the
2nd file (years 101-200) to PCMDI, which subsequently publishes it on
the PCMDI node.
3. The ESG portal at PCMDI knows about both nodes.
4. Suppose that there is no special designation associated with any of
the data (e.g., we have not defined a "requested" or "replicated" subset).
I presume the gateway will see 2 different datasets. Could you please
tell me whether the gateway will be aware of all the information found
in the catalogs at both nodes, or only a subset of the information?
(And will the gateway have to retrieve this information from each node
whenever it is needed by a user, or will the gateway already have a
copy?) In particular will the gateway be able to access (locally?):
a) the full list of files at each node?
b) what time period the data covers in each node?
Could you also tell me what information/scripts each of the following
users will receive from ESG that will allow him to get the data he wants?
User 1:
This user wants to download all precipitation data available in the
archive. How will he know he should download his data from the original
node, rather than from PCMDI?
User 2:
This user wants to download only years 1-100 of the data. How will he
know he should download his data from the original node, rather than
from PCMDI?
'User 3:
This user wants to download only years 101-200 of the data. How will he
know that he can get his data from either site?
The answers to these questions may help guide us in setting priorities
beyond the first release.
thanks,
Karl
On 26-Feb-10 5:47 AM, Luca Cinquini wrote:
> Hi Stephen,
> it's good to think of all possible scenarios...
>
> It seems to me like in this case:
> o) it would make more sense to change the propose notification system
> to operate on datasets, not single files
> o) in any case, when the two users compare the plots for variable V1,
> the first thing they should do is exchange information about which
> file versions they are using - and they would find they have different
> versions. If instead they'd rather exchange information about dataset
> versions, they can do that too, and they would still find they are
> using different versions.
>
> thanks, Luca
>
> On Feb 26, 2010, at 4:50 AM, <stephen.pascoe at stfc.ac.uk
> <mailto:stephen.pascoe at stfc.ac.uk>> <stephen.pascoe at stfc.ac.uk
> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>
>> Another issue with changing the publication granularity.
>> Will users be notified about changes to files, atomic-datasets or
>> realm-datasets? I think Gavin has said in the past that users will
>> be emailed when *files* change. Consider the scenario:
>> 1. A realm-dataset DS1 is published at version v1.
>> 2. User A downloads variable V1 from DS1.
>> 3. User B downloads all of DS1.
>> 4. An error is found in variable V2 of DS1.
>> 5. The files for V2 are replaced and DS1 is republished as version v2.
>> 6. User B is notified that some files have changed in DS1.
>> 7. User A is *not* notified because he never downloaded the files
>> that changed.
>> 8. User A & B collaborate discussing the data from DS1 v1. THEY
>> HAVE DIFFERENT FILES!
>> If this is how the system is supposed to work it's going to be very
>> confusing.
>> S.
>> ---
>> Stephen Pascoe +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>
>> --
>> Scanned by iCritical.
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100315/fbd0a4e3/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list