[Go-essp-tech] Publication, versioning and notification

Mon Mar 15 16:14:20 MDT 2010

Dear all,

I'll try here to summarize my understanding of how data will get 
published and replicated as part of ESG.  If my summary is accurate, 
there are a number of items we'll need to address soon, which I'll come 
back to at the end of this email.

Consider the following simplified case:

1.  Model A produces precipitation and temperature data for a 100-year 
simulation.  This will be considered the entire "output".
2.  CMIP5 requests that only temperature data be archived, and this 
temperature data then constitutes the entire "requested" model output.
3.  The ESG federation has agreed that the last 20 years of temperature 
data will be replicated at the archival gateways (PCMDI, BADC, DKRZ, ...).

Thus, the "replicated" output is a subset of the "requested" output, 
which is a subset of the "output".  Note that in what follows I assume 
there is no good reason to separate "requested" from "output".  The 
"replicated" output, however, needs to be treated someone separately 
because of the issues having to do with quality control, versioning and 
replication.

As I understand it, one possible route by which data will appear on ESG 
is as follows:

1.  Modeling group A publishes all its "output" on an ESG node.  This 
requires writing the output files into directories, determining which 
data will be part of the official replicated subset and which will not, 
collecting files into ESG datasets, and assigning version numbers to 
files and to datasets.

1a) files are initially placed in directories following the DRS 
specifications, without assigning a version number.  Thus, they are 
placed directly in the <ensemble member> directory.  Here is the DRS 
directory structure:  ...../ 
</activity/>/</product/>/</institute/>/</model/>/</experiment/>/</frequency/>/</modeling 
realm/>/</variable name/>/</ensemble member/>/
1b) ESG decides whether any of the files are supposed to replace earlier 
versions and assigns a version number to each file.  It also moves any 
replaced or withdrawn files down one level into directories named by the 
version of the files they contain. Thus, both the latest versions of 
files and directories containing earlier versions of the files will 
appear under </ensemble member/>.
1c) ESG decides whether any portion of the data in each file is included 
in the officially called for replicated set.  [The code has not yet been 
written that can make this decision.]  If it finds any data in the 
"replicated" category, ESG creates a parallel directory (under the 
"requested" directory as called for by the current DRS, but perhaps this 
should be changed to "replicated").  ESG creates a link from the new 
directory to the file itself under "output" (i.e., to the file or files 
containing temperature data that falls within the specified 20-year period.
1d) Assuming publication at the "realm" level, the ESG publisher is 
executed on the "output" side of the directory tree, yielding a single 
dataset containing all the temperature and precipitation data.  Note, 
that publication at the "realm" level means all variables from a single 
realization of a simulation will be in the same ESG dataset.  [The 
different members of an ensemble will appear in separate ESG datasets.]
1e) The ESG publisher will then act on the "links" in the "replicated" 
side of the directory tree, and publish a dataset  containing only 
temperature data from 20 years (plus perhaps any additional years that 
might be included as part of the needed files).  Thus, the replicated 
data will be found either under the "replicated" dataset or the original 
"output" dataset (along with additional data stored there).

2. Modeling group A sends all of its temperature data to PCMDI (because 
PCMDI plans to archive as much of the requested data as it can).

3.  PCMDI publishes the data it receives, following a similar procedure 
as in 1a-e above, but of course the realm dataset will only include 
temperature data.  The "replicated" dataset, on the other hand should be 
identical to the one published by Modeling Group A on its own node.

4.  The data is sent to other archival gateways (e.g., BADC and DKRZ) 
who might choose only to archive the "replicated" subset.  They might 
place their files directly in the "replicated" side of the directory 
tree (and omit the "output" side of the tree).

Questions:

1.  How we can make sure that the version numbers assigned by the 
publishers at the different nodes/gateways are the same across the 
federation?

2.  Isn't there an easier way to do all of this?

3.  When a user looks for data, he is more likely to find what he needs 
by searching "output" not "replicated" (since "output" is more 
complete).  In fact I'm not sure the typical user will care what portion 
of the data has been replicated.  What are the advantages to the user of 
requiring that some defined subset of the data be replicated.  Is it in 
the DOI assignment?

I've run out of time for now, but I think we still have envisioned how 
this is going to work end to end.  Also, the requirements of the 
"search" capability and the "notification" service still seem quite 
vague to me.  It seems to me we need to get the specifications down on 
paper soon.

Best regards,
Karl

Hi Karl,
     this is a VERY good use case, and thinking about it can really help 
clarify how the system will or should work, even for me. It might be 
worth discussing this with the go-essp list just to make sure everybody 
is on the same page. I'm cc'ing Eric too because he is working on wget 
scripts right these days...

That said, I think the use case is flawed, because, as it stands, it 
involves partial replicas of datasets, a thing that we said we wouldn't 
support. To be specific, the only way that the modeling center can only 
ship one file to PCMDI is that the output stream is split into 2 
datasets: "requested" and "full" (or whatever), contrary to assumption 
4) below.

So, if we assume that the "full" dataset is composed of two files, and 
the "requested" dataset of 1 file, the following happens:

o The modeling center publishes the full dataset onto its data node and 
to the PCMDI gateway
o The "requested" dataset is replicated to PCMDI, and published to the 
PCMDI datanode and the PCMDI gateway
o The PCMDI gateway exposes both datasets in the search interface. The 2 
datasets share all the same DRS facets (model, experiment, time 
frequency,...) except perhaps a facet called "product" that has the two 
possible values "Full CMIP5 output" and "Core CMIP5 output". To be 
distinguishable, the two datasets must come with a name/description that 
specify their time extent, and/or their product type. We could also 
harvest the overall time information and display it, if it can be helpful.
o So when users 1, 2, 3 below make a search, 2 results will be returned: 
by inspecting the results descriptions, they will realize that all the 
original data is available from the modeling center, and only a subset 
of it from PCMDI. Depending on what they want, they will make their 
dataset selection, click a button, and obtain a files listing which 
contains all the files for that particular dataset. At this point they 
can still presumably deselect any files they don't want (perhaps based 
on the total size displayed) before asking for a wget script to be 
generated.

In summary, I think the system fully supports this use case provided the 
two datasets are identified as distinct at the time of publication.

Also, let me add a few comments. This is a simple use case because there 
is only one gateway serving two datanodes. In this case, the gateway 
knows exactly which files are present at each data node. If the user (1, 
2 or 3) was going to select BOTH datasets in the search results and ask 
for the files, a single web page would be presented that contains all 
the files from the two datasets. Since some of the files share the same 
name, the gateway can either present two options for download, or maybe 
make an authoritative decision and present one only.

More complicated is the case where the modeling center publishes the 
"full" dataset to BADC (for example), and the "requested" dataset is 
replicated to the PCMDI data node and published to the PCMDI gateway. In 
this case, the PCMDI gateway knows about two datasets, but only the 
files of its datanode, and similarly the BADC gateway knows about two 
datasets, but only the files of the modeling center. In this scenario, 
it's very important that the two datasets be accurately described so 
that the user can make the proper selection, after which the listing of 
files is presented. If the user were to select both datasets, he would 
be presented with two sets of files, and two wget scripts to download 
them. Probably the worst that can happen in this case is that if the 
user doesn't pay attention to the file listing, he'll download the files 
twice.

I hope this helps in understanding - Bob, Eric please speak up if you 
think I got any of this wrong.

thanks, Luca

On Mar 6, 2010, at 11:31 AM, Karl Taylor wrote:

Hi Bob and Luca,

I'm trying to get a feel for what to expect from a user's perspective 
out of a federated ESG, assuming only what software will be in place at 
the time of the first release.  Consider the following simple federated 
archive, involving just two partners -- a modeling center hosting a data 
node and PCMDI hosting a data node and a portal (i.e., a gateway).

1.  Suppose the archive is tiny and comprises only two files: one file 
with precipitation data for years 1-100 of a single simulation, and the 
other years 101-200 from the same simulations.

2.  Suppose the modeling center responsible for the simulation publishes 
the data (years 1-200) on its node, and then sends a copy of only the 
2nd file (years 101-200) to PCMDI, which subsequently publishes it on 
the PCMDI node.

3.  The ESG portal at PCMDI knows about both nodes.

4.  Suppose that there is no special designation associated with any of 
the data (e.g., we have not defined a "requested" or "replicated" subset).

I presume the gateway will see 2 different datasets.  Could you please 
tell me whether the gateway will be aware of all the information found 
in the catalogs at both nodes, or only a subset of the information?  
(And will the gateway have to retrieve this information from each node 
whenever it is needed by a user, or will the gateway already have a 
copy?)   In particular will the gateway be able to access (locally?):
a) the full list of files at each node?
b) what time period the data covers in each node?

Could you also tell me what information/scripts each of the following 
users will receive from ESG that will allow him to get the data he wants?
User 1:
This user wants to download all precipitation data available in the 
archive.  How will he know he should download his data from the original 
node, rather than from PCMDI?

User 2:
This user wants to download only years 1-100 of the data.  How will he 
know he should download his data from the original node, rather than 
from PCMDI?

'User 3:
This user wants to download only years 101-200 of the data.  How will he 
know that he can get his data from either site?

The answers to these questions may help guide us in setting priorities 
beyond the first release.
thanks,
Karl

On 26-Feb-10 5:47 AM, Luca Cinquini wrote:
> Hi Stephen,
> it's good to think of all possible scenarios...
>
> It seems to me like in this case:
> o) it would make more sense to change the propose notification system 
> to operate on datasets, not single files
> o) in any case, when the two users compare the plots for variable V1, 
> the first thing they should do is exchange information about which 
> file versions they are using - and they would find they have different 
> versions. If instead they'd rather exchange information about dataset 
> versions, they can do that too, and they would still find they are 
> using different versions.
>
> thanks, Luca
>
> On Feb 26, 2010, at 4:50 AM, <stephen.pascoe at stfc.ac.uk 
> <mailto:stephen.pascoe at stfc.ac.uk>> <stephen.pascoe at stfc.ac.uk 
> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>
>> Another issue with changing the publication granularity.
>> Will users be notified about changes to files, atomic-datasets or 
>> realm-datasets?  I think Gavin has said in the past that users will 
>> be emailed when *files* change.  Consider the scenario:
>>  1. A realm-dataset DS1 is published at version v1.
>>  2. User A downloads variable V1 from DS1.
>>  3. User B downloads all of DS1.
>>  4. An error is found in variable V2 of DS1.
>>  5. The files for V2 are replaced and DS1 is republished as version v2.
>>  6. User B is notified that some files have changed in DS1.
>>  7. User A is *not* notified because he never downloaded the files 
>> that changed.
>>  8. User A & B collaborate discussing the data from DS1 v1.  THEY 
>> HAVE DIFFERENT FILES!
>> If this is how the system is supposed to work it's going to be very 
>> confusing.
>> S.
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>
>> -- 
>> Scanned by iCritical.
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>    

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100315/fbd0a4e3/attachment-0001.html