[Go-essp-tech] Publication, versioning and notification

Thu Mar 18 06:16:05 MDT 2010

There are three problems with making b.c.s CMOR2 compliant:

(1) the groups supplying the data are funded to supply it to groups who are happy with the native model format and not funded to do the decent thing and make it available in a globally useful format (though grib -- which I think DKRZ will produce -- is certainly not useless as far as regional modellers are concerned),
(2) there may be variables which are not in the CMOR2 tables and hence, as I understand it, cannot be put into CMOR2 compliant format. Extending the tables is of course possible, but may not be possible in the foreseeable future -- at least my impression is that Karl has enough other priorities,
(3) modelling groups may want the native format for their internal use and not have storage resources to duplicate everything in netcdf (a specific issue raised by IPSL).

Cheers,
Martin

> -----Original Message-----
> From: Bryan Lawrence [mailto:bryan.lawrence at stfc.ac.uk]
> Sent: 18 March 2010 11:32
> To: go-essp-tech at ucar.edu
> Cc: Juckes, Martin (STFC,RAL,SSTD); taylor13 at llnl.gov
> Subject: Re: [Go-essp-tech] Publication, versioning and notification
> 
> On Thursday 18 Mar 2010 10:17:38 martin.juckes at stfc.ac.uk wrote:
> 
> > The driver for this concern is the fact that the CMIP5 data request
> >  does not, despite the extensive consultation, include all output for
> >  forcing of regional models that some groups want. So, this
> >  additional output will (or is likely to be) be produced and
> >  distributed by at least DKRZ and IPSL. There are two factors which
> >  might block the preferred route of making it CMOR2 compliant: (1)
> >  justifying the data transformation effort when it is not part of the
> >  WCRP request and (2) getting the variables into tables so that
> >  creation of CMOR2 compliant data is possible.
> 
> We didn't have time to finish talking about this. I don't understand
> why
> the b.cs wouldn't be made CMOR2 compliant: whether or not they've been
> requested for CMIP5, if they're going to be used by other groups, the
> original native format is going to be useless.
> 
> So, perhaps the issue here is the distinction between CMOR2 compliant
> and CF compliant?
> 
> I don't see this particular use case as requiring anything new. We
> already expect folk to put data into their "CMIP5 output" extra data
> which has not been requested for CMIP5. What we wont do is harvest that
> stuff, but, I say, if it's to have a CMIP5 badge, it has to be CMOR2
> compliant ...
> 
> ... but I don't speak for CMIP5, so my opinion isn't authoratative :-)
> 
> > If we end up distributing raw model output within the IS-ENES
> >  project, would you object to us calling it "CMIP5_raw"?
> 
> I think that it has the same CMIP5 experiment name in the DRS, but it
> has a different project ... if it's not CMOR2 compliant ...
> 
> ... but I don't speak for CMIP5, so my opinion isn't authoratative :-)
> 
> Happy for Karl to find another way through this :-) Perhaps CF
> compliance
> is enough ...
> 
> Cheers
> Bryan
> 
> >
> >
> >
> > Regards,
> >
> > Martin
> >
> >
> >
> > From: go-essp-tech-bounces at ucar.edu
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
> > Sent: 15 March 2010 22:14
> > To: go-essp-tech at ucar.edu
> > Subject: Re: [Go-essp-tech] Publication, versioning and notification
> >
> >
> >
> > Dear all,
> >
> > I'll try here to summarize my understanding of how data will get
> > published and replicated as part of ESG.  If my summary is accurate,
> > there are a number of items we'll need to address soon, which I'll
> >  come back to at the end of this email.
> >
> > Consider the following simplified case:
> >
> > 1.  Model A produces precipitation and temperature data for a
> >  100-year simulation.  This will be considered the entire "output".
> > 2.  CMIP5 requests that only temperature data be archived, and this
> > temperature data then constitutes the entire "requested" model
> >  output. 3.  The ESG federation has agreed that the last 20 years of
> >  temperature data will be replicated at the archival gateways (PCMDI,
> >  BADC, DKRZ, ...).
> >
> > Thus, the "replicated" output is a subset of the "requested" output,
> > which is a subset of the "output".  Note that in what follows I
> >  assume there is no good reason to separate "requested" from
> >  "output".  The "replicated" output, however, needs to be treated
> >  someone separately because of the issues having to do with quality
> >  control, versioning and replication.
> >
> > As I understand it, one possible route by which data will appear on
> >  ESG is as follows:
> >
> > 1.  Modeling group A publishes all its "output" on an ESG node.  This
> > requires writing the output files into directories, determining which
> > data will be part of the official replicated subset and which will
> >  not, collecting files into ESG datasets, and assigning version
> >  numbers to files and to datasets.
> >
> > 1a) files are initially placed in directories following the DRS
> > specifications, without assigning a version number.  Thus, they are
> > placed directly in the <ensemble member> directory.  Here is the DRS
> > directory structure:  ...../
> > <activity>/<product>/<institute>/<model>/<experiment>/<frequency>/<mo
> > del ing realm>/<variable name>/<ensemble member>/
> > 1b) ESG decides whether any of the files are supposed to replace
> >  earlier versions and assigns a version number to each file.  It also
> >  moves any replaced or withdrawn files down one level into
> >  directories named by the version of the files they contain. Thus,
> >  both the latest versions of files and directories containing earlier
> >  versions of the files will appear under <ensemble member>.
> > 1c) ESG decides whether any portion of the data in each file is
> >  included in the officially called for replicated set.  [The code has
> >  not yet been written that can make this decision.]  If it finds any
> >  data in the "replicated" category, ESG creates a parallel directory
> >  (under the "requested" directory as called for by the current DRS,
> >  but perhaps this should be changed to "replicated").  ESG creates a
> >  link from the new directory to the file itself under "output" (i.e.,
> >  to the file or files containing temperature data that falls within
> >  the specified 20-year period.
> > 1d) Assuming publication at the "realm" level, the ESG publisher is
> > executed on the "output" side of the directory tree, yielding a
> >  single dataset containing all the temperature and precipitation
> >  data.  Note, that publication at the "realm" level means all
> >  variables from a single realization of a simulation will be in the
> >  same ESG dataset.  [The different members of an ensemble will appear
> >  in separate ESG datasets.]
> >
> > 1e) The ESG publisher will then act on the "links" in the
> >  "replicated" side of the directory tree, and publish a dataset
> >  containing only temperature data from 20 years (plus perhaps any
> >  additional years that might be included as part of the needed
> >  files).  Thus, the replicated data will be found either under the
> >  "replicated" dataset or the original "output" dataset (along with
> >  additional data stored there).
> >
> > 2. Modeling group A sends all of its temperature data to PCMDI
> >  (because PCMDI plans to archive as much of the requested data as it
> >  can).
> >
> > 3.  PCMDI publishes the data it receives, following a similar
> >  procedure as in 1a-e above, but of course the realm dataset will
> >  only include temperature data.  The "replicated" dataset, on the
> >  other hand should be identical to the one published by Modeling
> >  Group A on its own node.
> >
> > 4.  The data is sent to other archival gateways (e.g., BADC and DKRZ)
> > who might choose only to archive the "replicated" subset.  They might
> > place their files directly in the "replicated" side of the directory
> > tree (and omit the "output" side of the tree).
> >
> > Questions:
> >
> > 1.  How we can make sure that the version numbers assigned by the
> > publishers at the different nodes/gateways are the same across the
> > federation?
> >
> > 2.  Isn't there an easier way to do all of this?
> >
> > 3.  When a user looks for data, he is more likely to find what he
> >  needs by searching "output" not "replicated" (since "output" is more
> >  complete).  In fact I'm not sure the typical user will care what
> >  portion of the data has been replicated.  What are the advantages to
> >  the user of requiring that some defined subset of the data be
> >  replicated.  Is it in the DOI assignment?
> >
> > I've run out of time for now, but I think we still have envisioned
> >  how this is going to work end to end.  Also, the requirements of the
> >  "search" capability and the "notification" service still seem quite
> >  vague to me.  It seems to me we need to get the specifications down
> >  on paper soon.
> >
> > Best regards,
> > Karl
> >
> >
> >
> >
> > Hi Karl,
> >     this is a VERY good use case, and thinking about it can really
> >  help clarify how the system will or should work, even for me. It
> >  might be worth discussing this with the go-essp list just to make
> >  sure everybody is on the same page. I'm cc'ing Eric too because he
> >  is working on wget scripts right these days...
> >
> > That said, I think the use case is flawed, because, as it stands, it
> > involves partial replicas of datasets, a thing that we said we
> >  wouldn't support. To be specific, the only way that the modeling
> >  center can only ship one file to PCMDI is that the output stream is
> >  split into 2 datasets: "requested" and "full" (or whatever),
> >  contrary to assumption 4) below.
> >
> > So, if we assume that the "full" dataset is composed of two files,
> >  and the "requested" dataset of 1 file, the following happens:
> >
> > o The modeling center publishes the full dataset onto its data node
> >  and to the PCMDI gateway
> > o The "requested" dataset is replicated to PCMDI, and published to
> >  the PCMDI datanode and the PCMDI gateway
> > o The PCMDI gateway exposes both datasets in the search interface.
> >  The 2 datasets share all the same DRS facets (model, experiment,
> >  time frequency,...) except perhaps a facet called "product" that has
> >  the two possible values "Full CMIP5 output" and "Core CMIP5 output".
> >  To be distinguishable, the two datasets must come with a
> >  name/description that specify their time extent, and/or their
> >  product type. We could also harvest the overall time information and
> >  display it, if it can be helpful.
> > o So when users 1, 2, 3 below make a search, 2 results will be
> >  returned: by inspecting the results descriptions, they will realize
> >  that all the original data is available from the modeling center,
> >  and only a subset of it from PCMDI. Depending on what they want,
> >  they will make their dataset selection, click a button, and obtain a
> >  files listing which contains all the files for that particular
> >  dataset. At this point they can still presumably deselect any files
> >  they don't want (perhaps based on the total size displayed) before
> >  asking for a wget script to be generated.
> >
> > In summary, I think the system fully supports this use case provided
> >  the two datasets are identified as distinct at the time of
> >  publication.
> >
> > Also, let me add a few comments. This is a simple use case because
> >  there is only one gateway serving two datanodes. In this case, the
> >  gateway knows exactly which files are present at each data node. If
> >  the user (1, 2 or 3) was going to select BOTH datasets in the search
> >  results and ask for the files, a single web page would be presented
> >  that contains all the files from the two datasets. Since some of the
> >  files share the same name, the gateway can either present two
> >  options for download, or maybe make an authoritative decision and
> >  present one only.
> >
> > More complicated is the case where the modeling center publishes the
> > "full" dataset to BADC (for example), and the "requested" dataset is
> > replicated to the PCMDI data node and published to the PCMDI gateway.
> >  In this case, the PCMDI gateway knows about two datasets, but only
> >  the files of its datanode, and similarly the BADC gateway knows
> >  about two datasets, but only the files of the modeling center. In
> >  this scenario, it's very important that the two datasets be
> >  accurately described so that the user can make the proper selection,
> >  after which the listing of files is presented. If the user were to
> >  select both datasets, he would be presented with two sets of files,
> >  and two wget scripts to download them. Probably the worst that can
> >  happen in this case is that if the user doesn't pay attention to the
> >  file listing, he'll download the files twice.
> >
> > I hope this helps in understanding - Bob, Eric please speak up if you
> > think I got any of this wrong.
> >
> > thanks, Luca
> >
> >
> >
> > On Mar 6, 2010, at 11:31 AM, Karl Taylor wrote:
> >
> > Hi Bob and Luca,
> >
> > I'm trying to get a feel for what to expect from a user's perspective
> > out of a federated ESG, assuming only what software will be in place
> >  at the time of the first release.  Consider the following simple
> >  federated archive, involving just two partners -- a modeling center
> >  hosting a data node and PCMDI hosting a data node and a portal
> >  (i.e., a gateway).
> >
> > 1.  Suppose the archive is tiny and comprises only two files: one
> >  file with precipitation data for years 1-100 of a single simulation,
> >  and the other years 101-200 from the same simulations.
> >
> > 2.  Suppose the modeling center responsible for the simulation
> >  publishes the data (years 1-200) on its node, and then sends a copy
> >  of only the 2nd file (years 101-200) to PCMDI, which subsequently
> >  publishes it on the PCMDI node.
> >
> > 3.  The ESG portal at PCMDI knows about both nodes.
> >
> > 4.  Suppose that there is no special designation associated with any
> >  of the data (e.g., we have not defined a "requested" or "replicated"
> >  subset).
> >
> > I presume the gateway will see 2 different datasets.  Could you
> >  please tell me whether the gateway will be aware of all the
> >  information found in the catalogs at both nodes, or only a subset of
> >  the information? (And will the gateway have to retrieve this
> >  information from each node whenever it is needed by a user, or will
> >  the gateway already have a copy?)   In particular will the gateway
> >  be able to access (locally?): a) the full list of files at each
> >  node?
> > b) what time period the data covers in each node?
> >
> > Could you also tell me what information/scripts each of the following
> > users will receive from ESG that will allow him to get the data he
> > wants?
> > User 1:
> > This user wants to download all precipitation data available in the
> > archive.  How will he know he should download his data from the
> >  original node, rather than from PCMDI?
> >
> > User 2:
> > This user wants to download only years 1-100 of the data.  How will
> >  he know he should download his data from the original node, rather
> >  than from PCMDI?
> >
> > 'User 3:
> > This user wants to download only years 101-200 of the data.  How will
> >  he know that he can get his data from either site?
> >
> > The answers to these questions may help guide us in setting
> >  priorities beyond the first release.
> > thanks,
> > Karl
> >
> > On 26-Feb-10 5:47 AM, Luca Cinquini wrote:
> >
> > Hi Stephen,
> >
> > it's good to think of all possible scenarios...
> >
> >
> >
> > It seems to me like in this case:
> >
> > o) it would make more sense to change the propose notification system
> >  to operate on datasets, not single files
> >
> > o) in any case, when the two users compare the plots for variable V1,
> > the first thing they should do is exchange information about which
> >  file versions they are using - and they would find they have
> >  different versions. If instead they'd rather exchange information
> >  about dataset versions, they can do that too, and they would still
> >  find they are using different versions.
> >
> >
> >
> > thanks, Luca
> >
> >
> >
> > On Feb 26, 2010, at 4:50 AM, <stephen.pascoe at stfc.ac.uk>
> > <stephen.pascoe at stfc.ac.uk> wrote:
> >
> >
> >
> >
> >
> > Another issue with changing the publication granularity.
> >
> >
> >
> > Will users be notified about changes to files, atomic-datasets or
> > realm-datasets?  I think Gavin has said in the past that users will
> >  be emailed when *files* change.  Consider the scenario:
> >
> >
> >
> >  1. A realm-dataset DS1 is published at version v1.
> >
> >  2. User A downloads variable V1 from DS1.
> >
> >  3. User B downloads all of DS1.
> >
> >  4. An error is found in variable V2 of DS1.
> >
> >  5. The files for V2 are replaced and DS1 is republished as version
> >  v2.
> >
> >  6. User B is notified that some files have changed in DS1.
> >
> >  7. User A is *not* notified because he never downloaded the files
> >  that changed.
> >
> >  8. User A & B collaborate discussing the data from DS1 v1.  THEY
> >  HAVE DIFFERENT FILES!
> >
> >
> >
> > If this is how the system is supposed to work it's going to be very
> > confusing.
> >
> >
> >
> > S.
> >
> >
> >
> > ---
> >
> > Stephen Pascoe  +44 (0)1235 445980
> >
> > British Atmospheric Data Centre
> >
> > Rutherford Appleton Laboratory
> >
> 
> --
> Bryan Lawrence
> Director of Environmental Archival and Associated Research
> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
> STFC, Rutherford Appleton Laboratory
> Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence
> --
> Scanned by iCritical.