[Go-essp-tech] DRS corrections and extensions

Thu Jun 14 15:16:08 MDT 2012

So, two questions.

Should I go ahead with using product=modified for our fudged RH data?

If we do our own regridding or decimation of hi-res or curvilinear
coordinate data, should I also put that in product=modified with no
change in filename? (of course the DRS path will change; and we'll be
explaining in the file what was done, as precisely as netCDF
attributes allow).

That's what I'd like to do.

Thanks,

On Thu, Jun 14, 2012 at 2:52 PM, Karl Taylor <taylor13 at llnl.gov> wrote:
> I agree that we shouldn't try to describe everything in the filename.  I was
> hoping to provide a standard way of naming files generated "on-the-fly" on
> the server data, when a user requests a subsets of CMIP5 files (e.g., for a
> region) or averages (e.g., climatologies).  These operations reduce the
> amount of data that needs to be downloaded, and are therefore likely to be
> popular.  regridding and other processing does not buy you as much in data
> reduction, so I don't expect we'll try to standardize the way that's done in
> file names.
>
> Karl
>
>
> On 6/14/12 11:02 AM, V Balaji wrote:
>
> I've got a bit lost in this discussion, and admit I'm a little
> sceptical that you could sufficiently describe using short strings how
> data was processed, gridding, averaging, etc. CF has attributes to
> describe these things (cell_measures, cell_methods) and DRS could put
> them all in a general product of processed or modified data.
>
> There are other use cases for a product category called "modified" or
> "processed". (Karl and I in an offline conversation used "modified"
> some days ago, but if "processed" is the consensus, that's fine).
>
> One use case I have for that is that we are working with downstream
> software from the impacts world that fails (aborts with no output) if
> the relative humidity exceeds 100%. This can happen in model output
> for legitimate physical reasons, fog, etc.
>
> Having no control over the downstream software, we've chosen to
> "modify" our RH field to lie between 0 and 100%. As in RH=max(RH,100).
>
> Karl and I believe that this field should be placed in the DRS with
> product=modified. Variable name is unchanged, and the processing is
> also described in a global netCDF "comment" or "history" attribute.
>
> Is this acceptable?
>
> On Thu, Jun 14, 2012 at 3:29 AM,  <martin.juckes at stfc.ac.uk> wrote:
>
> Hello Karl,
>
> that looks good to me. I would add the specification of what form "XXXX"
> should take if the the field is global in the sentence where XXXX is
> introduced (e.g. "global" -- or it could be "", as in "g--areaavg" if you
> want brevity). Would it be possible to specify that "XXXX" and "YYYY" should
> not include a hyphen, and replace "global-ocn" with "globalOcn"?
>
> regards,
> Martin
> ________________________________
> From: Karl Taylor [taylor13 at llnl.gov]
> Sent: 14 June 2012 01:07
> To: Juckes, Martin (STFC,RAL,RALSP)
> Cc: jamie.kettleborough at metoffice.gov.uk; v.balaji at noaa.gov;
> Steven.C.Hankin at noaa.gov; Lawrence, Bryan (STFC,RAL,RALSP); Pascoe, Stephen
> (STFC,RAL,RALSP); go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] DRS corrections and extensions
>
> Hi Stephen, Martin, and all,
>
> Thanks very much for thinking carefully about this.  I've responded to your
> input below:
>
> Stephen:
> Extending the use of the "-suffix" part of temporal subset to include
> averaging looks reasonable.  The geographic subset section is rather complex
> and I worry that it will be difficult to implement unambiguous parsers for
> it.  This may not matter provided we can always interpret it as an opaque
> string in filenames of the form:
> "c1_c2_...cn_[temporal-subset]_[geospatial-info].nc".  My specific concerns
> about parsing are below.
> Also, more generally, I wonder whether we are repeating too much information
> from the CF metadata in the filename.  I think the temporal subset  is
> already pushing to the limit what can be effectively represent in a filename
> and this could push it too far.  Filenames within a dataset should be unique
> but maybe we could let data providers decide how they are labelled?
>
> Karl:  Yes, this is an option.  Including a uniform way of embedding the
> time in the filename was essential since we wanted to be able to split
> time-series across files.  The motivation for treating simple spatial
> subsetting and averaging in a standard way is that we hope to return to
> users requested regional datasets, extracting the data on the server side.
> Shouldn't a user expect the files to be named similarly, even if they were
> created from different ESG nodes?
> Stephen:
> If we continue to add detailed syntax to the filename it would greatly help
> to have a formal grammar in BNF notation
> (http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form).
> Karl:  I hope someone familiar with BNF might do this if it's deemed
> important.
>
> Stephen:
> Section 2.4 Geographic subsets
> As described the format is "g[-XXXX][-YYYY]" where both XXXX and YYYY are
> optional and YYYY = "[yyy][-zzz]". XXXX can be omitted when YYYY is present
> as in the example "g-ocn-areaavg".
> I foresee problems in writing parsers that disambiguate the case "g-XXXX"
> from "g-YYYY" particularly in the case where XXXX is a named region.  If we
> wanted to extend the valid vocabulary of YYYY we would have to check for
> clashes with all named regions used in XXXX.  This would seam like a hostage
> to fortune, particularly if users start defining their own regions.
>  Karl:  Perhaps to simplify things we should require XXXX (and prohibit
> hyphens within XXXX).
> Stephen:
> Similarly how do we disambiguate these cases:
> g-XXXX-yyy
> g-yyy-zzz
> With a sufficiently complex parser we can differentiate these because yyy
> and zzz are from controlled vocabularies but writing a generic parser that
> forsees extensions to these vocabularies will be tricky and error-prone.
>  Karl:  requiring XXXX would eliminate this problem.
> Martin:
> (1) Like Stephen, I’m concerned about the complexity of the “XXXX” section.
> My first suggestion would be to drop the first hyphen and “lat” and “lon”,
> changing “g-lat20S20Nlon170p5W130p5W” to “g20S20N170p5W130p5W”. I’d also be
> tempted to drop the “p5” terms: for some grids (e.g. Gaussian) the exact
> limits will have many decimal places and so there will need to be some
> specification of the level of truncation expected, and I think the most
> convenient would be to round to the nearest integer.
>
> Karl:  I like the suggestion to round to the nearest integer.  I'd like to
> hear others weigh in on whether to eliminate "lat" and "lon".   I guess this
> would be o.k.
>
> (2) To make parsing of the overall file name easier, you could use
> c1_c2_..... [_<time range>][.<spatial info>].nc – using a “.” Instead of “_”
> makes life easier for file parsers. Technically this is not necessary, as
> the “g” already makes it unambiguous, but parsers have to deal with the
> special case of gridspec files and adding more variants make life more
> complicated. Using “.” will make it easier to separate the parsing of the
> existing components from the new ones.
> Karl:  I don't find this argument compelling.  I think it's pretty easy to
> write a parser that can deal with the two optional suffixes (i.e., temporal
> subset and geographical info.)  The first consists of only numerals (and a
> hyphen), whereas the second begins with "g-".  I think some software doesn't
> like "." in filenames except to separate the final "file-type" suffix (e.g,
> ".nc").
> (3) As Stephen points out, in the present form, in a string “g-aaa-bbb” the
> term “aaa” could be either a region from an gazetteer or a designation of a
> type of surface (“ocn” or “lnd”). Having to look through multiple
> vocabularies is a problem for file name interpretation, even if one of them
> only has two elements. To get over this, I’d suggest something like:
> “.....[_<time range>][.gXXXX_pYYY-ZZZ].nc”, where the “gXXXX” and “pYYY-ZZZ”
> terms are both optional and the underscore is only present if both are
> present. This approach will only work if you accept the use of “.” suggested
> in (2) to make a clear break between the first part of the name and the new.
> This would give us a first section of the file name in which components are
> identified by position and a 2nd section in which components are identified
> by the first letter of the component.
> Karl:  I prefer simply requiring XXXX if you want to include YYYY.
>
>
> I've made the changes inspired by your input in the attached file.  Further
> comments/suggestions are welcome.
>
> Best regards,
> Karl
>
> Regards,
> Martin
>
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: 06 June 2012 21:58
> To: Kettleborough, Jamie; V. Balaji; Steve Hankin; Juckes, Martin
> (STFC,RAL,RALSP); Lawrence, Bryan (STFC,RAL,RALSP); Pascoe, Stephen
> (STFC,RAL,RALSP); go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> Subject: Re: [Go-essp-tech] DRS corrections and extensions
>
> Dear all,
>
> In February I asked for comments on my proposal to extend the DRS to
> include information about spatio-temporal subsets or means.  I heard from
> Jamie, but no one else.  I respond to Jamie below, but I also would like
> your input specifically about:
>
> 1.  Is this method of describing spatio-temporal subsets acceptable?
> 2.  Is it worth taking this step if we don't say anything about other
> "processed" output?   For example how to describe "regridded" data or
> multi-model means.
>
> I've attached the proposed version of the DRS, which differs from the one I
> sent in January only in a couple mods made in response to Jamie.
>
> Best regards,
> Karl
>
> On 2/13/12 6:47 AM, Kettleborough, Jamie wrote:
> Hello Karl,
>
> this will be terse as I have time to review, but not to necessarily get the
> words right - hope I don't say anything too bad because of this.
>
> 1. section 2.3,  Not sure 'output' should be mentioned under 'product'.  I
> don't think 'output' ever makes it to publication level, so does not need to
> appear in a publication level id.  I know cmor produces it, but I think
> that's kind of historical isn't it, rather than necessary?  Maybe its too
> late for details like this?
> It's true that in the end the CMIP5 output should not remain as "output",
> but be assigned to "output1" or "output2".  Nevertheless, I don't think
> there is any harm in keeping it in the DRS.
>
>
> 2. section 2.3 version number: to be consistent with what we really have in
> CMIP5 I think you need to note that v1, v2 are also present, though any
> *new* versions should use vYYYYMMDD.
> I have modified the text to indicate that software cannot rely on the
> version number reflecting a date.
>
>
> 3. section 2.3 version:  I wonder if you need to say more (maybe not here,
> but if not where?) about what triggers a new version.  I think its
>  a. anything that changes the content of a file already published and
>  b. the addition or deletion of files from any publication data set.
>  Pure 'data management' meta data changes (addition of checksums, move to
> new URL's) need not trigger a new version.
>  Do you also need to say there is no guarantee that old versions will be
> kept (unless they have a DOI).
> I've added some of this information now to the document.
>
>
> 4. section 2.4 Temporal Subsets or means: I don't understand the 'avg'
> example, or if I do I don't know if its right (but the point is relatively
> minor).  I think the example you quote as one 6 month mean field in it.
> This is based on 1 day means.  I think its a little anomalous to keep the
> frequency as 'day' in this case.  That's not quite consistent with the
> definition (and I think all other uses) of frequency.  Strictly speaking
> frequency should be 6mon no?  (I may have misunderstood).
> I think you're right.  I'm not sure why I thought this was the right way to
> do it.  I've changed the example,
>
>
> 5. section 3.5.  Does this need clarifying? I think the current wording is
> potentially confusing,  I think it should say something like:
>
> 'URLs referencing the data files will have a site dependent prefix (that may
> change due to site-specific data management tasks) followed by the directory
> structure.  This directory structure should (but may not) follow the
> recommendations of section 3.3'
>
> I've modified the text as suggested.
>
> 6. I've noticed that the thredds catalogs also expose a thing called the
> file_id, e.g
>
> <property name="file_id"
> value="cmip5.output1.CNRM-CERFACS.CNRM-CM5.rcp45.mon.ocean.Omon.r1i1p1.vo_Omon_CNRM-CM5_rcp45_r1i1p1_203601-204512.nc"/>
>
> I don't know if they need a mention as being anything important (we don't
> use them as they don't give any version info).
>
> We've already given 5 use cases, which I think is enough.  The DRS is used
> in a number of other ways.
>
> Hope this is useful,
> Yes thanks very much!
> Karl
>
>
> Jamie
>
> ________________________________
> From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
> Sent: 10 February 2012 01:32
> To: V. Balaji; Steve Hankin; Martin Juckes; Bryan Lawrence; Stephen Pascoe;
> go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> Subject: [Go-essp-tech] DRS corrections and extensions
> Dear all,
>
> Attached is my attempt to make the DRS consistent with CMIP5 (in describing
> the precision of "time instants"), but primarily to extend it to a more
> complete treatment of spatio-temporal subsets or means.  I've also corrected
> a few typos.
>
> Comments most welcome.  In particular could someone recheck sections 3.3-3.5
> (which haven't been changed by me) to see if they remain consistent with
> CMIP5?
>
> thanks and best regards,
> Karl
>
>
> --
> Scanned by iCritical.
>
>
>
> --
> Scanned by iCritical.
>
>
> --
>
> V. Balaji                               Office:  +1-609-452-6516
> Head, Modeling Systems Group, GFDL      Home:    +1-212-643-2089
> Princeton University                    Email: v.balaji at noaa.gov
>
>
>

-- 

V. Balaji                               Office:  +1-609-452-6516
Head, Modeling Systems Group, GFDL      Home:    +1-212-643-2089
Princeton University                    Email: v.balaji at noaa.gov