[Go-essp-tech] DRS corrections and extensions

V Balaji v.balaji at noaa.gov
Thu Jun 14 12:02:37 MDT 2012


I've got a bit lost in this discussion, and admit I'm a little
sceptical that you could sufficiently describe using short strings how
data was processed, gridding, averaging, etc. CF has attributes to
describe these things (cell_measures, cell_methods) and DRS could put
them all in a general product of processed or modified data.

There are other use cases for a product category called "modified" or
"processed". (Karl and I in an offline conversation used "modified"
some days ago, but if "processed" is the consensus, that's fine).

One use case I have for that is that we are working with downstream
software from the impacts world that fails (aborts with no output) if
the relative humidity exceeds 100%. This can happen in model output
for legitimate physical reasons, fog, etc.

Having no control over the downstream software, we've chosen to
"modify" our RH field to lie between 0 and 100%. As in RH=max(RH,100).

Karl and I believe that this field should be placed in the DRS with
product=modified. Variable name is unchanged, and the processing is
also described in a global netCDF "comment" or "history" attribute.

Is this acceptable?

On Thu, Jun 14, 2012 at 3:29 AM,  <martin.juckes at stfc.ac.uk> wrote:
> Hello Karl,
>
> that looks good to me. I would add the specification of what form "XXXX" should take if the the field is global in the sentence where XXXX is introduced (e.g. "global" -- or it could be "", as in "g--areaavg" if you want brevity). Would it be possible to specify that "XXXX" and "YYYY" should not include a hyphen, and replace "global-ocn" with "globalOcn"?
>
> regards,
> Martin
> ________________________________
> From: Karl Taylor [taylor13 at llnl.gov]
> Sent: 14 June 2012 01:07
> To: Juckes, Martin (STFC,RAL,RALSP)
> Cc: jamie.kettleborough at metoffice.gov.uk; v.balaji at noaa.gov; Steven.C.Hankin at noaa.gov; Lawrence, Bryan (STFC,RAL,RALSP); Pascoe, Stephen (STFC,RAL,RALSP); go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] DRS corrections and extensions
>
> Hi Stephen, Martin, and all,
>
> Thanks very much for thinking carefully about this.  I've responded to your input below:
>
> Stephen:
> Extending the use of the "-suffix" part of temporal subset to include averaging looks reasonable.  The geographic subset section is rather complex and I worry that it will be difficult to implement unambiguous parsers for it.  This may not matter provided we can always interpret it as an opaque string in filenames of the form: "c1_c2_...cn_[temporal-subset]_[geospatial-info].nc".  My specific concerns about parsing are below.
> Also, more generally, I wonder whether we are repeating too much information from the CF metadata in the filename.  I think the temporal subset  is already pushing to the limit what can be effectively represent in a filename and this could push it too far.  Filenames within a dataset should be unique but maybe we could let data providers decide how they are labelled?
>
> Karl:  Yes, this is an option.  Including a uniform way of embedding the time in the filename was essential since we wanted to be able to split time-series across files.  The motivation for treating simple spatial subsetting and averaging in a standard way is that we hope to return to users requested regional datasets, extracting the data on the server side.  Shouldn't a user expect the files to be named similarly, even if they were created from different ESG nodes?
> Stephen:
> If we continue to add detailed syntax to the filename it would greatly help to have a formal grammar in BNF notation (http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form).
> Karl:  I hope someone familiar with BNF might do this if it's deemed important.
>
> Stephen:
> Section 2.4 Geographic subsets
> As described the format is "g[-XXXX][-YYYY]" where both XXXX and YYYY are optional and YYYY = "[yyy][-zzz]". XXXX can be omitted when YYYY is present as in the example "g-ocn-areaavg".
> I foresee problems in writing parsers that disambiguate the case "g-XXXX" from "g-YYYY" particularly in the case where XXXX is a named region.  If we wanted to extend the valid vocabulary of YYYY we would have to check for clashes with all named regions used in XXXX.  This would seam like a hostage to fortune, particularly if users start defining their own regions.
>  Karl:  Perhaps to simplify things we should require XXXX (and prohibit hyphens within XXXX).
> Stephen:
> Similarly how do we disambiguate these cases:
> g-XXXX-yyy
> g-yyy-zzz
> With a sufficiently complex parser we can differentiate these because yyy and zzz are from controlled vocabularies but writing a generic parser that forsees extensions to these vocabularies will be tricky and error-prone.
>  Karl:  requiring XXXX would eliminate this problem.
> Martin:
> (1) Like Stephen, I’m concerned about the complexity of the “XXXX” section. My first suggestion would be to drop the first hyphen and “lat” and “lon”, changing “g-lat20S20Nlon170p5W130p5W” to “g20S20N170p5W130p5W”. I’d also be tempted to drop the “p5” terms: for some grids (e.g. Gaussian) the exact limits will have many decimal places and so there will need to be some specification of the level of truncation expected, and I think the most convenient would be to round to the nearest integer.
>
> Karl:  I like the suggestion to round to the nearest integer.  I'd like to hear others weigh in on whether to eliminate "lat" and "lon".   I guess this would be o.k.
>
> (2) To make parsing of the overall file name easier, you could use c1_c2_..... [_<time range>][.<spatial info>].nc – using a “.” Instead of “_” makes life easier for file parsers. Technically this is not necessary, as the “g” already makes it unambiguous, but parsers have to deal with the special case of gridspec files and adding more variants make life more complicated. Using “.” will make it easier to separate the parsing of the existing components from the new ones.
> Karl:  I don't find this argument compelling.  I think it's pretty easy to write a parser that can deal with the two optional suffixes (i.e., temporal subset and geographical info.)  The first consists of only numerals (and a hyphen), whereas the second begins with "g-".  I think some software doesn't like "." in filenames except to separate the final "file-type" suffix (e.g, ".nc").
> (3) As Stephen points out, in the present form, in a string “g-aaa-bbb” the term “aaa” could be either a region from an gazetteer or a designation of a type of surface (“ocn” or “lnd”). Having to look through multiple vocabularies is a problem for file name interpretation, even if one of them only has two elements. To get over this, I’d suggest something like: “.....[_<time range>][.gXXXX_pYYY-ZZZ].nc”, where the “gXXXX” and “pYYY-ZZZ” terms are both optional and the underscore is only present if both are present. This approach will only work if you accept the use of “.” suggested in (2) to make a clear break between the first part of the name and the new. This would give us a first section of the file name in which components are identified by position and a 2nd section in which components are identified by the first letter of the component.
> Karl:  I prefer simply requiring XXXX if you want to include YYYY.
>
>
> I've made the changes inspired by your input in the attached file.  Further comments/suggestions are welcome.
>
> Best regards,
> Karl
>
> Regards,
> Martin
>
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: 06 June 2012 21:58
> To: Kettleborough, Jamie; V. Balaji; Steve Hankin; Juckes, Martin (STFC,RAL,RALSP); Lawrence, Bryan (STFC,RAL,RALSP); Pascoe, Stephen (STFC,RAL,RALSP); go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> Subject: Re: [Go-essp-tech] DRS corrections and extensions
>
> Dear all,
>
> In February I asked for comments on my proposal to extend the DRS to  include information about spatio-temporal subsets or means.  I heard from Jamie, but no one else.  I respond to Jamie below, but I also would like your input specifically about:
>
> 1.  Is this method of describing spatio-temporal subsets acceptable?
> 2.  Is it worth taking this step if we don't say anything about other "processed" output?   For example how to describe "regridded" data or multi-model means.
>
> I've attached the proposed version of the DRS, which differs from the one I sent in January only in a couple mods made in response to Jamie.
>
> Best regards,
> Karl
>
> On 2/13/12 6:47 AM, Kettleborough, Jamie wrote:
> Hello Karl,
>
> this will be terse as I have time to review, but not to necessarily get the words right - hope I don't say anything too bad because of this.
>
> 1. section 2.3,  Not sure 'output' should be mentioned under 'product'.  I don't think 'output' ever makes it to publication level, so does not need to appear in a publication level id.  I know cmor produces it, but I think that's kind of historical isn't it, rather than necessary?  Maybe its too late for details like this?
> It's true that in the end the CMIP5 output should not remain as "output", but be assigned to "output1" or "output2".  Nevertheless, I don't think there is any harm in keeping it in the DRS.
>
>
> 2. section 2.3 version number: to be consistent with what we really have in CMIP5 I think you need to note that v1, v2 are also present, though any *new* versions should use vYYYYMMDD.
> I have modified the text to indicate that software cannot rely on the version number reflecting a date.
>
>
> 3. section 2.3 version:  I wonder if you need to say more (maybe not here, but if not where?) about what triggers a new version.  I think its
>  a. anything that changes the content of a file already published and
>  b. the addition or deletion of files from any publication data set.
>  Pure 'data management' meta data changes (addition of checksums, move to new URL's) need not trigger a new version.
>  Do you also need to say there is no guarantee that old versions will be kept (unless they have a DOI).
> I've added some of this information now to the document.
>
>
> 4. section 2.4 Temporal Subsets or means: I don't understand the 'avg' example, or if I do I don't know if its right (but the point is relatively minor).  I think the example you quote as one 6 month mean field in it.  This is based on 1 day means.  I think its a little anomalous to keep the frequency as 'day' in this case.  That's not quite consistent with the definition (and I think all other uses) of frequency.  Strictly speaking frequency should be 6mon no?  (I may have misunderstood).
> I think you're right.  I'm not sure why I thought this was the right way to do it.  I've changed the example,
>
>
> 5. section 3.5.  Does this need clarifying? I think the current wording is potentially confusing,  I think it should say something like:
>
> 'URLs referencing the data files will have a site dependent prefix (that may change due to site-specific data management tasks) followed by the directory structure.  This directory structure should (but may not) follow the recommendations of section 3.3'
>
> I've modified the text as suggested.
>
> 6. I've noticed that the thredds catalogs also expose a thing called the file_id, e.g
>
> <property name="file_id" value="cmip5.output1.CNRM-CERFACS.CNRM-CM5.rcp45.mon.ocean.Omon.r1i1p1.vo_Omon_CNRM-CM5_rcp45_r1i1p1_203601-204512.nc"/>
>
> I don't know if they need a mention as being anything important (we don't use them as they don't give any version info).
>
> We've already given 5 use cases, which I think is enough.  The DRS is used in a number of other ways.
>
> Hope this is useful,
> Yes thanks very much!
> Karl
>
>
> Jamie
>
> ________________________________
> From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
> Sent: 10 February 2012 01:32
> To: V. Balaji; Steve Hankin; Martin Juckes; Bryan Lawrence; Stephen Pascoe; go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> Subject: [Go-essp-tech] DRS corrections and extensions
> Dear all,
>
> Attached is my attempt to make the DRS consistent with CMIP5 (in describing the precision of "time instants"), but primarily to extend it to a more complete treatment of spatio-temporal subsets or means.  I've also corrected a few typos.
>
> Comments most welcome.  In particular could someone recheck sections 3.3-3.5 (which haven't been changed by me) to see if they remain consistent with CMIP5?
>
> thanks and best regards,
> Karl
>
>
> --
> Scanned by iCritical.
>
>
>
> --
> Scanned by iCritical.



-- 

V. Balaji                               Office:  +1-609-452-6516
Head, Modeling Systems Group, GFDL      Home:    +1-212-643-2089
Princeton University                    Email: v.balaji at noaa.gov


More information about the GO-ESSP-TECH mailing list