[Go-essp-tech] [Ncpp_tech] Directory structure proposal and a doodle poll

Wed Mar 13 21:41:02 MDT 2013

Thanks for the comments, Karl.

Yes, when we said "directory structure" we meant "DRS structure",
which could be implemented as unix directories, but needn't be... the
important thing is that these terms can be converted into searchable
facets.

On Wed, Mar 13, 2013 at 8:52 PM, Karl Taylor <taylor13 at llnl.gov> wrote:
> Dear Galia and all,
>
> I'm glad thought is being given  to unify directory structures across
> projects.
>
> That being said it may not be known by everyone that:
>
> 1) From the user's perspective this isn't really that necessary, since the
> directory structure is hidden by ESGF (but see below for further
> discussion).
> 2) The ESGF publishing software doesn't care much at all about the directory
> structure.
> 3)  In particular the ESGF search categories (facets) are not tied to the
> directory structure.
>
> So there is quite a bit of flexibility allowed by ESGF, and users can
> normally access data blissfully unaware of how its been organized on various
> data nodes.
>
> Nevertheless, I agree that at least for some "projects" organizing data in
> unified directory structures is useful.  Node managers who often want to
> access data outside of the ESGF api can more easily find what they're
> looking for with a standardized directory structure.  If the directory
> structure is rigorously enforced, users could take wget scripts created to
> download one variable and easily modify it to download a different variable
> (although its hard to be sure of success).
>
> Concerning your proposed directory structure, I have two comments:
>
> 1)  I would recommend omitting the "institution" subdirectory.  For CMIP5 I
> think it was a mistake to include this (and also, I wouldn't include it at a
> search category).  The institution can be recorded in the metadata of each
> file.  If the same model was run at more than 1 institution, then I'd like
> to see all the simulations under a single "model" subdirectory, not split
> across two directories under different institutions.
>
> 2) When possible, I would stick with the terminology established in the
> "CMIP5 Data Reference Syntax (DRS) and Controlled Vocabularies" document
> available at:
> http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf
>
> Best regards,
> Karl
>
>
>
>
> On 3/12/13 10:14 AM, Galia Guentchev wrote:
>
> Hi everybody,
>
> Several groups have expressed interest to publish downscaled climate
> datasets on ESGF. A standardized solution to publishing (directory structure
> elements) would contribute to the prompt identification of datasets. To
> discuss needs and options for directory structure elements we had an initial
> teleconference about a month ago. With this email we are expanding our reach
> to other groups, such as the go-essp group, in order to have a wider
> discussion of these elements.
>
> As agreed during our first teleconference, Aparna and Galia worked on a
> proposal for a Directory Structure for publishing downscaled datasets on
> ESGF. We would like to focus our next teleconference on discussing this
> proposal. Below please find a doodle poll for a potential next
> teleconference.
>
> http://doodle.com/hrwthqs2g5pgsyv6
>
>
> **********************************************************************
> Details of each element of the proposed directory structure:
>
> Proposed elements -
> /projectID/sub-project/product/institution/predictorModel/experimentID/frequency/realm/MIPtable/Pred
> ictor_experiment_rip/predictorversion/downscalingMethod/predictand
> (variableName)/region/DownscaledDataversion/file_name.nc
>
> Example:
>
> /ncpp2013/perfectModel/downscaled/NOAA-GFDL/GFDL-HIRAM-C360-coarsened/amip/day/atmos/day/r1i1p1/v20121024/GFDL-ARRMv1/tasmax/US48/v20120227/tasmax_day_amip_r1i1p1_downscaled_US48_GFDLARRMv1_19790101-19831231.nc
>
> The new element sub-project (in blue above) gives the opportunity to
> indicate to users that in the one case the method was trained on
> observations (standard setting), and in the other on model that was
> considered to be the truth (perfect model setting);
> The options there could be: PerfectModel or Standard - where possibly there
> could be a different name instead of 'standard' for the standard downscaling
> setting.
>
> For NASA datasets some of the directories could be:
>
> project = NEX
> product = downscaled
> institution = NASA-Ames
> predictorModel - original model value
> experimentID = historical
> frequency = mon
> realm = atmos
> Predictor_experiment_rip - original model value
> variable = precipitation or temperature
> region = CONUS
>
> DownscalingMethod will also be included as a directory to allow for search
> on method.
>
> **********************
> There are a set of sub-directories that refer to the PredictorModel -
> presented in bold -
> /predictorModel/experimentID/frequency/realm/MIPtable/Pred
> ictor_experiment_rip/predictorversion
>
> Where:
>
> predictor model - is the specific GCM which is the source of the predictor
> data set - GFDL-HIRAM-C360-coarsened - in the above example
> experimentID - the specific experiment - amip in this case
> frequency - refers to the temporal scale of the predictor fields - daily
> realm - the realm of the predictors - in this case atmos(phere)
> MIPtable - name of the model intercomparison table - daily in this example,
> could be amon - for atm monthly data;
> Predictor-Experiment-rip - follows the standard notation from CMIP5
> version - the version date of the global model that provided the predictor
> dataset
>
> The elements above follow quite closely the structure for CMIP5 model output
> directory elements.
>
> There is a set of sub-directories that refer to the Downscaling method -
> presented in italics -
> downscalingMethod/predictand (variableName)/region/DownscaledDataversion
>
> Where:
>
> downscalingMethod - is the downscaling method abbreviation - in this case
> GFDL-ARRMv1 - the GFDL in the name indicates that this is a setting applied
> by GFDL where there were two sets of predictors, based on the ARRM method of
> K.Hayhoe; also v.1 indicates which version of the ARRM method was used (the
> original version) - more details about the method are given in the global
> attributes of the file;
> Predictand (variableName) - the specific predictand variable that was
> downscaled; tasmax in this case;
> region - indicates that the method was applied to the US48
> DownscaledDataversion - the version of the downscaled dataset
>
> For the purposes of standardization there are two directions to consider:
>
> 1) One is to have one standard directory structure that will be used by all
> - for example, following the example of GFDL to have the details of the
> predictor model first and then the downscaling method details:
>
> ProjectID - sub-project - product - Institution - Predictor dataset details
> - Downscaling method details - Filename
>
> Having a standardized approach would help any automated service/web service
> to detect the directory path for a particular dataset.
>
> 2) During our last teleconference there was a proposal to follow the
> downscaling practice and describe the downscaling method first and then the
> predictor model. This leads to two paths:
>
>         • ProjectID - Standard or Perfect Model sub-project facet - product
> - Institution -  then see below:
>                -  (if Perfect model setting) Predictor dataset details -
> Downscaling method details,
>                -  (if Standard setting) - Downscaling method details -
> Predictor dataset details
>
>
> The NCPP Core team accepts that it may be reasonable to have a directory
> structure - where the method description is first; and another directory
> structure - where the predictor description is first and then the methods
> that are applied are described; NCPP will support either approach (one
> overall directory structure, or two separate pathways) and if the second
> approach is chosen (with two different sub-directory sequences) - we would
> like to promote and to support the standardization of these different
> directory pathways - meaning - we will support two standardized directory
> structures to accommodate two common practices.
>
>
> ******************
> Additional details:
>
> Variable level attributes-
> The published dataset should also conform to CF-standards.
> eg-
>
>                 tasmax:long_name = "Downscaled Daily Maximum Near-Surface
> Air Temperature" ;
>                 tasmax:units = "K" ;
>                 tasmax:missing_value = 1.e+20f ;
>                 tasmax:_FillValue = 1.e+20f ;
>                 tasmax:standard_name = "air_temperature" ;
>                 tasmax:original_units = "K" ;
>                 tasmax:downscaling_method: GFDL-ARRMv1
>
> Global attributes- listing a few here, several CMIP-style attributes will be
> inherited.
>
> "predictorModel" will replace "model_id"
>   For the 'downscaling model', as agreed with Luca on the call it would be
> 'downscalingMethod'
>
>                 :Conventions = "CF-1.4" ;
>                 :references = "info about model, training datasets etc will
> be provided here"
>                 :info = "additional info about the downscaling method"
>                 :creation_date = "2011-08-19T21:57:06Z" ;
>                 :institution = "NOAA GFDL(201 Forrestal Rd, Princeton, NJ,
> 08540)" ;
>                 :history = "info on file processing. Eg" processed by
> toolX." ;
>                 :projectID = ncpp2013
>                 :subprojectID = perfectModel
>                 :product = downscaled
>                 :institution = NOAA-GFDL
>                 :predictorModel = GFDL-HIRAM-C360-coarsened
>                 :experimentID = amip
>                 :frequency = day
>                 :modeling_realm = atmos
>                 :Predictor_experiment_rip = r1i1p1
>                 :region = US48
>                 :table_id = day
>                 :version = v20120227
>                 :downscalingMethod = GFDL-ARRMv1
> **************************************************
>
> Best regards,
> Galia and Aparna
>
> --
> Galia Guentchev, PhD
> Project Scientist
> National CLimate
> Predictions and
> Projections
> Platform (NCPP)
> NCAR RAL CSAP
> FL2 3103
> 3450 Mitchell Lane
> Boulder, CO, 80301
> phone: 303 497 2743
>
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
> _______________________________________________
> ncpp_tech mailing list
> ncpp_tech at list.woc.noaa.gov
> https://list.woc.noaa.gov/cgi-bin/mailman/listinfo/ncpp_tech
>
> This mailing list is intended for a discussion of technical issues
> concerning the NCPP project.  Any scientific artifacts or methodologies
> referenced by comments on this list are only discussed in terms of how well
> they meet NCPP's software needs or the needs of NCPP's potential users.
> These comments should not be taken as endorsements nor rejections, nor used
> in any claims of scientific accuracy.  Furthermore, because they are likely
> just a fraction of the communications done over various media, they should
> not be interpreted nor reproduced without their full context.

--
Balaji                     Office: +1-609-452-6516
Princeton University       Home:   +1-212-643-2089