[Go-essp-tech] Comments on Tuesday telco on QC and DOI

Fri Mar 19 08:51:57 MDT 2010

Hi Bob,

thanks a lot for your explanation of the ESG conformance checks. I am
about to update our QC paper and therefore I have a couple of questions
for clarification:

1. Are your DRS checks or axes checks involve the data?
E.g.: For DRS "frequency": Do you check only the DRS name "6hr" or do
you check, if the data is described on a time scale with a timestep of 6
hours?

2. CF Check: You mentioned a warning for non-valid standard names. The
data is published in spite of this warning? Where is it written to? For
the final QC checks for DOI publication (L3) it is helpful to have these
warnings. Is there a log file written? And introduced into the
CIM/metafor metadata?

3. This is more a question for Charles: When CMOR2 conformance is
checked in the publisher, will there be a log file written? Are there
any cases of passing the checks with warnings or errors?

Your questions (a) of incompleteness and (b) of a small number of
erroneous files of a simulation are good questions, because errors might
occur on the Atomic Dataset level but searching is done on the Model
Realm level and STD-DOI is assigned for the Experiment level. 

For (a) incompleteness, I cannot see an option for us to prevent this:
Certain modeling groups will provide not all requested parameters and
some will add more difficult parameters or the last years at a later
point. We do not know what they plan to deliver. Or not before we are in
the QC level 2 checks, where we can check against the questionnaire
data. However, they might have changed their mind since the metadata
entry. Therefore we have to publish the delivered parts.

For the erroneous data, in my opinion, as long as the data has QC level
1 or before it is replicated (Bryans "preprint" stage), it can be
exchanged and updated without any action of the ESG federation. If the
data has passed QC L1 and is replicated, we need a quick notification
within the ESGF to hold the QC procedure for this dataset. The users,
who have downloaded the "preprint" data, have done it on their own risk.
Therefore I don't think that we have to notify them. If these users want
to cite the data, they have to go back to the portal to get the citation
(DOI) and eventually discover a revised data version.
Since we do not expect to discover data errors of QC L2 data in the QC
L3 (STD-DOI) data publication process, the question of partly erroneous
DOI data is left. In principle, there are the following possibilities:
1. add an erratum (for a minor error in an atomic dataset like a missing
record), 2. a new version of the DOI (e.g. for additional data), or 3.
assign a new DOI (for mayor changes). We have just started the internal
discussion of which criteria to apply to these different cases.

Finally, for the access of erronous DOI data, WDCC follows the principle
that a normal user finds all data in a data search: erroneous and
corrected. A erroneous file is marked by appending an error flag "_err"
to the original name and before download the user is notified about the
corrected version and has to ask for a special permission to download
the erroneous file. We have to discuss how easy or difficult it should
be to find and download  erroneous data versions in the ESGF.

I agree with you that a priorization of the data to be quality
controlled is a good idea. We need a fixed list based on your
experiences of AR4 for that. Later maybe exchanged by an actual list
based on download rates.

Best wishes,
Martina

Bob Drach wrote:
> Hi Michael,
>
> It was mentioned in today's telco that the ESG publisher currently  
> does some QC checks automatically. To be specific, the publisher checks:
>
> - Discovery data - especially DRS fields - are identifiable and have  
> correct values. If any mandatory fields are missing or invalid, an  
> error is raised and the data cannot be published.
> - Standard names are valid. A warning is issued if the standard name  
> is missing or unrecognized.
> - Coordinate axes are recognizable - particularly time. A calendar is  
> defined.
> - Time values are monotonic and do not overlap between files. This is  
> checked when aggregations are generated. It is not considered an  
> error if timepoints are missing.
>
> There seems to be reasonable consensus that the quality control flag  
> will be created and updated by the publisher, will be associated with  
> the publication-level dataset and displayed with that dataset on the  
> gateway. The question remains how to deal with datasets for which  
> either (a) some of the variables in the dataset were not generated by  
> the modeling group, or (b) a small number of variables (whatever that  
> means) did not pass quality control:
>
> (a) Experience suggests that some groups will not submit all  
> variables for an experiment, or will not generate and submit them at  
> the same time. When the publishing group is not the same as the  
> modelling group (e.g. where a center has submitted data to one of the  
> core nodes for publication and archival)  it is not always obvious  
> when a dataset is 'complete'. Should the publisher wait until the  
> modellers say 'the dataset is complete', or publish partial datasets  
> with the idea that only QC level 3 would require the 'complete' dataset?
>
> (b) If one or a few variables in a dataset are found to be in error,  
> there may be a considerable delay before the modelling center can  
> replace the erroneous data. Again, should the remaining valid data be  
> published, with the idea that some users will not care about the  
> missing variables but want prompt access to the remaining variables?
>
> One last comment: in AR4 some datasets ( and some variables within  
> those datasets) were much more heavily subscribed than others. In  
> particular, the 20th century historical runs were downloaded with  
> greater frequency. If it were possible to anticipate which datasets  
> would be of greatest interest, it would be a good idea to prioritize  
> the associated QC and publication.
>
> Best regards,
>
> Bob
>
> On Mar 15, 2010, at 8:06 AM, Michael Lautenschlager wrote:
>
>   
>> Dear all,
>>
>> as Stephen just announced I merged the contributions into two  
>> documents for our tomorrow's telco. The QC document contains the  
>> complete set of flow charts and highlights open issues and points  
>> of discussion.
>>
>> Best wishes, Michael
>>
>> -- 
>> ---------------
>> Dr. Michael Lautenschlager
>>
>> German Climate Computing Centre (DKRZ)
>> World Data Center Climate (WDCC)
>> ADDRESS: Bundesstrasse 45a, D-20146 Hamburg, Germany
>> PHONE:   +4940-460094-118
>> E-Mail:  lautenschlager at dkrz.de
>>
>> URL:    http://*www.*dkrz.de/
>>         http://*www.*wdc-climate.de/<data-citations-100311-mil- 
>> bnl.pdf><CMIP5-AR5- 
>> QualityControl-20100315.pdf>__________________________________________ 
>> _____
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>     
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>   

-- 
----------- DKRZ / Data Management -----------

Martina Stockhause
Deutsches Klimarechenzentrum
Bundesstr. 45a
D-20146 Hamburg
Germany

phone:	+49-40-460094-122
FAX:	+49-40-460094-106
e-mail:	martina.stockhause at zmaw.de

----------------------------------------------