[Go-essp-tech] Access control for data with different QC Level

Tue Jul 20 11:09:43 MDT 2010

Dear all,

A few corrections of Martin's write-up:

1. "expt:{all decadal – except 
}/frequency:6hr/realm:atmos/variables:{all of table 3hr}" should read 
"/3hr" not "/6hr"
2. "preindustrial control" 3hr data *is* included in *both* requested 
and replicated categories. There is no inconsistency. [Martin, where did 
you get the idea it wasn't requested?]
3. The "historical" 3hr data is only requested for years 1960-2005, and 
all of that is to be replicated. [Again, I'm not sure why you thought 56 
years were requested, but only 46 years replicated.]

Best regards,
Karl

On 7/20/10 9:13 AM, martin.juckes at stfc.ac.uk wrote:
> Hello,
>
> Attached is an outline, based the "achive_size" spreadsheet Karl
> produced,
>
> Cheers,
> Martin
>
>    
>> -----Original Message-----
>> From: Bryan Lawrence [mailto:bryan.lawrence at stfc.ac.uk]
>> Sent: 20 July 2010 16:52
>> To: Juckes, Martin (STFC,RAL,SSTD)
>> Cc: go-essp-tech at ucar.edu; Karl E.Taylor; Cinquini, Luca (3880)
>> Subject: Re: [Go-essp-tech] Access control for data with different QC
>> Level
>>
>> Hi Martin
>>
>> On Tuesday 20 July 2010 14:58:08 Juckes, Martin (STFC,RAL,SSTD) wrote:
>>      
>>> If the un-replicated (and hence less quality controlled) data is to
>>> be less widely available, then I think we have to re-consider what
>>> gets replicated. In particular, the 3-hourly, 2d fields have been
>>> requested by TGICA for the impacts community (and when I mentioned
>>> this at a recent meeting with hydrologists they were indeed very
>>> keen on this data). The current definition of "replicated" excludes
>>> around 200Tb of 3 hourly data from the decadal projections.
>>>        
>>
>> Hmm. I don't know when that happened. Last time I looked it was in ...
>> I certainly think it needs to be. I know a lot of folk who will be
>> looking to use that.  Perhaps it's worth reminding me what data is not
>> being replicated (of the requested). I had thought it was the ocean 3d
>> fields + (can't remember, but didn't think it was the tgica data).
>>
>>      
>>> It may be that the last point (which I hadn't noticed before) will
>>> force us to reconsider the replication issue. TGICA may well want to
>>> have the data that falls under their request included in the data
>>> which is migrated/tagged into the IPCC DDC: and this would mean that
>>> it all would have to be quality controlled all the way to level 3.
>>>        
>> By hook or crook we will need this data to make it to level 3.
>>
>> Thanks for picking up on this.
>> Bryan
>>
>>
>>      
>>> Regards,
>>> Martin
>>>
>>>        
>>>> -----Original Message-----
>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>>> bounces at ucar.edu] On Behalf Of Bryan Lawrence
>>>> Sent: 20 July 2010 14:19
>>>> To: go-essp-tech at ucar.edu; Karl E.Taylor
>>>> Cc: Cinquini, Luca (3880)
>>>> Subject: Re: [Go-essp-tech] Access control for data with different
>>>> QC Level
>>>>
>>>> Hi Martina
>>>>
>>>> I think answering the political and technical in one shot might
>>>> help here, so I'm going to try. This was my understanding of all
>>>> the
>>>>          
>>> various
>>>
>>>        
>>>> conversations we have had.
>>>>
>>>> Karl, Balaji, Martin: can I trouble you to  read through. I'll
>>>> highlight
>>>> the bits where you need to pay attention!
>>>>
>>>>          
>>>>> 1. My understanding was that at QC L1 the CMIP5 modelling
>>>>> centres,
>>>>>            
>>> at
>>>
>>>        
>>>>> QC L2 non-commercial researchers and at QC L3 every registered
>>>>> user can access the data.
>>>>>            
>>>> The issue of commercial v non-commercial is a decision for the
>>>> modelling
>>>> centres. Given the met office is now allowing commercial, it may
>>>>          
> be
>    
>>>> that
>>>> it has gone away. But that' licensing decision.
>>>>
>>>> So functionally:
>>>>
>>>> access_token_required (dataset) =
>>>>
>>>>       f(qc_level(dataset), license_type(dataset))
>>>>
>>>> Currently we expect PCMDI to allocate tokens to users.
>>>>
>>>> I expect the following three classes of tokens:
>>>>
>>>> unrestricted_use*
>>>> noncommercial_use
>>>> testing
>>>>
>>>> (* unrestricted still requires citation, I'll get to what i mean
>>>>          
> by
>    
>>>> citation below.)
>>>>
>>>> And we need machinery to allocate access_token_required to
>>>>          
> specific
>    
>>>> datasets.
>>>>
>>>> The questions then become:
>>>>
>>>> a) on what grounds and how does PCMDI allocate the tokens?
>>>>
>>>> How should be easy, being part of the user management tooling. I
>>>> don't know what the state of generic ESG tooling for that is, but
>>>> we have this
>>>> sort of tooling available  as part of our normal infrastructure.
>>>>
>>>> On what grounds is more interesting. I'll postpone that a moment.
>>>>
>>>> b) when, where and how, do we set up the "table" which maps
>>>> access_token_required onto the dataset.
>>>>
>>>> Step 1). we need to record qc information.
>>>>
>>>>   - the plan is to build a tool for that in September, to be
>>>>   complete
>>>>          
>>> by
>>>
>>>        
>>>> the end of September, and it will export CIM quality documents via
>>>> an atom feed. It will  be independent of the questionnaire, and
>>>> folks could
>>>> deploy it anywhere, even on a data node.
>>>>
>>>>   (Martina: that can cover the DOI information too.)
>>>>   (I have someone in mind to do the work.)
>>>>   - we get qc level one for free (the data can't be published by a
>>>>   data
>>>>
>>>> node without being qc level one).
>>>> - given that qc level 2 can only be dealt with at DKRZ, PCMDI and
>>>>          
>>> BADC
>>>
>>>        
>>>> since it will apply to replicated data, then we only need to
>>>>          
> deploy
>    
>>> the
>>>
>>>        
>>>> tool there, and gateways will only need to harvest from three
>>>> places - q3 information can only be made available at DKRZ
>>>>
>>>> Step 2) the qc information needs to be harvested.
>>>> - this is a gateway issue,  need only to harvest from the three
>>>> above, plus the QC one at DKRZ.
>>>>
>>>>   - It needs to be mapped onto each replica.
>>>>
>>>> Step 3) this information needs to propagate into the PDP (or
>>>> whatever we
>>>> are calling the policy decision point, I've lost track of the
>>>> names).
>>>>
>>>> Karl, Balaji, Martin:
>>>>
>>>> At this point, we should recognise that we have the ability to
>>>> discriminate  between
>>>> QC  L1
>>>> QC  L2 *only for replicants*
>>>> QC  L3 *only for replicants*
>>>>
>>>> We will assign DOIs *only for replicants*.  (At least in the first
>>>> instance).
>>>>
>>>> This brings me back to on what grounds should we allocate access
>>>> tokens,
>>>> and what licenses should be associated with them.
>>>>
>>>> I thought we had agreed on something like (abbreviated, the exact
>>>> wording needs agreement as per Balaji's email):
>>>>
>>>> testing: you can use this data to exercise this software, and
>>>> report issues with the data to the originator. you may not publish
>>>> science with
>>>> this data, without the express permission of the data originator.
>>>>
>>>> unrestricted: you can do anything you like with the data but you
>>>> must include citations in publications.
>>>>
>>>> non-commercial: there are some restrictions on use, and you must
>>>> include
>>>> citations in publications ...
>>>>
>>>> Before continuing, this brings me to a point of disagreement with
>>>>          
>>> Karl.
>>>
>>>        
>>>> Users *should* absolutely care about the distinction between
>>>>          
>>> replicated
>>>
>>>        
>>>> and non-replicated data. It's a quality thing. They can use the qc
>>>> stuff
>>>> with more confidence.
>>>>
>>>> However, as it stands, we can't give a DOI to output which is not
>>>> replicated, but people will need to use it. I *do* think it's ok
>>>>          
> to
>    
>>>> restrict this to modellers (despite Martin's point about what
>>>>          
> PCMDI
>    
>>> are
>>>
>>>        
>>>> advertising). I think most of the non-modelling community will be
>>>>          
>>> happy
>>>
>>>        
>>>> with the replicated data ...
>>>>
>>>> ... and I think WGCM will buy that argument.
>>>>
>>>> But for the modellers using the L1 data which cannot be qc'd, then
>>>> we need a form of words for an old style acknowledgement or a
>>>> citatoin into
>>>> the data equivalent of the "grey literature". (probably ok to give
>>>> a url.)
>>>>
>>>> So, now the criteria.
>>>>
>>>> testing should be given to modellers as required by the
>>>>          
> originators
>    
>>>> the other two in the normal way by default to anyone for
>>>>          
> replicated
>    
>>>> data, only for special people who sign up to hte restriction above
>>>> for the non-replicated data.
>>>>
>>>> (nb: nothing in the above precludes downloading and using
>>>> replicated data from other than DKRZ, BADC, PCMDI ... if you have
>>>> the tokens),
>>>>
>>>> So that's how I thought we'd agreed it all, but i concede it had
>>>> never been written down in one place.
>>>>
>>>> Cheers
>>>> Bryan
>>>>
>>>>          
>>>>> 2. Bryan please correct me: There is QC L1 as in 1. and after QC
>>>>> L2 and QC L3 all registered users have access to the core data.
>>>>> Maybe only non-commercial researchers are granted access to the
>>>>> non-core data.
>>>>>
>>>>> This is more a political issue.
>>>>>
>>>>> In either case the QC Level has to be communicated to the ESG.
>>>>> Luca suggests that the portal uses the AtomFeed of the
>>>>> questionnaire to harvest the QC Flag. And after QC L3 the DOI
>>>>> link as well. QC and DOI are informations on data, so the right
>>>>> place in metafor CIM would be the dataObject on the hierarchy
>>>>> level "DRS experiment". Which parts of CIM do you harvest?
>>>>>
>>>>> My biggest question at the moment is how to deliver the QC
>>>>> information to CIM. For the DOI target page there are a few
>>>>> additional information pieces needed on citation and contact.
>>>>> Stephen suggested to type them into the questionnaire. This
>>>>>            
> would
>    
>>>>> slow the publication process down and is error-prone. We need an
>>>>> automated CIM update there. The metafor people were against that
>>>>> solution as well because the questionnaire is meant for an
>>>>>            
> inital
>    
>>>>> metadata ingest by the modeling centers. Bryan, how do we get
>>>>>            
> the
>    
>>>>> information in the questionnaire, so that it can be harvested by
>>>>> the ESG?
>>>>> Which would be the alternatives to the AtomFeed/questionnaire as
>>>>> harvesting source for the quality level and DOI information?
>>>>>
>>>>> My second biggest question is where to put the information in
>>>>>            
> the
>    
>>>>> CIM. I sent my interpretation / suggestion to the metafor list,
>>>>> but it didn't start a discussion. Examples for a simulationRun
>>>>> object, on how the dataObjects are referenced and on how the
>>>>> dataObject hierarchies are built, would be of great help. Or
>>>>> metafor just defines how I should send the quality information
>>>>> to them.
>>>>>
>>>>> I moved away from the technical issues, but to solve these
>>>>>            
> things
>    
>>>>> is the precondition for the technical solution in the ESG.
>>>>>
>>>>> Thanks a lot,
>>>>> Martina
>>>>>
>>>>> V. Balaji wrote:
>>>>>            
>>>>>> I know we discussed this at the Princeton workshop. I didn't
>>>>>> register some of the implications then.
>>>>>>
>>>>>> I agree that in a technical sense, yes a dataset is
>>>>>>              
> "available"
>    
>>>>>> to registered users as soon as it is passed by the publisher.
>>>>>> (QCL1-D). At that point, however, it's incompletely
>>>>>> documented, so I'm not sure it can be declared fully
>>>>>> compliant.
>>>>>>
>>>>>> My understanding is that while users are free to begin working
>>>>>>              
>>> with
>>>
>>>        
>>>>>> the data, they can publish results from the data only when the
>>>>>> dataset is citable, which means it has undergone more rigorous
>>>>>> QC. What they downloaded before QC-L2 is certainly
>>>>>> use-at-your-own-risk because L2's when the "semantic QC" kicks
>>>>>> in. And without QC-L3 it isn't citable.
>>>>>>
>>>>>> I think there is a pretty strong feeling that the modeling
>>>>>>              
>>> centers'
>>>
>>>        
>>>>>> data were used too often without citation or acknowledgment
>>>>>> last time, which is what some of the more formal QC levels
>>>>>> this time, e.g DOIs tied to data publication, are trying to
>>>>>> avoid. Assuming the QC document is adopted by the WGCM, it
>>>>>> will be a requirement for downstream users to cite datasets.
>>>>>>
>>>>>> So, QC-L1D data are "available" in the sense that the 1s and
>>>>>>              
> 0s
>    
>>> may
>>>
>>>        
>>>>>> be downloaded, but they're not licensed yet for "do whatever
>>>>>> you like with them"... perhaps?
>>>>>>
>>>>>> It's pretty important that we come up with language that is
>>>>>> clear what one can and cannot do with data at various levels
>>>>>> of QC. I've talked with Karl and Ron and others about making
>>>>>> WGCM the
>>>>>>              
>>> authority
>>>
>>>        
>>>>>> for this, wo whatever words we use have to be run by them.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Cinquini, Luca (3880) writes:
>>>>>>              
>>>>>>> Hi Estani,
>>>>>>>
>>>>>>>       I concur with what Eric said, and to iterate my
>>>>>>>                
>> understanding
>>      
>>>> is
>>>>
>>>>          
>>>>>>>       that as soon as the data is published with QCL1,
>>>>>>>
>>>>>>> it will be available to registered users. Maybe Bob, Dean or
>>>>>>> Karl can comment if my understanding is correct or not.
>>>>>>> thanks, Luca
>>>>>>>
>>>>>>> On Jul 19, 2010, at 2:52 PM, Eric Nienhouse wrote:
>>>>>>>                
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We've had a number of discussions on the topic of QC level
>>>>>>>> and data access.  However, I feel we don't yet have a formal
>>>>>>>> definition of the requirements relating to this area.
>>>>>>>>
>>>>>>>> I believe it is important to clarify and define the
>>>>>>>>                  
> following
>    
>>> two
>>>
>>>        
>>>>>>>> QC related areas:
>>>>>>>>
>>>>>>>> 1)  Who is the authoritative source of the QC level and how
>>>>>>>> this information is propagated through the system?
>>>>>>>>
>>>>>>>> 2)  How does QC level apply to data access policy (eg.
>>>>>>>>                  
> access
>    
>>>>>>>> control)?
>>>>>>>>
>>>>>>>> I would propose discussing this as a future GO-ESSP telco
>>>>>>>> agenda topic, with the intention we document the outcome.
>>>>>>>>
>>>>>>>> Perhaps we can discuss this further via email and work
>>>>>>>> towards capturing the system requirements and related
>>>>>>>> policies in the meanwhile.
>>>>>>>>
>>>>>>>> Please note that there are plans to expose the QC Level
>>>>>>>> within the Gateway UI once the data flow is identified.
>>>>>>>> However, data access control is based upon the group (eg.
>>>>>>>> role) auth-z attribute (such as "CMIP5 Research") and does
>>>>>>>> not currently rely on the QC Level explicitly.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -Eric
>>>>>>>>
>>>>>>>> Estanislao Gonzalez wrote:
>>>>>>>>                  
>>>>>>>>> Hi Luca,
>>>>>>>>>
>>>>>>>>> to sum things up (and correct me Martina/Bryan if I'm
>>>>>>>>> wrong):
>>>>>>>>>
>>>>>>>>> 1) Published data have QC L1-Data "per se",  and will be
>>>>>>>>> available to a very selected group only (which doesn't seem
>>>>>>>>> to be the group you mention, but I might be wrong).
>>>>>>>>> 2) When acquiring QC L2 the data should be accessible to a
>>>>>>>>> broader although still confined group. This check will be
>>>>>>>>> performed by DKRZ and BADC and the information stored
>>>>>>>>> somewhere (not sure where though). Where BADC nor DKRZ have
>>>>>>>>> access to all data-nodes, so the information will be
>>>>>>>>> definitely be stored on some "neutral grounds" (CIM DB?).
>>>>>>>>> 3) QC L3 == DOI acquired == publication. At this stage data
>>>>>>>>> will be available to any registered user.
>>>>>>>>>
>>>>>>>>> If I'm correct, then the security service must check
>>>>>>>>> "somehow" the QC level of the file in order to proceed with
>>>>>>>>> the authorization as it is currently implemented (thus
>>>>>>>>> comparing roles).
>>>>>>>>>
>>>>>>>>> Any comments anyone?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Estani
>>>>>>>>>
>>>>>>>>> Cinquini, Luca (3880) wrote:
>>>>>>>>>                    
>>>>>>>>>> Hi Bryan, Martina,
>>>>>>>>>> I agree that these issues need to be discussed better, but
>>>>>>>>>>                      
>>> here
>>>
>>>        
>>>>>>>>>> are some considerations, which may in some cases only
>>>>>>>>>> reflect my understanding:
>>>>>>>>>>
>>>>>>>>>> 1) we talked about the QC flag for Levels 2 and 3 to be
>>>>>>>>>>                      
> set
>    
>>>>>>>>>> in the metaphor questionnaire, and be propagated through
>>>>>>>>>> the atom feed to the gateways
>>>>>>>>>>
>>>>>>>>>> 2) I thought that in order not to delay data distribution,
>>>>>>>>>> as soon as the data has QC level 1 (I.e. It has been
>>>>>>>>>> processed by the publisher), it will available to
>>>>>>>>>> registered users of the CMIP5 research and commercial
>>>>>>>>>> groups
>>>>>>>>>>
>>>>>>>>>> 3) At this time there is nothing in the ESG access control
>>>>>>>>>> model that toes the access attributes to the QC flags.
>>>>>>>>>>
>>>>>>>>>> Thanks, luca
>>>>>>>>>>
>>>>>>>>>> On Jul 19, 2010, at 7:39 AM, Bryan Lawrence
>>>>>>>>>>                      
>>>> <bryan.lawrence at stfc.ac.uk>  wrote:
>>>>          
>>>>>>>>>>> Hi Martina
>>>>>>>>>>>
>>>>>>>>>>> We definitely need to formalise some of this, so thanks
>>>>>>>>>>> for bringing it up.
>>>>>>>>>>>
>>>>>>>>>>> What I had thought we were proposing was that L2 and L3
>>>>>>>>>>> data have effectively the same restrictions ...
>>>>>>>>>>>
>>>>>>>>>>> ... but your fundamental point (I think) is how do we
>>>>>>>>>>> assign the QC, and how does the security software get
>>>>>>>>>>> that information? Ie what is the workflow that needs to
>>>>>>>>>>> exist. We do need to bottom that out.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Bryan
>>>>>>>>>>>
>>>>>>>>>>> On Monday 19 July 2010 13:43:59 Martina Stockhause wrote:
>>>>>>>>>>>                        
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I had a little discussion with Estani about how the
>>>>>>>>>>>>                          
>>> different
>>>
>>>        
>>>>>>>>>>>> and changing access constraints on the data depending on
>>>>>>>>>>>> their QC levels are realized. It came out that we don't
>>>>>>>>>>>> really know.
>>>>>>>>>>>>
>>>>>>>>>>>> We have on the one hand the user with a special role
>>>>>>>>>>>>                          
> e.g.
>    
>>>>>>>>>>>> "scientific, non-commercial user", who has access to
>>>>>>>>>>>>                          
> data
>    
>>>>>>>>>>>> on QC L3 like every registered user and QC L2 because of
>>>>>>>>>>>> his role. On the other hand, the data has a quality
>>>>>>>>>>>> attribute (QC Level or QC Flag), which defines the
>>>>>>>>>>>> access restriction of the data. For data access a
>>>>>>>>>>>> mechanism has to check user role and data attribute,
>>>>>>>>>>>> before access is granted or denied.
>>>>>>>>>>>>
>>>>>>>>>>>> How does the data get this quality attribute?
>>>>>>>>>>>> How is the user role checked against this quality
>>>>>>>>>>>> attribute?
>>>>>>>>>>>>
>>>>>>>>>>>> For QC L3 we don't need that mechanism, because every
>>>>>>>>>>>> registered user has access to all CMIP5 data, but for QC
>>>>>>>>>>>> L1 and L2 exist such access restrictions.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>> Martina
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>                          
>>>>>>>>>>> --
>>>>>>>>>>> Bryan Lawrence
>>>>>>>>>>> Director of Environmental Archival and Associated
>>>>>>>>>>>                        
> Research
>    
>>>>>>>>>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC
>>>>>>>>>>>                        
> NEODC)
>    
>>>>>>>>>>> STFC, Rutherford Appleton Laboratory
>>>>>>>>>>> Phone +44 1235 445012; Fax ... 5848;
>>>>>>>>>>> Web: home.badc.rl.ac.uk/lawrence
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>                        
>>>>>>>>>> _______________________________________________
>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>                      
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>                
>>>> --
>>>> Bryan Lawrence
>>>> Director of Environmental Archival and Associated Research
>>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>>>> STFC, Rutherford Appleton Laboratory
>>>> Phone +44 1235 445012; Fax ... 5848;
>>>> Web: home.badc.rl.ac.uk/lawrence
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>          
>> --
>> Bryan Lawrence
>> Director of Environmental Archival and Associated Research
>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>> STFC, Rutherford Appleton Laboratory
>> Phone +44 1235 445012; Fax ... 5848;
>> Web: home.badc.rl.ac.uk/lawrence
>> --
>> Scanned by iCritical.
>>      
> --
> Scanned by iCritical.
>
>