[Go-essp-tech] Access control for data with different QC Level

Tue Jul 20 07:19:21 MDT 2010

Hi Martina

I think answering the political and technical in one shot might help 
here, so I'm going to try. This was my understanding of all the various 
conversations we have had. 

Karl, Balaji, Martin: can I trouble you to  read through. I'll highlight 
the bits where you need to pay attention!

> 1. My understanding was that at QC L1 the CMIP5 modelling centres, at
> QC L2 non-commercial researchers and at QC L3 every registered user
> can access the data.

The issue of commercial v non-commercial is a decision for the modelling 
centres. Given the met office is now allowing commercial, it may be that 
it has gone away. But that' licensing decision.

So functionally:

access_token_required (dataset) = 
     f(qc_level(dataset), license_type(dataset))

Currently we expect PCMDI to allocate tokens to users.

I expect the following three classes of tokens:

unrestricted_use*
noncommercial_use
testing

(* unrestricted still requires citation, I'll get to what i mean by 
citation below.)

And we need machinery to allocate access_token_required to specific 
datasets.

The questions then become: 

a) on what grounds and how does PCMDI allocate the tokens? 

How should be easy, being part of the user management tooling. I don't 
know what the state of generic ESG tooling for that is, but we have this 
sort of tooling available  as part of our normal infrastructure.

On what grounds is more interesting. I'll postpone that a moment.

b) when, where and how, do we set up the "table" which maps 
access_token_required onto the dataset.

Step 1). we need to record qc information.
 - the plan is to build a tool for that in September, to be complete by 
the end of September, and it will export CIM quality documents via an 
atom feed. It will  be independent of the questionnaire, and folks could 
deploy it anywhere, even on a data node.
 (Martina: that can cover the DOI information too.)
 (I have someone in mind to do the work.)
 - we get qc level one for free (the data can't be published by a data 
node without being qc level one).
- given that qc level 2 can only be dealt with at DKRZ, PCMDI and  BADC 
since it will apply to replicated data, then we only need to deploy the 
tool there, and gateways will only need to harvest from three places
- q3 information can only be made available at DKRZ

Step 2) the qc information needs to be harvested.
- this is a gateway issue,  need only to harvest from the three above, 
plus the QC one at DKRZ. 
 - It needs to be mapped onto each replica.

Step 3) this information needs to propagate into the PDP (or whatever we 
are calling the policy decision point, I've lost track of the names).

Karl, Balaji, Martin:

At this point, we should recognise that we have the ability to 
discriminate  between
QC  L1
QC  L2 *only for replicants*
QC  L3 *only for replicants*

We will assign DOIs *only for replicants*.  (At least in the first 
instance).

This brings me back to on what grounds should we allocate access tokens, 
and what licenses should be associated with them.

I thought we had agreed on something like (abbreviated, the exact 
wording needs agreement as per Balaji's email):

testing: you can use this data to exercise this software, and report 
issues with the data to the originator. you may not publish science with 
this data, without the express permission of the data originator.

unrestricted: you can do anything you like with the data but you must 
include citations in publications.

non-commercial: there are some restrictions on use, and you must include 
citations in publications ...

Before continuing, this brings me to a point of disagreement with Karl. 
Users *should* absolutely care about the distinction between replicated 
and non-replicated data. It's a quality thing. They can use the qc stuff 
with more confidence.

However, as it stands, we can't give a DOI to output which is not 
replicated, but people will need to use it. I *do* think it's ok to 
restrict this to modellers (despite Martin's point about what PCMDI are 
advertising). I think most of the non-modelling community will be happy 
with the replicated data ...

... and I think WGCM will buy that argument.

But for the modellers using the L1 data which cannot be qc'd, then we 
need a form of words for an old style acknowledgement or a citatoin into 
the data equivalent of the "grey literature". (probably ok to give a 
url.)

So, now the criteria.

testing should be given to modellers as required by the originators
the other two in the normal way by default to anyone for replicated 
data, only for special people who sign up to hte restriction above for 
the non-replicated data.

(nb: nothing in the above precludes downloading and using replicated 
data from other than DKRZ, BADC, PCMDI ... if you have the tokens), 

So that's how I thought we'd agreed it all, but i concede it had never 
been written down in one place.

Cheers
Bryan

> 2. Bryan please correct me: There is QC L1 as in 1. and after QC L2
> and QC L3 all registered users have access to the core data. Maybe
> only non-commercial researchers are granted access to the non-core
> data.
> 
> This is more a political issue.
> 
> In either case the QC Level has to be communicated to the ESG.
> Luca suggests that the portal uses the AtomFeed of the questionnaire
> to harvest the QC Flag. And after QC L3 the DOI link as well. QC and
> DOI are informations on data, so the right place in metafor CIM
> would be the dataObject on the hierarchy level "DRS experiment".
> Which parts of CIM do you harvest?
> 
> My biggest question at the moment is how to deliver the QC
> information to CIM. For the DOI target page there are a few
> additional information pieces needed on citation and contact.
> Stephen suggested to type them into the questionnaire. This would
> slow the publication process down and is error-prone. We need an
> automated CIM update there. The metafor people were against that
> solution as well because the questionnaire is meant for an inital
> metadata ingest by the modeling centers. Bryan, how do we get the
> information in the questionnaire, so that it can be harvested by the
> ESG?
> Which would be the alternatives to the AtomFeed/questionnaire as
> harvesting source for the quality level and DOI information?
> 
> My second biggest question is where to put the information in the
> CIM. I sent my interpretation / suggestion to the metafor list, but
> it didn't start a discussion. Examples for a simulationRun object,
> on how the dataObjects are referenced and on how the dataObject
> hierarchies are built, would be of great help. Or metafor just
> defines how I should send the quality information to them.
> 
> I moved away from the technical issues, but to solve these things is
> the precondition for the technical solution in the ESG.
> 
> Thanks a lot,
> Martina
> 
> V. Balaji wrote:
> > I know we discussed this at the Princeton workshop. I didn't
> > register some of the implications then.
> > 
> > I agree that in a technical sense, yes a dataset is "available" to
> > registered users as soon as it is passed by the publisher.
> > (QCL1-D). At that point, however, it's incompletely documented, so
> > I'm not sure it can be declared fully compliant.
> > 
> > My understanding is that while users are free to begin working with
> > the data, they can publish results from the data only when the
> > dataset is citable, which means it has undergone more rigorous QC.
> > What they downloaded before QC-L2 is certainly
> > use-at-your-own-risk because L2's when the "semantic QC" kicks in.
> > And without QC-L3 it isn't citable.
> > 
> > I think there is a pretty strong feeling that the modeling centers'
> > data were used too often without citation or acknowledgment last
> > time, which is what some of the more formal QC levels this time,
> > e.g DOIs tied to data publication, are trying to avoid. Assuming
> > the QC document is adopted by the WGCM, it will be a requirement
> > for downstream users to cite datasets.
> > 
> > So, QC-L1D data are "available" in the sense that the 1s and 0s may
> > be downloaded, but they're not licensed yet for "do whatever you
> > like with them"... perhaps?
> > 
> > It's pretty important that we come up with language that is clear
> > what one can and cannot do with data at various levels of QC. I've
> > talked with Karl and Ron and others about making WGCM the authority
> > for this, wo whatever words we use have to be run by them.
> > 
> > Thanks,
> > 
> > Cinquini, Luca (3880) writes:
> >> Hi Estani,
> >> 
> >> 	I concur with what Eric said, and to iterate my understanding 
is
> >> 	that as soon as the data is published with QCL1,
> >> 
> >> it will be available to registered users. Maybe Bob, Dean or Karl
> >> can comment if my understanding is correct or not. thanks, Luca
> >> 
> >> On Jul 19, 2010, at 2:52 PM, Eric Nienhouse wrote:
> >>> Hi All,
> >>> 
> >>> We've had a number of discussions on the topic of QC level and
> >>> data access.  However, I feel we don't yet have a formal
> >>> definition of the requirements relating to this area.
> >>> 
> >>> I believe it is important to clarify and define the following two
> >>> QC related areas:
> >>> 
> >>> 1)  Who is the authoritative source of the QC level and how this
> >>> information is propagated through the system?
> >>> 
> >>> 2)  How does QC level apply to data access policy (eg. access
> >>> control)?
> >>> 
> >>> I would propose discussing this as a future GO-ESSP telco agenda
> >>> topic, with the intention we document the outcome.
> >>> 
> >>> Perhaps we can discuss this further via email and work towards
> >>> capturing the system requirements and related policies in the
> >>> meanwhile.
> >>> 
> >>> Please note that there are plans to expose the QC Level within
> >>> the Gateway UI once the data flow is identified.  However, data
> >>> access control is based upon the group (eg. role) auth-z
> >>> attribute (such as "CMIP5 Research") and does not currently rely
> >>> on the QC Level explicitly.
> >>> 
> >>> Thanks,
> >>> 
> >>> -Eric
> >>> 
> >>> Estanislao Gonzalez wrote:
> >>>> Hi Luca,
> >>>> 
> >>>> to sum things up (and correct me Martina/Bryan if I'm wrong):
> >>>> 
> >>>> 1) Published data have QC L1-Data "per se",  and will be
> >>>> available to a very selected group only (which doesn't seem to
> >>>> be the group you mention, but I might be wrong).
> >>>> 2) When acquiring QC L2 the data should be accessible to a
> >>>> broader although still confined group. This check will be
> >>>> performed by DKRZ and BADC and the information stored somewhere
> >>>> (not sure where though). Where BADC nor DKRZ have access to all
> >>>> data-nodes, so the information will be definitely be stored on
> >>>> some "neutral grounds" (CIM DB?). 3) QC L3 == DOI acquired ==
> >>>> publication. At this stage data will be available to any
> >>>> registered user.
> >>>> 
> >>>> If I'm correct, then the security service must check "somehow"
> >>>> the QC level of the file in order to proceed with the
> >>>> authorization as it is currently implemented (thus comparing
> >>>> roles).
> >>>> 
> >>>> Any comments anyone?
> >>>> 
> >>>> Thanks,
> >>>> Estani
> >>>> 
> >>>> Cinquini, Luca (3880) wrote:
> >>>>> Hi Bryan, Martina,
> >>>>> I agree that these issues need to be discussed better, but here
> >>>>> are some considerations, which may in some cases only reflect
> >>>>> my understanding:
> >>>>> 
> >>>>> 1) we talked about the QC flag for Levels 2 and 3 to be set in
> >>>>> the metaphor questionnaire, and be propagated through the atom
> >>>>> feed to the gateways
> >>>>> 
> >>>>> 2) I thought that in order not to delay data distribution, as
> >>>>> soon as the data has QC level 1 (I.e. It has been processed by
> >>>>> the publisher), it will available to registered users of the
> >>>>> CMIP5 research and commercial groups
> >>>>> 
> >>>>> 3) At this time there is nothing in the ESG access control
> >>>>> model that toes the access attributes to the QC flags.
> >>>>> 
> >>>>> Thanks, luca
> >>>>> 
> >>>>> On Jul 19, 2010, at 7:39 AM, Bryan Lawrence 
<bryan.lawrence at stfc.ac.uk> wrote:
> >>>>>> Hi Martina
> >>>>>> 
> >>>>>> We definitely need to formalise some of this, so thanks for
> >>>>>> bringing it up.
> >>>>>> 
> >>>>>> What I had thought we were proposing was that L2 and L3 data
> >>>>>> have effectively the same restrictions ...
> >>>>>> 
> >>>>>> ... but your fundamental point (I think) is how do we assign
> >>>>>> the QC, and how does the security software get that
> >>>>>> information? Ie what is the workflow that needs to exist. We
> >>>>>> do need to bottom that out.
> >>>>>> 
> >>>>>> Thanks
> >>>>>> Bryan
> >>>>>> 
> >>>>>> On Monday 19 July 2010 13:43:59 Martina Stockhause wrote:
> >>>>>>> Hi all,
> >>>>>>> 
> >>>>>>> I had a little discussion with Estani about how the different
> >>>>>>> and changing access constraints on the data depending on
> >>>>>>> their QC levels are realized. It came out that we don't
> >>>>>>> really know.
> >>>>>>> 
> >>>>>>> We have on the one hand the user with a special role e.g.
> >>>>>>> "scientific, non-commercial user", who has access to data on
> >>>>>>> QC L3 like every registered user and QC L2 because of his
> >>>>>>> role. On the other hand, the data has a quality attribute
> >>>>>>> (QC Level or QC Flag), which defines the access restriction
> >>>>>>> of the data. For data access a mechanism has to check user
> >>>>>>> role and data attribute, before access is granted or denied.
> >>>>>>> 
> >>>>>>> How does the data get this quality attribute?
> >>>>>>> How is the user role checked against this quality attribute?
> >>>>>>> 
> >>>>>>> For QC L3 we don't need that mechanism, because every
> >>>>>>> registered user has access to all CMIP5 data, but for QC L1
> >>>>>>> and L2 exist such access restrictions.
> >>>>>>> 
> >>>>>>> Thanks a lot,
> >>>>>>> Martina
> >>>>>>> 
> >>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> GO-ESSP-TECH mailing list
> >>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>> 
> >>>>>> --
> >>>>>> Bryan Lawrence
> >>>>>> Director of Environmental Archival and Associated Research
> >>>>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
> >>>>>> STFC, Rutherford Appleton Laboratory
> >>>>>> Phone +44 1235 445012; Fax ... 5848;
> >>>>>> Web: home.badc.rl.ac.uk/lawrence
> >>>>>> _______________________________________________
> >>>>>> GO-ESSP-TECH mailing list
> >>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>> 
> >>>>> _______________________________________________
> >>>>> GO-ESSP-TECH mailing list
> >>>>> GO-ESSP-TECH at ucar.edu
> >>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >> 
> >> _______________________________________________
> >> GO-ESSP-TECH mailing list
> >> GO-ESSP-TECH at ucar.edu
> >> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence