[Go-essp-tech] PKI access control on data nodes

Tue Jul 12 08:34:40 MDT 2011

  Hello Jamie,

On 12/07/2011 16:07, Kettleborough, Jamie wrote:
> Hello Estani, Phil, Martin,
>
> Thanks for your replies on this.  I think the summary of what you are telling me is: 'its harder to provide an API that will give a data user the http token than it is to get the nodes to all use the same PKI method'.
>
> As I understand it Martin and the BADC are contacting the data node owners to encourage the them to support the PKI method.  So I guess I'll wait for the out come of that.
>
> In the mean time I'm not sure I've been clear on what our data use is - sorry.   I think its fairly typical for a multi-model analysis.
>
> The highish level context is something like this (at the risk of teaching grandmothers to suck eggs)... There are a number of scientists, at the Met Office,  hoping to write papers based on the analysis of the CMIP5 multi model ensemble.  Before they can do the analysis to contribute to the paper they'll need to gather the same set of diagnostics from a number of experiments: say piControl, historical, and a couple of rcp simulations.  They'll probably want the initial condition ensembles to help take into account internal variablitiy.  They'll want the data from as many models as are available so they can start to understand/quantify/take into account the possible role of model errors.  Obviously they'd like the data gathering bit to have as little intrusion or interuption in their normal working schedule as possible.  As they want the papers to contribute to the IPCC AR5 they need to have submitted the paper by July 2012 - which means they have about a year to get data, do some analysis, write up and submit - this is a pretty tight timeline, so we want to start gathering data now, even if only for a few models.
>
> We're also on a slow network and have security constraints that make things a bit harder for us.  And some users will want the same data sets, so we want to minimise the amount of 'double fetch'  - so we will provide a local replication of a subset of the CMIP5 archive to MO scientists. But I'm not sure how typical these things are elsewhere.
>

I think this is a typical behaviour for a group willing to contribute to 
the IPCC AR5. Distributed archive but with multi-centralized/multi-polar 
analysis.

> I thought the least risky way we can do this is to go straight to the data nodes to list what data is available, and to fetch the data from the nodes directly.   The data volumes, and number of publication level datasets we are likely to want, makes doing this via the web based gateways impractical.  I'm also guessing that the data nodes have the most definitive list of what they hold, so it the most reliable source for listing data.  Clearly though to do this we need 1) a list of nodes to talk to, 2) a way of listing the data on the data node (I think the thredds catalogue gives us this capability) 3) a way of doing the authenticated get from each node.  (1 and 3 are why I started asking the questions I did).  Obviously we want to also be able to protect against corruption during download, and have some way of understanding when and what to do when exsisting data is corrected, BUT the ability to get data is more important in this first instance.

We follow exactly the same step. We have a code doing that, evolving 
code of course.
If you want to make use of it : why not. Contact me.
If there is an interest to put that code on esgf : why not. Contact me.

Cheers.
Sébastien

> I hope this gives a bit more context to why I've suddenly started chipping in...
>
> Jamie
>
>
>
>
>> -----Original Message-----
>> From: Estanislao Gonzalez [mailto:gonzalez at dkrz.de]
>> Sent: 07 July 2011 15:26
>> To: Kettleborough, Jamie
>> Cc: martin.juckes at stfc.ac.uk; gavin at llnl.gov; go-essp-tech at ucar.edu
>> Subject: Re: [Go-essp-tech] PKI access control on data nodes
>>
>> Hi,
>>
>> There are several problems:
>>
>> 1) Tokens have a short-life, and thus urls are disposable.
>> The tokenless method separates the security (which is
>> short-lived) from the url. Thus you can pass the wget script
>> around and it will never expire. It will only work though,
>> together with a valid certificate.
>> 2) Tokens cannot be inferred, that's the point. So it's
>> impossible to create one (like download the same but for
>> other model). Only the gateway can do that.
>> 3) Regarding how to get a Token, only the Gateway can provide
>> it and there's no API to that (as it would make no sense,
>> because you'll have to authenticate anyways for this). This
>> is what Phil was saying.
>>
>> So, if you only plan to download a couple of files
>> immediately at a Gateway, you might be well suited with the
>> token security.
>> In all other cases you need the token less.
>>
>> Thanks,
>> Estani
>>
>> Am 07.07.2011 15:46, schrieb Kettleborough, Jamie:
>>> Hello Martin,
>>>
>>> I agree its better to have one method of authorization rather than
>>> many
>>> - its less to code, debug, maintain, support.  But at the moment we
>>> have 2, and I don't really have any information on when the
>> http token
>>> method will be retired.  I know there is an intent to
>> retire it, but
>>> no indication of the timescales.  So I was going for plan
>> B: which was
>>> to look into the option of supporting both.
>>>
>>> A programmatic interface to the http token authorization is
>> possible
>>> surely - how else to the gateways generate those wget
>> scripts?  (Or am
>>> I missing something obvious).  What did you mean by 'adequate data
>>> access'?
>>>
>>> Jamie
>>>
>>>> -----Original Message-----
>>>> From: martin.juckes at stfc.ac.uk [mailto:martin.juckes at stfc.ac.uk]
>>>> Sent: 06 July 2011 22:43
>>>> To: Kettleborough, Jamie; gavin at llnl.gov
>>>> Cc: go-essp-tech at ucar.edu
>>>> Subject: Checksums and PKI access control on data nodes
>>>>
>>>> Hi Jamie,
>>>>
>>>> just picking up something on one of your data node authorization
>>>> threads.
>>>>
>>>> I think programmatic access to data requires PKI security
>> -- I don't
>>>> see any prospect of adequate data access with the http token
>>>> approach.
>>>>
>>>> I think that checksums are also necessary to guarantee
>> data integrity
>>>> -- these are given in the THREDDS catalogues of BADC,
>> IPSL, and CNRM
>>>> -- and CCCMA is in the process of adding them.
>>>>
>>>> I aim to continue contacting data nodes over the coming weeks and
>>>> hope that there will be steady progress in levelling the
>> quality of
>>>> service upwards,
>>>>
>>>> cheers,
>>>> Martin
>>>>
>>>> ________________________________________
>>>> From: go-essp-tech-bounces at ucar.edu
>>>> [go-essp-tech-bounces at ucar.edu] on behalf of Kettleborough, Jamie
>>>> [jamie.kettleborough at metoffice.gov.uk]
>>>> Sent: 05 July 2011 14:48
>>>> To: Gavin M. Bell
>>>> Cc: go-essp-tech at ucar.edu
>>>> Subject: Re: [Go-essp-tech] Data node authorization
>>>>
>>>> Hello Gavin,
>>>>
>>>> thanks for this.  This looks useful.  Any ideas when any
>>>> live/production data nodes will have this version of the
>> service on
>>>> them? - I couldn't find any (but that's part of the problem of
>>>> course). When available how up to date will the registry
>> be e.g. are
>>>> their constraints on it like it will only know about data nodes
>>>> running the same releases?
>>>>
>>>> I know you were just answering my tangent.  But I think
>> the original
>>>> question is still only half answered.  As I understand it
>> there are
>>>> two ways this might go:
>>>>
>>>> 1. all data nodes upgrade change to the PKI infrastructure
>>>>
>>>> 2. the ESGF continues to support (for some time) both PKI and the
>>>> HTTP query string token (I don't know the right name for this,
>>>> sorry).
>>>>
>>>> (there is a 3rd option of everyone move to just the HTTP
>> query string
>>>> token - but I don't think that is really under discussion).
>>>>
>>>> My guess is that 2. is the most likely outcome and data users will
>>>> have to cope with both.  So...
>>>>
>>>> 1. How do you programmatically get data using the HTTP
>> query string
>>>> token (I think Martin is following this up with Bob - can
>> we have a
>>>> summary posted to the list?)
>>>>
>>>> 2. How does a user know which method to use for which nodes.
>>>> (This may be in the data-node registry, when available, but it
>>>> wasn't' obvious to me from the sample Luca sent round? -
>> again I may
>>>> be missing something though).
>>>>
>>>> Apologies if I'm coming across as over demanding here - I
>> realise I'm
>>>> coming to this discussion relatively late in the day.
>> Just I'm aware
>>>> that we have scientists who want to get data so they can start the
>>>> analysis and writing of multi model papers in time for the
>> 1st draft
>>>> of the AR5. At the moment I'm really uncertain on how they can get
>>>> the data minimising the effort that have to put into finding and
>>>> fetching it.
>>>>
>>>> Thanks,
>>>>
>>>> Jamie
>>>>
>>>>
>>>> ________________________________
>>>>
>>>>           From: Gavin M. Bell [mailto:gavin at llnl.gov]
>>>>           Sent: 01 July 2011 20:35
>>>>           To: Kettleborough, Jamie
>>>>           Cc: Cinquini, Luca (3880); go-essp-tech at ucar.edu
>>>>           Subject: Re: [Go-essp-tech] Data node authorization
>>>>
>>>>
>>>>           Hello Jamie,
>>>>
>>>>           Allow me to solely indulge your tangent for a
>> moment... :-)
>>>>           The issue of knowing who is where etc. is solved
>> by using a
>>>> sufficiently recent version of the  ESGF "data" Node (v0.5.1+).
>>>>           The node-manager's registry component will automatically
>>>> generate a continuously updating descriptive
>>>> (xml) document of nodes currently present in the federation at a
>>>> given time.  This would have ameliorated your task considerably.
>>>>
>>>>           If you look at the sites you have collected; go to the
>>>> esgf-node-manager page and look at the bottom left corner for the
>>>> version.
>>>>           They are all earlier than v0.5.1 and hence do not
>> have the
>>>> automatic federation feature in place.
>>>>
>>>>           Ex:
>>>>           http://esgnode1.nci.org.au/esgf-node-manager/  (v0.5.0)
>>>>           http://vesg.ipsl.fr/esgf-node-manager/  (v0.4.0)
>>>>           http://esg.cnrm-game-meteo.fr/esgf-node-manager/  (v0.4.0)
>>>>           http://dap.cccma.uvic.ca/esgf-node-manager/  (v0.5.0)
>>>>           http://cmip-dn.badc.rl.ac.uk/esgf-node-manager/  (v0.4.0)
>>>>
>>>>           (NASA-GISS are not running a node manager at all)
>>>>
>>>>           If you look at more recent node installations (version
>>>> 0.5.1+) you will see that there is a registration.xml
>> document that
>>>> is served under esgf-node-manager.  It is an active
>> document that is
>>>> automatically updated by the node manager's registry service to
>>>> always reflect the current state of the federation.
>>>>           This is a feature of the new ESGF Node.  Gateways are not
>>>> running node managers so they are not present in the
>> registration.xml
>>>> document.  However, you can find out about gateways indirectly by
>>>> looking at the ESGF Node's registration entry and looking at the
>>>> attribute "adminPeer"
>>>> this indicates that node's target IDP service, which in older ESG
>>>> parlance indicates a "gateway".  The new ESGF Nodes are
>> built based
>>>> on a modular component architecture such that sets of components
>>>> embody functionality, these are what we call ESGF Node "types".
>>>> There are 4 node types. The node type that is currently being
>>>> installed is the well known "data" type a.k.a the "data node", the
>>>> other types are not mutually exclusive and extend the ESGF Nodes
>>>> functionality to include familiar features such as:
>>>>           - User credential management and single sign on support
>>>>           - Attribute management
>>>>           - Enhanced Federation-wide searching (with new search
>>>> front-end)
>>>>
>>>>           As well as recent features since v0.5.1 and
>> pending features
>>>> coming on line such as:
>>>>           - Automatic fail-over and fault tolerance
>>>>           - New administrative front ends
>>>>           - Computation / Visualization tools
>>>>           - and more...
>>>>
>>>>           I would suggest upgrading :-).
>>>>
>>>>           The installation/upgrading process has been
>> streamlined to
>>>> make things more straight forward - and the team and I are always
>>>> glad to help if needed.  There are further enhancements in
>> the queue
>>>> that will further streamline the process to make
>>>> installation/upgrading as turn-key as possible.  There are also
>>>> enhancements to the federation protocol and new features as well,
>>>> that will soon be available in an upcoming v0.5.3 release that is
>>>> currently in test.
>>>>
>>>>           FYI:
>>>>           The current installer installs the ESGF Node at v0.5.1.
>>>>           In staging is v0.5.2
>>>>           In test is v0.5.3.
>>>>
>>>>           Note: The list above are versions of the node manager
>>>> component.
>>>> As it is a component of the ESGF Node, the node itself has
>> a version
>>>> currently ESG Node v1.0.4+ (Stuyvesant release).
>>>>
>>>>           The new ESGF Node augments the data node and is a
>> complete
>>>> solution in and of itself while being compatible with the current
>>>> Gateway.  It should be considered a useful tool to help
>> the climate
>>>> community and adding to the ESG ecosystem of utilities :-).
>>>>
>>>>           Whew... (that was a long email)
>>>>           I hope this was somewhat useful information in
>> the context
>>>> of your tangent. :-)
>>>>
>>>>
>>>>           On 7/1/11 6:49 AM, Kettleborough, Jamie wrote:
>>>>
>>>>                   I created this table by: looking at each gateway,
>>>> figuring out which
>>>>                   modelling institutes contributed to the CMIP5
>>>> project, selecting a
>>>>                   sample data-set, creating a wget script, and then
>>>> inspecting the url in
>>>>                   the script.  (I couldn't get to any NCC data as I
>>>> didn't have access).
>>>>                   I only sampled one dataset.
>>>>
>>>>                   This feels a bit long winded - what is
>> the expected
>>>> way to do this?
>>>>                   Although today I was just gathering
>> information on
>>>> what data nodes are
>>>>                   out there I can imagine this as a part of a real
>>>> life use case (a very
>>>>                   common use case).  If I want to gather a
>> diagnostic,
>>>> such as monthly
>>>>                   mean surface temperature from as many models as I
>>>> can, I think I'd have
>>>>                   to do this sort of trawling.  OK I maybe
>> only have
>>>> to do the initial
>>>>                   mapping of institute to data node once,
>> but I think
>>>> there is still a
>>>>                   trawl needed between gateways to get the data.  I
>>>> may be missing
>>>>                   something - and I took some unnecessary steps.
>>>> Please let me know if
>>>>                   this is the case.  Estani, Martin, Sebastien
>>>> - sounds like you have
>>>>                   already started to do this sort of thing?
>>>>
>>>>                   I also note that not all gateways know about all
>>>> institutes - I think
>>>>                   this is a known problem.  For instance
>> PCMDI doesn't
>>>> know about IPSL,
>>>>                   and only NCI seems to know about CSIRO. Any ideas
>>>> when this might be
>>>>                   resolved?
>>>>
>>>>
>>>>
>>>>
>>>>           --
>>>>           Gavin M. Bell
>>>>           Lawrence Livermore National Labs
>>>>           --
>>>>
>>>>            "Never mistake a clear view for a short distance."
>>>>                          -Paul Saffo
>>>>
>>>>           (GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>
>>>>            A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>>
>>>>
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>> --
>>>> Scanned by iCritical.
>>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>> --
>> Estanislao Gonzalez
>>
>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>
>> Phone:   +49 (40) 46 00 94-126
>> E-Mail:  gonzalez at dkrz.de
>>
>>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>

-- 
Sébastien Denvil
IPSL, Pôle de modélisation du climat
UPMC, Case 101, 4 place Jussieu,
75252 Paris Cedex 5

Tour 45-55 2ème étage Bureau 209
Tel: 33 1 44 27 21 10
Fax: 33 1 44 27 39 02

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4172 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110712/ed565e34/attachment.bin