[Go-essp-tech] DRS, version number & Co

Sébastien Denvil sebastien.denvil at ipsl.jussieu.fr
Fri Jul 8 06:43:18 MDT 2011


  Dear all,

so as to share the way we manage versions and downloads I have written 
down a few key points.

The main one could be : at the scale of CMIP5 "the way to easiest 
science" will be achieved with distributed archive but with 
multi-centralized/multi-polar analysis.... Just a thought.

*- Versioning policy.*

We used drslib since our first publication and follow exactly the DRS
We decided that any change in a dataset (new file(s), removed file(s), 
updated file(s)) will trigger a new dataset version, even before QC take 
place
Successive dataset versions are managed by drslib (obvious but still...)
We are in the process of publishing new dataset version due to new, 
removed & updated file(s). So we will let you know our findings
The document describing "Versioning policy" that I was looking after 
could be this one.
*CMIP5 Dataset Version Directory Structure (Stephen Pascoe, version 0.5, 
2010-06-04)* I have a printed version of it but cannot find it in my 
mail box.

*- Download strategy*.

We download a predefined subset of the CMIP5 archive to sustain analysis 
activity of scientists (100+ from WG1 family) on a dedicated cluster 
(economy of scale)
We download from PKI enabled data node only
We rely on thredds catalogue information to make decision for 
downloading (dataset_version, checksums, tracking_id)
We rely on thredds catalogue information and on drslib to locally 
reproduce DRS structure alongside IPSL data (scientists love homogeneous 
directory structure)
We maintain manually a datanodes list we try to download from
     - first pass --> download subset (store all info as you can in a db 
at the file level : transfert_rate, date, versions, tracking_id, data 
size per time step ....)
                             check checksums when available otherwise 
check size
     - periodically
         - update the subset definition
         - from db compute new subset size ---> yes/no it can fit on our 
storage
         - from db compute time it should take to download
         - next pass --> if dataset_version changed
                                   download file if checksum changed (if 
no checksum : download file if tracking_id changed) and update db.
                                   check checksums when available 
otherwise check size

That's how it is, imperfect but it fulfils main use cases.
Regards.
Sébastien

On 08/07/2011 12:48, Estanislao Gonzalez wrote:
> Hi Sebastien,
>
> The only part which was not resolved (or agreed upon) was the first 
> level after /fileServer/ for the file access capabilities of the TDS. 
> so every node has something different (we will be publishing to 
> /fileServer/cmip/out... though)
>
> what the drslib do is threefold:
> 1) It separates output into output1 and output2 since only output2 is 
> interesting for replication (that's how I got IPSL aqua4K experiment, 
> just skipped everything that was output2..)
> 2) It version the files (and thus inserts the version into the DRS 
> structure). This helps finding the version and is the only way I know 
> of, that it can be gathered from the Gateway.
> 3) it recreates the DRS structure assuring is a valid one. For reasons 
> I'm not aware of, it misses the activity part so you can still end up 
> with a non valid DRS... in CNRM case it means it will not validate the 
> CMIP5 which should have been cmip5 (sadly computers are worse than the 
> worst bureaucrats :-)
>
> The drslib it's quite well described (In my opinion) and it's here: 
> http://esgf.org/esgf-drslib-site/
> All documentation regarding the datanode and all tools around it can 
> be found here: http://esgf.org/wiki/ESGF_Node
>
> Hope this helps,
> Estani
>
> Am 08.07.2011 12:13, schrieb Stéphane Senesi:
>> Hi all,
>>
>> martin.juckes at stfc.ac.uk wrote, On 08/07/2011 11:09:
>>> Hi Estani,
>>>
>>> I agree with you that this is an important issue and that we want to have a clean implementation.
>>>
>>> Unfortunately, given where we are now, I don't think there is going to be any support for withdrawing data nodes which don't meet this implementation standard -- so enforcement by the gateway won't work. So I think the only way forward is to work on simplifying the installation and then persuade the node managers to adopt the standard. Making it the default would, as you suggest, be a huge help.
>>>
>>> I keep telling our users in the UK that the archive is currently in a very early stage, with a significant chance that data will be replaced. The same applies to the level of service. I think we need to work on demonstrating best practise as far as data node deployment goes.
>>>
>>> At the moment I see the PKI security as a higher priority, since most of our users want scripting access rather than clicking through the gateways, and this only works when the PKI security is enabled.
>>>
>>> For the versioning implementation, it would help to have a step by step guide on esgf.org (or if it is already there, it would help me to understand the issues if I knew where it is) -- but I guess this will have to wait until Stephen has worked through some other priorities.
>>>    
>>
>> Regarding CNRM data node, what prevented us to turn to the 
>> "recommended" directory structure (it is not coined as "standard" in 
>> CMIP5 documents), was the lack of such a guide
>>
>> I agree with Martin that it is important to ease an OpenDAP-enabled 
>> scripted access for data users; if it appears that this version issue 
>> is the only obstacle for computing datafiles addresses (not quoting 
>> the issue of data node name), then we can consider changing the 
>> directory structure (assuming we have the guide).
>>
>> Alss, I note that the first part of HTTPServer URL's also show a part 
>> which may vary on a datanode basis,and even on an experiment or realm 
>> basis (such as the boldface part in 
>> http://esg.cnrm-game-meteo.fr/thredds/fileServer/*esg_dataroot1*/CMIP5/output/CNRM-CERFACS/CNRM-CM5/historicalGHG/mon/land/evspsblsoi/r1i1p1/evspsblsoi_Lmon_CNRM-CM5_historicalGHG_r1i1p1_190001-194912.nc) 
>> . Would this be cured by applying drslib ?
>>
>> On a very close subject, may I quote Sébastien Denvil ( 9 june 2011), 
>> with whom I agree :
>>
>>> I would like to remind us all that having a clear add/remove/update 
>>> procedure is a requirement together with add/remove/update impact on 
>>> versions (dataset version, file version).
>>>
>>> It's clear we do our best to publish the right dataset. It's clear 
>>> too that QC process, and scientific process will spot issues 
>>> (acceptable or not) and will trigger add/remove/update actions.
>>>
>>> I can't remember if a clear document describing publish/unpublish 
>>> procedure exist. That should describe from both perspective (data 
>>> provider/those in charge of replication) how to:
>>>
>>> - add file(s) within existing datasets
>>> - remove file(s) from existing datasets
>>> - update files(s) from existing datasets (is that just add/remove? 
>>> not if we want easy life for replication. Yes if we want easy life 
>>> for data provider)
>>>
>>> If such document doesn't exist yet I think it is a priority (given 
>>> where we are) to produce one.
>>>
>>> Can someone points me that document?
>>
>> Regards
>>
>> Stéphane
>>
>>> It should be possible to get all this fixed in time, but I think people are working through a large number of issues in parallel at present.
>>>
>>> Cheers,
>>> Martin
>>>
>>>    
>>>>> -----Original Message-----
>>>>> From:go-essp-tech-bounces at ucar.edu  [mailto:go-essp-tech-
>>>>> bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
>>>>> Sent: 07 July 2011 16:46
>>>>> To:go-essp-tech at ucar.edu
>>>>> Subject: [Go-essp-tech] DRS, version number&  Co
>>>>>
>>>>> Hi,
>>>>>
>>>>> What's the current stand regarding DRS and dataset version number?
>>>>>
>>>>> I've seen too many data nodes with too many different configurations.
>>>>> > From invalid datasets name to invalid DRS structure, names, missing
>>>>> version numbers, etc.
>>>>> The version number is a particular interesting one, since in some
>>>>> cases
>>>>> the only way to find it is by parsing the TDS Catalogs themselves,
>>>>> since
>>>>> the Gateway is not providing this info (AFAICT) and if the DRS is not
>>>>> followed can neither be implied from the directory structure of its
>>>>> files.
>>>>>
>>>>> Obviously neither the publisher nor the Gateway is enforcing those
>>>>> constraints. I think this should be changed ASAP.
>>>>> Both Node and Gateway publishing steps should enforce this when
>>>>> publishing for cmip5. I think is the most direct way to get to the
>>>>> publisher at the right time.
>>>>>
>>>>> If we keep drifting away from what we already agreed on, we won't be
>>>>> able to do anything useful with the data at all, since we won't be
>>>>> able
>>>>> to handle it properly.
>>>>>
>>>>> I'll urge the data node managers to check DRS compliance.
>>>>>
>>>>> I've only seen BADC publishing according to the DRS structure. I know
>>>>> PCMDI, BCC, CNRM and NCCS are not. I haven't checked others.
>>>>>
>>>>> Thanks,
>>>>> Estani
>>>>>
>>>>> --
>>>>> Estanislao Gonzalez
>>>>>
>>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>
>>>>> Phone:   +49 (40) 46 00 94-126
>>>>> E-Mail:gonzalez at dkrz.de
>>>>>
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>        
>>
>>
>> -- 
>> Stéphane Sénési
>> Ingénieur - équipe Assemblage du Système Terre
>> Centre National de Recherches Météorologiques
>> Groupe de Météorologie à Grande Echelle et Climat
>>
>> CNRM/GMGEC/ASTER
>> 42 Av Coriolis
>> F-31057 Toulouse Cedex 1
>>
>> +33.5.61.07.99.31 (Fax :....9610)
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
> -- 
> Estanislao Gonzalez
>
> Max-Planck-Institut für Meteorologie (MPI-M)
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
> Phone:   +49 (40) 46 00 94-126
> E-Mail:gonzalez at dkrz.de  
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech


-- 
Sébastien Denvil
IPSL, Pôle de modélisation du climat
UPMC, Case 101, 4 place Jussieu,
75252 Paris Cedex 5

Tour 45-55 2ème étage Bureau 209
Tel: 33 1 44 27 21 10
Fax: 33 1 44 27 39 02

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110708/162f08bf/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4172 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110708/162f08bf/attachment-0001.bin 


More information about the GO-ESSP-TECH mailing list