[Go-essp-tech] Status of Gateway 2.0 (another use case)

Estanislao Gonzalez gonzalez at dkrz.de
Thu Dec 15 03:34:19 MST 2011


Hi Jamie,

to be honest replicas are not fully supported. There are too many 
cave-eats at the moment and I'm waiting for Eric green light to 
completely remove them from our end.

But the idea up to now was to let the user decide which replica was 
going to be downloaded, just the same way the user decides which dataset 
it is (see the NCC datasets that are fully replicated at our end).

I honestly doubt this is what the user needs, although it might be what 
they want. I would presume the client tool would better take care of 
those decisions and even parallelizing downloads from multiple ends. The 
wget script will not handle all this complexity.

So the user will just select something, and whatever is selected will be 
wrapped by the wget script. there will be no concept of replica at this 
stage.

Hope this answers your question.

Regards,
Estani

Am 14.12.2011 19:33, schrieb Kettleborough, Jamie:
> Hello Estani,
> thanks for this information - very useful.  One question then some 
> follow up inline.
> How does the wget generator (or search) deal with replicas - what 
> determines which replica the user will download or get returned from 
> the search?
> Jamie
>
>     **Well, the P2P has the /search which returns an xml response that
>     the user could parse and the wget script has the -w flag (also the
>     current one) that outputs this list.
>     The way the wget script is designed, it's pretty simple to extract
>     this list anyhow as the whole wget script is not more than a
>     "bash" decoration (with all the intelligence). That's e.g. what
>     the DML uses for ingesting the files to be retrieved.
>     this is good to know.
>
>     The replication system is much more complicated, because you
>     handle many more files at the same time, so a simple script won't
>     be able to manage 200.000 urls for replicating a couple of
>     experiments (3~4 urls per file in our case). Furthermore there are
>     many other requirements that the user don't have, including
>     publication. But at the very bottom there are many similarities
>     indeed.
>     Happy to be corrected on my niave view of the replication problem
>     - though I still think it is useful to recognise what is common. 
>     Picking up a couple of comments from your previous e-mail.  How
>     modular is the replication system, and how much work would be
>     involved in using those modules that deal with that common stuff
>     'at the very bottom'  to write an 'intellegent' user client?
>
>     Regarding a client wget script generator, it's the other way
>     around how it works now. You get the wget script and from it the
>     list of files. It already checks for downloaded files, so you
>     don't need to do that and create a new wget script, it will do it
>     for you.
>     Doesn't this mean the wget script generator has to know the
>     directory structure the user is using for their local replica of
>     the archive - and this may differ (hey as we know its not even the
>     same from node to node).  Your bash is way ahead of mine - so I
>     could be wrong in what follows, but from what *I* could tell from
>     the sample wget script (generated from an example on
>     http://www.esgf.org/wiki/ESGF_scripting) it simply uses the file
>     name assuming the file is in the local directory.
>     Have you considered copying to the DRS directory structure as the
>     default  - this has a nice side effect of helping users know what
>     version they have downloaded. (Though you'd need to get version,
>     and other drs elements missing from the filename, into your script
>     I think).    I know my suggestion would force people to use the
>     DRS locally... but I don't know that thats a *bad* thing.  It
>     would also have to either be run from the root of their local
>     copy, or have a compulsary argument that is the root of their
>     local copy.
>     The wget script *could* then also get really clever and not do the
>     remote copy of data that has not changed from one version to the
>     next (based on checksum), but just put in a hard link or something
>     like that.... thats probably quite a way to getting an
>     'intellegent' user client?
>     Having just written this I realised it doesn't really work for us
>     at the MO as our 'download server' - where we run wget -  can not
>     see the disk where we keep our local replicas of the cmip5 and
>     tamip archives because of security constraints.  Thankfully the
>     machines that do the list/search before we fetch can see the local
>     replica so we can filter the list returned.  It *may* work for
>     others though?
>
>      If a download manager or any other "better" application
>     communicates with the node, it should use the xml semantic. If a
>     simple "one-way" script is required, the wget is a good way of
>     getting it, dump the list (the overhead is really minimal) and
>     process as usual...
>
>     I think these two options are just fine, but correct me if you
>     don't share my view.
>     I think your options work - though I can't see how to get the
>     checksum from either the xml or the wget -w option (and I'm not
>     sure the wget -w ever would?).  *BUT* that may be just because the
>     sample datasets I'm using don't make the checksums available?
>     Thanks again for your reply.
>
>     Thanks,
>     Estani
>
>     Am 14.12.2011 17:43, schrieb Kettleborough, Jamie:
>>
>>         So, the way I picture this is:
>>         1) get the list of files to be downloaded (in the wget script
>>         or by any other means)
>>         2) filter that to remove what is not required
>>
>>     This is basically what we do MO - we create  list of files to
>>     download, then compare it with our local file system, and we
>>     filter out any we already have.  I think the replication system
>>     would have to do this too wouldn't it?  For what its worth I
>>     think *every* user has their own version of the replication
>>     problem - just the set of files they are trying to replicate is
>>     different and they might be using a different protocol to fetch
>>     the data.
>>     If you accept this way of working as valid/acceptable/encouraged
>>     then does it have implications for the (scriptable) interfaces to
>>     either P2P and or gateway 2?   I think it means there 'should' be
>>     an interface that returns a list of files (not wrapped in a wget
>>     script) and then maybe a service (either client side or server
>>     side) that will take a list of urls and generate the wget
>>     scripts.  If you only have an interface that returns wget scripts
>>     then users will have to parse these to enable them to filter out
>>     the files they already have copies of.
>>     Jamie
>>     (Sebastien - I'm aware this sort of touches on a set
>>     of unanswered questions you asked a while ago related to what we
>>     do at the MO... I've not forgotten I want to answer this is more
>>     detail, apologies for being so rubbish at answering so far).
>
>
>     -- 
>     Estanislao Gonzalez
>
>     Max-Planck-Institut für Meteorologie (MPI-M)
>     Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>     Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
>     Phone:   +49 (40) 46 00 94-126
>     E-Mail:gonzalez at dkrz.de  
>


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111215/71a49bd6/attachment.html 


More information about the GO-ESSP-TECH mailing list