[Go-essp-tech] Non-DRS File structure at data nodes

Gavin M. Bell gavin at llnl.gov
Mon Sep 5 23:47:14 MDT 2011


Hi Balaji,

I have read through almost all of the Thredds source code.  I know
exactly where this bit of code would be placed and thank goodness it has
nothing to do with aggregations! :-).  The hyperslab configuration and
selection is orthogonal to this particular issue.  There are no issues
from a query or aggregation standpoint that will be ameliorated or
worsened by this bit of translation.

Trust me. :-)


On 9/5/11 7:55 PM, V. Balaji wrote:
> It would be great if the URL-to-path translation were a simple hash,
> but in actual fact it's a THREDDS aggregation layer:-).
>
> I could be wrong, but I've not seen enough to convince me that this
> can scale up to the demands likely to be placed up it for CMIP5. Recall
> that our data volumes are such that people are likely to make a lot
> of server-side queries before they can configure a minimal hyperslab
> to download for each of their science problems.
>
> Gavin M. Bell writes:
>
>> Hi Balaji,
>>
>> Indeed you have good points.  The only thing I am suggesting is that it
>> is an equally daunting task to *impose* anything on a group of people.
>> I am a big fan of the benevolent dictatorship, but as history tells us,
>> they don't last.  I guess I would contend that having a transparent,
>> open, easy to grok algorithm would be incumbent upon any institution
>> that decides to take advantage of such an indirection mechanism.  It is
>> optional.  It may very well be for a community as disciplined as the
>> climate community (I am quite serious many other communities look to the
>> climate community as  model of organization) that having a single
>> structure would suffice.  But from a system admin perspective filesystem
>> requirements may be prohibitive.  As we can see just from this
>> discussion, many are already looking for ways to perform this
>> indirection.  I propose that we allow this indirection in a regimented
>> way so we can all understand / embrace the mechanism by which this is
>> done.  Think of it like a well known hashing algorithm that we know how
>> to plug into to get out what we want.  As long as the transformation
>> machinery is clear then the particular transform becomes a simpler, more
>> circumscribed task.  I only propose that we allow for this at the
>> highest ingress level... ESGF.  This ameliorates the burden on ESGF.
>> Let's not forget ESGF belongs to all of us, *we* are *them*.
>>
>> As for torrents... well, in my mind's eye that is what we are building
>> to some extent, but with better security that avoids some of the
>> byzantine attacks torrents are prone to.  ESGF should have the best
>> features from that community... at least that's part of the goal of the
>> design.  Oh, and ESGF won't preclude using torrents... as a matter of
>> fact you could create a back-end that is a true torrent in/egress that
>> plugs into ESGF... Oh... but then you will need a transformation layer
>> to allow that to happen, it would be nice to have one available to plug
>> into, right? ;-).
>>
>> On 9/2/11 11:53 AM, V. Balaji wrote:
>>> I've been a proponent since the beginning of having a file layout
>>> (DRS) agreed by convention and _imposed_ (rather than recommended)
>>> on participant nodes. While this may be old-fashioned thinking,
>>> our finding is that predictable paths are the most useful thing for
>>> building the software, and I continue to believe that it's not so
>>> difficult to agree upon a file layout. I think the difficulties here
>>> arose from a discrepancy in the way DRSlib and CMOR handled versioning
>>> rather than people digging their heels in about a conventionally
>>> agreed file and directory layout.
>>>
>>> Regarding Gavin's larger point, having an indirection layer in the
>>> middleware separating the apparent path in the query from the actual
>>> path in the resource introduces a huge dependency on that indirection
>>> layer: pretty much nothing can function without it. I'm not sure ESGF
>>> should take upon itself such a huge burden.
>>>
>>> With DRS being an imposed convention, you could undertake many tasks
>>> following software paths for which we aren't responsible. There are
>>> many tasks -- e.g data movement, replication -- which are shared by
>>> communities much larger than ESGF and shouldn't require specialized
>>> middleware. One of my big disappointments is that we don't use torrents
>>> for anything:-).
>>>
>>> Gavin M. Bell writes:
>>>
>>>> Hi Estani and colleagues, :-)
>>>>
>>>> Okay, so let me jump in for a minute.  There are two notions that are
>>>> being conflated in this discussion.  Everyone is used to using paths and
>>>> such to find things on the filesystem.  Also people are used to using
>>>> tried and true mechanisms that use the filesystem to get to information
>>>> remotely by further qualifying the filesystem path with the host.  This
>>>> is all well and good for the scope of these tools.
>>>>
>>>> Now we are in a distributed world as we build this ESG*F* (Federation)
>>>> that will unify and sew together disparate organizations' data into a
>>>> seamless 'dataspace'.  The goal of building such a thing is to make it
>>>> easy for all interested in the data to get to data and post data and in
>>>> so doing share data in an environment that is fluid.
>>>>
>>>> ESGF is providing a mechanism/platform/infrastructure... that
>>>> simultaneously addresses the need for everyone to share data while
>>>> maintaining sovereign custody over their data assets.  ESGF has already
>>>> met this challenge in many ways.  However, to continue to make the
>>>> system simple and easy to use and a joy to use we should alleviate the
>>>> requirement of filesystem structure.  This is a particular case where
>>>> 'some' is good but 'too much' hurts.
>>>>
>>>> So now, cutting to the chase.  More than anecdotal evidence (the length
>>>> of this discussion) clearly suggests that strict filesystem adherence is
>>>> not in accord with the sovereignty we would like organizations to
>>>> enjoy.  It would behoove us to operate the federation such that
>>>> descriptors in the context of the federation are divorced from
>>>> filesystem structure itself.  This can be achieved rather directly.
>>>>
>>>> Going back to what I initially said, the two notions being conflated
>>>> here are the *query* and the *resource*.  An URL, even the filesystem
>>>> path itself, is nothing more than a query to the network/operating
>>>> system to locate bits on a platter (clearly I am dating myself).  We
>>>> should use the DRS as the Federation's canonical locator for resources.
>>>> The DRS is the *query* (in the same spirit as above).  The ESGF system,
>>>> just like the filesystem, would resolve the query (DRS) to the
>>>> resource.  This, by the way, bears fruit in quite few places in the
>>>> system making quite few things more efficient.
>>>>
>>>> I have thought about this particular filesystem problem and have come up
>>>> with a solution... the solution would allow us to still use tools like
>>>> wget/curl right out of the box and with a little bit of tweaking gridftp
>>>> and globus.  As a matter of fact the solution would lend itself to being
>>>> used by any tool old or new.  To more directly address Estani's
>>>> questions about *relying* on things.... I don't think that the tone of
>>>> that should be so pejorative.  You *use* a tool because it helps you. I
>>>> feel that using the ESGF infrastructure is useful to the community and
>>>> the communities goals.  I don't think that it is too much skin in the
>>>> game to ask for.  If things go horribly wrong, your organization has
>>>> it's own filesystem structure that fits their needs that they can rely
>>>> on in order to make sense of things as they see it.  So, fundamentally
>>>> the act of scanning the data is what provides the cohesion between the
>>>> DRS and filesystem structure.  The job of scanning is certainly not
>>>> terribly laborious.  So there is quite literally very *little* cost to
>>>> "relying" on a system/infrastructure/set of tools that is ESGF,
>>>> especially compared to the benefit of what ESGF can bring to this
>>>> community. I find it hard to conjure a cogent argument against creating
>>>> a flexible system, especially given the nature of this
>>>> multi-organization, international effort. We must make it easy for
>>>> organizations to be independent and not push a myopic view (IMHO) of a
>>>> certain state of the world on everyone.
>>>>
>>>> Thank you for reading this rather lengthy email... I need an in-house
>>>> editor perhaps... I tend to get garrulous but I wanted to be as clear as
>>>> I could.
>>>>
>>>> If there isn't already a working group on this I would like to propose
>>>> one, we can set it up on the ESGF wiki and talk more about this.  :-)
>>>>
>>>> P.S.
>>>> In 10 years ESGF will have morphed into something even more lovely...
>>>> because it is build by the all of us and nurtured on our wisdom :-).
>>>> The will be the tool people count on and rely on as you alluded to with
>>>> ftp, et. al.  There is no tomorrow without today (modulo the quantum
>>>> mechanics fridge).
>>>>
>>>> On 9/2/11 2:55 AM, Estanislao Gonzalez wrote:
>>>>> I know the main idea is to create a middleware layer that would make
>>>>> file structures obsolete. But then, we will have to write all tools
>>>>> again in order to interact with this intermediate level or at least
>>>>> patch them somehow. gridFTP, as well as ftp, are only useful as
>>>>> transmission protocols, you can't write your own script to use them,
>>>>> you have to rely on either the gateway or the datanode to find what
>>>>> you are looking.
>>>>> In my opinion, we will be relying too much in the ESG infrastructure.
>>>>> What would happen if we loose the publisher database? How would we
>>>>> tell apart one version from another, if this is not represented in the
>>>>> directory structure?
>>>>> My fear is that if we keep separating the metadata from the data
>>>>> itself, we add a new weak link in the chain. Now if we loose the
>>>>> metadata the data will also be useless (this would be indeed the worst
>>>>> case scenario). In 10 years we will have no idea what this interfaces
>>>>> were like, probably both data node and gateways will be superseded  by
>>>>> newer versions that can't translate our old requirements. But as I
>>>>> said, that's a problem for LTAs only. In any case, we need the
>>>>> middleware to provide some services and speed things up, but I don't
>>>>> think we should rely blindly on it.
>>

-- 
Gavin M. Bell
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110905/4a0276db/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list