[Go-essp-tech] Finding new datasets? Was RE: [esg-gateway-dev] Catalog Browsing broken at WDCC

Estanislao Gonzalez gonzalez at dkrz.de
Wed Mar 14 09:06:27 MDT 2012


Hi Jamie,

I won't start my _typical_ philosophical discussion that... simulation 
data is per-se deprecated. It's a matter of time to find it out :-)

But leaving things as they are right now. The procedure from this user 
(checking the Gateway catalogs for new versions) does not scale well, 
though indeed it might work for some (especially if only one model is 
being used... if not, it's impossible to find a line in the 34000 
datasets we have distributed over 26 nodes and 7 Gateways (the numbers 
are form the top of my head))

So I'm diverting your question to CMIP5 Operation: How can a user find 
out if something changed for files he/she already download?
(Just to keep threads apart)

The first procedure is the search. If you search for data, you could 
"see" if something is new (that's what the user *does* to the catalogs, 
just looks around for the datasets he/she is interested in). Not very 
fancy though.

Then there's the atom-feed. For example the current Gateway allow to 
query the *local* publications:
Everything published from March 1st onwards (for pcmdi and our Gateway):
http://pcmdi3.llnl.gov/esgcet/project/cmip5/dataset/feed.atom?updated-min=20120301
http://ipcc-ar5.dkrz.de/project/CMIP5/dataset/feed.atom?updated-min=20120301
(not centralize, cmip5 casing might be a problem, version not being 
displayed in the atom feed...)

For the P2P there's an rss feed document here: 
http://www.esgf.org/wiki/ESGF_RSS
For example, these are all CMCC CMIP5 dataset: 
http://adm07.cmcc.it/esg-search/feed/node.rss
(should be centralized, but I haven't figured out how to do it, can't 
trim dates AFAICT)

Now going back to notification, this is something we could use as it is 
already there. But it's not properly linked and requires too much 
work... We have notified users about important changes, but not about 
every single new version (I would still be at that...). Though it 
requires a few steps and will cost hours per notification (new version 
of multiple datasets), it is possible to do it right now.

IMHO what we need is a system to let users help other users. We can't do 
all at once, and most users could be helping us out (I see myself 
writing all the time the same emails in the help-desk). If they could 
share their knowledge about how they interact with the system, they 
would get things done faster and we would have the required feedback to 
know *how* the system is *really* being used.
The example you are citing is a nice example of the difference between 
how the system was used vs. how it was intended to be used.

My 2c,
Estani

Am 14.03.2012 15:21, schrieb Kettleborough, Jamie:
> Hello Estani,
>
> Thanks for following this problem up.
>
> This is related to the thread on goessp-tech on 'risks that science is being done with deprecated data'.  The reason the MOHC user (one of our scientists) is going to the catalogue is to try and find if any new datasets have been published - either completely new, or new versions of datasets he already has.  I don't know if there is a better way for a user (or even a replication system) to answer this question - I think it's the 'notification' problem isn't it?  I think the scientists basic question is: 'is there any new Amon tas data from any model, any ensemble member, for any piControl run, any historical type run, or any  rcp run?'.  My guess is that kind of question is very common.
>
> Jamie
>
>> -----Original Message-----
>> From: esg-gateway-dev-bounces at mailman.earthsystemgrid.org
>> [mailto:esg-gateway-dev-bounces at mailman.earthsystemgrid.org]
>> On Behalf Of Estanislao Gonzalez
>> Sent: 14 March 2012 11:23
>> To: stephen.pascoe at stfc.ac.uk
>> Cc: esg-gateway-dev at earthsystemgrid.org
>> Subject: Re: [esg-gateway-dev] Catalog Browsing broken at WDCC
>>
>> Hi Stephen,
>>
>> the thing is that I only have NCC replicas (~400) and those
>> are already under a different top level collection...
>> According to the CMIP5 view we have ~3100 datasets.... hmmm
>> it's a little more than you have though... but I can't even
>> display the replicas... although... again, I do have +3000
>> replicas there thought they are marked as "retracted",
>> perhaps that's why they are not working...
>>
>> Should I completely remove them? I was afraid I was going to
>> loose the metrics, but I guess I already did, since there
>> seem to be no metrics stored at all at the Gateway...
>>
>> This was a request from someone at MOHC that is apparently
>> using the catalogs.
>> I tried listing MRI/MIROC top level collection, this are much
>> more that what we have... I'll leave this running for a while
>> and see if something pops up... but I guess this page needs
>> pagination...
>>
>> Thanks,
>> Estani
>> Am 14.03.2012 12:12, schrieb stephen.pascoe at stfc.ac.uk:
>>> It's a while since I've tried the browser but I had noticed
>> responsiveness pretty low as the number of datasets in the
>> single "cmip5" container increased.  Maybe you have more
>> locally-published datasets since you've published replicas?
>> My guess is that it's a straight-forward scalability problem.
>>> You could try moving some datasets to separate top-level
>> collections through some SQL but I'd try it out on a test
>> Gateway first.
>>> Stephen.
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> Centre of Environmental Data Archival
>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot
>> OX11 0QX,
>>> UK
>>>
>>> -----Original Message-----
>>> From: esg-gateway-dev-bounces at mailman.earthsystemgrid.org
>>>
>> [mailto:esg-gateway-dev-bounces at mailman.earthsystemgrid.org]
>> On Behalf
>>> Of Estanislao Gonzalez
>>> Sent: 14 March 2012 11:08
>>> To: esg-gateway-dev at earthsystemgrid.org
>>> Subject: [esg-gateway-dev] Catalog Browsing broken at WDCC
>>>
>>> Hi,
>>>
>>> for some reason I don't know it's impossible to browse the
>> catalogs
>>> at our Gateway: http://ipcc-ar5.dkrz.de/project/CMIP5.html
>>> I've increase the memory... but it does nothing... the browser just
>>> waits for ever, no exception is thrown no problem is being
>> displayed.
>>> I notice I had ~10 postgres threads running at 30%. After
>> restarting
>>> the server and waiting for a while I had 2 threads running at 95%.
>>> I've restarted it once more and I have a single threads
>> running at 100%.
>>> I don't see this happening in other gateways... any Idea
>> what it might
>>> be? Any idea on how to solve/debug it?
>>>
>>> Thanks,
>>> Estani
>>>
>>
>> --
>> Estanislao Gonzalez
>>
>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>
>> Phone:   +49 (40) 46 00 94-126
>> E-Mail:  gonzalez at dkrz.de
>>
>> _______________________________________________
>> esg-gateway-dev mailing list
>> esg-gateway-dev at mailman.earthsystemgrid.org
>> http://mailman.earthsystemgrid.org/mailman/listinfo/esg-gateway-dev
>>


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de



More information about the GO-ESSP-TECH mailing list