[Go-essp-tech] Fwd: Re: Visibility of old versions, was... RE: Fwd: Re: Publishing dataset with option --update
Martina Stockhause
stockhause at dkrz.de
Wed Jan 11 07:20:14 MST 2012
Hallo Luca,
please access our atom feed with the CIM quality documents at:
http://cera-www.dkrz.de/WDCC/CMIP5/feed/
Every time an assignment of QC level 2 or QC level 3 is done, a new
entry is added to the feed.
Best wishes,
Martina
On 11.01.2012 14:10, Michael Lautenschlager wrote:
> Hello Martina,
> could you please provide Lucca with the requested information with
> copy to the go-essp-tech list.
> Thanks, Michael
>
>
> -------- Original-Nachricht --------
> Betreff: Re: [Go-essp-tech] Visibility of old versions, was... RE:
> Fwd: Re: Publishing dataset with option --update
> Datum: Wed, 11 Jan 2012 04:18:38 -0800
> Von: Cinquini, Luca (3880) <Luca.Cinquini at jpl.nasa.gov>
> An: Michael Lautenschlager <lautenschlager at dkrz.de>
> Kopie (CC): Karl Taylor <taylor13 at llnl.gov>,
> "go-essp-tech at ucar.edu" <go-essp-tech at ucar.edu>, "Drach, Bob"
> <drach1 at llnl.gov>, "serguei.nikonov at noaa.gov" <serguei.nikonov at noaa.gov>
>
>
>
> Hi Michael,
> sorry if I should know this already, but how can we access the
> DOI information for a given dataset ? The goal is, off course, to
> enable search on DOIs in the P2P system.
> thanks, Luca
>
> On Jan 11, 2012, at 1:37 AM, Michael Lautenschlager wrote:
>
>> Hi Karl,
>> even with respect to IPCC DDC I think we have to keep at least the most
>> recent version of CMIP5 and those versions which ran through QC-L3 with
>> assignment of DOI and citation reference. At least we WDCC/DKRZ are in
>> contract with DataCite to keep these DataCite published data entries
>> forever in the sense of common library time scales. The number and
>> location of replicas is decidable within CMIP5 and no matter for
>> DataCite but we have to ensure identical copies if we link to them from
>> the DOI landing page.
>>
>> These DataCite published CMIP5 data entities may form the GCM data
>> basis
>> of the IPCC DDC because they are stable, quality proofed, accessible at
>> any time and have a citation reference. So these data entities can be
>> traced back in the scientific literature providing the citation
>> references are used there. But I agree we have to discuss this with the
>> IPCC DDC people.
>>
>> Best wishes, Michael
>>
>> ---------------
>> Dr. Michael Lautenschlager
>> Head of DKRZ Department Data Management
>> Director World Data Center Climate
>>
>> German Climate Computing Centre (DKRZ)
>> ADDRESS: Bundesstrasse 45a, D-20146 Hamburg, Germany
>> PHONE: +4940-460094-118
>> E-Mail: lautenschlager at dkrz.de
>>
>> URL: http://www.dkrz.de/
>> http://www.wdc-climate.de/
>>
>>
>> Geschäftsführer: Prof. Dr. Thomas Ludwig
>> Sitz der Gesellschaft: Hamburg
>> Amtsgericht Hamburg HRB 39784
>>
>>
>> Am 10.01.2012 16:40, schrieb Karl Taylor:
>>> Hi all,
>>>
>>> thanks for the good discussion. Some good arguments have been made for
>>> keeping all versions. I'll not make a policy decision immediately, but
>>> am tending toward strong encouragement to keep all versions. I'll
>>> distribute a draft statement about this for your input and comment
>>> before posting. I'll, of course, also consult directly with other IPCC
>>> DDC folks too.
>>>
>>> Best regards,
>>> Karl
>>>
>>> On 1/10/12 5:31 AM, Estanislao Gonzalez wrote:
>>>> Hi Jamie,
>>>>
>>>> Indeed, DOIs are not going to solve everything. The DOI is analogous
>>>> to the ISBN of a book, citing the whole book is not always what you
>>>> want in any case. To continue the analogy, users are indeed working
>>>> with pre-prints which get corrected all the time (i.e. the CMIP5
>>>> archive is in flux). People are writing papers citing a pre-print. Of
>>>> course this makes no sense, but they are not doing so willingly, they
>>>> have to as the dead line approaches but the computing groups are not
>>>> ready.
>>>>
>>>> So what do we have now? Some archives with a strong commitment for
>>>> preserving data.
>>>> If the DRS were honored, the URL would be enough for citing any file
>>>> as it has the version in it. Indeed citing +1000 Urls is not
>>>> practical, but a redirection could be added so that the scientist
>>>> cites one URL in which all files URLs are listed (There's no
>>>> implementation for this AFAIK). But at least the URL of DRS committed
>>>> sites could be safely cited, and if the checksum is attached to the
>>>> citation, it is sure that the correct file is always being cited (and
>>>> it could even be found if moved).
>>>>
>>>> I don't know how citations are being done now, nor do I know how they
>>>> were done before when everyone was citing data that it was almost
>>>> impossible to get. DOIs are the very first step in the right
>>>> direction, not the last one.
>>>> IMHO the community should come up with some best practices to
>>>> overcome the problem we are facing: how to cite something that's
>>>> permanently changing. Sharing this will certainly help everyone.
>>>> Before jumping away from this subject I'd also like to add that I
>>>> don't see any proper communication mechanism in the community. All
>>>> (or at least most) questions regarding CMIP5 are AFAICT directed to
>>>> the help-desk, so mostly developers are trying to help the community
>>>> instead of the community trying to help itself. I think we might be
>>>> missing some kind of platform for doing this. We don't have the means
>>>> to support the growing community (and new communities which we are
>>>> now serving), we need them to help with the "helping". Just a
>>>> thought....
>>>>
>>>> And last, and probably least, the only way to get the latest version
>>>> of any dataset is by re-issuing the search. Especially since multiple
>>>> datasets are referred to in a wget script, finding the latest
>>>> versions of each of them "by hand" will be more time-consuming than
>>>> issuing the search query again.
>>>>
>>>> Thanks,
>>>> Estani
>>>>
>>>> Am 10.01.2012 13:00, schrieb Kettleborough, Jamie:
>>>>> Hello,
>>>>> I'm not sure how to say this: but I'm not sure its just down to
>>>>> DOI's to determine whether a data set should always be visible. I
>>>>> think data needs to be visible where its sufficiently important that
>>>>> a user might want to download it. e.g they want to check or extend
>>>>> someone elses study (and I think there are other reasons). Its not
>>>>> clear to me that all data of this kind will have a DOI - for
>>>>> instance how many of the datasets referenced in papers being written
>>>>> now for the summer deadline of AR5 have (or will have in time) DOIs?
>>>>> I know its tempting to say - any dataset referenced in a paper
>>>>> should have a DOI. But Ithink you need to be realistic about the
>>>>> prospects of this happening on the right timescales.
>>>>> If the DOI is used as the determinent of whether data is always
>>>>> visible then should users be made aware of the risk they are
>>>>> carrying now? For instance, so they know to have local backups of
>>>>> data that is really important to them. (With the possible
>>>>> implication too that they may need to be prepared to 'reshare' this
>>>>> data with others.)
>>>>> For what its worth my personal preference is with the BADC/DKRZ (and
>>>>> I'm sure others) philosophy of keeping all versions - though I
>>>>> realise there are costs in doing this, like getting DRSlib
>>>>> sufficiently bug free and getting it to work in all the contexts it
>>>>> needs to (hard links/soft links), getting it deployed, getting the
>>>>> update mechanism in place for when new bugs are found etc. If you
>>>>> used DRSlib doesn't Estanis use case that caused user grief become
>>>>> easier too - the wget scripts do not need regenerating, you should
>>>>> instead be able to replace the version strings in the url (though I
>>>>> may be assuming things about load balancing etc in saying this).
>>>>> Jamie
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> *From:* go-essp-tech-bounces at ucar.edu
>>>>> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Estanislao
>>>>> Gonzalez
>>>>> *Sent:* 10 January 2012 10:21
>>>>> *To:* Karl Taylor
>>>>> *Cc:* Drach, Bob; go-essp-tech at ucar.edu; serguei.nikonov at noaa.gov
>>>>> *Subject:* Re: [Go-essp-tech] Fwd: Re: Publishing dataset with
>>>>> option --update
>>>>>
>>>>> Well to be honest I do agree this is a decision each institution
>>>>> has to make, but for us I'd prefer offering everything we have
>>>>> and let the systems decide what to do with this information.
>>>>> I.e. I've used it to generate some comments (I might have
>>>>> already show you this), just go here:
>>>>>
>>>>> http://ipcc-ar5.dkrz.de/dataset/cmip5.output1.NCC.NorESM1-M.sstClim.mon.land.Lmon.r1i1p1.html
>>>>> and click on history.
>>>>> That information could be generated only because we store the
>>>>> metadata to the previous version.
>>>>>
>>>>> By the way, The only way of inhibiting the user from getting an
>>>>> older version, if that's what it's wanted, is by either removing
>>>>> the files from the TDS served directory, or changing the access
>>>>> restriction at the Gateway. Because of a well-known TDS bug (or
>>>>> feature) files present at that directory and not found in any
>>>>> catalog are served without any restriction (AFAIK no certificate
>>>>> is required for this). So, normally the wget script would work
>>>>> even if the files where unpublished.
>>>>>
>>>>> It really depends on the use-case... but e.g. I had to explain
>>>>> all this to a couple of people in the help-desk since the wget
>>>>> script they've downloaded wasn't working anymore (files were
>>>>> removed). They weren't thrilled to know they had to re issue the
>>>>> search again (there's no workaround for this) and they wanted to
>>>>> know what was changed in the new version, and there's where we
>>>>> can't help our users any more since we don't have that
>>>>> information...
>>>>>
>>>>> I don't know what our users prefer, but I think they have more
>>>>> important problems to cope with at this time... if they could
>>>>> reliably get one version they could start worrying about others.
>>>>> From my perspective as a data manager, it's worth the tiny
>>>>> additional effort, if there's any.
>>>>>
>>>>> Cheers,
>>>>> Estani
>>>>>
>>>>> Am 09.01.2012 20:05, schrieb Karl Taylor:
>>>>>> Hi Estani,
>>>>>>
>>>>>> I agree that a new version number should (I'd say must) be
>>>>>> assigned when any changes are made. However, except for DOI
>>>>>> datasets, most groups will not want older versions to be
>>>>>> visible or downloadable.
>>>>>>
>>>>>> Do you agree?
>>>>>>
>>>>>> cheers,
>>>>>> Karl
>>>>>>
>>>>>> On 1/9/12 10:37 AM, Estanislao Gonzalez wrote:
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> It is indeed a good point, but I must add that we are not
>>>>>>> talking about preserving a version (although we do it here at
>>>>>>> DKRZ) but of signaling that a version has been changed. So the
>>>>>>> version is a key to find a specific dataset which changes in
>>>>>>> time.
>>>>>>>
>>>>>>> Even before a DOI assignment I'd encourage all to create a new
>>>>>>> version every time the dataset changes in any way.
>>>>>>> Institutions have the right to preserve whatever version they
>>>>>>> want (they may even delete DOI-assigned versions, on the other
>>>>>>> hand archives can't, that's why archives are for).
>>>>>>> But altering the dataset preserving the version just bring
>>>>>>> chaos for the users and for us at the help-desk as we have to
>>>>>>> explain why something has changed (or rather answer that we
>>>>>>> don't know why...). It means that the same key now points to a
>>>>>>> different dataset.
>>>>>>>
>>>>>>> The only benefits I can see for preserving the same version is
>>>>>>> that publishing using the same version seems to be easier to
>>>>>>> some (for our workflow it's not, it's exactly the same) and
>>>>>>> that if only new files are added this seems to work fine for
>>>>>>> publication at both the data-node and the gateway as it's
>>>>>>> properly supported.
>>>>>>> If anything else changes, this does not work as expected
>>>>>>> (wrong checksums, ghost files at the gateway, etc). And
>>>>>>> changing a version contents makes no sense to the user IMHO
>>>>>>> (e.g. it's as if you might sometimes get more files from a
>>>>>>> tarred file... how often should you extract it to be sure you
>>>>>>> got "all of them")
>>>>>>>
>>>>>>> If old versions were preserved (which take almost no resources
>>>>>>> if using hardlinks), a simple comparison would tell that the
>>>>>>> only changes were the addition of some specific files.
>>>>>>>
>>>>>>> Basically, reusing the version ends in a non-recoverable loss
>>>>>>> of information. That's why I discourage it.
>>>>>>>
>>>>>>> My 2c,
>>>>>>> Estani
>>>>>>>
>>>>>>> Am 09.01.2012 17:25, schrieb Karl Taylor:
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> I do not have time to read this thoroughly, so perhaps what
>>>>>>>> I'll mention here is irrelevant. There may be some
>>>>>>>> miscommunication about what is meant by "version". There are
>>>>>>>> two cases to consider:
>>>>>>>>
>>>>>>>> 1. Before a dataset has become official (i.e., assigned a
>>>>>>>> DOI), a group may choose to remove all record of it from the
>>>>>>>> database and publish a replacement version.
>>>>>>>>
>>>>>>>> 2. Alternatively, if a group wants to preserve a previous
>>>>>>>> version (as is required after a DOI has been assigned), then
>>>>>>>> the new version will not "replace" the previous version, but
>>>>>>>> simply be added to the archive.
>>>>>>>>
>>>>>>>> It is possible that different publication procedures will
>>>>>>>> apply in these different cases.
>>>>>>>>
>>>>>>>> best,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On 1/9/12 4:26 AM, Estanislao Gonzalez wrote:
>>>>>>>>> Just to mentioned that we do the same thing. We use directly
>>>>>>>>> --new-version and a map file containing all files for the
>>>>>>>>> new version,
>>>>>>>>> but we do create hard-links to the files being reused, so
>>>>>>>>> they are
>>>>>>>>> indeed all "new" as their paths always differ from those
>>>>>>>>> of previous
>>>>>>>>> versions. (In any case for the publisher they are the same
>>>>>>>>> and thus
>>>>>>>>> encode them with the nc_0 name if I recall correctly)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Estani
>>>>>>>>> Am 09.01.2012 12:15, schriebstephen.pascoe at stfc.ac.uk:
>>>>>>>>>> Hi Bob,
>>>>>>>>>>
>>>>>>>>>> This "unpublish first" requirement is news to me. We've
>>>>>>>>>> been publishing new versions without doing this for some
>>>>>>>>>> time. Now, we have come across difficulties with a few
>>>>>>>>>> datasets but it's generally worked.
>>>>>>>>>>
>>>>>>>>>> We don't use the --update option though. Each time we
>>>>>>>>>> publish a new version we provide a mapfile of all files in
>>>>>>>>>> the dataset(s). I'd recommend Sergey try doing this before
>>>>>>>>>> removing a previous version.
>>>>>>>>>>
>>>>>>>>>> If you unpublish from the Gateway first you'll loose the
>>>>>>>>>> information in the "History" tab. For
>>>>>>>>>> instancehttp://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output2.MOHC.HadGEM2-ES.rcp85.mon.aerosol.aero.r1i1p1.html
>>>>>>>>>> shows 2 versions.
>>>>>>>>>>
>>>>>>>>>> Stephen.
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Stephen Pascoe +44 (0)1235 445980
>>>>>>>>>> Centre of Environmental Data Archival
>>>>>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
>>>>>>>>>> Didcot OX11 0QX, UK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From:go-essp-tech-bounces at ucar.edu
>>>>>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Drach, Bob
>>>>>>>>>> Sent: 06 January 2012 20:53
>>>>>>>>>> To: Serguei Nikonov; Eric Nienhouse
>>>>>>>>>> Cc:go-essp-tech at ucar.edu
>>>>>>>>>> Subject: Re: [Go-essp-tech] Fwd: Re: Publishing dataset
>>>>>>>>>> with option --update
>>>>>>>>>>
>>>>>>>>>> Hi Sergey,
>>>>>>>>>>
>>>>>>>>>> When updating a dataset, it's also important to unpublish
>>>>>>>>>> it before publishing the new version. E.g, first run
>>>>>>>>>>
>>>>>>>>>> esgunpublish<dataset_id>
>>>>>>>>>>
>>>>>>>>>> The reason is that, when you publish to the gateway, the
>>>>>>>>>> gateway software tries to *add* the new information to the
>>>>>>>>>> existing dataset entry, rather that replace it.
>>>>>>>>>>
>>>>>>>>>> --Bob
>>>>>>>>>> ________________________________________
>>>>>>>>>> From: Serguei Nikonov [serguei.nikonov at noaa.gov]
>>>>>>>>>> Sent: Friday, January 06, 2012 10:45 AM
>>>>>>>>>> To: Eric Nienhouse
>>>>>>>>>> Cc: Bob Drach;go-essp-tech at ucar.edu
>>>>>>>>>> Subject: Re: [Go-essp-tech] Fwd: Re: Publishing dataset
>>>>>>>>>> with option --update
>>>>>>>>>>
>>>>>>>>>> Hi Eric,
>>>>>>>>>>
>>>>>>>>>> thanks for you help. I have no any objections about any
>>>>>>>>>> adopted versioning
>>>>>>>>>> policy. What I need is to know how to apply it. The ways
>>>>>>>>>> I used did not work for
>>>>>>>>>> me. Hopefully, the reasons is bad things in thredds and
>>>>>>>>>> database you pointed
>>>>>>>>>> put. I am cleaning them right now, then will see...
>>>>>>>>>>
>>>>>>>>>> Just for clarification, if I need to update dataset (with
>>>>>>>>>> changing version) I
>>>>>>>>>> create map file containing full set of files (old and new
>>>>>>>>>> ones) and then use
>>>>>>>>>> this map file in esgpublish script with option --update,
>>>>>>>>>> is it correct? Will it
>>>>>>>>>> be enough for creating dataset of new version? BTW, there
>>>>>>>>>> is nothing about
>>>>>>>>>> version for option 'update' in esgpublish help.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Sergey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 01/04/2012 04:27 PM, Eric Nienhouse wrote:
>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>
>>>>>>>>>>> Following are a few more suggestions to diagnose this
>>>>>>>>>>> publishing issue. I agree
>>>>>>>>>>> with others on this thread that adding new files (or
>>>>>>>>>>> changing existing ones)
>>>>>>>>>>> should always trigger a new dataset version.
>>>>>>>>>>>
>>>>>>>>>>> It does not appear you are receiving a final "SUCCESS"
>>>>>>>>>>> or failure message when
>>>>>>>>>>> publishing to the Gateway (with esgpublish --publish).
>>>>>>>>>>> Please try increasing
>>>>>>>>>>> your "polling" levels in your $ESGINI file. Eg:
>>>>>>>>>>>
>>>>>>>>>>> hessian_service_polling_delay = 10
>>>>>>>>>>> hessian_service_polling_iterations = 500
>>>>>>>>>>>
>>>>>>>>>>> You should see a final "SUCCESS" or "ERROR" with Java
>>>>>>>>>>> trace output at the
>>>>>>>>>>> termination of the command.
>>>>>>>>>>>
>>>>>>>>>>> I've reviewed the Thredds catalog for the dataset you
>>>>>>>>>>> note below:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://esgdata.gfdl.noaa.gov/thredds/esgcet/1/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2.xml
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> There appear to be multiple instances of certain files
>>>>>>>>>>> within the catalog which
>>>>>>>>>>> is a problem. The Gateway publish will fail if a
>>>>>>>>>>> particular file (URL) is
>>>>>>>>>>> referenced multiple times with differing metadata. An
>>>>>>>>>>> example is:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> */gfdl_dataroot/NOAA-GFDL/GFDL-CM3/historical/mon/atmos/Amon/r1i1p1/v20110601/rtmt/rtmt_Amon_GFDL-CM3_historical_r1i1p1_186001-186412.nc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This file appears as two separate file versions in the
>>>>>>>>>>> Thredds catalog (one with
>>>>>>>>>>> id ending in ".nc" and another with ".nc_0"). There
>>>>>>>>>>> should be only one reference
>>>>>>>>>>> to this file URL in the catalog.
>>>>>>>>>>>
>>>>>>>>>>> The previous version of the dataset in the
>>>>>>>>>>> publisher/node database may be
>>>>>>>>>>> leading to this issue. You may need to add
>>>>>>>>>>> "--database-delete" to your
>>>>>>>>>>> esgunpublish command to clean things up. Bob can advise
>>>>>>>>>>> on this. Note that the
>>>>>>>>>>> original esgpublish command shown in this email thread
>>>>>>>>>>> included "--keep-version".
>>>>>>>>>>>
>>>>>>>>>>> After publishing to the Gateway successfully, you can
>>>>>>>>>>> check the dataset details
>>>>>>>>>>> by URL with the published dataset identifier. For example:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.html
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> -Eric
>>>>>>>>>>>
>>>>>>>>>>> Serguei Nikonov wrote:
>>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>>
>>>>>>>>>>>> I still can not do anything about updating datasets.
>>>>>>>>>>>> The commands you
>>>>>>>>>>>> suggested executed successfully but datasets did not
>>>>>>>>>>>> appear on gateway. I
>>>>>>>>>>>> tried it several times for different datasets but
>>>>>>>>>>>> result is the same.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have any idea what to undertake in such situation.
>>>>>>>>>>>>
>>>>>>>>>>>> Here it is some details about what I tried.
>>>>>>>>>>>> I needed to add file to dataset
>>>>>>>>>>>>
>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.
>>>>>>>>>>>>
>>>>>>>>>>>> As you advised I unpublished it (esgunpublish
>>>>>>>>>>>>
>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1)
>>>>>>>>>>>> and then
>>>>>>>>>>>> created full mapfile (with additional file) and then
>>>>>>>>>>>> publised it:
>>>>>>>>>>>> esgpublish --read-files --map new_mapfile --project
>>>>>>>>>>>> cmip5 --thredd --publish
>>>>>>>>>>>>
>>>>>>>>>>>> As I told there were no any errors. Dataset is in
>>>>>>>>>>>> database and in thredds but
>>>>>>>>>>>> not in gateway.
>>>>>>>>>>>>
>>>>>>>>>>>> The second way I tried is using mapfile containing only
>>>>>>>>>>>> files to update. I
>>>>>>>>>>>> needed to substitute several existing files in dataset
>>>>>>>>>>>> for new ones. I created
>>>>>>>>>>>> mapfile with only files needed to substitute:
>>>>>>>>>>>> esgscan_directory --read-files --project cmip5 -o
>>>>>>>>>>>> mapfile.txt
>>>>>>>>>>>>
>>>>>>>>>>>> /data/CMIP5/output1/NOAA-GFDL/GFDL-ESM2M/historical/mon/ocean/Omon/r1i1p1/v20111206
>>>>>>>>>>>>
>>>>>>>>>>>> and then published it with update option:
>>>>>>>>>>>> esgpublish --update --map mapfile.txt --project cmip5
>>>>>>>>>>>> --thredd --publish.
>>>>>>>>>>>>
>>>>>>>>>>>> The result is the same as in a previous case - all
>>>>>>>>>>>> things are fine locally but
>>>>>>>>>>>> nothing happened on gateway.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Sergey
>>>>>>>>>>>>
>>>>>>>>>>>> -------- Original Message --------
>>>>>>>>>>>> Subject: Re: [Go-essp-tech] Publishing dataset with
>>>>>>>>>>>> option --update
>>>>>>>>>>>> Date: Thu, 29 Dec 2011 11:02:05 -0500
>>>>>>>>>>>> From: Serguei Nikonov<Serguei.Nikonov at noaa.gov>
>>>>>>>>>>>> Organization: GFDL
>>>>>>>>>>>> To: Drach, Bob<drach1 at llnl.gov>
>>>>>>>>>>>> CC: Nathan Wilhelmi<wilhelmi at ucar.edu>, "Ganzberger,
>>>>>>>>>>>> Michael"
>>>>>>>>>>>> <Ganzberger1 at llnl.gov>,"go-essp-tech at ucar.edu"<go-essp-tech at ucar.edu>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>>
>>>>>>>>>>>> I tried the 1st way you suggested and it worked
>>>>>>>>>>>> partially - the dataset was
>>>>>>>>>>>> created om datanode with version 2 but it was not
>>>>>>>>>>>> popped up on gateway. To make
>>>>>>>>>>>> sure that it's not occasional result I repeated it with
>>>>>>>>>>>> another datasets with
>>>>>>>>>>>> the same result.
>>>>>>>>>>>> Now I have 2 datasets on datanode (visible in thredds
>>>>>>>>>>>> server) but they are
>>>>>>>>>>>> absent on gateway:
>>>>>>>>>>>>
>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r2i1p1.v2.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Does it make sense to repeat esgpublish with 'publish'
>>>>>>>>>>>> option?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks and Happy New Year,
>>>>>>>>>>>> Sergey
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/21/2011 08:41 PM, Drach, Bob wrote:
>>>>>>>>>>>>> Hi Sergey,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The way I would recommend adding new files to an
>>>>>>>>>>>>> existing dataset is as
>>>>>>>>>>>>> follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Unpublish the previous dataset from the gateway and
>>>>>>>>>>>>> thredds
>>>>>>>>>>>>>
>>>>>>>>>>>>> % esgunpublish
>>>>>>>>>>>>>
>>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Add the new files to the existing mapfile for the
>>>>>>>>>>>>> dataset they are being
>>>>>>>>>>>>> added to.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Republish with the expanded mapfile:
>>>>>>>>>>>>>
>>>>>>>>>>>>> % esgpublish --read-files --map newmap.txt --project
>>>>>>>>>>>>> cmip5 --thredds
>>>>>>>>>>>>> --publish
>>>>>>>>>>>>>
>>>>>>>>>>>>> The publisher will:
>>>>>>>>>>>>> - not rescan existing files, only the new files
>>>>>>>>>>>>> - create a new version to reflect the additional files
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alternatively you can create a mapfile with *only* the
>>>>>>>>>>>>> new files (Using
>>>>>>>>>>>>> esgscan_directory), then republish using the --update
>>>>>>>>>>>>> command.
>>>>>>>>>>>>>
>>>>>>>>>>>>> --Bob
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/21/11 8:40 AM, "Serguei
>>>>>>>>>>>>> Nikonov"<serguei.nikonov at noaa.gov> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Nate,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> unfortunately this is not the only dataset I have a
>>>>>>>>>>>>>> problem - there are at
>>>>>>>>>>>>>> least
>>>>>>>>>>>>>> 5 more. Should I unpublish them locally (db, thredds)
>>>>>>>>>>>>>> and than create new
>>>>>>>>>>>>>> version containing full set of files? What is the
>>>>>>>>>>>>>> official way to update
>>>>>>>>>>>>>> dataset?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Sergey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12/20/2011 07:06 PM, Nathan Wilhelmi wrote:
>>>>>>>>>>>>>>> Hi Bob/Mike,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I believe the problem is that when files were added
>>>>>>>>>>>>>>> the timestamp on the
>>>>>>>>>>>>>>> dataset
>>>>>>>>>>>>>>> wasn't updated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The triple store will only harvest datasets that
>>>>>>>>>>>>>>> have files and an updated
>>>>>>>>>>>>>>> timestamp after the last harvest.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So what likely happened is the dataset was created
>>>>>>>>>>>>>>> without files, so it
>>>>>>>>>>>>>>> wasn't
>>>>>>>>>>>>>>> initially harvested. Files were subsequently added,
>>>>>>>>>>>>>>> but the timestamp wasn't
>>>>>>>>>>>>>>> updated, so it was still not a candidate for
>>>>>>>>>>>>>>> harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you update the date_updated timestamp for the
>>>>>>>>>>>>>>> dataset in question and
>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>> trigger the RDF harvesting, I believe the dataset
>>>>>>>>>>>>>>> will show up then.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> -Nate
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/20/2011 11:49 AM, Serguei Nikonov wrote:
>>>>>>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am a member of data publishers group. I have been
>>>>>>>>>>>>>>>> publishing considerable
>>>>>>>>>>>>>>>> amount of data without such kind of troubles but
>>>>>>>>>>>>>>>> this one occurred only when
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> tried to add some files to existing dataset.
>>>>>>>>>>>>>>>> Publishing from scratch works
>>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>>> for me.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Sergey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 12/20/2011 01:29 PM, Ganzberger, Michael wrote:
>>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That task is on a scheduler and will re-run every
>>>>>>>>>>>>>>>>> 10 minutes. If your data
>>>>>>>>>>>>>>>>> does not appear after that time then perhaps there
>>>>>>>>>>>>>>>>> is another issue. One
>>>>>>>>>>>>>>>>> issue could be that publishing to the gateway
>>>>>>>>>>>>>>>>> requires that you have the
>>>>>>>>>>>>>>>>> role
>>>>>>>>>>>>>>>>> of "Data Publisher";
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> "check that the account is member of the proper
>>>>>>>>>>>>>>>>> group and has the special
>>>>>>>>>>>>>>>>> role of Data Publisher"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> http://esgf.org/wiki/ESGFNode/FAQ
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mike
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Serguei Nikonov
>>>>>>>>>>>>>>>>> [mailto:serguei.nikonov at noaa.gov]
>>>>>>>>>>>>>>>>> Sent: Tuesday, December 20, 2011 10:12 AM
>>>>>>>>>>>>>>>>> To: Ganzberger, Michael
>>>>>>>>>>>>>>>>> Cc: StИphane Senesi; Drach, Bob;go-essp-tech at ucar.edu
>>>>>>>>>>>>>>>>> Subject: Re: [Go-essp-tech] Publishing dataset
>>>>>>>>>>>>>>>>> with option --update
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> thansk for suggestion but I don't have any
>>>>>>>>>>>>>>>>> privileges to do anything on
>>>>>>>>>>>>>>>>> gateway.
>>>>>>>>>>>>>>>>> I am just publishing data on GFDL data node.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Sergey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 12/20/2011 01:05 PM, Ganzberger, Michael wrote:
>>>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd like to suggest this that may help you from
>>>>>>>>>>>>>>>>>> http://esgf.org/wiki/Cmip5Gateway/FAQ
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "The search does not reflect the latest DB
>>>>>>>>>>>>>>>>>> changes I've made
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You have to manually trigger the 3store
>>>>>>>>>>>>>>>>>> harvesting. Logging as root and go
>>>>>>>>>>>>>>>>>> to Admin->"Gateway Scheduled Tasks"->"Run tasks"
>>>>>>>>>>>>>>>>>> and restart the job named
>>>>>>>>>>>>>>>>>> RDFSynchronizationJobDetail"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Mike Ganzberger
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From:go-essp-tech-bounces at ucar.edu
>>>>>>>>>>>>>>>>>> [mailto:go-essp-tech-bounces at ucar.edu]
>>>>>>>>>>>>>>>>>> On Behalf Of StИphane Senesi
>>>>>>>>>>>>>>>>>> Sent: Tuesday, December 20, 2011 9:42 AM
>>>>>>>>>>>>>>>>>> To: Serguei Nikonov
>>>>>>>>>>>>>>>>>> Cc: Drach, Bob;go-essp-tech at ucar.edu
>>>>>>>>>>>>>>>>>> Subject: Re: [Go-essp-tech] Publishing dataset
>>>>>>>>>>>>>>>>>> with option --update
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Serguei
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We have for some time now experienced similar
>>>>>>>>>>>>>>>>>> problems when publishing
>>>>>>>>>>>>>>>>>> to the PCMDI gateway, i.e. not getting a
>>>>>>>>>>>>>>>>>> "SUCCESS" message when
>>>>>>>>>>>>>>>>>> publishing . Sometimes, files are actually
>>>>>>>>>>>>>>>>>> published (or at least
>>>>>>>>>>>>>>>>>> accessible through the gateway, their status
>>>>>>>>>>>>>>>>>> being actually
>>>>>>>>>>>>>>>>>> "START_PUBLISHING", after esg_list_datasets
>>>>>>>>>>>>>>>>>> report) , sometimes not. An
>>>>>>>>>>>>>>>>>> hypothesis is that the PCMDI Gateway load do
>>>>>>>>>>>>>>>>>> generate the problem. We
>>>>>>>>>>>>>>>>>> havn't yet got a confirmation by Bob.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In contrast to your case, this happens when
>>>>>>>>>>>>>>>>>> publishing a dataset from
>>>>>>>>>>>>>>>>>> scratch (I mean, not an update)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best regards (do not expect any feeback from me
>>>>>>>>>>>>>>>>>> since early january, yet)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> S
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Serguei Nikonov wrote, On 20/12/2011 18:11:
>>>>>>>>>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I needed to add some missed variables to
>>>>>>>>>>>>>>>>>>> existing dataset and I found in
>>>>>>>>>>>>>>>>>>> esgpublish command an option --update. When I
>>>>>>>>>>>>>>>>>>> tried it I've got normal
>>>>>>>>>>>>>>>>>>> message like
>>>>>>>>>>>>>>>>>>> INFO 2011-12-20 11:21:00,893 Publishing:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1,
>>>>>>>>>>>>>>>>>>> parent
>>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>> pcmdi.GFDL
>>>>>>>>>>>>>>>>>>> INFO 2011-12-20 11:21:07,564 Result: PROCESSING
>>>>>>>>>>>>>>>>>>> INFO 2011-12-20 11:21:11,209 Result: PROCESSING
>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> but nothing happened on gateway - new variables
>>>>>>>>>>>>>>>>>>> are not there. The files
>>>>>>>>>>>>>>>>>>> corresponding to these variables are in database
>>>>>>>>>>>>>>>>>>> and in THREDDS catalog
>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>> apparently were not published on gateway.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I used command line
>>>>>>>>>>>>>>>>>>> esgpublish --update --keep-version
>>>>>>>>>>>>>>>>>>> --map<map_file> --project cmip5
>>>>>>>>>>>>>>>>>>> --noscan
>>>>>>>>>>>>>>>>>>> --publish.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Should map file be of some specific format to
>>>>>>>>>>>>>>>>>>> make it works in mode I
>>>>>>>>>>>>>>>>>>> need?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Sergey Nikonov
>>>>>>>>>>>>>>>>>>> GFDL
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>> _______________________________________________
>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>> --
>>>>>>>>> Estanislao Gonzalez
>>>>>>>>>
>>>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>>>>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate
>>>>>>>>> Computing Centre
>>>>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>>
>>>>>>>>> Phone: +49 (40) 46 00 94-126
>>>>>>>>> E-Mail:gonzalez at dkrz.de
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Estanislao Gonzalez
>>>>>>>
>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate
>>>>>>> Computing Centre
>>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>
>>>>>>> Phone: +49 (40) 46 00 94-126
>>>>>>> E-Mail:gonzalez at dkrz.de
>>>>>
>>>>>
>>>>> --
>>>>> Estanislao Gonzalez
>>>>>
>>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing
>>>>> Centre
>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>
>>>>> Phone: +49 (40) 46 00 94-126
>>>>> E-Mail:gonzalez at dkrz.de
>>>>>
>>>>
>>>>
>>>> --
>>>> Estanislao Gonzalez
>>>>
>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>
>>>> Phone: +49 (40) 46 00 94-126
>>>> E-Mail:gonzalez at dkrz.de
>>>
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
--
------------------ DKRZ / Data Management ------------------
Martina Stockhause
Deutsches Klimarechenzentrum phone: +49-40-460094-122
Bundesstr. 45a FAX: +49-40-460094-106
D-20146 Hamburg, Germany e-mail: stockhause at dkrz.de
------------------------------------------------------------
More information about the GO-ESSP-TECH
mailing list