[Go-essp-tech] Visibility of old versions, was... RE: Fwd: Re: Publishing dataset with option --update
Cinquini, Luca (3880)
Luca.Cinquini at jpl.nasa.gov
Wed Jan 11 05:18:38 MST 2012
Hi Michael,
sorry if I should know this already, but how can we access the DOI information for a given dataset ? The goal is, off course, to enable search on DOIs in the P2P system.
thanks, Luca
On Jan 11, 2012, at 1:37 AM, Michael Lautenschlager wrote:
> Hi Karl,
> even with respect to IPCC DDC I think we have to keep at least the most
> recent version of CMIP5 and those versions which ran through QC-L3 with
> assignment of DOI and citation reference. At least we WDCC/DKRZ are in
> contract with DataCite to keep these DataCite published data entries
> forever in the sense of common library time scales. The number and
> location of replicas is decidable within CMIP5 and no matter for
> DataCite but we have to ensure identical copies if we link to them from
> the DOI landing page.
>
> These DataCite published CMIP5 data entities may form the GCM data basis
> of the IPCC DDC because they are stable, quality proofed, accessible at
> any time and have a citation reference. So these data entities can be
> traced back in the scientific literature providing the citation
> references are used there. But I agree we have to discuss this with the
> IPCC DDC people.
>
> Best wishes, Michael
>
> ---------------
> Dr. Michael Lautenschlager
> Head of DKRZ Department Data Management
> Director World Data Center Climate
>
> German Climate Computing Centre (DKRZ)
> ADDRESS: Bundesstrasse 45a, D-20146 Hamburg, Germany
> PHONE: +4940-460094-118
> E-Mail: lautenschlager at dkrz.de
>
> URL: http://www.dkrz.de/
> http://www.wdc-climate.de/
>
>
> Geschäftsführer: Prof. Dr. Thomas Ludwig
> Sitz der Gesellschaft: Hamburg
> Amtsgericht Hamburg HRB 39784
>
>
> Am 10.01.2012 16:40, schrieb Karl Taylor:
>> Hi all,
>>
>> thanks for the good discussion. Some good arguments have been made for
>> keeping all versions. I'll not make a policy decision immediately, but
>> am tending toward strong encouragement to keep all versions. I'll
>> distribute a draft statement about this for your input and comment
>> before posting. I'll, of course, also consult directly with other IPCC
>> DDC folks too.
>>
>> Best regards,
>> Karl
>>
>> On 1/10/12 5:31 AM, Estanislao Gonzalez wrote:
>>> Hi Jamie,
>>>
>>> Indeed, DOIs are not going to solve everything. The DOI is analogous
>>> to the ISBN of a book, citing the whole book is not always what you
>>> want in any case. To continue the analogy, users are indeed working
>>> with pre-prints which get corrected all the time (i.e. the CMIP5
>>> archive is in flux). People are writing papers citing a pre-print. Of
>>> course this makes no sense, but they are not doing so willingly, they
>>> have to as the dead line approaches but the computing groups are not
>>> ready.
>>>
>>> So what do we have now? Some archives with a strong commitment for
>>> preserving data.
>>> If the DRS were honored, the URL would be enough for citing any file
>>> as it has the version in it. Indeed citing +1000 Urls is not
>>> practical, but a redirection could be added so that the scientist
>>> cites one URL in which all files URLs are listed (There's no
>>> implementation for this AFAIK). But at least the URL of DRS committed
>>> sites could be safely cited, and if the checksum is attached to the
>>> citation, it is sure that the correct file is always being cited (and
>>> it could even be found if moved).
>>>
>>> I don't know how citations are being done now, nor do I know how they
>>> were done before when everyone was citing data that it was almost
>>> impossible to get. DOIs are the very first step in the right
>>> direction, not the last one.
>>> IMHO the community should come up with some best practices to
>>> overcome the problem we are facing: how to cite something that's
>>> permanently changing. Sharing this will certainly help everyone.
>>> Before jumping away from this subject I'd also like to add that I
>>> don't see any proper communication mechanism in the community. All
>>> (or at least most) questions regarding CMIP5 are AFAICT directed to
>>> the help-desk, so mostly developers are trying to help the community
>>> instead of the community trying to help itself. I think we might be
>>> missing some kind of platform for doing this. We don't have the means
>>> to support the growing community (and new communities which we are
>>> now serving), we need them to help with the "helping". Just a thought....
>>>
>>> And last, and probably least, the only way to get the latest version
>>> of any dataset is by re-issuing the search. Especially since multiple
>>> datasets are referred to in a wget script, finding the latest
>>> versions of each of them "by hand" will be more time-consuming than
>>> issuing the search query again.
>>>
>>> Thanks,
>>> Estani
>>>
>>> Am 10.01.2012 13:00, schrieb Kettleborough, Jamie:
>>>> Hello,
>>>> I'm not sure how to say this: but I'm not sure its just down to
>>>> DOI's to determine whether a data set should always be visible. I
>>>> think data needs to be visible where its sufficiently important that
>>>> a user might want to download it. e.g they want to check or extend
>>>> someone elses study (and I think there are other reasons). Its not
>>>> clear to me that all data of this kind will have a DOI - for
>>>> instance how many of the datasets referenced in papers being written
>>>> now for the summer deadline of AR5 have (or will have in time) DOIs?
>>>> I know its tempting to say - any dataset referenced in a paper
>>>> should have a DOI. But Ithink you need to be realistic about the
>>>> prospects of this happening on the right timescales.
>>>> If the DOI is used as the determinent of whether data is always
>>>> visible then should users be made aware of the risk they are
>>>> carrying now? For instance, so they know to have local backups of
>>>> data that is really important to them. (With the possible
>>>> implication too that they may need to be prepared to 'reshare' this
>>>> data with others.)
>>>> For what its worth my personal preference is with the BADC/DKRZ (and
>>>> I'm sure others) philosophy of keeping all versions - though I
>>>> realise there are costs in doing this, like getting DRSlib
>>>> sufficiently bug free and getting it to work in all the contexts it
>>>> needs to (hard links/soft links), getting it deployed, getting the
>>>> update mechanism in place for when new bugs are found etc. If you
>>>> used DRSlib doesn't Estanis use case that caused user grief become
>>>> easier too - the wget scripts do not need regenerating, you should
>>>> instead be able to replace the version strings in the url (though I
>>>> may be assuming things about load balancing etc in saying this).
>>>> Jamie
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* go-essp-tech-bounces at ucar.edu
>>>> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Estanislao
>>>> Gonzalez
>>>> *Sent:* 10 January 2012 10:21
>>>> *To:* Karl Taylor
>>>> *Cc:* Drach, Bob; go-essp-tech at ucar.edu; serguei.nikonov at noaa.gov
>>>> *Subject:* Re: [Go-essp-tech] Fwd: Re: Publishing dataset with
>>>> option --update
>>>>
>>>> Well to be honest I do agree this is a decision each institution
>>>> has to make, but for us I'd prefer offering everything we have
>>>> and let the systems decide what to do with this information.
>>>> I.e. I've used it to generate some comments (I might have
>>>> already show you this), just go here:
>>>> http://ipcc-ar5.dkrz.de/dataset/cmip5.output1.NCC.NorESM1-M.sstClim.mon.land.Lmon.r1i1p1.html
>>>> and click on history.
>>>> That information could be generated only because we store the
>>>> metadata to the previous version.
>>>>
>>>> By the way, The only way of inhibiting the user from getting an
>>>> older version, if that's what it's wanted, is by either removing
>>>> the files from the TDS served directory, or changing the access
>>>> restriction at the Gateway. Because of a well-known TDS bug (or
>>>> feature) files present at that directory and not found in any
>>>> catalog are served without any restriction (AFAIK no certificate
>>>> is required for this). So, normally the wget script would work
>>>> even if the files where unpublished.
>>>>
>>>> It really depends on the use-case... but e.g. I had to explain
>>>> all this to a couple of people in the help-desk since the wget
>>>> script they've downloaded wasn't working anymore (files were
>>>> removed). They weren't thrilled to know they had to re issue the
>>>> search again (there's no workaround for this) and they wanted to
>>>> know what was changed in the new version, and there's where we
>>>> can't help our users any more since we don't have that
>>>> information...
>>>>
>>>> I don't know what our users prefer, but I think they have more
>>>> important problems to cope with at this time... if they could
>>>> reliably get one version they could start worrying about others.
>>>> From my perspective as a data manager, it's worth the tiny
>>>> additional effort, if there's any.
>>>>
>>>> Cheers,
>>>> Estani
>>>>
>>>> Am 09.01.2012 20:05, schrieb Karl Taylor:
>>>>> Hi Estani,
>>>>>
>>>>> I agree that a new version number should (I'd say must) be
>>>>> assigned when any changes are made. However, except for DOI
>>>>> datasets, most groups will not want older versions to be
>>>>> visible or downloadable.
>>>>>
>>>>> Do you agree?
>>>>>
>>>>> cheers,
>>>>> Karl
>>>>>
>>>>> On 1/9/12 10:37 AM, Estanislao Gonzalez wrote:
>>>>>> Hi Karl,
>>>>>>
>>>>>> It is indeed a good point, but I must add that we are not
>>>>>> talking about preserving a version (although we do it here at
>>>>>> DKRZ) but of signaling that a version has been changed. So the
>>>>>> version is a key to find a specific dataset which changes in time.
>>>>>>
>>>>>> Even before a DOI assignment I'd encourage all to create a new
>>>>>> version every time the dataset changes in any way.
>>>>>> Institutions have the right to preserve whatever version they
>>>>>> want (they may even delete DOI-assigned versions, on the other
>>>>>> hand archives can't, that's why archives are for).
>>>>>> But altering the dataset preserving the version just bring
>>>>>> chaos for the users and for us at the help-desk as we have to
>>>>>> explain why something has changed (or rather answer that we
>>>>>> don't know why...). It means that the same key now points to a
>>>>>> different dataset.
>>>>>>
>>>>>> The only benefits I can see for preserving the same version is
>>>>>> that publishing using the same version seems to be easier to
>>>>>> some (for our workflow it's not, it's exactly the same) and
>>>>>> that if only new files are added this seems to work fine for
>>>>>> publication at both the data-node and the gateway as it's
>>>>>> properly supported.
>>>>>> If anything else changes, this does not work as expected
>>>>>> (wrong checksums, ghost files at the gateway, etc). And
>>>>>> changing a version contents makes no sense to the user IMHO
>>>>>> (e.g. it's as if you might sometimes get more files from a
>>>>>> tarred file... how often should you extract it to be sure you
>>>>>> got "all of them")
>>>>>>
>>>>>> If old versions were preserved (which take almost no resources
>>>>>> if using hardlinks), a simple comparison would tell that the
>>>>>> only changes were the addition of some specific files.
>>>>>>
>>>>>> Basically, reusing the version ends in a non-recoverable loss
>>>>>> of information. That's why I discourage it.
>>>>>>
>>>>>> My 2c,
>>>>>> Estani
>>>>>>
>>>>>> Am 09.01.2012 17:25, schrieb Karl Taylor:
>>>>>>> Dear all,
>>>>>>>
>>>>>>> I do not have time to read this thoroughly, so perhaps what
>>>>>>> I'll mention here is irrelevant. There may be some
>>>>>>> miscommunication about what is meant by "version". There are
>>>>>>> two cases to consider:
>>>>>>>
>>>>>>> 1. Before a dataset has become official (i.e., assigned a
>>>>>>> DOI), a group may choose to remove all record of it from the
>>>>>>> database and publish a replacement version.
>>>>>>>
>>>>>>> 2. Alternatively, if a group wants to preserve a previous
>>>>>>> version (as is required after a DOI has been assigned), then
>>>>>>> the new version will not "replace" the previous version, but
>>>>>>> simply be added to the archive.
>>>>>>>
>>>>>>> It is possible that different publication procedures will
>>>>>>> apply in these different cases.
>>>>>>>
>>>>>>> best,
>>>>>>> Karl
>>>>>>>
>>>>>>> On 1/9/12 4:26 AM, Estanislao Gonzalez wrote:
>>>>>>>> Just to mentioned that we do the same thing. We use directly
>>>>>>>> --new-version and a map file containing all files for the new version,
>>>>>>>> but we do create hard-links to the files being reused, so they are
>>>>>>>> indeed all "new" as their paths always differ from those of previous
>>>>>>>> versions. (In any case for the publisher they are the same and thus
>>>>>>>> encode them with the nc_0 name if I recall correctly)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Estani
>>>>>>>> Am 09.01.2012 12:15, schriebstephen.pascoe at stfc.ac.uk:
>>>>>>>>> Hi Bob,
>>>>>>>>>
>>>>>>>>> This "unpublish first" requirement is news to me. We've been publishing new versions without doing this for some time. Now, we have come across difficulties with a few datasets but it's generally worked.
>>>>>>>>>
>>>>>>>>> We don't use the --update option though. Each time we publish a new version we provide a mapfile of all files in the dataset(s). I'd recommend Sergey try doing this before removing a previous version.
>>>>>>>>>
>>>>>>>>> If you unpublish from the Gateway first you'll loose the information in the "History" tab. For instancehttp://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output2.MOHC.HadGEM2-ES.rcp85.mon.aerosol.aero.r1i1p1.html shows 2 versions.
>>>>>>>>>
>>>>>>>>> Stephen.
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Stephen Pascoe +44 (0)1235 445980
>>>>>>>>> Centre of Environmental Data Archival
>>>>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From:go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Drach, Bob
>>>>>>>>> Sent: 06 January 2012 20:53
>>>>>>>>> To: Serguei Nikonov; Eric Nienhouse
>>>>>>>>> Cc:go-essp-tech at ucar.edu
>>>>>>>>> Subject: Re: [Go-essp-tech] Fwd: Re: Publishing dataset with option --update
>>>>>>>>>
>>>>>>>>> Hi Sergey,
>>>>>>>>>
>>>>>>>>> When updating a dataset, it's also important to unpublish it before publishing the new version. E.g, first run
>>>>>>>>>
>>>>>>>>> esgunpublish<dataset_id>
>>>>>>>>>
>>>>>>>>> The reason is that, when you publish to the gateway, the gateway software tries to *add* the new information to the existing dataset entry, rather that replace it.
>>>>>>>>>
>>>>>>>>> --Bob
>>>>>>>>> ________________________________________
>>>>>>>>> From: Serguei Nikonov [serguei.nikonov at noaa.gov]
>>>>>>>>> Sent: Friday, January 06, 2012 10:45 AM
>>>>>>>>> To: Eric Nienhouse
>>>>>>>>> Cc: Bob Drach;go-essp-tech at ucar.edu
>>>>>>>>> Subject: Re: [Go-essp-tech] Fwd: Re: Publishing dataset with option --update
>>>>>>>>>
>>>>>>>>> Hi Eric,
>>>>>>>>>
>>>>>>>>> thanks for you help. I have no any objections about any adopted versioning
>>>>>>>>> policy. What I need is to know how to apply it. The ways I used did not work for
>>>>>>>>> me. Hopefully, the reasons is bad things in thredds and database you pointed
>>>>>>>>> put. I am cleaning them right now, then will see...
>>>>>>>>>
>>>>>>>>> Just for clarification, if I need to update dataset (with changing version) I
>>>>>>>>> create map file containing full set of files (old and new ones) and then use
>>>>>>>>> this map file in esgpublish script with option --update, is it correct? Will it
>>>>>>>>> be enough for creating dataset of new version? BTW, there is nothing about
>>>>>>>>> version for option 'update' in esgpublish help.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sergey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01/04/2012 04:27 PM, Eric Nienhouse wrote:
>>>>>>>>>> Hi Serguei,
>>>>>>>>>>
>>>>>>>>>> Following are a few more suggestions to diagnose this publishing issue. I agree
>>>>>>>>>> with others on this thread that adding new files (or changing existing ones)
>>>>>>>>>> should always trigger a new dataset version.
>>>>>>>>>>
>>>>>>>>>> It does not appear you are receiving a final "SUCCESS" or failure message when
>>>>>>>>>> publishing to the Gateway (with esgpublish --publish). Please try increasing
>>>>>>>>>> your "polling" levels in your $ESGINI file. Eg:
>>>>>>>>>>
>>>>>>>>>> hessian_service_polling_delay = 10
>>>>>>>>>> hessian_service_polling_iterations = 500
>>>>>>>>>>
>>>>>>>>>> You should see a final "SUCCESS" or "ERROR" with Java trace output at the
>>>>>>>>>> termination of the command.
>>>>>>>>>>
>>>>>>>>>> I've reviewed the Thredds catalog for the dataset you note below:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://esgdata.gfdl.noaa.gov/thredds/esgcet/1/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2.xml
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There appear to be multiple instances of certain files within the catalog which
>>>>>>>>>> is a problem. The Gateway publish will fail if a particular file (URL) is
>>>>>>>>>> referenced multiple times with differing metadata. An example is:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> */gfdl_dataroot/NOAA-GFDL/GFDL-CM3/historical/mon/atmos/Amon/r1i1p1/v20110601/rtmt/rtmt_Amon_GFDL-CM3_historical_r1i1p1_186001-186412.nc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This file appears as two separate file versions in the Thredds catalog (one with
>>>>>>>>>> id ending in ".nc" and another with ".nc_0"). There should be only one reference
>>>>>>>>>> to this file URL in the catalog.
>>>>>>>>>>
>>>>>>>>>> The previous version of the dataset in the publisher/node database may be
>>>>>>>>>> leading to this issue. You may need to add "--database-delete" to your
>>>>>>>>>> esgunpublish command to clean things up. Bob can advise on this. Note that the
>>>>>>>>>> original esgpublish command shown in this email thread included "--keep-version".
>>>>>>>>>>
>>>>>>>>>> After publishing to the Gateway successfully, you can check the dataset details
>>>>>>>>>> by URL with the published dataset identifier. For example:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I hope this helps.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> -Eric
>>>>>>>>>>
>>>>>>>>>> Serguei Nikonov wrote:
>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>
>>>>>>>>>>> I still can not do anything about updating datasets. The commands you
>>>>>>>>>>> suggested executed successfully but datasets did not appear on gateway. I
>>>>>>>>>>> tried it several times for different datasets but result is the same.
>>>>>>>>>>>
>>>>>>>>>>> Do you have any idea what to undertake in such situation.
>>>>>>>>>>>
>>>>>>>>>>> Here it is some details about what I tried.
>>>>>>>>>>> I needed to add file to dataset
>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.
>>>>>>>>>>> As you advised I unpublished it (esgunpublish
>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1) and then
>>>>>>>>>>> created full mapfile (with additional file) and then publised it:
>>>>>>>>>>> esgpublish --read-files --map new_mapfile --project cmip5 --thredd --publish
>>>>>>>>>>>
>>>>>>>>>>> As I told there were no any errors. Dataset is in database and in thredds but
>>>>>>>>>>> not in gateway.
>>>>>>>>>>>
>>>>>>>>>>> The second way I tried is using mapfile containing only files to update. I
>>>>>>>>>>> needed to substitute several existing files in dataset for new ones. I created
>>>>>>>>>>> mapfile with only files needed to substitute:
>>>>>>>>>>> esgscan_directory --read-files --project cmip5 -o mapfile.txt
>>>>>>>>>>> /data/CMIP5/output1/NOAA-GFDL/GFDL-ESM2M/historical/mon/ocean/Omon/r1i1p1/v20111206
>>>>>>>>>>>
>>>>>>>>>>> and then published it with update option:
>>>>>>>>>>> esgpublish --update --map mapfile.txt --project cmip5 --thredd --publish.
>>>>>>>>>>>
>>>>>>>>>>> The result is the same as in a previous case - all things are fine locally but
>>>>>>>>>>> nothing happened on gateway.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sergey
>>>>>>>>>>>
>>>>>>>>>>> -------- Original Message --------
>>>>>>>>>>> Subject: Re: [Go-essp-tech] Publishing dataset with option --update
>>>>>>>>>>> Date: Thu, 29 Dec 2011 11:02:05 -0500
>>>>>>>>>>> From: Serguei Nikonov<Serguei.Nikonov at noaa.gov>
>>>>>>>>>>> Organization: GFDL
>>>>>>>>>>> To: Drach, Bob<drach1 at llnl.gov>
>>>>>>>>>>> CC: Nathan Wilhelmi<wilhelmi at ucar.edu>, "Ganzberger, Michael"
>>>>>>>>>>> <Ganzberger1 at llnl.gov>,"go-essp-tech at ucar.edu"<go-essp-tech at ucar.edu>
>>>>>>>>>>>
>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>
>>>>>>>>>>> I tried the 1st way you suggested and it worked partially - the dataset was
>>>>>>>>>>> created om datanode with version 2 but it was not popped up on gateway. To make
>>>>>>>>>>> sure that it's not occasional result I repeated it with another datasets with
>>>>>>>>>>> the same result.
>>>>>>>>>>> Now I have 2 datasets on datanode (visible in thredds server) but they are
>>>>>>>>>>> absent on gateway:
>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2
>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r2i1p1.v2.
>>>>>>>>>>>
>>>>>>>>>>> Does it make sense to repeat esgpublish with 'publish' option?
>>>>>>>>>>>
>>>>>>>>>>> Thanks and Happy New Year,
>>>>>>>>>>> Sergey
>>>>>>>>>>>
>>>>>>>>>>> On 12/21/2011 08:41 PM, Drach, Bob wrote:
>>>>>>>>>>>> Hi Sergey,
>>>>>>>>>>>>
>>>>>>>>>>>> The way I would recommend adding new files to an existing dataset is as
>>>>>>>>>>>> follows:
>>>>>>>>>>>>
>>>>>>>>>>>> - Unpublish the previous dataset from the gateway and thredds
>>>>>>>>>>>>
>>>>>>>>>>>> % esgunpublish
>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1
>>>>>>>>>>>>
>>>>>>>>>>>> - Add the new files to the existing mapfile for the dataset they are being
>>>>>>>>>>>> added to.
>>>>>>>>>>>>
>>>>>>>>>>>> - Republish with the expanded mapfile:
>>>>>>>>>>>>
>>>>>>>>>>>> % esgpublish --read-files --map newmap.txt --project cmip5 --thredds
>>>>>>>>>>>> --publish
>>>>>>>>>>>>
>>>>>>>>>>>> The publisher will:
>>>>>>>>>>>> - not rescan existing files, only the new files
>>>>>>>>>>>> - create a new version to reflect the additional files
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Alternatively you can create a mapfile with *only* the new files (Using
>>>>>>>>>>>> esgscan_directory), then republish using the --update command.
>>>>>>>>>>>>
>>>>>>>>>>>> --Bob
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/21/11 8:40 AM, "Serguei Nikonov"<serguei.nikonov at noaa.gov> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Nate,
>>>>>>>>>>>>>
>>>>>>>>>>>>> unfortunately this is not the only dataset I have a problem - there are at
>>>>>>>>>>>>> least
>>>>>>>>>>>>> 5 more. Should I unpublish them locally (db, thredds) and than create new
>>>>>>>>>>>>> version containing full set of files? What is the official way to update
>>>>>>>>>>>>> dataset?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Sergey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/20/2011 07:06 PM, Nathan Wilhelmi wrote:
>>>>>>>>>>>>>> Hi Bob/Mike,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I believe the problem is that when files were added the timestamp on the
>>>>>>>>>>>>>> dataset
>>>>>>>>>>>>>> wasn't updated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The triple store will only harvest datasets that have files and an updated
>>>>>>>>>>>>>> timestamp after the last harvest.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So what likely happened is the dataset was created without files, so it
>>>>>>>>>>>>>> wasn't
>>>>>>>>>>>>>> initially harvested. Files were subsequently added, but the timestamp wasn't
>>>>>>>>>>>>>> updated, so it was still not a candidate for harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you update the date_updated timestamp for the dataset in question and
>>>>>>>>>>>>>> then
>>>>>>>>>>>>>> trigger the RDF harvesting, I believe the dataset will show up then.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> -Nate
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12/20/2011 11:49 AM, Serguei Nikonov wrote:
>>>>>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am a member of data publishers group. I have been publishing considerable
>>>>>>>>>>>>>>> amount of data without such kind of troubles but this one occurred only when
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> tried to add some files to existing dataset. Publishing from scratch works
>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>> for me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Sergey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/20/2011 01:29 PM, Ganzberger, Michael wrote:
>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That task is on a scheduler and will re-run every 10 minutes. If your data
>>>>>>>>>>>>>>>> does not appear after that time then perhaps there is another issue. One
>>>>>>>>>>>>>>>> issue could be that publishing to the gateway requires that you have the
>>>>>>>>>>>>>>>> role
>>>>>>>>>>>>>>>> of "Data Publisher";
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> "check that the account is member of the proper group and has the special
>>>>>>>>>>>>>>>> role of Data Publisher"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://esgf.org/wiki/ESGFNode/FAQ
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mike
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Serguei Nikonov [mailto:serguei.nikonov at noaa.gov]
>>>>>>>>>>>>>>>> Sent: Tuesday, December 20, 2011 10:12 AM
>>>>>>>>>>>>>>>> To: Ganzberger, Michael
>>>>>>>>>>>>>>>> Cc: StИphane Senesi; Drach, Bob;go-essp-tech at ucar.edu
>>>>>>>>>>>>>>>> Subject: Re: [Go-essp-tech] Publishing dataset with option --update
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> thansk for suggestion but I don't have any privileges to do anything on
>>>>>>>>>>>>>>>> gateway.
>>>>>>>>>>>>>>>> I am just publishing data on GFDL data node.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Sergey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 12/20/2011 01:05 PM, Ganzberger, Michael wrote:
>>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd like to suggest this that may help you from
>>>>>>>>>>>>>>>>> http://esgf.org/wiki/Cmip5Gateway/FAQ
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> "The search does not reflect the latest DB changes I've made
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You have to manually trigger the 3store harvesting. Logging as root and go
>>>>>>>>>>>>>>>>> to Admin->"Gateway Scheduled Tasks"->"Run tasks" and restart the job named
>>>>>>>>>>>>>>>>> RDFSynchronizationJobDetail"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mike Ganzberger
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From:go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-bounces at ucar.edu]
>>>>>>>>>>>>>>>>> On Behalf Of StИphane Senesi
>>>>>>>>>>>>>>>>> Sent: Tuesday, December 20, 2011 9:42 AM
>>>>>>>>>>>>>>>>> To: Serguei Nikonov
>>>>>>>>>>>>>>>>> Cc: Drach, Bob;go-essp-tech at ucar.edu
>>>>>>>>>>>>>>>>> Subject: Re: [Go-essp-tech] Publishing dataset with option --update
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Serguei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We have for some time now experienced similar problems when publishing
>>>>>>>>>>>>>>>>> to the PCMDI gateway, i.e. not getting a "SUCCESS" message when
>>>>>>>>>>>>>>>>> publishing . Sometimes, files are actually published (or at least
>>>>>>>>>>>>>>>>> accessible through the gateway, their status being actually
>>>>>>>>>>>>>>>>> "START_PUBLISHING", after esg_list_datasets report) , sometimes not. An
>>>>>>>>>>>>>>>>> hypothesis is that the PCMDI Gateway load do generate the problem. We
>>>>>>>>>>>>>>>>> havn't yet got a confirmation by Bob.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In contrast to your case, this happens when publishing a dataset from
>>>>>>>>>>>>>>>>> scratch (I mean, not an update)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best regards (do not expect any feeback from me since early january, yet)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> S
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Serguei Nikonov wrote, On 20/12/2011 18:11:
>>>>>>>>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I needed to add some missed variables to existing dataset and I found in
>>>>>>>>>>>>>>>>>> esgpublish command an option --update. When I tried it I've got normal
>>>>>>>>>>>>>>>>>> message like
>>>>>>>>>>>>>>>>>> INFO 2011-12-20 11:21:00,893 Publishing:
>>>>>>>>>>>>>>>>>> cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1, parent
>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>> pcmdi.GFDL
>>>>>>>>>>>>>>>>>> INFO 2011-12-20 11:21:07,564 Result: PROCESSING
>>>>>>>>>>>>>>>>>> INFO 2011-12-20 11:21:11,209 Result: PROCESSING
>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> but nothing happened on gateway - new variables are not there. The files
>>>>>>>>>>>>>>>>>> corresponding to these variables are in database and in THREDDS catalog
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> apparently were not published on gateway.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I used command line
>>>>>>>>>>>>>>>>>> esgpublish --update --keep-version --map<map_file> --project cmip5
>>>>>>>>>>>>>>>>>> --noscan
>>>>>>>>>>>>>>>>>> --publish.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Should map file be of some specific format to make it works in mode I
>>>>>>>>>>>>>>>>>> need?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Sergey Nikonov
>>>>>>>>>>>>>>>>>> GFDL
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>> _______________________________________________
>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>> --
>>>>>>>> Estanislao Gonzalez
>>>>>>>>
>>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>>>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>
>>>>>>>> Phone: +49 (40) 46 00 94-126
>>>>>>>> E-Mail:gonzalez at dkrz.de
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Estanislao Gonzalez
>>>>>>
>>>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>
>>>>>> Phone: +49 (40) 46 00 94-126
>>>>>> E-Mail:gonzalez at dkrz.de
>>>>
>>>>
>>>> --
>>>> Estanislao Gonzalez
>>>>
>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>
>>>> Phone: +49 (40) 46 00 94-126
>>>> E-Mail:gonzalez at dkrz.de
>>>>
>>>
>>>
>>> --
>>> Estanislao Gonzalez
>>>
>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>
>>> Phone: +49 (40) 46 00 94-126
>>> E-Mail:gonzalez at dkrz.de
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
More information about the GO-ESSP-TECH
mailing list