[Go-essp-tech] Fwd: Re: Publishing dataset with option --update

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Tue Jan 10 01:49:12 MST 2012


HI Estani, Karl,

I had one pertinent comment from a user: “if you don’t keep the data, why do you call it an archive?” As far as MOHC data is concerned, we are trying to preserve versions. So far this involves minimal storage overhead, as versions (so far) tend to differ by relatively modest numbers of files. It gives the advantage to users that they can go back and look at data they may have used earlier, or data that was used in a result they are trying to replicate. I understand that most groups don’t like the overhead of keeping multiple versions, but the users definitely do like it,

Cheers,
Martin

From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
Sent: 09 January 2012 19:06
To: Estanislao Gonzalez
Cc: Drach, Bob; go-essp-tech at ucar.edu; serguei.nikonov at noaa.gov
Subject: Re: [Go-essp-tech] Fwd: Re: Publishing dataset with option --update

 Hi Estani,

I agree that a new version number should (I'd say must) be assigned when any changes are made.  However, except for DOI datasets, most groups will not want older versions to be visible or downloadable.

Do you agree?

cheers,
Karl

On 1/9/12 10:37 AM, Estanislao Gonzalez wrote:
Hi Karl,

It is indeed a good point, but I must add that we are not talking about preserving a version (although we do it here at DKRZ) but of signaling that a version has been changed. So the version is a key to find a specific dataset which changes in time.

Even before a DOI assignment I'd encourage all to create a new version every time the dataset changes in any way. Institutions have the right to preserve whatever version they want (they may even delete DOI-assigned versions, on the other hand archives can't, that's why archives are for).
But altering the dataset preserving the version just bring chaos for the users and for us at the help-desk as we have to explain why something has changed (or rather answer that we don't know why...). It means that the same key now points to a different dataset.

The only benefits I can see for preserving the same version is that publishing using the same version seems to be easier to some (for our workflow it's not, it's exactly the same) and that if only new files are added this seems to work fine for publication at both the data-node and the gateway as it's properly supported.
If anything else changes, this does not work as expected (wrong checksums, ghost files at the gateway, etc). And changing a version contents makes no sense to the user IMHO (e.g. it's as if you might sometimes get more files from a tarred file... how often should you extract it to be sure you got "all of them")

If old versions were preserved (which take almost no resources if using hardlinks), a simple comparison would tell that the only changes were the addition of some specific files.

Basically, reusing the version ends in a non-recoverable loss of information. That's why I discourage it.

My 2c,
Estani

Am 09.01.2012 17:25, schrieb Karl Taylor:
Dear all,

I do not have time to read this thoroughly, so perhaps what I'll mention here is irrelevant.  There may be some miscommunication about what is meant by "version".  There are two cases to consider:

1.  Before a dataset has become official (i.e., assigned a DOI), a group may choose to remove all record of it from the database and publish a replacement version.

2.  Alternatively, if a group wants to preserve a previous version (as  is required after a DOI has been assigned), then the new version will not "replace" the previous version, but simply be added to the archive.

It is possible that different publication procedures will apply in these different cases.

best,
Karl

On 1/9/12 4:26 AM, Estanislao Gonzalez wrote:

Just to mentioned that we do the same thing. We use directly

--new-version and a map file containing all files for the new version,

but we do create hard-links to the files being reused, so they are

indeed all "new" as their paths always differ from those of previous

versions. (In any case for the publisher they are the same and thus

encode them with the nc_0 name if I recall correctly)



Thanks,

Estani

Am 09.01.2012 12:15, schrieb stephen.pascoe at stfc.ac.uk:<mailto:stephen.pascoe at stfc.ac.uk:>

Hi Bob,



This "unpublish first" requirement is news to me.  We've been publishing new versions without doing this for some time.  Now, we have come across difficulties with a few datasets but it's generally worked.



We don't use the --update option though.  Each time we publish a new version we provide a mapfile of all files in the dataset(s).  I'd recommend Sergey try doing this before removing a previous version.



If you unpublish from the Gateway first you'll loose the information in the "History" tab.  For instance http://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output2.MOHC.HadGEM2-ES.rcp85.mon.aerosol.aero.r1i1p1.html shows 2 versions.



Stephen.



---

Stephen Pascoe  +44 (0)1235 445980

Centre of Environmental Data Archival

STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK





-----Original Message-----

From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Drach, Bob

Sent: 06 January 2012 20:53

To: Serguei Nikonov; Eric Nienhouse

Cc: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>

Subject: Re: [Go-essp-tech] Fwd: Re: Publishing dataset with option --update



Hi Sergey,



When updating a dataset, it's also important to unpublish it before publishing the new version. E.g, first run



esgunpublish<dataset_id>



The reason is that, when you publish to the gateway, the gateway software tries to *add* the new information to the existing dataset entry, rather that replace it.



--Bob

________________________________________

From: Serguei Nikonov [serguei.nikonov at noaa.gov<mailto:serguei.nikonov at noaa.gov>]

Sent: Friday, January 06, 2012 10:45 AM

To: Eric Nienhouse

Cc: Bob Drach; go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>

Subject: Re: [Go-essp-tech] Fwd: Re:  Publishing dataset with option --update



Hi Eric,



thanks for you help. I have no any objections about any adopted versioning

policy. What I need is to know how to apply it. The ways I used did not work for

me. Hopefully, the reasons is bad things in thredds and database you pointed

put. I am cleaning them right now, then will see...



Just for clarification, if I need to update dataset (with changing version) I

create map file containing full set of files (old and new ones) and then use

this map file in esgpublish script with option --update, is it correct? Will it

be enough for creating dataset of new version? BTW, there is nothing about

version for option 'update' in esgpublish help.



Thanks,

Sergey







On 01/04/2012 04:27 PM, Eric Nienhouse wrote:

Hi Serguei,



Following are a few more suggestions to diagnose this publishing issue. I agree

with others on this thread that adding new files (or changing existing ones)

should always trigger a new dataset version.



It does not appear you are receiving a final "SUCCESS" or failure message when

publishing to the Gateway (with esgpublish --publish). Please try increasing

your "polling" levels in your $ESGINI file. Eg:



hessian_service_polling_delay = 10

hessian_service_polling_iterations = 500



You should see a final "SUCCESS" or "ERROR" with Java trace output at the

termination of the command.



I've reviewed the Thredds catalog for the dataset you note below:





http://esgdata.gfdl.noaa.gov/thredds/esgcet/1/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2.xml





There appear to be multiple instances of certain files within the catalog which

is a problem. The Gateway publish will fail if a particular file (URL) is

referenced multiple times with differing metadata. An example is:





*/gfdl_dataroot/NOAA-GFDL/GFDL-CM3/historical/mon/atmos/Amon/r1i1p1/v20110601/rtmt/rtmt_Amon_GFDL-CM3_historical_r1i1p1_186001-186412.nc





This file appears as two separate file versions in the Thredds catalog (one with

id ending in ".nc" and another with ".nc_0"). There should be only one reference

to this file URL in the catalog.



The previous version of the dataset in the publisher/node database may be

leading to this issue. You may need to add "--database-delete" to your

esgunpublish command to clean things up. Bob can advise on this. Note that the

original esgpublish command shown in this email thread included "--keep-version".



After publishing to the Gateway successfully, you can check the dataset details

by URL with the published dataset identifier. For example:





http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.html





I hope this helps.



Regards,



-Eric



Serguei Nikonov wrote:

Hi Bob,



I still can not do anything about updating datasets. The commands you

suggested executed successfully but datasets did not appear on gateway. I

tried it several times for different datasets but result is the same.



Do you have any idea what to undertake in such situation.



Here it is some details about what I tried.

I needed to add file to dataset

cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.

As you advised I unpublished it (esgunpublish

cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1) and then

created full mapfile (with additional file) and then publised it:

esgpublish --read-files --map new_mapfile --project cmip5 --thredd --publish



As I told there were no any errors. Dataset is in database and in thredds but

not in gateway.



The second way I tried is using mapfile containing only files to update. I

needed to substitute several existing files in dataset for new ones. I created

mapfile with only files needed to substitute:

esgscan_directory --read-files --project cmip5 -o mapfile.txt

/data/CMIP5/output1/NOAA-GFDL/GFDL-ESM2M/historical/mon/ocean/Omon/r1i1p1/v20111206



and then published it with update option:

esgpublish --update --map mapfile.txt --project cmip5 --thredd --publish.



The result is the same as in a previous case - all things are fine locally but

nothing happened on gateway.



Thanks,

Sergey



-------- Original Message --------

Subject: Re: [Go-essp-tech] Publishing dataset with option --update

Date: Thu, 29 Dec 2011 11:02:05 -0500

From: Serguei Nikonov<Serguei.Nikonov at noaa.gov><mailto:Serguei.Nikonov at noaa.gov>

Organization: GFDL

To: Drach, Bob<drach1 at llnl.gov><mailto:drach1 at llnl.gov>

CC: Nathan Wilhelmi<wilhelmi at ucar.edu><mailto:wilhelmi at ucar.edu>, "Ganzberger, Michael"

<Ganzberger1 at llnl.gov><mailto:Ganzberger1 at llnl.gov>, "go-essp-tech at ucar.edu"<mailto:go-essp-tech at ucar.edu><go-essp-tech at ucar.edu><mailto:go-essp-tech at ucar.edu>



Hi Bob,



I tried the 1st way you suggested and it worked partially - the dataset was

created om datanode with version 2 but it was not popped up on gateway. To make

sure that it's not occasional result I repeated it with another datasets with

the same result.

Now I have 2 datasets on datanode (visible in thredds server) but they are

absent on gateway:

cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2

cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r2i1p1.v2.



Does it make sense to repeat esgpublish with 'publish' option?



Thanks and Happy New Year,

Sergey



On 12/21/2011 08:41 PM, Drach, Bob wrote:

Hi Sergey,



The way I would recommend adding new files to an existing dataset is as

follows:



- Unpublish the previous dataset from the gateway and thredds



% esgunpublish

cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1



- Add the new files to the existing mapfile for the dataset they are being

added to.



- Republish with the expanded mapfile:



% esgpublish --read-files --map newmap.txt --project cmip5 --thredds

--publish



The publisher will:

- not rescan existing files, only the new files

- create a new version to reflect the additional files





Alternatively you can create a mapfile with *only* the new files (Using

esgscan_directory), then republish using the --update command.



--Bob





On 12/21/11 8:40 AM, "Serguei Nikonov"<serguei.nikonov at noaa.gov><mailto:serguei.nikonov at noaa.gov>  wrote:



Hi Nate,



unfortunately this is not the only dataset I have a problem - there are at

least

5 more. Should I unpublish them locally (db, thredds) and than create new

version containing full set of files? What is the official way to update

dataset?



Thanks,

Sergey





On 12/20/2011 07:06 PM, Nathan Wilhelmi wrote:

Hi Bob/Mike,



I believe the problem is that when files were added the timestamp on the

dataset

wasn't updated.



The triple store will only harvest datasets that have files and an updated

timestamp after the last harvest.



So what likely happened is the dataset was created without files, so it

wasn't

initially harvested. Files were subsequently added, but the timestamp wasn't

updated, so it was still not a candidate for harvesting.



Can you update the date_updated timestamp for the dataset in question and

then

trigger the RDF harvesting, I believe the dataset will show up then.



Thanks!

-Nate



On 12/20/2011 11:49 AM, Serguei Nikonov wrote:

Hi Mike,



I am a member of data publishers group. I have been publishing considerable

amount of data without such kind of troubles but this one occurred only when

I

tried to add some files to existing dataset. Publishing from scratch works

fine

for me.



Thanks,

Sergey



On 12/20/2011 01:29 PM, Ganzberger, Michael wrote:

Hi Serguei,



That task is on a scheduler and will re-run every 10 minutes. If your data

does not appear after that time then perhaps there is another issue. One

issue could be that publishing to the gateway requires that you have the

role

of "Data Publisher";



"check that the account is member of the proper group and has the special

role of Data Publisher"



http://esgf.org/wiki/ESGFNode/FAQ



Mike





-----Original Message-----

From: Serguei Nikonov [mailto:serguei.nikonov at noaa.gov]

Sent: Tuesday, December 20, 2011 10:12 AM

To: Ganzberger, Michael

Cc: StИphane Senesi; Drach, Bob; go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>

Subject: Re: [Go-essp-tech] Publishing dataset with option --update



Hi Mike,



thansk for suggestion but I don't have any privileges to do anything on

gateway.

I am just publishing data on GFDL data node.



Regards,

Sergey



On 12/20/2011 01:05 PM, Ganzberger, Michael wrote:

Hi Serguei,



I'd like to suggest this that may help you from

http://esgf.org/wiki/Cmip5Gateway/FAQ







"The search does not reflect the latest DB changes I've made



You have to manually trigger the 3store harvesting. Logging as root and go

to Admin->"Gateway Scheduled Tasks"->"Run tasks" and restart the job named

RDFSynchronizationJobDetail"



Mike Ganzberger











-----Original Message-----

From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> [mailto:go-essp-tech-bounces at ucar.edu]

On Behalf Of StИphane Senesi

Sent: Tuesday, December 20, 2011 9:42 AM

To: Serguei Nikonov

Cc: Drach, Bob; go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>

Subject: Re: [Go-essp-tech] Publishing dataset with option --update



Serguei



We have for some time now experienced similar problems when publishing

to the PCMDI gateway, i.e. not getting a "SUCCESS" message when

publishing . Sometimes, files are actually published (or at least

accessible through the gateway, their status being actually

"START_PUBLISHING", after esg_list_datasets report) , sometimes not. An

hypothesis is that the PCMDI Gateway load do generate the problem. We

havn't yet got a confirmation by Bob.



In contrast to your case, this happens when publishing a dataset from

scratch (I mean, not an update)



Best regards (do not expect any feeback from me since early january, yet)



S





Serguei Nikonov wrote, On 20/12/2011 18:11:

Hi Bob,



I needed to add some missed variables to existing dataset and I found in

esgpublish command an option --update. When I tried it I've got normal

message like

INFO 2011-12-20 11:21:00,893 Publishing:

cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1, parent

=

pcmdi.GFDL

INFO 2011-12-20 11:21:07,564 Result: PROCESSING

INFO 2011-12-20 11:21:11,209 Result: PROCESSING

....



but nothing happened on gateway - new variables are not there. The files

corresponding to these variables are in database and in THREDDS catalog

but

apparently were not published on gateway.



I used command line

esgpublish --update --keep-version --map<map_file>  --project cmip5

--noscan

--publish.



Should map file be of some specific format to make it works in mode I

need?



Thanks,

Sergey Nikonov

GFDL





_______________________________________________

GO-ESSP-TECH mailing list

GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech





_______________________________________________

GO-ESSP-TECH mailing list

GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

_______________________________________________

GO-ESSP-TECH mailing list

GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

_______________________________________________

GO-ESSP-TECH mailing list

GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

--

Estanislao Gonzalez



Max-Planck-Institut für Meteorologie (MPI-M)

Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre

Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany



Phone:   +49 (40) 46 00 94-126

E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>



_______________________________________________

GO-ESSP-TECH mailing list

GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech




--

Estanislao Gonzalez



Max-Planck-Institut für Meteorologie (MPI-M)

Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre

Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany



Phone:   +49 (40) 46 00 94-126

E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120110/20a1d152/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list