[Go-essp-tech] Visibility of old versions, was... RE: Fwd: Re: Publishing dataset with option --update

Cinquini, Luca (3880) Luca.Cinquini at jpl.nasa.gov
Wed Jan 11 05:18:38 MST 2012


Hi Michael,
        sorry if I should know this already, but how can we access the DOI information for a given dataset ? The goal is, off course, to enable search on DOIs in the P2P system.
thanks, Luca

On Jan 11, 2012, at 1:37 AM, Michael Lautenschlager wrote:

> Hi Karl,
> even with respect to IPCC DDC I think we have to keep at least the most
> recent version of CMIP5 and those versions which ran through QC-L3 with
> assignment of DOI and citation reference. At least we WDCC/DKRZ are in
> contract with DataCite to keep these DataCite published data entries
> forever in the sense of common library time scales. The number and
> location of replicas is decidable within CMIP5 and no matter for
> DataCite but we have to ensure identical copies if we link to them from
> the DOI landing page.
>
> These DataCite published CMIP5 data entities may form the GCM data basis
> of the IPCC DDC because they are stable, quality proofed, accessible at
> any time and have a citation reference. So these data entities can be
> traced back in the scientific literature providing the citation
> references are used there. But I agree we have to discuss this with the
> IPCC DDC people.
>
> Best wishes, Michael
>
> ---------------
> Dr. Michael Lautenschlager
> Head of DKRZ Department Data Management
> Director World Data Center Climate
>
> German Climate Computing Centre (DKRZ)
> ADDRESS: Bundesstrasse 45a, D-20146 Hamburg, Germany
> PHONE:   +4940-460094-118
> E-Mail:  lautenschlager at dkrz.de
>
> URL:    http://www.dkrz.de/
>         http://www.wdc-climate.de/
>
>
> Geschäftsführer: Prof. Dr. Thomas Ludwig
> Sitz der Gesellschaft: Hamburg
> Amtsgericht Hamburg HRB 39784
>
>
> Am 10.01.2012 16:40, schrieb Karl Taylor:
>> Hi all,
>>
>> thanks for the good discussion. Some good arguments have been made for
>> keeping all versions. I'll not make a policy decision immediately, but
>> am tending toward strong encouragement to keep all versions. I'll
>> distribute a draft statement about this for your input and comment
>> before posting. I'll, of course, also consult directly with other IPCC
>> DDC folks too.
>>
>> Best regards,
>> Karl
>>
>> On 1/10/12 5:31 AM, Estanislao Gonzalez wrote:
>>> Hi Jamie,
>>>
>>> Indeed, DOIs are not going to solve everything. The DOI is analogous
>>> to the ISBN of a book, citing the whole book is not always what you
>>> want in any case. To continue the analogy, users are indeed working
>>> with pre-prints which get corrected all the time (i.e. the CMIP5
>>> archive is in flux). People are writing papers citing a pre-print. Of
>>> course this makes no sense, but they are not doing so willingly, they
>>> have to as the dead line approaches but the computing groups are not
>>> ready.
>>>
>>> So what do we have now? Some archives with a strong commitment for
>>> preserving data.
>>> If the DRS were honored, the URL would be enough for citing any file
>>> as it has the version in it. Indeed citing +1000 Urls is not
>>> practical, but a redirection could be added so that the scientist
>>> cites one URL in which all files URLs are listed (There's no
>>> implementation for this AFAIK). But at least the URL of DRS committed
>>> sites could be safely cited, and if the checksum is attached to the
>>> citation, it is sure that the correct file is always being cited (and
>>> it could even be found if moved).
>>>
>>> I don't know how citations are being done now, nor do I know how they
>>> were done before when everyone was citing data that it was almost
>>> impossible to get. DOIs are the very first step in the right
>>> direction, not the last one.
>>> IMHO the community should come up with some best practices to
>>> overcome the problem we are facing: how to cite something that's
>>> permanently changing. Sharing this will certainly help everyone.
>>> Before jumping away from this subject I'd also like to add that I
>>> don't see any proper communication mechanism in the community. All
>>> (or at least most) questions regarding CMIP5 are AFAICT directed to
>>> the help-desk, so mostly developers are trying to help the community
>>> instead of the community trying to help itself. I think we might be
>>> missing some kind of platform for doing this. We don't have the means
>>> to support the growing community (and new communities which we are
>>> now serving), we need them to help with the "helping". Just a thought....
>>>
>>> And last, and probably least, the only way to get the latest version
>>> of any dataset is by re-issuing the search. Especially since multiple
>>> datasets are referred to in a wget script, finding the latest
>>> versions of each of them "by hand" will be more time-consuming than
>>> issuing the search query again.
>>>
>>> Thanks,
>>> Estani
>>>
>>> Am 10.01.2012 13:00, schrieb Kettleborough, Jamie:
>>>> Hello,
>>>> I'm not sure how to say this: but I'm not sure its just down to
>>>> DOI's to determine whether a data set should always be visible. I
>>>> think data needs to be visible where its sufficiently important that
>>>> a user might want to download it. e.g they want to check or extend
>>>> someone elses study (and I think there are other reasons). Its not
>>>> clear to me that all data of this kind will have a DOI - for
>>>> instance how many of the datasets referenced in papers being written
>>>> now for the summer deadline of AR5 have (or will have in time) DOIs?
>>>> I know its tempting to say - any dataset referenced in a paper
>>>> should have a DOI. But Ithink you need to be realistic about the
>>>> prospects of this happening on the right timescales.
>>>> If the DOI is used as the determinent of whether data is always
>>>> visible then should users be made aware of the risk they are
>>>> carrying now? For instance, so they know to have local backups of
>>>> data that is really important to them. (With the possible
>>>> implication too that they may need to be prepared to 'reshare' this
>>>> data with others.)
>>>> For what its worth my personal preference is with the BADC/DKRZ (and
>>>> I'm sure others) philosophy of keeping all versions - though I
>>>> realise there are costs in doing this, like getting DRSlib
>>>> sufficiently bug free and getting it to work in all the contexts it
>>>> needs to (hard links/soft links), getting it deployed, getting the
>>>> update mechanism in place for when new bugs are found etc. If you
>>>> used DRSlib doesn't Estanis use case that caused user grief become
>>>> easier too - the wget scripts do not need regenerating, you should
>>>> instead be able to replace the version strings in the url (though I
>>>> may be assuming things about load balancing etc in saying this).
>>>> Jamie
>>>>
>>>>    ------------------------------------------------------------------------
>>>>    *From:* go-essp-tech-bounces at ucar.edu
>>>>    [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Estanislao
>>>>    Gonzalez
>>>>    *Sent:* 10 January 2012 10:21
>>>>    *To:* Karl Taylor
>>>>    *Cc:* Drach, Bob; go-essp-tech at ucar.edu; serguei.nikonov at noaa.gov
>>>>    *Subject:* Re: [Go-essp-tech] Fwd: Re: Publishing dataset with
>>>>    option --update
>>>>
>>>>    Well to be honest I do agree this is a decision each institution
>>>>    has to make, but for us I'd prefer offering everything we have
>>>>    and let the systems decide what to do with this information.
>>>>    I.e. I've used it to generate some comments (I might have
>>>>    already show you this), just go here:
>>>>    http://ipcc-ar5.dkrz.de/dataset/cmip5.output1.NCC.NorESM1-M.sstClim.mon.land.Lmon.r1i1p1.html
>>>>    and click on history.
>>>>    That information could be generated only because we store the
>>>>    metadata to the previous version.
>>>>
>>>>    By the way, The only way of inhibiting the user from getting an
>>>>    older version, if that's what it's wanted, is by either removing
>>>>    the files from the TDS served directory, or changing the access
>>>>    restriction at the Gateway. Because of a well-known TDS bug (or
>>>>    feature) files present at that directory and not found in any
>>>>    catalog are served without any restriction (AFAIK no certificate
>>>>    is required for this). So, normally the wget script would work
>>>>    even if the files where unpublished.
>>>>
>>>>    It really depends on the use-case... but e.g. I had to explain
>>>>    all this to a couple of people in the help-desk since the wget
>>>>    script they've downloaded wasn't working anymore (files were
>>>>    removed). They weren't thrilled to know they had to re issue the
>>>>    search again (there's no workaround for this) and they wanted to
>>>>    know what was changed in the new version, and there's where we
>>>>    can't help our users any more since we don't have that
>>>>    information...
>>>>
>>>>    I don't know what our users prefer, but I think they have more
>>>>    important problems to cope with at this time... if they could
>>>>    reliably get one version they could start worrying about others.
>>>>    From my perspective as a data manager, it's worth the tiny
>>>>    additional effort, if there's any.
>>>>
>>>>    Cheers,
>>>>    Estani
>>>>
>>>>    Am 09.01.2012 20:05, schrieb Karl Taylor:
>>>>>    Hi Estani,
>>>>>
>>>>>    I agree that a new version number should (I'd say must) be
>>>>>    assigned when any changes are made. However, except for DOI
>>>>>    datasets, most groups will not want older versions to be
>>>>>    visible or downloadable.
>>>>>
>>>>>    Do you agree?
>>>>>
>>>>>    cheers,
>>>>>    Karl
>>>>>
>>>>>    On 1/9/12 10:37 AM, Estanislao Gonzalez wrote:
>>>>>>    Hi Karl,
>>>>>>
>>>>>>    It is indeed a good point, but I must add that we are not
>>>>>>    talking about preserving a version (although we do it here at
>>>>>>    DKRZ) but of signaling that a version has been changed. So the
>>>>>>    version is a key to find a specific dataset which changes in time.
>>>>>>
>>>>>>    Even before a DOI assignment I'd encourage all to create a new
>>>>>>    version every time the dataset changes in any way.
>>>>>>    Institutions have the right to preserve whatever version they
>>>>>>    want (they may even delete DOI-assigned versions, on the other
>>>>>>    hand archives can't, that's why archives are for).
>>>>>>    But altering the dataset preserving the version just bring
>>>>>>    chaos for the users and for us at the help-desk as we have to
>>>>>>    explain why something has changed (or rather answer that we
>>>>>>    don't know why...). It means that the same key now points to a
>>>>>>    different dataset.
>>>>>>
>>>>>>    The only benefits I can see for preserving the same version is
>>>>>>    that publishing using the same version seems to be easier to
>>>>>>    some (for our workflow it's not, it's exactly the same) and
>>>>>>    that if only new files are added this seems to work fine for
>>>>>>    publication at both the data-node and the gateway as it's
>>>>>>    properly supported.
>>>>>>    If anything else changes, this does not work as expected
>>>>>>    (wrong checksums, ghost files at the gateway, etc). And
>>>>>>    changing a version contents makes no sense to the user IMHO
>>>>>>    (e.g. it's as if you might sometimes get more files from a
>>>>>>    tarred file... how often should you extract it to be sure you
>>>>>>    got "all of them")
>>>>>>
>>>>>>    If old versions were preserved (which take almost no resources
>>>>>>    if using hardlinks), a simple comparison would tell that the
>>>>>>    only changes were the addition of some specific files.
>>>>>>
>>>>>>    Basically, reusing the version ends in a non-recoverable loss
>>>>>>    of information. That's why I discourage it.
>>>>>>
>>>>>>    My 2c,
>>>>>>    Estani
>>>>>>
>>>>>>    Am 09.01.2012 17:25, schrieb Karl Taylor:
>>>>>>>    Dear all,
>>>>>>>
>>>>>>>    I do not have time to read this thoroughly, so perhaps what
>>>>>>>    I'll mention here is irrelevant. There may be some
>>>>>>>    miscommunication about what is meant by "version". There are
>>>>>>>    two cases to consider:
>>>>>>>
>>>>>>>    1. Before a dataset has become official (i.e., assigned a
>>>>>>>    DOI), a group may choose to remove all record of it from the
>>>>>>>    database and publish a replacement version.
>>>>>>>
>>>>>>>    2. Alternatively, if a group wants to preserve a previous
>>>>>>>    version (as is required after a DOI has been assigned), then
>>>>>>>    the new version will not "replace" the previous version, but
>>>>>>>    simply be added to the archive.
>>>>>>>
>>>>>>>    It is possible that different publication procedures will
>>>>>>>    apply in these different cases.
>>>>>>>
>>>>>>>    best,
>>>>>>>    Karl
>>>>>>>
>>>>>>>    On 1/9/12 4:26 AM, Estanislao Gonzalez wrote:
>>>>>>>>    Just to mentioned that we do the same thing. We use directly
>>>>>>>>    --new-version and a map file containing all files for the new version,
>>>>>>>>    but we do create hard-links to the files being reused, so they are
>>>>>>>>    indeed all "new" as their paths always differ from those of previous
>>>>>>>>    versions. (In any case for the publisher they are the same and thus
>>>>>>>>    encode them with the nc_0 name if I recall correctly)
>>>>>>>>
>>>>>>>>    Thanks,
>>>>>>>>    Estani
>>>>>>>>    Am 09.01.2012 12:15, schriebstephen.pascoe at stfc.ac.uk:
>>>>>>>>>    Hi Bob,
>>>>>>>>>
>>>>>>>>>    This "unpublish first" requirement is news to me.  We've been publishing new versions without doing this for some time.  Now, we have come across difficulties with a few datasets but it's generally worked.
>>>>>>>>>
>>>>>>>>>    We don't use the --update option though.  Each time we publish a new version we provide a mapfile of all files in the dataset(s).  I'd recommend Sergey try doing this before removing a previous version.
>>>>>>>>>
>>>>>>>>>    If you unpublish from the Gateway first you'll loose the information in the "History" tab.  For instancehttp://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output2.MOHC.HadGEM2-ES.rcp85.mon.aerosol.aero.r1i1p1.html  shows 2 versions.
>>>>>>>>>
>>>>>>>>>    Stephen.
>>>>>>>>>
>>>>>>>>>    ---
>>>>>>>>>    Stephen Pascoe  +44 (0)1235 445980
>>>>>>>>>    Centre of Environmental Data Archival
>>>>>>>>>    STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    -----Original Message-----
>>>>>>>>>    From:go-essp-tech-bounces at ucar.edu  [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Drach, Bob
>>>>>>>>>    Sent: 06 January 2012 20:53
>>>>>>>>>    To: Serguei Nikonov; Eric Nienhouse
>>>>>>>>>    Cc:go-essp-tech at ucar.edu
>>>>>>>>>    Subject: Re: [Go-essp-tech] Fwd: Re: Publishing dataset with option --update
>>>>>>>>>
>>>>>>>>>    Hi Sergey,
>>>>>>>>>
>>>>>>>>>    When updating a dataset, it's also important to unpublish it before publishing the new version. E.g, first run
>>>>>>>>>
>>>>>>>>>    esgunpublish<dataset_id>
>>>>>>>>>
>>>>>>>>>    The reason is that, when you publish to the gateway, the gateway software tries to *add* the new information to the existing dataset entry, rather that replace it.
>>>>>>>>>
>>>>>>>>>    --Bob
>>>>>>>>>    ________________________________________
>>>>>>>>>    From: Serguei Nikonov [serguei.nikonov at noaa.gov]
>>>>>>>>>    Sent: Friday, January 06, 2012 10:45 AM
>>>>>>>>>    To: Eric Nienhouse
>>>>>>>>>    Cc: Bob Drach;go-essp-tech at ucar.edu
>>>>>>>>>    Subject: Re: [Go-essp-tech] Fwd: Re:  Publishing dataset with option --update
>>>>>>>>>
>>>>>>>>>    Hi Eric,
>>>>>>>>>
>>>>>>>>>    thanks for you help. I have no any objections about any adopted versioning
>>>>>>>>>    policy. What I need is to know how to apply it. The ways I used did not work for
>>>>>>>>>    me. Hopefully, the reasons is bad things in thredds and database you pointed
>>>>>>>>>    put. I am cleaning them right now, then will see...
>>>>>>>>>
>>>>>>>>>    Just for clarification, if I need to update dataset (with changing version) I
>>>>>>>>>    create map file containing full set of files (old and new ones) and then use
>>>>>>>>>    this map file in esgpublish script with option --update, is it correct? Will it
>>>>>>>>>    be enough for creating dataset of new version? BTW, there is nothing about
>>>>>>>>>    version for option 'update' in esgpublish help.
>>>>>>>>>
>>>>>>>>>    Thanks,
>>>>>>>>>    Sergey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    On 01/04/2012 04:27 PM, Eric Nienhouse wrote:
>>>>>>>>>>    Hi Serguei,
>>>>>>>>>>
>>>>>>>>>>    Following are a few more suggestions to diagnose this publishing issue. I agree
>>>>>>>>>>    with others on this thread that adding new files (or changing existing ones)
>>>>>>>>>>    should always trigger a new dataset version.
>>>>>>>>>>
>>>>>>>>>>    It does not appear you are receiving a final "SUCCESS" or failure message when
>>>>>>>>>>    publishing to the Gateway (with esgpublish --publish). Please try increasing
>>>>>>>>>>    your "polling" levels in your $ESGINI file. Eg:
>>>>>>>>>>
>>>>>>>>>>    hessian_service_polling_delay = 10
>>>>>>>>>>    hessian_service_polling_iterations = 500
>>>>>>>>>>
>>>>>>>>>>    You should see a final "SUCCESS" or "ERROR" with Java trace output at the
>>>>>>>>>>    termination of the command.
>>>>>>>>>>
>>>>>>>>>>    I've reviewed the Thredds catalog for the dataset you note below:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    http://esgdata.gfdl.noaa.gov/thredds/esgcet/1/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2.xml
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    There appear to be multiple instances of certain files within the catalog which
>>>>>>>>>>    is a problem. The Gateway publish will fail if a particular file (URL) is
>>>>>>>>>>    referenced multiple times with differing metadata. An example is:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    */gfdl_dataroot/NOAA-GFDL/GFDL-CM3/historical/mon/atmos/Amon/r1i1p1/v20110601/rtmt/rtmt_Amon_GFDL-CM3_historical_r1i1p1_186001-186412.nc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    This file appears as two separate file versions in the Thredds catalog (one with
>>>>>>>>>>    id ending in ".nc" and another with ".nc_0"). There should be only one reference
>>>>>>>>>>    to this file URL in the catalog.
>>>>>>>>>>
>>>>>>>>>>    The previous version of the dataset in the publisher/node database may be
>>>>>>>>>>    leading to this issue. You may need to add "--database-delete" to your
>>>>>>>>>>    esgunpublish command to clean things up. Bob can advise on this. Note that the
>>>>>>>>>>    original esgpublish command shown in this email thread included "--keep-version".
>>>>>>>>>>
>>>>>>>>>>    After publishing to the Gateway successfully, you can check the dataset details
>>>>>>>>>>    by URL with the published dataset identifier. For example:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    I hope this helps.
>>>>>>>>>>
>>>>>>>>>>    Regards,
>>>>>>>>>>
>>>>>>>>>>    -Eric
>>>>>>>>>>
>>>>>>>>>>    Serguei Nikonov wrote:
>>>>>>>>>>>    Hi Bob,
>>>>>>>>>>>
>>>>>>>>>>>    I still can not do anything about updating datasets. The commands you
>>>>>>>>>>>    suggested executed successfully but datasets did not appear on gateway. I
>>>>>>>>>>>    tried it several times for different datasets but result is the same.
>>>>>>>>>>>
>>>>>>>>>>>    Do you have any idea what to undertake in such situation.
>>>>>>>>>>>
>>>>>>>>>>>    Here it is some details about what I tried.
>>>>>>>>>>>    I needed to add file to dataset
>>>>>>>>>>>    cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.
>>>>>>>>>>>    As you advised I unpublished it (esgunpublish
>>>>>>>>>>>    cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1) and then
>>>>>>>>>>>    created full mapfile (with additional file) and then publised it:
>>>>>>>>>>>    esgpublish --read-files --map new_mapfile --project cmip5 --thredd --publish
>>>>>>>>>>>
>>>>>>>>>>>    As I told there were no any errors. Dataset is in database and in thredds but
>>>>>>>>>>>    not in gateway.
>>>>>>>>>>>
>>>>>>>>>>>    The second way I tried is using mapfile containing only files to update. I
>>>>>>>>>>>    needed to substitute several existing files in dataset for new ones. I created
>>>>>>>>>>>    mapfile with only files needed to substitute:
>>>>>>>>>>>    esgscan_directory --read-files --project cmip5 -o mapfile.txt
>>>>>>>>>>>    /data/CMIP5/output1/NOAA-GFDL/GFDL-ESM2M/historical/mon/ocean/Omon/r1i1p1/v20111206
>>>>>>>>>>>
>>>>>>>>>>>    and then published it with update option:
>>>>>>>>>>>    esgpublish --update --map mapfile.txt --project cmip5 --thredd --publish.
>>>>>>>>>>>
>>>>>>>>>>>    The result is the same as in a previous case - all things are fine locally but
>>>>>>>>>>>    nothing happened on gateway.
>>>>>>>>>>>
>>>>>>>>>>>    Thanks,
>>>>>>>>>>>    Sergey
>>>>>>>>>>>
>>>>>>>>>>>    -------- Original Message --------
>>>>>>>>>>>    Subject: Re: [Go-essp-tech] Publishing dataset with option --update
>>>>>>>>>>>    Date: Thu, 29 Dec 2011 11:02:05 -0500
>>>>>>>>>>>    From: Serguei Nikonov<Serguei.Nikonov at noaa.gov>
>>>>>>>>>>>    Organization: GFDL
>>>>>>>>>>>    To: Drach, Bob<drach1 at llnl.gov>
>>>>>>>>>>>    CC: Nathan Wilhelmi<wilhelmi at ucar.edu>, "Ganzberger, Michael"
>>>>>>>>>>>    <Ganzberger1 at llnl.gov>,"go-essp-tech at ucar.edu"<go-essp-tech at ucar.edu>
>>>>>>>>>>>
>>>>>>>>>>>    Hi Bob,
>>>>>>>>>>>
>>>>>>>>>>>    I tried the 1st way you suggested and it worked partially - the dataset was
>>>>>>>>>>>    created om datanode with version 2 but it was not popped up on gateway. To make
>>>>>>>>>>>    sure that it's not occasional result I repeated it with another datasets with
>>>>>>>>>>>    the same result.
>>>>>>>>>>>    Now I have 2 datasets on datanode (visible in thredds server) but they are
>>>>>>>>>>>    absent on gateway:
>>>>>>>>>>>    cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1.v2
>>>>>>>>>>>    cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r2i1p1.v2.
>>>>>>>>>>>
>>>>>>>>>>>    Does it make sense to repeat esgpublish with 'publish' option?
>>>>>>>>>>>
>>>>>>>>>>>    Thanks and Happy New Year,
>>>>>>>>>>>    Sergey
>>>>>>>>>>>
>>>>>>>>>>>    On 12/21/2011 08:41 PM, Drach, Bob wrote:
>>>>>>>>>>>>    Hi Sergey,
>>>>>>>>>>>>
>>>>>>>>>>>>    The way I would recommend adding new files to an existing dataset is as
>>>>>>>>>>>>    follows:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Unpublish the previous dataset from the gateway and thredds
>>>>>>>>>>>>
>>>>>>>>>>>>    % esgunpublish
>>>>>>>>>>>>    cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1
>>>>>>>>>>>>
>>>>>>>>>>>>    - Add the new files to the existing mapfile for the dataset they are being
>>>>>>>>>>>>    added to.
>>>>>>>>>>>>
>>>>>>>>>>>>    - Republish with the expanded mapfile:
>>>>>>>>>>>>
>>>>>>>>>>>>    % esgpublish --read-files --map newmap.txt --project cmip5 --thredds
>>>>>>>>>>>>    --publish
>>>>>>>>>>>>
>>>>>>>>>>>>    The publisher will:
>>>>>>>>>>>>    - not rescan existing files, only the new files
>>>>>>>>>>>>    - create a new version to reflect the additional files
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    Alternatively you can create a mapfile with *only* the new files (Using
>>>>>>>>>>>>    esgscan_directory), then republish using the --update command.
>>>>>>>>>>>>
>>>>>>>>>>>>    --Bob
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    On 12/21/11 8:40 AM, "Serguei Nikonov"<serguei.nikonov at noaa.gov>   wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>    Hi Nate,
>>>>>>>>>>>>>
>>>>>>>>>>>>>    unfortunately this is not the only dataset I have a problem - there are at
>>>>>>>>>>>>>    least
>>>>>>>>>>>>>    5 more. Should I unpublish them locally (db, thredds) and than create new
>>>>>>>>>>>>>    version containing full set of files? What is the official way to update
>>>>>>>>>>>>>    dataset?
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Thanks,
>>>>>>>>>>>>>    Sergey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    On 12/20/2011 07:06 PM, Nathan Wilhelmi wrote:
>>>>>>>>>>>>>>    Hi Bob/Mike,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    I believe the problem is that when files were added the timestamp on the
>>>>>>>>>>>>>>    dataset
>>>>>>>>>>>>>>    wasn't updated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    The triple store will only harvest datasets that have files and an updated
>>>>>>>>>>>>>>    timestamp after the last harvest.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    So what likely happened is the dataset was created without files, so it
>>>>>>>>>>>>>>    wasn't
>>>>>>>>>>>>>>    initially harvested. Files were subsequently added, but the timestamp wasn't
>>>>>>>>>>>>>>    updated, so it was still not a candidate for harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Can you update the date_updated timestamp for the dataset in question and
>>>>>>>>>>>>>>    then
>>>>>>>>>>>>>>    trigger the RDF harvesting, I believe the dataset will show up then.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Thanks!
>>>>>>>>>>>>>>    -Nate
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    On 12/20/2011 11:49 AM, Serguei Nikonov wrote:
>>>>>>>>>>>>>>>    Hi Mike,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    I am a member of data publishers group. I have been publishing considerable
>>>>>>>>>>>>>>>    amount of data without such kind of troubles but this one occurred only when
>>>>>>>>>>>>>>>    I
>>>>>>>>>>>>>>>    tried to add some files to existing dataset. Publishing from scratch works
>>>>>>>>>>>>>>>    fine
>>>>>>>>>>>>>>>    for me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    Thanks,
>>>>>>>>>>>>>>>    Sergey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    On 12/20/2011 01:29 PM, Ganzberger, Michael wrote:
>>>>>>>>>>>>>>>>    Hi Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    That task is on a scheduler and will re-run every 10 minutes. If your data
>>>>>>>>>>>>>>>>    does not appear after that time then perhaps there is another issue. One
>>>>>>>>>>>>>>>>    issue could be that publishing to the gateway requires that you have the
>>>>>>>>>>>>>>>>    role
>>>>>>>>>>>>>>>>    of "Data Publisher";
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    "check that the account is member of the proper group and has the special
>>>>>>>>>>>>>>>>    role of Data Publisher"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    http://esgf.org/wiki/ESGFNode/FAQ
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    Mike
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    -----Original Message-----
>>>>>>>>>>>>>>>>    From: Serguei Nikonov [mailto:serguei.nikonov at noaa.gov]
>>>>>>>>>>>>>>>>    Sent: Tuesday, December 20, 2011 10:12 AM
>>>>>>>>>>>>>>>>    To: Ganzberger, Michael
>>>>>>>>>>>>>>>>    Cc: StИphane Senesi; Drach, Bob;go-essp-tech at ucar.edu
>>>>>>>>>>>>>>>>    Subject: Re: [Go-essp-tech] Publishing dataset with option --update
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    Hi Mike,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    thansk for suggestion but I don't have any privileges to do anything on
>>>>>>>>>>>>>>>>    gateway.
>>>>>>>>>>>>>>>>    I am just publishing data on GFDL data node.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    Regards,
>>>>>>>>>>>>>>>>    Sergey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    On 12/20/2011 01:05 PM, Ganzberger, Michael wrote:
>>>>>>>>>>>>>>>>>    Hi Serguei,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    I'd like to suggest this that may help you from
>>>>>>>>>>>>>>>>>    http://esgf.org/wiki/Cmip5Gateway/FAQ
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    "The search does not reflect the latest DB changes I've made
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    You have to manually trigger the 3store harvesting. Logging as root and go
>>>>>>>>>>>>>>>>>    to Admin->"Gateway Scheduled Tasks"->"Run tasks" and restart the job named
>>>>>>>>>>>>>>>>>    RDFSynchronizationJobDetail"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Mike Ganzberger
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    -----Original Message-----
>>>>>>>>>>>>>>>>>    From:go-essp-tech-bounces at ucar.edu  [mailto:go-essp-tech-bounces at ucar.edu]
>>>>>>>>>>>>>>>>>    On Behalf Of StИphane Senesi
>>>>>>>>>>>>>>>>>    Sent: Tuesday, December 20, 2011 9:42 AM
>>>>>>>>>>>>>>>>>    To: Serguei Nikonov
>>>>>>>>>>>>>>>>>    Cc: Drach, Bob;go-essp-tech at ucar.edu
>>>>>>>>>>>>>>>>>    Subject: Re: [Go-essp-tech] Publishing dataset with option --update
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Serguei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    We have for some time now experienced similar problems when publishing
>>>>>>>>>>>>>>>>>    to the PCMDI gateway, i.e. not getting a "SUCCESS" message when
>>>>>>>>>>>>>>>>>    publishing . Sometimes, files are actually published (or at least
>>>>>>>>>>>>>>>>>    accessible through the gateway, their status being actually
>>>>>>>>>>>>>>>>>    "START_PUBLISHING", after esg_list_datasets report) , sometimes not. An
>>>>>>>>>>>>>>>>>    hypothesis is that the PCMDI Gateway load do generate the problem. We
>>>>>>>>>>>>>>>>>    havn't yet got a confirmation by Bob.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    In contrast to your case, this happens when publishing a dataset from
>>>>>>>>>>>>>>>>>    scratch (I mean, not an update)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Best regards (do not expect any feeback from me since early january, yet)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    S
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Serguei Nikonov wrote, On 20/12/2011 18:11:
>>>>>>>>>>>>>>>>>>    Hi Bob,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    I needed to add some missed variables to existing dataset and I found in
>>>>>>>>>>>>>>>>>>    esgpublish command an option --update. When I tried it I've got normal
>>>>>>>>>>>>>>>>>>    message like
>>>>>>>>>>>>>>>>>>    INFO 2011-12-20 11:21:00,893 Publishing:
>>>>>>>>>>>>>>>>>>    cmip5.output1.NOAA-GFDL.GFDL-CM3.historical.mon.atmos.Amon.r1i1p1, parent
>>>>>>>>>>>>>>>>>>    =
>>>>>>>>>>>>>>>>>>    pcmdi.GFDL
>>>>>>>>>>>>>>>>>>    INFO 2011-12-20 11:21:07,564 Result: PROCESSING
>>>>>>>>>>>>>>>>>>    INFO 2011-12-20 11:21:11,209 Result: PROCESSING
>>>>>>>>>>>>>>>>>>    ....
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    but nothing happened on gateway - new variables are not there. The files
>>>>>>>>>>>>>>>>>>    corresponding to these variables are in database and in THREDDS catalog
>>>>>>>>>>>>>>>>>>    but
>>>>>>>>>>>>>>>>>>    apparently were not published on gateway.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    I used command line
>>>>>>>>>>>>>>>>>>    esgpublish --update --keep-version --map<map_file>   --project cmip5
>>>>>>>>>>>>>>>>>>    --noscan
>>>>>>>>>>>>>>>>>>    --publish.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Should map file be of some specific format to make it works in mode I
>>>>>>>>>>>>>>>>>>    need?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Thanks,
>>>>>>>>>>>>>>>>>>    Sergey Nikonov
>>>>>>>>>>>>>>>>>>    GFDL
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    _______________________________________________
>>>>>>>>>>>>>>>>>>    GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>>>>>    GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>>>>>    http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    _______________________________________________
>>>>>>>>>>>>>>>    GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>>    GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>>    http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>    _______________________________________________
>>>>>>>>>>>    GO-ESSP-TECH mailing list
>>>>>>>>>>>    GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>    http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>    _______________________________________________
>>>>>>>>>    GO-ESSP-TECH mailing list
>>>>>>>>>    GO-ESSP-TECH at ucar.edu
>>>>>>>>>    http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>    --
>>>>>>>>    Estanislao Gonzalez
>>>>>>>>
>>>>>>>>    Max-Planck-Institut für Meteorologie (MPI-M)
>>>>>>>>    Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>>>>>    Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>
>>>>>>>>    Phone:   +49 (40) 46 00 94-126
>>>>>>>>    E-Mail:gonzalez at dkrz.de
>>>>>>>>
>>>>>>>>    _______________________________________________
>>>>>>>>    GO-ESSP-TECH mailing list
>>>>>>>>    GO-ESSP-TECH at ucar.edu
>>>>>>>>    http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>
>>>>>>
>>>>>>    --
>>>>>>    Estanislao Gonzalez
>>>>>>
>>>>>>    Max-Planck-Institut für Meteorologie (MPI-M)
>>>>>>    Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>>>    Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>
>>>>>>    Phone:   +49 (40) 46 00 94-126
>>>>>>    E-Mail:gonzalez at dkrz.de
>>>>
>>>>
>>>>    --
>>>>    Estanislao Gonzalez
>>>>
>>>>    Max-Planck-Institut für Meteorologie (MPI-M)
>>>>    Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>    Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>
>>>>    Phone:   +49 (40) 46 00 94-126
>>>>    E-Mail:gonzalez at dkrz.de
>>>>
>>>
>>>
>>> --
>>> Estanislao Gonzalez
>>>
>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>
>>> Phone:   +49 (40) 46 00 94-126
>>> E-Mail:gonzalez at dkrz.de
>>
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech



More information about the GO-ESSP-TECH mailing list