[Go-essp-tech] tracking_id and check sums was... RE: Status of Gateway 2.0 (another use case)

Fri Dec 16 02:49:30 MST 2011

Hello Stephen Bryan,

I wouldn't rely on tracking_id - there is too high a likelihood that it
is not unique. We have seen cases where different files have the same
tracking_id. ('though in the cases we have seen the data has been the
same, there are just minor updates to the meta-data - hmmm... maybe that
statement was a red rag to a bull).   I think the checksum is the most
reliable indicator of the uniqueness of a file.   Though clearly its not
enough on its own as it doesn't tell you what has been changed or why,
or how changes in one file are related to changes in other files.

We've also seen examples where data providers have tried to be helpful -
which is great - and put version number as an attribute in the netcdf
file... but then that has not been updated when the files have be
published at a new version...

Karl - where are we with the agreement that all data nodes should
provide checksums with the data?  I think its agreed in principle, but
I'm not sure whether and when the implications of that agreement will be
followed up.

Jamie

________________________________

	From: go-essp-tech-bounces at ucar.edu
[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of
stephen.pascoe at stfc.ac.uk
	Sent: 16 December 2011 09:25
	To: jma at cola.iges.org; go-essp-tech at ucar.edu
	Subject: Re: [Go-essp-tech] Status of Gateway 2.0 (another use
case)

	Hi Jennifer,

	I just wanted to add a few more technical specifics to this
sub-thread about versions.  Bryan's point that it has all been a
compromise is the take-home message.

	> If the version is so important and needs to be preserved, then
it should have been included in the data file name. It's obviously too
late to make 

	> that change now.

	Indeed, and it was already too late when we got agreement on the
format for version identifiers.  By that point CMOR, the tool that
generates the filenames, was already finalised and being run at some
modelling centres.  Also a version has to be assigned much later in the
process than when CMOR is run.  Bryan is right that the tracking_id or
md5 checksum should provide the link between file and version.
Unfortunately we don't have tools for that yet.

	Although the filenames don't contain versions the wget scripts
do  provided datanodes have their data in DRS directory format.  ESG
insiders know this has been a long-term bugbear of mine.  Presently
IPSL, BADC and DKRZ have this and maybe some others too but not all
datanodes have implemented this.  Maybe the wget scripts need to include
versions in a more explicit way than just the DRS path which would allow
datanodes that can't implement DRS to include versions.  It would be
good if wget scripts replicated the DRS directory structure at the
client.  That's something I wish we'd implemented by now but since not
every datanode has DRS structure it's impossible to implement
federation-wide.

	Thanks for the great feedback.

	Stephen.

	---

	Stephen Pascoe  +44 (0)1235 445980

	Centre of Environmental Data Archival

	STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
0QX, UK

	From: go-essp-tech-bounces at ucar.edu
[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Jennifer Adams
	Sent: 15 December 2011 19:58
	To: go-essp-tech at ucar.edu
	Subject: Re: [Go-essp-tech] Status of Gateway 2.0 (another use
case)

	On Dec 15, 2011, at 2:14 PM, Bryan Lawrence wrote:

	Hi Jennifer

	With due respect, it's completely unrealistic to expect
modelling groups not to want to have multiple versions of some datasets
... that's just not how the world (and in particular, modelling
workflow) works. It has never  been thus. There simply isn't time to
look at everthing before it is released ... if you haave a problem with
that, blame the government folk who set the IPCC timetables :-)  (Maybe
your comment was somewhat tongue in cheek, but I feel obliged to make
this statement anyway :-).

	Fair enough. I was being cheeky, that is why I put the :-). The
users suffer the IPCC time constraints too, we have to deliver analyses
of data that take an impossibly long time to grab. 

	Also, with due respect, please don't "replace files with newer
versions" ... we absolutely need folks to understand the idea of
processing with one particular version of the data, and understanding
the provenance of that, so that they understand if the data has changed,
they may need to re-run the processing. 

	If the version is so important and needs to be preserved, then
it should have been included in the data file name. It's obviously too
late to make that change now. As I mentioned before, the version number
is a valuable piece of metadata that is lost in the wget download
process. The problem of how to keep track of version numbers and update
my copy when necessary remains. 

	I'll take this opportunity to point out that the realm and
frequency are also missing from the file name. I can't remember where I
read this, but MIP_table value is not always adequate for uniquely
determining what the realm and frequency are. 

	I'm sure this doesn't apply to you, but for too long our
community has had a pretty cavalier attitude to data provenance! CMIP3
and AR4 was a "dogs breakfast" in this regard ...

	Looks like CMIP5 hasn't improved the situation. 

	(And I too am very grateful that you are laying out your
requirements in some detail :-)

	I'm glad to hear that. 

	--Jennifer

	Cheers
	Bryan

		On Dec 15, 2011, at 11:22 AM, Estanislao Gonzalez wrote:

			Hi Jennifer,

			I'll check this more carefully and see what can
be done with what we have (or minimal changes), thought the multiple
versions is something CMIP3 hasn't worried about, files just got changed
or deleted, cmip5 add a two figure factor to that since there are many
more institutions and data... but it might be possible.

		At the moment, I have no good ideas for how to solve the
problem of replacing files in my local CMIP5 collection with newer
versions if they are available. My strategy at this point is to get the
version that is available now and not look for it again. If any data
providers are listening, here is my plea: 

		==> Please don't submit new versions of your CMIP5 data.
Get it right the first time! <==

		:-)

			In any case I wanted just to thank you very much
for the detailed description, it is very useful.

		I'm glad you (and Steve Hankin) find my long emails
helpful. 

		--Jennifer

			Regards,

			Estani

			Am 15.12.2011 14:52, schrieb Jennifer Adams:

				Hi, Estanislao -- 

				Please see my comments inline.

				On Dec 15, 2011, at 5:47 AM, Estanislao
Gonzalez wrote:

				Hi Jennifer,

				I'm still not sure how is Lucas change
in the API going to help you Jennifer. But perhaps it would help me to
fully understand your requirement as well as your use of wget when using
the FTP  protocol.

				I presume what you want is to crawl the
archive and get file from a specific directory structure?

				Maybe it would be better if you just
describe briefly the procedure you've been using for getting the CMIP3
data so we can see what could be done for CMIP5.

				How did you find out which data was
interesting?

				COLA scientists ask for a specific
scenario/realm/frequency/variable they need for their research. Our
CMIP3 collection is a shared resource of about 4Tb of data. For CMIP5,
we are working with an estimate of 4-5 times that data volume to meet
our needs. It's hard to say at this point whether that will be enough. 

				How did you find out which files were
required to be downloaded?

				For CMIP3, we often referred to
http://www-pcmdi.llnl.gov/ipcc/data_status_tables.htm to see what was
available. 

				The new version of this chart for CMIP5,
http://cmip-pcmdi.llnl.gov/cmip5/esg_tables/transpose_esg_static_table.h
tml, is also useful. An improvement I'd like to see on this page: the
numbers inside the blue boxes that show how many runs there are for a
particular experiment/model should be a link to a list of those runs
that has all the necessary components from the Data Reference Synatax so
that I can go directly to the URL for that data set. For example, 

				the BCC-CSM1.1 model shows 45 runs for
the decadal1960 experiment. I would like to click on that 45 and get a
list of the 45 URLs for those runs, like this:

http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decad
al1960.day.land.day.r1i1p1.html

http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.BCC.bcc-csm1-1.decad
al1960.day.land.day.r2i1p1.html

				...

				How did you tell wget to download those
files?

				For example: wget -nH --retr-symlinks -r
-A nc ftp://username@ftp-esg.ucllnl.org/picntrl/atm/mo/tas -o log.tas

				This would populate a local directory
./picntrl/atm/mo/tas with all the models and ensemble members in the
proper subdirectory. If I wanted to update with newer versions or models
that had been added, I simply ran the same 1-line wget command again.
This is what I refer to as 'elegant.'

				We might have already some way of
achieving what you want, if we knew exactly what that is.

				Wouldn't that be wonderful? I am hopeful
that the P2P will simplify the elaborate and flawed workflow I have
cobbled together to navigate the current system.

				I have a list of desired
experiment/realm/frequency/MIP_table/variables for which I need to grab
all available models/ensembles. Is that not enough to describe my needs?

				I guess my proposal of issuing:

				bash <(wget
http://p2pnode/wget?experiment=decadal1960&realm=atmos&time_frequency=mo
nth&variable=clt -qO - | grep -v HadCM3)

				Yes, this would likely achieve the same
result as the '&model=!name' that Luca implemented. However, I believe
the documentation says that there is a limit of 1000 to the number of
wgets that p2pnode will put into a single search request, so I don't
want to populate my precious 1000 results with wgets that I'm going to
grep out afterwards. 

				--Jennifer

				was not acceptable to you. But I still
don't know exactly why. 

				It would really help to know what you
meant by "elegant use of wget".

				Thanks,

				Estani

				Am 14.12.2011 18:44, schrieb Cinquini,
Luca (3880):

				So Jennifer, would having the capability
of doing negative searches (model=!CCSM), and generate the corresponding
wget scripts, help you ?

				thanks, Luca

				On Dec 14, 2011, at 10:38 AM, Jennifer
Adams wrote:

				Well, after working from the client side
to get CMIP3 and CMIP5 data, I can say that wget is a fine tool to rely
on at the core of the workflow. Unfortunately, the step up in complexity
from CMIP3 to CMIP5 and the switch from FTP to HTTP trashed the elegant
use of wget. No amount of customized wrapper software, browser
interfaces, or pre-packaged tools like DML fixes that problem. 

				At the moment, the burden on the user is
embarrassingly high. It's so easy to suggest that the user should
"filter to remove what is not required" from a downloaded script, but
the actual pratice of doing that in a timely and automated and
distributed way is NOT simple! And if the solution to my problem of
filling in the gaps in my incomplete collection is to go back to
clicking in my browser and do the whole thing over again but make my
filters smarter by looking for what's already been acquired or what has
a new version number ... this is unacceptable. The filtering must be a
server-side responsibility and the interface must be accessible by
automated scripts. Make it so! 

				By the way, the version number is a
piece of metadata that is not in the downloaded files or the gateway's
search criteria. It appears in the wget script as part of the path in
the file's http location, but the path is not preserved after the wget
is complete, so it is effectively lost after the download is
done. I guess the file's date stamp would be the only way to know if the
version number of the data file in question has been changed, but I'm
not going to write that check into my filtering scripts. 

				--Jennifer

				--

				Jennifer M. Adams

				IGES/COLA

				4041 Powder Mill Road, Suite 302

				Calverton, MD 20705

				jma at cola.iges.org

_______________________________________________

				GO-ESSP-TECH mailing list

				GO-ESSP-TECH at ucar.edu

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

_______________________________________________

				GO-ESSP-TECH mailing list

				GO-ESSP-TECH at ucar.edu

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

				--

				Jennifer M. Adams

				IGES/COLA

				4041 Powder Mill Road, Suite 302

				Calverton, MD 20705

				jma at cola.iges.org

_______________________________________________

				GO-ESSP-TECH mailing list

				GO-ESSP-TECH at ucar.edu

http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

		--

		Jennifer M. Adams

		IGES/COLA

		4041 Powder Mill Road, Suite 302

		Calverton, MD 20705

		jma at cola.iges.org

	--
	Bryan Lawrence
	University of Reading:  Professor of Weather and Climate
Computing.
	National Centre for Atmospheric Science: Director of Models and
Data. 
	STFC: Director of the Centre for Environmental Data Archival.
	Ph: +44 118 3786507 or 1235 445012;
Web:home.badc.rl.ac.uk/lawrence

	--

	Jennifer M. Adams

	IGES/COLA

	4041 Powder Mill Road, Suite 302

	Calverton, MD 20705

	jma at cola.iges.org

	-- 
	Scanned by iCritical. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111216/2c853a3e/attachment-0001.html