[Go-essp-tech] CMIP5 data archive size estimate
stephen.pascoe at stfc.ac.uk
stephen.pascoe at stfc.ac.uk
Fri Dec 11 08:16:15 MST 2009
I have adapted Karl's spreadsheet to give us an estimate of the number of atomic datasets in the system.
I hacked the sheet of each modelling centre to set all grid sizes and model years to 1 then set each "bytes needed for single field, ..." to 1e9. This causes the numbers in the "achive size" sheet to be of units datasets/1000.
The results are Requested datasets: 2.8 million, Replicated datasets: 2.7 million.
I interpret the small difference being because the replicated subset is mainly the same variables over a smaller time period.
This tells us we could be dealing with ~10 million datasets including all replicants. I think the RDF triple store is going to need work to make it scale to this level. A quick survey of the web tells me Sesame can be configured to deal with +1 billion triples but it might take a special backend or hardware configuration.
See:
http://www.openrdf.org/forum/mvnforum/viewthread?thread=2189
Cheers,
Stephen.
-----Original Message-----
From: Karl Taylor [mailto:taylor13 at llnl.gov]
Sent: Wed 12/9/2009 4:08 PM
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: Lawrence, Bryan (STFC,RAL,SSTD); luca at ucar.edu; go-essp-tech at ucar.edu
Subject: Re: CMIP5 data archive size estimate
Hi Stephen,
Nice trick! (or perhaps that's a forbidden term now). Amazing how MS
can expand by nearly an order of magnitude the storage needed for the
same information.
By the way, I just noticed the "requested" data would occupy 2.2
petabytes, not "almost 2 petabytes" as stated in point 10 of my previous
email.
cheers,
Karl
stephen.pascoe at stfc.ac.uk wrote:
>
> Thanks Karl,
>
> Converting it to *.xls then to *.ods with OpenOffice calc makes it much
> smaller (attached).
>
> S.
>
> ---
> Stephen Pascoe +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>
> -----Original Message-----
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: 09 December 2009 08:06
> To: Lawrence, Bryan (STFC,RAL,SSTD)
> Cc: luca at ucar.edu; Pascoe, Stephen (STFC,RAL,SSTD);
> go-essp-tech at ucar.edu
> Subject: CMIP5 data archive size estimate
>
> Dear all,
>
> I promised to send these spreadsheets to you today, but I don't have
> time to explain them. Here are some quick notes:
>
> 0. I've only attached the .xlxs version. The .xls version is 40
> megabytes, so I can't send it by email. I'll try to find another way to
> get it to you tomorrow.
>
> 1. Estimates are based on input from modeling groups collected more
> than a year ago.
>
> 2. I think only about 2/3 of the models are included in the estimate.
>
> 3. Estimate is based on assuming that all experiments designated by the
> group as 66% likely to be performed or better will actually be run.
> (This perhaps approximately offsets the fact that not all groups have
> provided input yet.)
>
> 4. You can't rely on a single piece of information in the spread sheet
> (it's all completely unofficial), but the estimate of archive size under
> the stated assumptions is probably correct.
>
> 5. There are no estimates of the number of "atomic datasets" or the
> number of files per atomic dataset.
>
> 6. I think in one place, at least gigabytes should have read bytes, but
> that should be obvious.
>
> 7. There are estimates for size at the end of 2010 and at the end of
> 2014, but I didn't ask groups for their timelines, so these estimates
> are identical.
>
> 8. There are estimates for "requested output" volume and "replicated"
> output volume.
>
> 9. The tables of variables that are referred to in the spreadsheets can
> be found at:
> http://*cmip-pcmdi.llnl.gov/cmip5/data_description.html?submenuheader=1
>
> 10. Bottom line: about 1 petabyte of data will be replicated of the
> almost 2 petabytes requested.
>
> Best regards,
> Karl
>
>
--
Scanned by iCritical.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CMIP5_archive_dataset_count.ods
Type: application/vnd.oasis.opendocument.spreadsheet
Size: 5836119 bytes
Desc: CMIP5_archive_dataset_count.ods
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20091211/89acfee4/attachment-0001.ods
More information about the GO-ESSP-TECH
mailing list