[Go-essp-tech] CMIP5 data archive size estimate

stephen.pascoe at stfc.ac.uk stephen.pascoe at stfc.ac.uk
Fri Dec 11 08:16:15 MST 2009


I have adapted Karl's spreadsheet to give us an estimate of the number of atomic datasets in the system.

I hacked the sheet of each modelling centre to set all grid sizes and model years to 1 then set each "bytes needed for single field, ..." to 1e9.  This causes the numbers in the "achive size" sheet to be of units datasets/1000.

The results are Requested datasets: 2.8 million, Replicated datasets: 2.7 million.

I interpret the small difference being because the replicated subset is mainly the same variables over a smaller time period.

This tells us we could be dealing with ~10 million datasets including all replicants.  I think the RDF triple store is going to need work to make it scale to this level.  A quick survey of the web tells me Sesame can be configured to deal with +1 billion triples but it might take a special backend or hardware configuration.

See:
http://www.openrdf.org/forum/mvnforum/viewthread?thread=2189

Cheers,
Stephen.

-----Original Message-----
From: Karl Taylor [mailto:taylor13 at llnl.gov]
Sent: Wed 12/9/2009 4:08 PM
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: Lawrence, Bryan (STFC,RAL,SSTD); luca at ucar.edu; go-essp-tech at ucar.edu
Subject: Re: CMIP5 data archive size estimate
 
Hi Stephen,

Nice trick!  (or perhaps that's a forbidden term now).  Amazing how MS 
can expand by nearly an order of magnitude the storage needed for the 
same information.

By the way, I just noticed the "requested" data would occupy 2.2 
petabytes, not "almost 2 petabytes" as stated in point 10 of my previous 
email.

cheers,
Karl

stephen.pascoe at stfc.ac.uk wrote:
>  
> Thanks Karl,
>
> Converting it to *.xls then to *.ods with OpenOffice calc makes it much
> smaller (attached).
>
> S.
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>
> -----Original Message-----
> From: Karl Taylor [mailto:taylor13 at llnl.gov] 
> Sent: 09 December 2009 08:06
> To: Lawrence, Bryan (STFC,RAL,SSTD)
> Cc: luca at ucar.edu; Pascoe, Stephen (STFC,RAL,SSTD);
> go-essp-tech at ucar.edu
> Subject: CMIP5 data archive size estimate
>
> Dear all,
>
> I promised to send these spreadsheets to you today, but I don't have
> time to explain them.  Here are some quick notes:
>
> 0.  I've only attached the .xlxs version.  The .xls version is 40
> megabytes, so I can't send it by email.  I'll try to find another way to
> get it to you tomorrow.
>
> 1.  Estimates are based on input from modeling groups collected more
> than a year ago.
>
> 2.  I think only about 2/3 of the models are included in the estimate.
>
> 3.  Estimate is based on assuming that all experiments designated by the
> group as 66% likely to be performed or better will actually be run.  
> (This perhaps approximately offsets the fact that not all groups have
> provided input yet.)
>
> 4.  You can't rely on a single piece of information in the spread sheet
> (it's all completely unofficial), but the estimate of archive size under
> the stated assumptions is probably correct.
>
> 5.  There are no estimates of the number of "atomic datasets" or the
> number of files per atomic dataset.
>
> 6.  I think in one place, at least gigabytes should have read bytes, but
> that should be obvious.
>
> 7.  There are estimates for size at the end of 2010 and at the end of
> 2014, but I didn't ask groups for their timelines, so these estimates
> are identical.
>
> 8.  There are estimates for "requested output" volume and "replicated" 
> output volume.  
>
> 9.  The tables of variables that are referred to in the spreadsheets can
> be found at: 
> http://*cmip-pcmdi.llnl.gov/cmip5/data_description.html?submenuheader=1
>
> 10.  Bottom line:  about 1 petabyte of data will be replicated of the
> almost 2 petabytes requested.
>
> Best regards,
> Karl
>
>   







-- 
Scanned by iCritical.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: CMIP5_archive_dataset_count.ods
Type: application/vnd.oasis.opendocument.spreadsheet
Size: 5836119 bytes
Desc: CMIP5_archive_dataset_count.ods
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20091211/89acfee4/attachment-0001.ods 


More information about the GO-ESSP-TECH mailing list