[Go-essp-tech] CMIP5 data archive size estimate

Mon Dec 14 19:05:04 MST 2009

Dear Stephen and all,

I found a number of mistakes in your spreadsheets.  I've attached 
spreadsheets in which I've

on the "requested" and the "replicated" pages
1)  set samples/yr to 1 for all data categories (monthly, daily, 
3-hourly, etc.)
2)  under number of years requested per simulation set to 1 all values 
not blank (i.e., 0) and not 1000

on the individual modeling pages:
1) set all values in years/realization to 1, except for the RCP 
extensions into the 22nd and 23rd centuries, which I set to 0. (because 
these are in the same atomic dataset as the earlier portion of those runs.)
2) set "bytes needed for a single field, single sample" to 1
3) effectively replaced "/1000000000" by 1 (where ever it occurred)

This, I think gives the correct estimate of the number of atomic 
datasets (if all time-samples for a single run are considered to be part 
of the atomic dataset.

As shown on the attached "archive size" sheet the number of atomic 
datasets are 564000 and 527000 for the "requested" and "replicated" 
categories of data (about 1/5th the numbers estimated by Stephen).  
Don't know whether this matters to performance.  Also this is for a 
single copy of the archive, not counting replicated copies.

Best regards,
Karl

stephen.pascoe at stfc.ac.uk wrote:
> I have adapted Karl's spreadsheet to give us an estimate of the number of atomic datasets in the system.
>
> I hacked the sheet of each modelling centre to set all grid sizes and model years to 1 then set each "bytes needed for single field, ..." to 1e9.  This causes the numbers in the "achive size" sheet to be of units datasets/1000.
>
> The results are Requested datasets: 2.8 million, Replicated datasets: 2.7 million.
>
> I interpret the small difference being because the replicated subset is mainly the same variables over a smaller time period.
>
> This tells us we could be dealing with ~10 million datasets including all replicants.  I think the RDF triple store is going to need work to make it scale to this level.  A quick survey of the web tells me Sesame can be configured to deal with +1 billion triples but it might take a special backend or hardware configuration.
>
> See:
> http://*www.*openrdf.org/forum/mvnforum/viewthread?thread=2189
>
> Cheers,
> Stephen.
>
> -----Original Message-----
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: Wed 12/9/2009 4:08 PM
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: Lawrence, Bryan (STFC,RAL,SSTD); luca at ucar.edu; go-essp-tech at ucar.edu
> Subject: Re: CMIP5 data archive size estimate
>  
> Hi Stephen,
>
> Nice trick!  (or perhaps that's a forbidden term now).  Amazing how MS 
> can expand by nearly an order of magnitude the storage needed for the 
> same information.
>
> By the way, I just noticed the "requested" data would occupy 2.2 
> petabytes, not "almost 2 petabytes" as stated in point 10 of my previous 
> email.
>
> cheers,
> Karl
>
> stephen.pascoe at stfc.ac.uk wrote:
>   
>>  
>> Thanks Karl,
>>
>> Converting it to *.xls then to *.ods with OpenOffice calc makes it much
>> smaller (attached).
>>
>> S.
>>
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>
>> -----Original Message-----
>> From: Karl Taylor [mailto:taylor13 at llnl.gov] 
>> Sent: 09 December 2009 08:06
>> To: Lawrence, Bryan (STFC,RAL,SSTD)
>> Cc: luca at ucar.edu; Pascoe, Stephen (STFC,RAL,SSTD);
>> go-essp-tech at ucar.edu
>> Subject: CMIP5 data archive size estimate
>>
>> Dear all,
>>
>> I promised to send these spreadsheets to you today, but I don't have
>> time to explain them.  Here are some quick notes:
>>
>> 0.  I've only attached the .xlxs version.  The .xls version is 40
>> megabytes, so I can't send it by email.  I'll try to find another way to
>> get it to you tomorrow.
>>
>> 1.  Estimates are based on input from modeling groups collected more
>> than a year ago.
>>
>> 2.  I think only about 2/3 of the models are included in the estimate.
>>
>> 3.  Estimate is based on assuming that all experiments designated by the
>> group as 66% likely to be performed or better will actually be run.  
>> (This perhaps approximately offsets the fact that not all groups have
>> provided input yet.)
>>
>> 4.  You can't rely on a single piece of information in the spread sheet
>> (it's all completely unofficial), but the estimate of archive size under
>> the stated assumptions is probably correct.
>>
>> 5.  There are no estimates of the number of "atomic datasets" or the
>> number of files per atomic dataset.
>>
>> 6.  I think in one place, at least gigabytes should have read bytes, but
>> that should be obvious.
>>
>> 7.  There are estimates for size at the end of 2010 and at the end of
>> 2014, but I didn't ask groups for their timelines, so these estimates
>> are identical.
>>
>> 8.  There are estimates for "requested output" volume and "replicated" 
>> output volume.  
>>
>> 9.  The tables of variables that are referred to in the spreadsheets can
>> be found at: 
>> http://**cmip-pcmdi.llnl.gov/cmip5/data_description.html?submenuheader=1
>>
>> 10.  Bottom line:  about 1 petabyte of data will be replicated of the
>> almost 2 petabytes requested.
>>
>> Best regards,
>> Karl
>>
>>   
>>     
>
>
>
>
>
>
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: CMIP5_atomic_numbers.xlsx
Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Size: 9341640 bytes
Desc: not available
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20091214/0dfff245/attachment-0001.bin