[Go-essp-tech] CMIP5 data archive size estimate

Fri Dec 11 08:59:02 MST 2009

Hi Stephen,
	thanks for the estimate, it is indeed very useful. I am indeed in the  
process of evaluating the performance of the RDF query services,
and the first think I realized is that we need to drop the inferencer  
engine, i.e. some of the "intelligence" of the triple store, since it  
causes query performance to slow down considerably for large holdings  
(>10,000 datasets). This will require some software changes but it  
should be possible.
thanks, Luca

On Dec 11, 2009, at 8:16 AM, <stephen.pascoe at stfc.ac.uk> wrote:

>
> I have adapted Karl's spreadsheet to give us an estimate of the  
> number of atomic datasets in the system.
>
> I hacked the sheet of each modelling centre to set all grid sizes  
> and model years to 1 then set each "bytes needed for single  
> field, ..." to 1e9.  This causes the numbers in the "achive size"  
> sheet to be of units datasets/1000.
>
> The results are Requested datasets: 2.8 million, Replicated  
> datasets: 2.7 million.
>
> I interpret the small difference being because the replicated subset  
> is mainly the same variables over a smaller time period.
>
> This tells us we could be dealing with ~10 million datasets  
> including all replicants.  I think the RDF triple store is going to  
> need work to make it scale to this level.  A quick survey of the web  
> tells me Sesame can be configured to deal with +1 billion triples  
> but it might take a special backend or hardware configuration.
>
> See:
> http://www.openrdf.org/forum/mvnforum/viewthread?thread=2189
>
> Cheers,
> Stephen.
>
> -----Original Message-----
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: Wed 12/9/2009 4:08 PM
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: Lawrence, Bryan (STFC,RAL,SSTD); luca at ucar.edu; go-essp-tech at ucar.edu
> Subject: Re: CMIP5 data archive size estimate
>
> Hi Stephen,
>
> Nice trick!  (or perhaps that's a forbidden term now).  Amazing how MS
> can expand by nearly an order of magnitude the storage needed for the
> same information.
>
> By the way, I just noticed the "requested" data would occupy 2.2
> petabytes, not "almost 2 petabytes" as stated in point 10 of my  
> previous
> email.
>
> cheers,
> Karl
>
> stephen.pascoe at stfc.ac.uk wrote:
>>
>> Thanks Karl,
>>
>> Converting it to *.xls then to *.ods with OpenOffice calc makes it  
>> much
>> smaller (attached).
>>
>> S.
>>
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>
>> -----Original Message-----
>> From: Karl Taylor [mailto:taylor13 at llnl.gov]
>> Sent: 09 December 2009 08:06
>> To: Lawrence, Bryan (STFC,RAL,SSTD)
>> Cc: luca at ucar.edu; Pascoe, Stephen (STFC,RAL,SSTD);
>> go-essp-tech at ucar.edu
>> Subject: CMIP5 data archive size estimate
>>
>> Dear all,
>>
>> I promised to send these spreadsheets to you today, but I don't have
>> time to explain them.  Here are some quick notes:
>>
>> 0.  I've only attached the .xlxs version.  The .xls version is 40
>> megabytes, so I can't send it by email.  I'll try to find another  
>> way to
>> get it to you tomorrow.
>>
>> 1.  Estimates are based on input from modeling groups collected more
>> than a year ago.
>>
>> 2.  I think only about 2/3 of the models are included in the  
>> estimate.
>>
>> 3.  Estimate is based on assuming that all experiments designated  
>> by the
>> group as 66% likely to be performed or better will actually be run.
>> (This perhaps approximately offsets the fact that not all groups have
>> provided input yet.)
>>
>> 4.  You can't rely on a single piece of information in the spread  
>> sheet
>> (it's all completely unofficial), but the estimate of archive size  
>> under
>> the stated assumptions is probably correct.
>>
>> 5.  There are no estimates of the number of "atomic datasets" or the
>> number of files per atomic dataset.
>>
>> 6.  I think in one place, at least gigabytes should have read  
>> bytes, but
>> that should be obvious.
>>
>> 7.  There are estimates for size at the end of 2010 and at the end of
>> 2014, but I didn't ask groups for their timelines, so these estimates
>> are identical.
>>
>> 8.  There are estimates for "requested output" volume and  
>> "replicated"
>> output volume.
>>
>> 9.  The tables of variables that are referred to in the  
>> spreadsheets can
>> be found at:
>> http://*cmip-pcmdi.llnl.gov/cmip5/data_description.html? 
>> submenuheader=1
>>
>> 10.  Bottom line:  about 1 petabyte of data will be replicated of the
>> almost 2 petabytes requested.
>>
>> Best regards,
>> Karl
>>
>>
>
>
>
>
>
>
>
> -- 
> Scanned by iCritical.
>
> <CMIP5_archive_dataset_count.ods>