[Go-essp-tech] CMIP5 data archive size estimate

Bryan Lawrence bryan.lawrence at stfc.ac.uk
Fri Dec 11 08:55:29 MST 2009


hi Stephen

Hmm. I just spoke to you and suggested I thought you were wrong, but I did my calculation again, and it was probably me that was wrong.

I think the number of different outputs requested is of o(500) (*)
I think the number of experiments is of o(50)
The number of modelling centres is of o(20)
The number of ensembles is of o(3)
Number out=500x50x20x3=1.5E6.

So, what's a factor of two between friends :-)

But, this also implies, 1PB/2 million= 0.5 GB per atomic dataset. We know/think that gridftp doesn't like small files ... is this big enough? Does the BDM aggregate things to faster?

*(*) What's the deal with datasets for the CFMIP experiments, where we need to think about the suborbital things, how are we aggregating them? Can someone point me to where this stuff is described?

Cheers
Bryan



On Friday 11 December 2009 15:16:15 Pascoe, Stephen (STFC,RAL,SSTD) wrote:
> 
> I have adapted Karl's spreadsheet to give us an estimate of the number of atomic datasets in the system.
> 
> I hacked the sheet of each modelling centre to set all grid sizes and model years to 1 then set each "bytes needed for single field, ..." to 1e9.  This causes the numbers in the "achive size" sheet to be of units datasets/1000.
> 
> The results are Requested datasets: 2.8 million, Replicated datasets: 2.7 million.
> 
> I interpret the small difference being because the replicated subset is mainly the same variables over a smaller time period.
> 
> This tells us we could be dealing with ~10 million datasets including all replicants.  I think the RDF triple store is going to need work to make it scale to this level.  A quick survey of the web tells me Sesame can be configured to deal with +1 billion triples but it might take a special backend or hardware configuration.
> 
> See:
> http://www.openrdf.org/forum/mvnforum/viewthread?thread=2189
> 
> Cheers,
> Stephen.
> 
> -----Original Message-----
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: Wed 12/9/2009 4:08 PM
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: Lawrence, Bryan (STFC,RAL,SSTD); luca at ucar.edu; go-essp-tech at ucar.edu
> Subject: Re: CMIP5 data archive size estimate
>  
> Hi Stephen,
> 
> Nice trick!  (or perhaps that's a forbidden term now).  Amazing how MS 
> can expand by nearly an order of magnitude the storage needed for the 
> same information.
> 
> By the way, I just noticed the "requested" data would occupy 2.2 
> petabytes, not "almost 2 petabytes" as stated in point 10 of my previous 
> email.
> 
> cheers,
> Karl
> 
> stephen.pascoe at stfc.ac.uk wrote:
> >  
> > Thanks Karl,
> >
> > Converting it to *.xls then to *.ods with OpenOffice calc makes it much
> > smaller (attached).
> >
> > S.
> >
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > British Atmospheric Data Centre
> > Rutherford Appleton Laboratory
> >
> > -----Original Message-----
> > From: Karl Taylor [mailto:taylor13 at llnl.gov] 
> > Sent: 09 December 2009 08:06
> > To: Lawrence, Bryan (STFC,RAL,SSTD)
> > Cc: luca at ucar.edu; Pascoe, Stephen (STFC,RAL,SSTD);
> > go-essp-tech at ucar.edu
> > Subject: CMIP5 data archive size estimate
> >
> > Dear all,
> >
> > I promised to send these spreadsheets to you today, but I don't have
> > time to explain them.  Here are some quick notes:
> >
> > 0.  I've only attached the .xlxs version.  The .xls version is 40
> > megabytes, so I can't send it by email.  I'll try to find another way to
> > get it to you tomorrow.
> >
> > 1.  Estimates are based on input from modeling groups collected more
> > than a year ago.
> >
> > 2.  I think only about 2/3 of the models are included in the estimate.
> >
> > 3.  Estimate is based on assuming that all experiments designated by the
> > group as 66% likely to be performed or better will actually be run.  
> > (This perhaps approximately offsets the fact that not all groups have
> > provided input yet.)
> >
> > 4.  You can't rely on a single piece of information in the spread sheet
> > (it's all completely unofficial), but the estimate of archive size under
> > the stated assumptions is probably correct.
> >
> > 5.  There are no estimates of the number of "atomic datasets" or the
> > number of files per atomic dataset.
> >
> > 6.  I think in one place, at least gigabytes should have read bytes, but
> > that should be obvious.
> >
> > 7.  There are estimates for size at the end of 2010 and at the end of
> > 2014, but I didn't ask groups for their timelines, so these estimates
> > are identical.
> >
> > 8.  There are estimates for "requested output" volume and "replicated" 
> > output volume.  
> >
> > 9.  The tables of variables that are referred to in the spreadsheets can
> > be found at: 
> > http://*cmip-pcmdi.llnl.gov/cmip5/data_description.html?submenuheader=1
> >
> > 10.  Bottom line:  about 1 petabyte of data will be replicated of the
> > almost 2 petabytes requested.
> >
> > Best regards,
> > Karl
> >
> >   
> 
> 
> 
> 
> 
> 
> 



-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence


More information about the GO-ESSP-TECH mailing list