[ncl-talk] netcdf file size question

Tue Jun 26 13:53:48 MDT 2018

Hauss,

Thanks to Kevin for a thorough description of scale/offset packing and its
issues.  NCL is also easily capable of creating Netcdf-4 compressed files
from scratch.  I recommend this modern approach in preference to
scale/offset, because of simplicity for data users, and avoidance of
possible numeric problems.

Netcdf-4 compressed requires a few extra set-up steps when you first create
the initial file in NCL.  Set the file format to "NetCDF4Classic", and the
compression level to a low value 1-3 initially.  For large arrays,
set medium chunk sizes in the range of 10,000 to 500,000 bytes per chunk.
For low and medium resolution gridded data, one grid per chunk is a common
scheme.  More details and examples of writing netcdf-4 are found here:

   https://www.ncl.ucar.edu/Applications/write_netcdf.shtml

Netcdf-4 compression works well in general, but it also responds to reduced
number of decimal places on input, resulting in slightly improved
compression.

--Dave

On Tue, Jun 26, 2018 at 12:53 PM, Kevin Hallock <hallock at ucar.edu> wrote:

> Hi Hauss,
>
> That is correct, NCL uses a consistent data type for all elements of an
> array without considering whether individual values in the array actually
> require that level of precision. If an array contained a mix of data types,
> then in order to access a specific index it would be necessary to determine
> the size of every element before it; in a single-data-type array (let’s say
> “float”), the memory address of a particular index is easily determined as
> an offset from the beginning of the array equal to “sizeof(float) * index”,
> where the size of a float variable is 4 bytes. Just determining the memory
> address of an index in a mixed-type array is difficult enough, so trying to
> perform an actual computation across a multi-dimensional array of mixed
> types would likely be very slow compared to a homogeneous array.
>
>
> If you’re certain that losing several decimal places of precision is
> alright, then you could try using NCL’s pack_values
> <http://www.ncl.ucar.edu/Document/Functions/Contributed/pack_values.shtml> function.
> pack_values can “pack” a float (4 bytes per value) or double (8 bytes per
> value) type array into either “short” (2 bytes) or “byte” (1 byte) arrays,
> using a multiplier and an offset value to “unpack” an approximation of the
> original float/double data.
>
> Please note that “packing” data into a smaller data type is a form of
> “lossy” compression, meaning it may not be possible to recover the exact
> original data from the compressed data.
>
> If you have a float array “a_float” that you want to compress by a factor
> of 2, you could pack_values() it into a short array:
> a_short = pack_values(a_float, "short", False)
> a_unpacked = short2flt(a_short) ; This is essentially the same as
> "(a_short * a_short at scale_factor) + a_short at add_offset"
>
> You will likely want to compare your original array with the new
> packed-then-unpacked array to evaluate whether the lost precision is
> acceptable for your use case.
>
> It is also possible to pack values into a “byte” array (4 bytes to 1 byte
> compression in this case), although the loss of precision will be even more
> apparent:
> a_byte = pack_values(a_float, "byte", False)
> a_unpacked = byte2flt(a_byte)
>
> Alternatively, there is a way to do this outside of NCL for a netcdf file
> that already exists using a software package called NetCDF Operators
> <http://nco.sourceforge.net/>. In particular, the ncpdq
> <http://nco.sourceforge.net/nco.html#ncpdq> operator can be used to pack
> data as follows:
> ncpdq infile.nc outfile.nc
>
> I hope this helps,
> Kevin
>
> On Jun 25, 2018, at 7:47 PM, Hauss Reinbold <Hauss.Reinbold at dri.edu>
> wrote:
>
> Hi all,
>
> I’m creating a large netcdf dataset via NCL and I was looking to reduce
> the file size by reducing the number of decimal places the float values
> were holding, but it doesn’t look like it worked. In looking into it
> further, it seems like NCL allocates space in the file by data type,
> regardless of what value each individual index of an array might have. Is
> that correct?
>
> I did some looking and couldn’t see a way to reduce file size explicitly
> other than by changing data type, which I don’t think I can do. Is there a
> way to reduce the file size of the netcdf file by limiting the number of
> decimal places? Or is compression or changing the data type my only
> alternative here?
>
> Thanks for any help on this.
>
> Hauss Reinbold
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.ucar.edu/pipermail/ncl-talk/attachments/20180626/13e9cffa/attachment.html>