Working with NetCDF files in R

Introduction

Network Common Data Form (NetCDF) is a widely used format for storing array–based data as variables. NetCDF are developed and maintained by Unidata was originally developed for storing and distributiing climate data , such as those generatd by climate simulation or reanalysis models. It has also been adopted in other fields, particularly in oceanography, where large mutidimensional arrays of data are generatted from satellite observation systems. The NetCDF format is a platform-independent because can be transeerred among servers and coputers that are running different operating systems, without a need to convert the file that fit a particular sytem. The NetCDF is also self-describing—contains metadata that describe the what is contained in the file, like the dimensions of longitude and latitude of the grid, the names and units of variables in the dataset, and attributes that describe thingks like themissing values codesr, or offsets and scale facators that may have been used to compress the data.

Types of netCDF file

There are two version of netCDF. The first is netCDF version 3 (netCDF3), which is widely used, but has some size and performance limitations. The other is netCDF version 4 (netCDF4), which supports larger dataset and includes addtional capabilities like to compress the file and reduce the file size.

Rading NetCDF with ncdf4 package

The ncdf4 package allows reading, writing, and creation of netCDF files, either netCDF version 3 or (optionally) netCDF version 4. Let’s first load the packages we need. Note that the ncddf4 package must already be installed on your machine.

require(ncdf4)
require(magrittr)
require(tidyverse)

Once you have loaded the package, then use a function nc_open() to read an existing netCDF file

nc.file = nc_open("../data/wio_ssh_july_2015.nc")

Once we have opened the file, we can print it to have a glimpse of some basic information about the file

nc.file
File ../data/wio_ssh_july_2015.nc (NC_FORMAT_CLASSIC):

     2 variables (excluding dimension variables):
        int crs[]   
            comment: This is a container variable that describes the grid_mapping used by the data in this file. This variable does not contain any data; only information about the geographic coordinate system.
            grid_mapping_name: latitude_longitude
            inverse_flattening: 298.257
            semi_major_axis: 6378136.3
            _CoordinateTransformType: Projection
            _CoordinateAxisTypes: GeoX GeoY
        int adt[lon,lat,time]   
            _CoordinateAxes: lon lat time lat lon 
            _FillValue: -2147483647
            coordinates: lon lat
            grid_mapping: crs
            long_name: Absolute Dynamic Topography
            scale_factor: 1e-04
            standard_name: sea_surface_height_above_geoid
            units: m

     3 dimensions:
        time  Size:31
            axis: T
            long_name: Time
            standard_name: time
            units: days since 1950-01-01 00:00:00
            _CoordinateAxisType: Time
        lat  Size:420
            axis: Y
            bounds: lat_bnds
            long_name: Latitude
            standard_name: latitude
            units: degrees_north
            _CoordinateAxisType: Lat
        lon  Size:401
            axis: X
            bounds: lon_bnds
            long_name: Longitude
            standard_name: longitude
            units: degrees_east
            _CoordinateAxisType: Lon

    13 global attributes:
        title: NRT merged all satellites Global Ocean Gridded Absolute Dynamic Topography L4 product
        institution: CNES, CLS
        references: http://www.aviso.altimetry.fr
        source: Altimetry measurements
        Conventions: CF-1.0
        history: Data extracted from dataset http://opendap.aviso.altimetry.fr/thredds/dodsC/dataset-duacs-nrt-over30d-global-allsat-madt-h
        time_min: 23922
        time_max: 23952
        julian_day_unit: days since 1950-01-01 00:00:00
        latitude_min: -74.875
        latitude_max: 29.875
        longitude_min: 19.875
        longitude_max: 119.875

By simply printing this nc.file file which is the returned object of nc_open() gave us a lot of information that is ovewhelming to grasp. This information describe about the file and is very important to understand the basic information embedded in it as we will require them later for extraction of the varibles. In summary, the metadata tells us about the;

  1. filename, which is a character string holding the name of the file; for our case is File ./data/ml_depth__0.2deg_mean_grid_all_global_locean.nc
  2. ndims, which is an integer holding the number of dimensions in the file; in our file there are three dimension of time',latandlon`
  3. nvars, which is an integer holding the number of the variables in the file that are NOT coordinate variables (aka dimensional variables); the printed file showed there are seven variables krig_std_dev, mask, med_dev, mld, mld_raw, mld_smth, n_profiles
  4. natts, which is an integer holding the number of global attributes;
  5. unlimdimid, which is an integer holding the dimension id of the unlimited dimension, or -1 if there is none; 6 dim, which is a list of objects of class ncdim4;
  6. var, which is a list of objects of class ncvar4;
  7. writable, which is TRUE or FALSE, depending on whether the file was opened with write=TRUE or write=FALSE.

Extracting the dimensions

Based on the metadata, I can customize it to print only the information I need. I used the for loop to customize only the variable information I need shown. Here is the code that will print the dimensions in the files together with additional information

for (j in 1:nc.file$ndims){
  
  a = nc.file$dim[[j]]
  print(paste(" Here is information on variable number",j))
  print(paste(" Name: ",a$name))
    print(paste(" Units:",a$units))
    print(paste(" Length:",a$len))
    print(paste(" Values:"))
    print(a$vals)
  
}
[1] " Here is information on variable number 1"
[1] " Name:  time"
[1] " Units: days since 1950-01-01 00:00:00"
[1] " Length: 31"
[1] " Values:"
 [1] 23922 23923 23924 23925 23926 23927 23928 23929 23930 23931 23932 23933
[13] 23934 23935 23936 23937 23938 23939 23940 23941 23942 23943 23944 23945
[25] 23946 23947 23948 23949 23950 23951 23952
[1] " Here is information on variable number 2"
[1] " Name:  lat"
[1] " Units: degrees_north"
[1] " Length: 420"
[1] " Values:"
  [1] -74.875 -74.625 -74.375 -74.125 -73.875 -73.625 -73.375 -73.125 -72.875
 [10] -72.625 -72.375 -72.125 -71.875 -71.625 -71.375 -71.125 -70.875 -70.625
 [19] -70.375 -70.125 -69.875 -69.625 -69.375 -69.125 -68.875 -68.625 -68.375
 [28] -68.125 -67.875 -67.625 -67.375 -67.125 -66.875 -66.625 -66.375 -66.125
 [37] -65.875 -65.625 -65.375 -65.125 -64.875 -64.625 -64.375 -64.125 -63.875
 [46] -63.625 -63.375 -63.125 -62.875 -62.625 -62.375 -62.125 -61.875 -61.625
 [55] -61.375 -61.125 -60.875 -60.625 -60.375 -60.125 -59.875 -59.625 -59.375
 [64] -59.125 -58.875 -58.625 -58.375 -58.125 -57.875 -57.625 -57.375 -57.125
 [73] -56.875 -56.625 -56.375 -56.125 -55.875 -55.625 -55.375 -55.125 -54.875
 [82] -54.625 -54.375 -54.125 -53.875 -53.625 -53.375 -53.125 -52.875 -52.625
 [91] -52.375 -52.125 -51.875 -51.625 -51.375 -51.125 -50.875 -50.625 -50.375
[100] -50.125 -49.875 -49.625 -49.375 -49.125 -48.875 -48.625 -48.375 -48.125
[109] -47.875 -47.625 -47.375 -47.125 -46.875 -46.625 -46.375 -46.125 -45.875
[118] -45.625 -45.375 -45.125 -44.875 -44.625 -44.375 -44.125 -43.875 -43.625
[127] -43.375 -43.125 -42.875 -42.625 -42.375 -42.125 -41.875 -41.625 -41.375
[136] -41.125 -40.875 -40.625 -40.375 -40.125 -39.875 -39.625 -39.375 -39.125
[145] -38.875 -38.625 -38.375 -38.125 -37.875 -37.625 -37.375 -37.125 -36.875
[154] -36.625 -36.375 -36.125 -35.875 -35.625 -35.375 -35.125 -34.875 -34.625
[163] -34.375 -34.125 -33.875 -33.625 -33.375 -33.125 -32.875 -32.625 -32.375
[172] -32.125 -31.875 -31.625 -31.375 -31.125 -30.875 -30.625 -30.375 -30.125
[181] -29.875 -29.625 -29.375 -29.125 -28.875 -28.625 -28.375 -28.125 -27.875
[190] -27.625 -27.375 -27.125 -26.875 -26.625 -26.375 -26.125 -25.875 -25.625
[199] -25.375 -25.125 -24.875 -24.625 -24.375 -24.125 -23.875 -23.625 -23.375
[208] -23.125 -22.875 -22.625 -22.375 -22.125 -21.875 -21.625 -21.375 -21.125
[217] -20.875 -20.625 -20.375 -20.125 -19.875 -19.625 -19.375 -19.125 -18.875
[226] -18.625 -18.375 -18.125 -17.875 -17.625 -17.375 -17.125 -16.875 -16.625
[235] -16.375 -16.125 -15.875 -15.625 -15.375 -15.125 -14.875 -14.625 -14.375
[244] -14.125 -13.875 -13.625 -13.375 -13.125 -12.875 -12.625 -12.375 -12.125
[253] -11.875 -11.625 -11.375 -11.125 -10.875 -10.625 -10.375 -10.125  -9.875
[262]  -9.625  -9.375  -9.125  -8.875  -8.625  -8.375  -8.125  -7.875  -7.625
[271]  -7.375  -7.125  -6.875  -6.625  -6.375  -6.125  -5.875  -5.625  -5.375
[280]  -5.125  -4.875  -4.625  -4.375  -4.125  -3.875  -3.625  -3.375  -3.125
[289]  -2.875  -2.625  -2.375  -2.125  -1.875  -1.625  -1.375  -1.125  -0.875
[298]  -0.625  -0.375  -0.125   0.125   0.375   0.625   0.875   1.125   1.375
[307]   1.625   1.875   2.125   2.375   2.625   2.875   3.125   3.375   3.625
[316]   3.875   4.125   4.375   4.625   4.875   5.125   5.375   5.625   5.875
[325]   6.125   6.375   6.625   6.875   7.125   7.375   7.625   7.875   8.125
[334]   8.375   8.625   8.875   9.125   9.375   9.625   9.875  10.125  10.375
[343]  10.625  10.875  11.125  11.375  11.625  11.875  12.125  12.375  12.625
[352]  12.875  13.125  13.375  13.625  13.875  14.125  14.375  14.625  14.875
[361]  15.125  15.375  15.625  15.875  16.125  16.375  16.625  16.875  17.125
[370]  17.375  17.625  17.875  18.125  18.375  18.625  18.875  19.125  19.375
[379]  19.625  19.875  20.125  20.375  20.625  20.875  21.125  21.375  21.625
[388]  21.875  22.125  22.375  22.625  22.875  23.125  23.375  23.625  23.875
[397]  24.125  24.375  24.625  24.875  25.125  25.375  25.625  25.875  26.125
[406]  26.375  26.625  26.875  27.125  27.375  27.625  27.875  28.125  28.375
[415]  28.625  28.875  29.125  29.375  29.625  29.875
[1] " Here is information on variable number 3"
[1] " Name:  lon"
[1] " Units: degrees_east"
[1] " Length: 401"
[1] " Values:"
  [1]  19.875  20.125  20.375  20.625  20.875  21.125  21.375  21.625  21.875
 [10]  22.125  22.375  22.625  22.875  23.125  23.375  23.625  23.875  24.125
 [19]  24.375  24.625  24.875  25.125  25.375  25.625  25.875  26.125  26.375
 [28]  26.625  26.875  27.125  27.375  27.625  27.875  28.125  28.375  28.625
 [37]  28.875  29.125  29.375  29.625  29.875  30.125  30.375  30.625  30.875
 [46]  31.125  31.375  31.625  31.875  32.125  32.375  32.625  32.875  33.125
 [55]  33.375  33.625  33.875  34.125  34.375  34.625  34.875  35.125  35.375
 [64]  35.625  35.875  36.125  36.375  36.625  36.875  37.125  37.375  37.625
 [73]  37.875  38.125  38.375  38.625  38.875  39.125  39.375  39.625  39.875
 [82]  40.125  40.375  40.625  40.875  41.125  41.375  41.625  41.875  42.125
 [91]  42.375  42.625  42.875  43.125  43.375  43.625  43.875  44.125  44.375
[100]  44.625  44.875  45.125  45.375  45.625  45.875  46.125  46.375  46.625
[109]  46.875  47.125  47.375  47.625  47.875  48.125  48.375  48.625  48.875
[118]  49.125  49.375  49.625  49.875  50.125  50.375  50.625  50.875  51.125
[127]  51.375  51.625  51.875  52.125  52.375  52.625  52.875  53.125  53.375
[136]  53.625  53.875  54.125  54.375  54.625  54.875  55.125  55.375  55.625
[145]  55.875  56.125  56.375  56.625  56.875  57.125  57.375  57.625  57.875
[154]  58.125  58.375  58.625  58.875  59.125  59.375  59.625  59.875  60.125
[163]  60.375  60.625  60.875  61.125  61.375  61.625  61.875  62.125  62.375
[172]  62.625  62.875  63.125  63.375  63.625  63.875  64.125  64.375  64.625
[181]  64.875  65.125  65.375  65.625  65.875  66.125  66.375  66.625  66.875
[190]  67.125  67.375  67.625  67.875  68.125  68.375  68.625  68.875  69.125
[199]  69.375  69.625  69.875  70.125  70.375  70.625  70.875  71.125  71.375
[208]  71.625  71.875  72.125  72.375  72.625  72.875  73.125  73.375  73.625
[217]  73.875  74.125  74.375  74.625  74.875  75.125  75.375  75.625  75.875
[226]  76.125  76.375  76.625  76.875  77.125  77.375  77.625  77.875  78.125
[235]  78.375  78.625  78.875  79.125  79.375  79.625  79.875  80.125  80.375
[244]  80.625  80.875  81.125  81.375  81.625  81.875  82.125  82.375  82.625
[253]  82.875  83.125  83.375  83.625  83.875  84.125  84.375  84.625  84.875
[262]  85.125  85.375  85.625  85.875  86.125  86.375  86.625  86.875  87.125
[271]  87.375  87.625  87.875  88.125  88.375  88.625  88.875  89.125  89.375
[280]  89.625  89.875  90.125  90.375  90.625  90.875  91.125  91.375  91.625
[289]  91.875  92.125  92.375  92.625  92.875  93.125  93.375  93.625  93.875
[298]  94.125  94.375  94.625  94.875  95.125  95.375  95.625  95.875  96.125
[307]  96.375  96.625  96.875  97.125  97.375  97.625  97.875  98.125  98.375
[316]  98.625  98.875  99.125  99.375  99.625  99.875 100.125 100.375 100.625
[325] 100.875 101.125 101.375 101.625 101.875 102.125 102.375 102.625 102.875
[334] 103.125 103.375 103.625 103.875 104.125 104.375 104.625 104.875 105.125
[343] 105.375 105.625 105.875 106.125 106.375 106.625 106.875 107.125 107.375
[352] 107.625 107.875 108.125 108.375 108.625 108.875 109.125 109.375 109.625
[361] 109.875 110.125 110.375 110.625 110.875 111.125 111.375 111.625 111.875
[370] 112.125 112.375 112.625 112.875 113.125 113.375 113.625 113.875 114.125
[379] 114.375 114.625 114.875 115.125 115.375 115.625 115.875 116.125 116.375
[388] 116.625 116.875 117.125 117.375 117.625 117.875 118.125 118.375 118.625
[397] 118.875 119.125 119.375 119.625 119.875

The output now give us the basic information of the variable name, the units, the length of the variable and the values. This is very simplified version and make more sense to grasp easily the information contained in the file. we can now extract the dimensions.

Extracting time

The dimensions are extracted with the ncvar_get() function and parse along the arguments required. But before we extract let’s run the loop to have a glimpse again of the structure of the time dimension

[1] " Here is information on variable number 1"
[1] " Name:  time"
[1] " Units: days since 1950-01-01 00:00:00"
[1] " Length: 31"
[1] " Values:"
 [1] 23922 23923 23924 23925 23926 23927 23928 23929 23930 23931 23932 23933
[13] 23934 23935 23936 23937 23938 23939 23940 23941 23942 23943 23944 23945
[25] 23946 23947 23948 23949 23950 23951 23952
time = ncvar_get(nc = nc.file, varid = "time")

We notice that time is in numerical number—julian days. Using the original time 1950-01-01 00:00:00, we can transform julian date into gregorian calender.Since the time we are given is in julian, We need to convert the original time to julian to ensure its in the same format with time. Then to get the real time, we add up the original time to time and convert them from julian to gregorian. We used JD function from insol package for conversion of date between Julian and Gregorian [@insol]

to = insol::JD(lubridate::ymd_hms("1950-01-01 00:00:00", tz = ""))
jd = time + to
date = insol::JD(x = jd, inverse = TRUE) 

Extracting Latitude

Similary, we can have a glimpse of latidude with the loop code as;

[1] " Here is information on variable number 2"
[1] " Name:  lat"
[1] " Units: degrees_north"
[1] " Length: 420"

We notice the latitude values span from 88°S to 89.5°N.We extract with the ncget_var() function;

lat = ncvar_get(nc = nc.file, varid = "lat")

Extracting Longitude

[1] " Here is information on variable number 3"
[1] " Name:  lon"
[1] " Units: degrees_east"
[1] " Length: 401"

Unlike the convention range of longitude which span from 180°W to 180°, here we see that longitud range from 0 to 360. Let’s extract the longitude values with the ncvar_get() function.

lon = ncvar_get(nc = nc.file, varid = "lon")

Extracting the variable

As we have seen, obtaining the information from the metadata printout more information that we need, we can customize how we want the information to be printed with a loop. For instance the code of lines in the chunk below show how to printout information about the variables embbed in the file. Here is the code

for( i in 1:nc.file$nvars ) {
  
    v <- nc.file$var[[i]]
    
    print(paste(" Here is information on variable number",i))
    print(paste(" Name: ",v$name))
    print(paste(" Units:",v$units))
    print(paste(" Missing value:",v$missval))
    print(paste("# dimensions :",v$ndims))
    print(paste(" Variable size:",v$varsize))
    
    }
[1] " Here is information on variable number 1"
[1] " Name:  crs"
[1] " Units: "
[1] " Missing value: NA"
[1] "# dimensions : 0"
[1] " Variable size: "
[1] " Here is information on variable number 2"
[1] " Name:  adt"
[1] " Units: m"
[1] " Missing value: -2147483647"
[1] "# dimensions : 3"
[1] " Variable size: 401" " Variable size: 420" " Variable size: 31" 

We notice also there are two variables. here we are only interested with the variable 2, which is the sea surface height in meters. Let’s extract the variable

adt.array = ncvar_get(nc = nc.file, varid = "adt")

Once we have extracted the time, lat ,lon and mld, we need to verify their dimension. The dimension of atomic vector lon,lat and time must correspond to the dimension of the mld array. We can check the length of the lon, lat, time with the length()

length(lon);length(lat);length(time)
[1] 401
[1] 420
[1] 31

The output display that there are 180 longitude, 90 latitude and 12 months. To check for dimension of the array we use the dim() function instead of the length()

dim(adt.array)
[1] 401 420  31

That’s perfect, the length of the dimensions matches the dimension of the mld array

Replace FillValue with NA

In a netCDF file, values of a variable that are either missing or simply not available (i.e. ocean grid points in a terrestrial data set) are flagged using specific “fill values” (_FillValue) or missing values (missing_value), the values of which are set as attributes of a variable. In R, such unavailable data are indicated using the “NA” value. We can explore the value of the mld with the hist()

adt.array %>% hist()

The histogram display only two bars, which indicated the skewness of the data. Looking back on the metadata, we observed that there the value 1e+09 was used to mask the land. Therefore, in R, we need to set all pixel with the 1e+09 values to NA. The following code fragment illustrates how to replace the netCDF variable’s fill values with R NA’s .

adt.array[adt.array == v$missval] = NA

We can explore again the value of the mld with the hist() functions and the values now look good

adt.array %>% hist()

Obtain a slice

NetCDF variables are read and written as one-dimensional vectors (e.g. longitudes), two-dimensional arrays or matrices (raster “slices”), or multi-dimensional arrays (raster “bricks”). In such data structures, the coordinate values for each grid point are implicit, inferred from the marginal values of, for example, longitude, latitude and time. In contrast, in R, the principal data structure for a variable is the data.frame.For instance, the adt.array data file is the multidimensional array consist of longitude, latitude and 12 columns of long-term means for each month, with the full data set thus consisting of 5221020 rows (401 by length(lat)) and 12 columns.

adt.array %>% dim()
[1] 401 420  31
180*90
[1] 16200

In the kinds of data sets usually stored as netCDF files, each row in the data frame will contain the data for an individual grid point, with each column representing a particular variable, including explicit values for longitude and latitude (and perhaps time). This particular structure of this data set can be illustrated by selecting a single slice from the adt.array “brick”. Therefore, you need to convert extract matrices from array by indexing. For instance, we can extract the january matrix by typing

first = adt.array[,,1]

once we have the mld for january, we can map the spatial pattern of the mixed layer depth with the imagep() function from oce package. Note that the values were normalized with the `inverse_hyperbolic(), function.

oce::imagep(x = lon,y = lat, 
            z = first, 
            xlim = c(37,52), ylim = c(-17,2), 
            zlim = c(0.55,1.3), zclip = TRUE,
            filledContour = FALSE,
            col = oce::oceColors9A(120), 
            # at = seq(3.5,6.5,.5), 
            # labels = seq(20,500, length.out = 7),  
            xlab = "Longitude", ylab = "Latitude", zlab = "Sea Surface Height Anomaly (m)", zlabPosition = "side")

## expand the longitude and latitude
lon.lat = expand.grid(lon,lat) 

## convert the january matrix to a vector
first.vector = first %>% as.vector() 

## combine the expanded lon.lat with vectorized matrix of january
first.df = data.frame(lon.lat, first.vector) %>% 
  tibble::as_tibble() %>%
  dplyr::rename(lon = 1, lat = 2, ssha= 3)
wio = spData::world %>% sf::st_crop(xmin =  20, ymin = -80,xmax = 130, ymax =30)
ssha.eacc = first.df %>% filter(lon > 30 & lon < 53 & lat > -19 & lat < 3)

ggplot()+
  geom_sf(data =wio) +
  metR::geom_contour_fill(data = ssha.eacc, 
                          aes(x = lon, y = lat, z =ssha),
                          bins = 28, na.fill = TRUE) +
  scale_fill_gradientn(colors = odv_color)+
  geom_sf(data =wio) +
  coord_sf(xlim = c(37,51), ylim = c(-17,2), datum = sf::st_crs(4326)) +
  metR::scale_x_latitude(ticks = 5) + 
  metR::scale_x_longitude(ticks = 4)+
  labs(x = NULL, y = NULL, subtitle = paste("Sea surface height of ", date[1]))+
  theme_bw() %+%
  theme(axis.text = element_text(size = 11)) +
  guides(fill = guide_colorbar(barheight = 15, 
                               barwidth = .85, 
                               raster = FALSE, 
                               title = "Sea surface height (m)",
                               title.position = "right", 
                               title.hjust = .5, 
                               title.theme = element_text(angle = 90)))

## expand the longitude and latitude
lon.lat = expand.grid(lon,lat) 

## preallocate
ssha = list()

for (j in 1:length(date)){
  ## chop the blick
  data = adt.array[,,j]
  
  ## create a data frame
  ssha[[j]] = data.frame(lon.lat, first.vector) %>% 
    tibble::as_tibble() %>%
    dplyr::rename(lon = 1, lat = 2, ssha= 3) %>% 
    dplyr::mutate(date = date[j])
  
}

ssha = ssha %>% 
  dplyr::bind_rows() %>% 
  dplyr::filter(lon > 30 & lon < 53 & lat > -19 & lat < 3) %>% 
  dplyr::mutate(day = lubridate::day(date),
                jd = lubridate::yday(date) %>% as.integer(),
                week = lubridate::week(date))