Importing data in R

You can lean R with the dataset it comes with when you install it in your machine. But sometimes you want to use the real data you or someone gathered already. One of critical steps for data processing is to import data with special format into R workspace.Data import refers to read data from the working directory into the workspace. In this chapter you will learn how to import common files into R. We will only focus on two common types of tabular data storage format—The comma-seprated .csv and excell spreadsheet (.xlsx). In later chapter we will explain how to read other types of data into R.

Comma-Separated (.csv)

The most commonly format that R like is the comma-separated files. Although Base R provides various functions like read.table(), read.csv(), read.table() and read.csv2() to import data from the local directories into R workspace, for this book we use an read_csv() function from readr. Before we import the data, we need to load the packages that we will use their functions in this chapeter

require(dplyr)
require(readr)
require(lubridate)
require(readxl)
require(haven)
require(ggplot2)
require(kableExtra)
require(magrittr)
Consider a tabular data stored in my working directory in the .csv format in figure 1.
A screenshot of the sample dataset

Figure 1: A screenshot of the sample dataset

We can import it with the read_csv() functions as:

algoa.ctd = read_csv("algoa_ctd.csv") 

-- Column specification --------------------------------------------------------
cols(
  station = col_character(),
  time = col_datetime(format = ""),
  lon = col_double(),
  lat = col_double(),
  pressure = col_double(),
  temperature = col_double(),
  salinity = col_double(),
  oxygen = col_double(),
  fluorescence = col_double(),
  spar = col_double(),
  par = col_double(),
  density = col_double()
)

When read_csv() has imported the data into R workspace, it prints out the name and type of of data for each variable.

By simply glimpse the dataset, we see the format of the data is as expected. It has six variables(columns) and 177 observations (rows) similar to figure 1. Table 1 show sample of imported dataset.

Table 1: CTD profiles
Station Time Lon Lat Pressure Temperature Salinity Oxygen Fluorescence
st1 2004-08-18 40.61 -10.54 5 25.17 33.92 3.93 0.56
st1 2004-08-18 40.61 -10.54 10 25.13 34.86 4.49 0.60
st1 2004-08-18 40.61 -10.54 15 25.11 34.86 4.50 0.65
st1 2004-08-18 40.61 -10.54 20 25.04 34.86 4.51 0.68
st1 2004-08-18 40.61 -10.54 25 24.95 34.86 4.51 0.76
st1 2004-08-18 40.61 -10.54 30 24.91 34.86 4.50 0.73
st1 2004-08-18 40.61 -10.54 35 24.88 34.87 4.49 0.74
st1 2004-08-18 40.61 -10.54 40 24.85 34.87 4.48 0.69
st1 2004-08-18 40.61 -10.54 45 24.80 34.88 4.46 0.70
st1 2004-08-18 40.61 -10.54 50 24.61 34.89 4.44 0.75

Microsoft Excel(.xlsx)

Commonly our data is stored as a MS Excel file. we can import the file with read_xlsx() function of readxl package. The readxl package provides a function read_exel() that allows us to specify which sheet within the Excel file to read and what character specifies missing data (it assumes a blank cell is missing data if you don’t specifying anything). The function automatically convert the worksheet into a .csv file and read it. Let’s us import the the data in first sheet of the primary_productivity.xlsx. The dataset contain primary productivity value. We will use this file to illustrate how to import the excel file into R workspace with readxl package [@readxl].

sheet1 = readxl::read_xlsx("primary_productivity.xlsx", sheet = 1)
sheet1 %>% sample_n(5)
FALSE # A tibble: 5 x 7
FALSE   date                 year value              month   day site  variable
FALSE   <dttm>              <dbl> <chr>              <dbl> <dbl> <chr> <chr>   
FALSE 1 2011-07-15 00:00:00  2011 863.95789473684204     7    15 Pemba pp      
FALSE 2 2013-01-15 00:00:00  2013 589.247983870967       1    15 Pemba pp      
FALSE 3 2016-03-15 00:00:00  2016 635.401123046875       3    15 Pemba pp      
FALSE 4 2010-12-15 00:00:00  2010 575.15543619791595    12    15 Pemba pp      
FALSE 5 2003-05-15 00:00:00  2003 720.907401315789       5    15 Pemba pp

By printing the sheet1, we notice that the sheet contains monthly average value of primary productivity from the Pemba channel.

sheet2 = readxl::read_xlsx("primary_productivity.xlsx", sheet = 2)
sheet2 %>% sample_n(5)
FALSE # A tibble: 5 x 7
FALSE   date                 year value              month   day site     variable
FALSE   <dttm>              <dbl> <chr>              <dbl> <dbl> <chr>    <chr>   
FALSE 1 2015-06-15 00:00:00  2015 822.32483057228899     6    15 Zanzibar pp      
FALSE 2 2004-02-15 00:00:00  2004 706.53722734899304     2    15 Zanzibar pp      
FALSE 3 2006-03-15 00:00:00  2006 681.77963709677397     3    15 Zanzibar pp      
FALSE 4 2014-02-15 00:00:00  2014 618.70003939075605     2    15 Zanzibar pp      
FALSE 5 2006-06-15 00:00:00  2006 987.41377314814804     6    15 Zanzibar pp

sheet2 contains monthly average value of primary productivity from the Zanzibar channel.

sheet3 = readxl::read_xlsx("primary_productivity.xlsx", sheet = 3)
sheet3 %>% sample_n(5)
FALSE # A tibble: 5 x 7
FALSE   date                 year value              month   day site  variable
FALSE   <dttm>              <dbl> <chr>              <dbl> <dbl> <chr> <chr>   
FALSE 1 2013-06-15 00:00:00  2013 1278.17880794701       6    15 Mafia pp      
FALSE 2 2010-01-15 00:00:00  2010 1129.8221318493099     1    15 Mafia pp      
FALSE 3 2004-05-15 00:00:00  2004 1216.4666526845599     5    15 Mafia pp      
FALSE 4 2004-06-15 00:00:00  2004 1307.9894453642301     6    15 Mafia pp      
FALSE 5 2012-09-15 00:00:00  2012 1254.1508246527701     9    15 Mafia pp

sheet3 contains monthly average value of primary productivity from the Mafia channel.

We look on the internal structure of the sheet3 file with the glimpse() function. You can interact with the table that show all variables and observations (Table ??)

sheet3%>%glimpse()
FALSE Rows: 192
FALSE Columns: 7
FALSE $ date     <dttm> 2003-01-15, 2003-02-15, 2003-03-15, 2003-04-15, 2003-05-15, ~
FALSE $ year     <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2~
FALSE $ value    <chr> "1311.50104865771", "1211.3158482142801", "1302.45291940789",~
FALSE $ month    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8~
FALSE $ day      <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
FALSE $ site     <chr> "Mafia", "Mafia", "Mafia", "Mafia", "Mafia", "Mafia", "Mafia"~
FALSE $ variable <chr> "pp", "pp", "pp", "pp", "pp", "pp", "pp", "pp", "pp", "pp", "~
sheet2 %>% DT::datatable(rownames = FALSE, caption = "An Interactive table of primary productivity in the Zanzibar channel")

Writing t a File

Sometimes you work in the document and you want to export to a file. readr has write_csv() and write_tsv() functions that allows to export data frames from workspace to working directory

write_csv(x = sheet1, path = "./data/Primary_productivity_Pemba.csv")

@r4d recomment the use of write_excel_csv() function when you want to export a data frame to Excel. readr has other tools that export files to other software like SAS, SPSS and more …

write_excel_csv(x = sheet1, path = "./data/Primary_productivity_Pemba.csv")

Basic Data Manipulation

In this section, we brifely introduce some basic data handling and manipulation techniques, which are mostly associated with data frame. A data frame is a a tabular shaped contains columns and rows of equal length. In general a data frame structure with rows representing observations or measurements and with columns containing variables.

Explore the Data Frame

We can visualize the table by simply run the name of the data flights

octopus = read_csv("octopus_data.csv")

we can use class() to check if the data is data frame

octopus %>% class()
FALSE [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

We can use names() to extract the variable names

octopus %>% names()
FALSE  [1] "date"    "village" "port"    "ground"  "sex"     "dml"     "tl"     
FALSE  [8] "weight"  "lat"     "lon"

We can explore the internal structure of flights object with a dplyr()’s function glimpse()

octopus %>% glimpse()
FALSE Rows: 1,079
FALSE Columns: 10
FALSE $ date    <date> 2018-02-12, 2018-01-30, 2018-02-01, 2018-01-21, 2018-03-03, 2~
FALSE $ village <chr> "Somanga", "Bwejuu", "Somanga", "Somanga", "Somanga", "Somanga~
FALSE $ port    <chr> "Mbuyuni", "Kusini", "Mbuyuni", "Mbuyuni", "Mbuyuni", "Mbuyuni~
FALSE $ ground  <chr> "CHAMBA CHA MACHANGE", "NYAMALILE", "BANIANI", "CHAMBA CHA SEL~
FALSE $ sex     <chr> "F", "M", "M", "M", "M", "F", "M", "M", "M", "F", "F", "F", "M~
FALSE $ dml     <dbl> 14.0, 14.5, 17.0, 20.0, 12.0, 16.0, 15.0, 17.0, 12.0, 12.0, 9.~
FALSE $ tl      <dbl> 110.0, 115.0, 115.0, 130.0, 68.0, 90.0, 96.0, 110.0, 79.0, 84.~
FALSE $ weight  <dbl> 1.385, 1.750, 1.000, 2.601, 0.670, 0.870, 1.020, 1.990, 0.730,~
FALSE $ lat     <dbl> -8.397838, -7.915809, -8.392644, -8.391614, -8.391146, -8.3881~
FALSE $ lon     <dbl> 39.28079, 39.65424, 39.28153, 39.28089, 39.28251, 39.28196, 39~

We can check how rows (observations/measurements) and columns (variables/fields) are in the data

octopus %>% dim()
FALSE [1] 1079   10

The number of rows (observation) can be obtained using nrow() function

octopus %>% nrow()
FALSE [1] 1079

The number of columns (variables) can be obtained using ncol() function

octopus %>% ncol()
FALSE [1] 10

The length of the data frame is given by

octopus %>% length()
FALSE [1] 10

Count the number of sample at each sex of octopus

octopus %$% table(sex) 
FALSE sex
FALSE   F   M 
FALSE 581 498

Count the number and compute the proportion of sample at each sex of octopus

octopus %$% table(sex) %>% prop.table() %>% round(digits = 2)
FALSE sex
FALSE    F    M 
FALSE 0.54 0.46

simmple summary statistics

The most helpful function for for summarizing rows and columns is summary(), which gives a collection of basim cummary statistics. The first method is to calculate some basic summary statistics (minimum, 25th, 50th, 75th percentiles, maximum and mean) of each column. If a column is categorical, the summary function will return the number of observations in each category.

octopus %>% 
  summary()
FALSE       date              village              port              ground         
FALSE  Min.   :2017-12-18   Length:1079        Length:1079        Length:1079       
FALSE  1st Qu.:2018-01-14   Class :character   Class :character   Class :character  
FALSE  Median :2018-01-20   Mode  :character   Mode  :character   Mode  :character  
FALSE  Mean   :2018-01-26                                                           
FALSE  3rd Qu.:2018-02-15                                                           
FALSE  Max.   :2018-03-12                                                           
FALSE      sex                 dml             tl             weight     
FALSE  Length:1079        Min.   : 6.0   Min.   : 11.00   Min.   :0.055  
FALSE  Class :character   1st Qu.:10.0   1st Qu.: 68.00   1st Qu.:0.600  
FALSE  Mode  :character   Median :12.0   Median : 82.00   Median :0.915  
FALSE                     Mean   :12.8   Mean   : 86.01   Mean   :1.232  
FALSE                     3rd Qu.:15.0   3rd Qu.:100.00   3rd Qu.:1.577  
FALSE                     Max.   :24.0   Max.   :180.00   Max.   :5.210  
FALSE       lat              lon       
FALSE  Min.   :-8.904   Min.   : 0.00  
FALSE  1st Qu.:-8.523   1st Qu.:39.28  
FALSE  Median :-8.392   Median :39.50  
FALSE  Mean   :-8.069   Mean   :38.69  
FALSE  3rd Qu.:-7.973   3rd Qu.:39.67  
FALSE  Max.   : 0.000   Max.   :39.75

You noticed that the summary() function provide the common metric for central tendency and measure of dispersion. We will look at them later. The next post takes us to our favorite package dplyr.