Data types in R

R is a flexible language that allows to work with different kind of data format [@bradley]. This inluced integer, numeric, character, complex, dates and logical. The default data type or class in R is double precision—numeric. In a nutshell, R treats all kind of data into five categories but we deal with only four in this book. Before proceeding, we need to clear the workspace by typing rm(list = ls()) after the prompt in the in a console.

But before we move further, let’s us clean our working environment by clicking a combination of Ctrl+L. Clearing the workspace is always recommended before working on a new R project to avoid name conflicts with provious projects. We can also clear all figures using graphics.off() function. It is a good code practise that a new R project start with the code in the chunk below:

rm(list = ls())
graphics.off()
  1. Integers:Integer values do not have decimal places. They are commonly used for counting or indexing.
aa = c(20,68,78,50)

You can check if the data is integer with is.integer() and can convert numeric value to an integer with as.integer()

is.integer(aa)
FALSE [1] FALSE

You can query the class of the object with the class() to know the class of the object

class(aa)
FALSE [1] "numeric"

Although the object bb is integer as confirmed with as.integer() function, the class() ouput the answer as numeric. This is because the defaul type of number in r is numeric. However, you can use the function as.integer() to convert numeric value to integer

class(as.integer(aa))
FALSE [1] "integer"
  1. Numeric: The numeric class holds the set of real numbers — decimal place numbers. The numeric class is more general than the integer class, and inclused the integer numbers.

These could be any number (whole or decimal number). You can check if the data is integer with is.integer()

bb = c(12.5, 45.68, 2.65)
class(bb)
FALSE [1] "numeric"
is.numeric(bb)
FALSE [1] TRUE
  1. Strings: In programming terms, we usually call text as string.This often are text data like names.
countries = c("Kenya", "Uganda", "Rwanda", "Tanzania")
class(countries)
FALSE [1] "character"

We can be sure whether the object is a string with is.character() or check the class of the object with class().

  1. Factor: These are strings from finite set of values. For example, we might wish to store a variable that records gender of people. You can check if the data is factor with is.factor() and use as.factor() to convert string to factor
sex = c("Male", "Female", "Male", "Male", "Female")
sex = as.factor(sex)
class(sex)
FALSE [1] "factor"

Often times we need to know the possible groups that are in the factor data. This can be achieved with the levels() function

levels(sex)
FALSE [1] "Female" "Male"
levels(countries)
FALSE NULL

Often we wish to take a continuous numerical vector and transform it into a factor. The function cut() takes a vector of numerical data and creates a factor based on your give cut-points. Let us make a fictional income of 508 people with rnorm() function.

income = rnorm(n = 508, mean = 500, sd = 80)
hist(income, col = "green", main = "", las = 1, xlab = "Individual Income")
Income distribution

Figure 1: Income distribution

#mosaic::plotDist(dist = "norm", mean = 500, sd = 80)

We can now breaks the distribution into groups and make a simple plot as shown in figure 2, where those with income less than 400 were about 50, followed with a group with income range between 400 and 500 of about 200 and 250 people receive income above 500

group = cut(income, breaks = c(300,400,500,800),
            labels = c("Below 400", "400-500", "Above 500"))
is.factor(group)
FALSE [1] TRUE
levels(group)
FALSE [1] "Below 400" "400-500"   "Above 500"
barplot(table(group), las = 1, horiz = FALSE, col = c("blue", "red", "blue"), ylab = "Frequency", xlab = "Group of Income")
Barplot of grouped income

Figure 2: Barplot of grouped income

data = data.frame(group, income)
  1. Logicals: This is a special case of a factor that can only take on the values TRUE and FALSE. R is case-sensitive, therefore you must always capitalize TRUE and FALSE in function in R.

  2. Date and time

Vectors

Ofen times we want to store a set of numbers in once place. One way to do this is using the vectors in R. Vectors store severl numbers– a set of numbers in one container. let us look on the example below

id = c(1,2,3,4,5)
people = c(158,659,782,659,759)
street = c("Dege", "Mchikichini", "Mwembe Mdogo", "Mwongozo",  "Cheka")

Notice that the c() function, which is short for concatenate wraps the list of numbers. The c() function combines all numbers together into one container. Notice also that all the individual numbers are separated with a comma. The comma is reffered to an an item-delimiter. It allows R to hold each of the numbers separately. This is vital as without the item-delimiter, R will treat a vector as one big, unsperated number.

Indexing the element

One advantage of vector is that you can extract individual element in the vector object by indexing, which is accomplished using the square bracket as illustrated below.

id[5]
FALSE [1] 5
people[5]
FALSE [1] 759
street[5]
FALSE [1] "Cheka"

Apart from extracting single element, indexing allows to extract a range of element in a vector. This is extremely important because it allows to subset a portion of data in a vector. A colon operator is used to extract a range of data

street[2:4]
FALSE [1] "Mchikichini"  "Mwembe Mdogo" "Mwongozo"

Adding and Replacing an element in a vector

It is possible to add element of an axisting vecor. Here ia an example

id[6] = 6
people[6] = 578
street[6] = "Mwongozo"

Sometimes you may need to replace an element from a vector, this can be achieved with indexing

people[1] = 750

Number of elements in a vector

Sometimes you may have a long vector and want to know the numbers of elements in the object. R has length() function that allows you to query the vector and print the answer

length(people)
FALSE [1] 6

Generating sequence of vectors Numbers

There are few R operators that are designed for creating vecor of non-random numbers. These functions provide multiple ways for generating sequences of numbers

The colon : operator, explicitly generate regular sequence of numbers between the lower and upper boundary numbers specified. For example, generating number beween 0 and 10, we simply write;

vector.seq = 0:10
vector.seq
FALSE  [1]  0  1  2  3  4  5  6  7  8  9 10

However, if you want to generate a vector of sequence number with specified interval, let say we want to generate number between 0 and 10 with interval of 2, then the seq() function is used

regular.vector = seq(from = 0,to = 10, by = 2)
regular.vector
FALSE [1]  0  2  4  6  8 10

unlike the seq() function and : operator that works with numbers, the rep() function generate sequence of repeated numbers or strings to create a vector

id = rep(x = 3, each = 4)
station = rep(x = "Station1", each = 4)
id;station
FALSE [1] 3 3 3 3
FALSE [1] "Station1" "Station1" "Station1" "Station1"

The rep() function allows to parse each and times arguments. The each argument allows creation of vector that that repeat each element in a vector according to specified number.

sampled.months = c("January", "March", "May")
rep(x = sampled.months, each = 3)
FALSE [1] "January" "January" "January" "March"   "March"   "March"   "May"    
FALSE [8] "May"     "May"

But the times argument repeat the whole vector to specfied times

rep(x = sampled.months, times = 3)
FALSE [1] "January" "March"   "May"     "January" "March"   "May"     "January"
FALSE [8] "March"   "May"

Generating vector of normal distribution

The central limit theorem that ensure the data is normal distributed is well known to statistician. R has a rnorm() function which makes vector of normal distributed values. For example to generate a vector of 40 sea surface temperature values from a normal distribution with a mean of 25, and standard deviation of 1.58, we simply type this expression in console;

sst = rnorm(n = 40, mean = 25,sd = 1.58)
sst
FALSE  [1] 24.03922 23.80315 23.47413 26.27077 23.42572 23.75998 23.61258 23.35309
FALSE  [9] 24.69700 22.76580 24.71677 23.02477 27.00590 23.95165 26.52619 26.29226
FALSE [17] 25.35510 24.18597 27.37901 24.34999 24.38044 26.92839 21.37074 25.54579
FALSE [25] 26.55608 26.61256 25.71027 29.16311 25.19961 24.32504 26.28006 25.61089
FALSE [33] 24.85534 24.80107 25.56271 27.84438 24.39485 27.66808 25.80105 24.16359

Rounding off numbers

There are many ways of rounding off numerical number to the nearest integers or specify the number of decimal places. the code block below illustrate the common way to round off:

require(magrittr)
chl = rnorm(n = 20, mean = .55, sd = .2)
chl %>% round(digits = 2)
FALSE  [1] 0.43 0.58 0.18 0.55 0.85 0.38 0.38 0.40 0.60 0.20 0.74 0.44 0.75 0.37 0.72
FALSE [16] 0.83 0.54 0.53 0.61 0.51

Data Frame

data.frame is very much like a simple Excel spreadsheet where each column represents a variable type and each row represent observations. A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. A data frame is a list of equal–length vectors with rows as records and columns as variables. This makes data frames unique in data storing as it can store different classes of objects in each column (i.e. numeric, character, factor, logic, etc). In this section, we will create data frames and add attributes to data frames.

Creating data frames

Perhaps the easiest way to create a data frame is to parse vectors in a data.frame() function. For instance, in this case we create a simple data frame dt and assess its internal structure

# create vectors
Name  = c('Bob','Jeff','Mary')
Score = c(90, 75, 92)
Grade = c("A", "B", "A")

## use the vectors to make a data frame
dt = data.frame(Name, Score, Grade)

## assess the internal structure
str(dt)
FALSE 'data.frame': 3 obs. of  3 variables:
FALSE  $ Name : chr  "Bob" "Jeff" "Mary"
FALSE  $ Score: num  90 75 92
FALSE  $ Grade: chr  "A" "B" "A"

Note how Variable Name in dt was converted to a column of factors . This is because there is a default setting in data.frame() that converts character columns to factors . We can turn this off by setting the stringsAsFactors = FALSE argument:

## use the vectors to make a data frame
df = data.frame(Name, Score, Grade, stringsAsFactors = FALSE)
df %>% str()
FALSE 'data.frame': 3 obs. of  3 variables:
FALSE  $ Name : chr  "Bob" "Jeff" "Mary"
FALSE  $ Score: num  90 75 92
FALSE  $ Grade: chr  "A" "B" "A"

Now the variable Name is of character class in the data frame. The inherited problem of data frame to convert character columns into a factor is resolved by introduction f advanced data frames called tibble, which provides sticker checking and better formating than the traditional data.frame.

## use the vectors to make a tibble
tb = tibble(Name, Score, Grade) 
## check the internal structure of the tibble
tb%>% glimpse()
FALSE Rows: 3
FALSE Columns: 3
FALSE $ Name  <chr> "Bob", "Jeff", "Mary"
FALSE $ Score <dbl> 90, 75, 92
FALSE $ Grade <chr> "A", "B", "A"

Table 1 show the the data frame created by fusing the two vectors together.

Table 1: Variables in the data frame
Name Score Grade
Bob 90 A
Jeff 75 B
Mary 92 A

Because the columns have meaning and we have given them column names, it is desirable to want to access an element by the name of the column as opposed to the column number.In large Excel spreadsheets I often get annoyed trying to remember which column something was. The $sign and []are used in R to select variable from the data frame.

dt$Name
FALSE [1] "Bob"  "Jeff" "Mary"
dt[,1]
FALSE [1] "Bob"  "Jeff" "Mary"
dt$Score
FALSE [1] 90 75 92
dt[,2]
FALSE [1] 90 75 92

R has build in dataset that we can use for illustration. For example, @longley created a longley dataset, which is data frame with 7 economic variables observed every year from 1947 ti 1962 (Table 2). We can add the data in the workspace with data() function

data(longley)

longley %>% 
  kable(caption = "Longleys' Economic dataset", align = "c", row.names = F) %>%
  column_spec(1:7, width = "3cm")
Table 2: Longleys’ Economic dataset
GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
83.0 234.289 235.6 159.0 107.608 1947 60.323
88.5 259.426 232.5 145.6 108.632 1948 61.122
88.2 258.054 368.2 161.6 109.773 1949 60.171
89.5 284.599 335.1 165.0 110.929 1950 61.187
96.2 328.975 209.9 309.9 112.075 1951 63.221
98.1 346.999 193.2 359.4 113.270 1952 63.639
99.0 365.385 187.0 354.7 115.094 1953 64.989
100.0 363.112 357.8 335.0 116.219 1954 63.761
101.2 397.469 290.4 304.8 117.388 1955 66.019
104.6 419.180 282.2 285.7 118.734 1956 67.857
108.4 442.769 293.6 279.8 120.445 1957 68.169
110.8 444.546 468.1 263.7 121.950 1958 66.513
112.6 482.704 381.3 255.2 123.366 1959 68.655
114.2 502.601 393.1 251.4 125.368 1960 69.564
115.7 518.173 480.6 257.2 127.852 1961 69.331
116.9 554.894 400.7 282.7 130.081 1962 70.551

Sometimes you may need to create set of values and store them in vectors, then combine the vectors into a data frame. Let us see how this can be done. First create three vectors. One contains id for ten individuals, the second vector hold the time each individual signed in the attendane book and the third vector is the distance of each individual from office. We can concatenate the set of values to make vectors.

id  = c(1,2,3,4,5,6,7,8,9,10)

time = ymd_hms(c("2018-11-20 06:35:25 EAT", "2018-11-20 06:52:05 EAT", 
                 "2018-11-20 07:08:45 EAT", "2018-11-20 07:25:25 EAT", 
                 "2018-11-20 07:42:05 EAT", "2018-11-20 07:58:45 EAT", 
                 "2018-11-20 08:15:25 EAT", "2018-11-20 08:32:05 EAT", 
                 "2018-11-20 08:48:45 EAT", "2018-11-20 09:05:25 EAT"), tz = "")

distance = c(20, 85, 45, 69, 42,  52, 6, 45, 36, 7)

Once we have the vectors that have the same length dimension, we can use the function data.frame() to combine the the three vectors into one data frame shown in table 3

arrival = data.frame(id, time, distance)
Table 3: The time employees enter into the office with the distance from their residential areas to the office
IDs Time Distance
1 2018-11-20 06:35:25 20
2 2018-11-20 06:52:05 85
3 2018-11-20 07:08:45 45
4 2018-11-20 07:25:25 69
5 2018-11-20 07:42:05 42
6 2018-11-20 07:58:45 52
7 2018-11-20 08:15:25 6
8 2018-11-20 08:32:05 45
9 2018-11-20 08:48:45 36
10 2018-11-20 09:05:25 7

Matrix

A matrix is defined as a collection of data elements arranged in a two–dimensional rectangular layout. R is very strictly when you make up a matrix as it must be with equal dimension—all columns in a matrix must be of the same length. Unlike data frame and list that can store numeric or character.etc in columns, matrix columns must be numeric or characters in a matrix file.

Creating Matrices

The base R has a matrix() function that construct matrices column–wise. In other language, element in matrix are entered starting from the upper left corner and running down the columns. Therefore, one should take serious note of specifying the value to fill in a matrix and the number of rows and columns when using the matrix() function.For example in the code block below, we create an imaginary month sst value for five years and obtain an atomic vector of 60 observation.

sst = rnorm(n = 60, mean = 25, 3)

Once we have the atomic vector of sst value, we can convert it to matrix with the matrix() function. We put the observation as rows—months and the columns as years. Therefore, we have 12 rows and 5 years and the product of number of months and years we get 60—equivalent to our sst atomic vector we just created above.

sst.matrix = matrix(data = sst, nrow = 12, ncol = 5)

We then check whether we got the matrix with is.matrix() function

is.matrix(sst);is.matrix(sst.matrix)
FALSE [1] FALSE
FALSE [1] TRUE
sst
FALSE  [1] 25.08327 24.09097 26.34485 24.14146 24.64941 26.09739 23.91218 26.91260
FALSE  [9] 22.69753 24.84250 24.89610 21.91391 30.20031 22.91143 26.33853 24.99629
FALSE [17] 23.69466 30.10935 25.78983 25.05814 26.59829 26.77820 26.60457 27.57653
FALSE [25] 24.65990 26.95125 29.67351 20.43879 30.49668 25.97459 24.28109 25.72646
FALSE [33] 21.91946 21.91512 24.87088 27.31100 28.64027 26.58810 27.33460 21.98717
FALSE [41] 29.04679 25.85037 22.84274 22.30848 28.18201 25.51643 29.05196 24.12477
FALSE [49] 19.98520 30.13661 26.37739 25.49994 22.78731 25.80078 23.99339 23.64162
FALSE [57] 26.40151 29.94575 26.80928 23.84312

We can check whether the dimension we just defined while creating this matrix is correct. This is done with the dim() function from base R.

dim(sst.matrix)
FALSE [1] 12  5

If you have large vector and you you want the matrix() function to figure out the number of columns, you simply define the nrow and tell the function that you do not want those element arranged by rows —i.e you want them in column-wise. That is done by parsing the argument byrow = FALSE inside the matrixt() function.

sst.matrixby = sst %>% matrix(nrow = 12, byrow = FALSE)

Adding attributes to Matrices

Often times you may need to add additional attributes to the maxtrix—observation names, variable names and comments in the matrix.

We can add columns, which are years from 2014 to 2018

years = 2014:2018
colnames(sst.matrix) = years
sst.matrix
FALSE           2014     2015     2016     2017     2018
FALSE  [1,] 25.08327 30.20031 24.65990 28.64027 19.98520
FALSE  [2,] 24.09097 22.91143 26.95125 26.58810 30.13661
FALSE  [3,] 26.34485 26.33853 29.67351 27.33460 26.37739
FALSE  [4,] 24.14146 24.99629 20.43879 21.98717 25.49994
FALSE  [5,] 24.64941 23.69466 30.49668 29.04679 22.78731
FALSE  [6,] 26.09739 30.10935 25.97459 25.85037 25.80078
FALSE  [7,] 23.91218 25.78983 24.28109 22.84274 23.99339
FALSE  [8,] 26.91260 25.05814 25.72646 22.30848 23.64162
FALSE  [9,] 22.69753 26.59829 21.91946 28.18201 26.40151
FALSE [10,] 24.84250 26.77820 21.91512 25.51643 29.94575
FALSE [11,] 24.89610 26.60457 24.87088 29.05196 26.80928
FALSE [12,] 21.91391 27.57653 27.31100 24.12477 23.84312

and add the month for rows, which is January to December. Now the matrix has names for the rows—records and for columns—variables

months = seq(from = dmy(010115), to = dmy(311215), 
             by = "month") %>% month(abbr = TRUE, 
                                     label = TRUE)
rownames(sst.matrix) = months
sst.matrix
FALSE         2014     2015     2016     2017     2018
FALSE Jan 25.08327 30.20031 24.65990 28.64027 19.98520
FALSE Feb 24.09097 22.91143 26.95125 26.58810 30.13661
FALSE Mar 26.34485 26.33853 29.67351 27.33460 26.37739
FALSE Apr 24.14146 24.99629 20.43879 21.98717 25.49994
FALSE May 24.64941 23.69466 30.49668 29.04679 22.78731
FALSE Jun 26.09739 30.10935 25.97459 25.85037 25.80078
FALSE Jul 23.91218 25.78983 24.28109 22.84274 23.99339
FALSE Aug 26.91260 25.05814 25.72646 22.30848 23.64162
FALSE Sep 22.69753 26.59829 21.91946 28.18201 26.40151
FALSE Oct 24.84250 26.77820 21.91512 25.51643 29.94575
FALSE Nov 24.89610 26.60457 24.87088 29.05196 26.80928
FALSE Dec 21.91391 27.57653 27.31100 24.12477 23.84312

Arrays

array(data = sst, dim = c(3,5,4))
FALSE , , 1
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 25.08327 24.14146 23.91218 24.84250 30.20031
FALSE [2,] 24.09097 24.64941 26.91260 24.89610 22.91143
FALSE [3,] 26.34485 26.09739 22.69753 21.91391 26.33853
FALSE 
FALSE , , 2
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 24.99629 25.78983 26.77820 24.65990 20.43879
FALSE [2,] 23.69466 25.05814 26.60457 26.95125 30.49668
FALSE [3,] 30.10935 26.59829 27.57653 29.67351 25.97459
FALSE 
FALSE , , 3
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 24.28109 21.91512 28.64027 21.98717 22.84274
FALSE [2,] 25.72646 24.87088 26.58810 29.04679 22.30848
FALSE [3,] 21.91946 27.31100 27.33460 25.85037 28.18201
FALSE 
FALSE , , 4
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 25.51643 19.98520 25.49994 23.99339 29.94575
FALSE [2,] 29.05196 30.13661 22.78731 23.64162 26.80928
FALSE [3,] 24.12477 26.37739 25.80078 26.40151 23.84312

This can be done with the indexing. For example, in the sst.matrix we just create, it has twelve rows representing monthly average and five columns representing years. We then obtain data for the six year and we want to add it into the matrix. Simply done with indexing

sst.matrix[1:12,5]
FALSE      Jan      Feb      Mar      Apr      May      Jun      Jul      Aug 
FALSE 19.98520 30.13661 26.37739 25.49994 22.78731 25.80078 23.99339 23.64162 
FALSE      Sep      Oct      Nov      Dec 
FALSE 26.40151 29.94575 26.80928 23.84312

Dealing with Misiing Values

Just as we can assign numbers, strings, list to a variable, we can also assign nothing to an object, or an empty value to a variable. IN R, an empty object is defined with NULL. Assigning a value oof NULL to an object is one way to reset it to its original, empty state. You might do this when you wanto to pre–allocate an object without any value, especially when you iterate the process and you want the outputs to be stored in the empty object.

sst.container = NULL

You can check whether the object is an empty with the is.null() function, which return a logical ouputs indicating whther is TRUE or FALSE

is.null(sst.container)
FALSE [1] TRUE

You can also check for NULL in an if satement as well, as highlighted in the following example;

if (is.null(sst.container)){
  print("The object is empty and hence you can use to store looped outputs!!!")
}
FALSE [1] "The object is empty and hence you can use to store looped outputs!!!"

And empty element (value) in object is represented with NA in R, and it is the absence of value in an object or variable.

sst.sample = c(26.78, 25.98,NA, 24.58, NA)
sst.sample
FALSE [1] 26.78 25.98    NA 24.58    NA

To identify missing values in a vector in R, use the is.na() function, which returns a logical vector with TRUE of the corresponding element(s) with missing value

is.na(sst.sample)
FALSE [1] FALSE FALSE  TRUE FALSE  TRUE

and computing statistics of the variable with NA always will give out the NA ouputs

mean(sst.sample); sd(sst.sample);range(sst.sample)
FALSE [1] NA
FALSE [1] NA
FALSE [1] NA NA

However, we can exclude missing value in these mathematical operations by parsing , na.rm = TRUE argument

mean(sst.sample, na.rm = TRUE);sd(sst.sample, na.rm = TRUE);range(sst.sample, na.rm = TRUE)
FALSE [1] 25.78
FALSE [1] 1.113553
FALSE [1] 24.58 26.78

you can also exclude the element with NA value using the `na.omit()

sst.sample %>% na.omit()
FALSE [1] 26.78 25.98 24.58
FALSE attr(,"na.action")
FALSE [1] 3 5
FALSE attr(,"class")
FALSE [1] "omit"

Finally is a NaN, which is closely related to NA, which is used to assign non-floating numbers. For example when we have the anomaly of sea surface temperature and we are interested to use sqrt() function to reduce the variability of the dataset.

sst.anomaly = c(2.3,1.25,.8,.31,0,-.21)
sqrt(sst.anomaly)
FALSE [1] 1.5165751 1.1180340 0.8944272 0.5567764 0.0000000       NaN

We notice that the sqrt of -0.21 gives us a NaN elements.