Background Poll

1. Have you taken any statistics course before?

2. Do you use the following languages for data analysis?

  • MATLAB
  • Python
  • R
  • IDL
  • Other
  • No, I don’t

3. What do you use for data visualization?

  • MATLAB
  • Python
  • R
  • IDL
  • Origin
  • Other (but not Excel or similar spreadsheet software)
  • I use Excel, sometimes Power-point

4. On a scale of 0 (not any) to 10 (sophisticated), how to describe your programming experience?

5. What is your laptop’s operating system (OS)?

  • Windows
  • macOS
  • Ubuntu
  • Other

6. Name one thing you want to achieve at the end of the semester (open question)


Statistical topics to be covered

  • Review of statistic basics
    • Probability density function, distributions, statistical hypothesis, random draw, confidence interval, p-value
  • Characteristics of environmental data sets
    • Types of environmental data sets, format of environmental data sets, normal distribution, log normal distribution, log transformation, detection limit, missing values
  • Checking data sets: Quick summaries
    • Mean, median, quantile, standard deviation, variance, outlier
  • Checking data sets: Quick plots
    • Histogram, barplot, boxplot, scatterplot, time series plot, image plot, surface maps
  • Comparisons between two groups
    • t distribution, assumptions of t-test, comparing means of two groups, rank-Sum test, permutation test, Welch t-test, sign test, signed-rank test
  • Comparisons among several groups
    • One-way ANOVA, F-test, two-way ANOVA, linear combinations of group means, multiple comparison procedures
  • Correlation tests
    • Pearson’s test, Spearman’s test, Kendall’s test
  • Simple linear regression
    • Least squares regression estimation, robustness of least squares inferences, model assessment, fit assessment
  • Multiple linear regression
    • Least squares estimates, model assessment, fit assessment
  • Over-fitting and variable selection
    • Over-fitting, AIC, BIC, backward selection, forward selection, step-wise selection
  • Logistic regression
    • Binary responses, binomial responses, Poisson responses, building logistic regression model
  • Time series analysis
    • MA and AR model, SARIMA model fitting and prediction
  • Cluster analysis
    • Hierarchical clustering, K-Means clustering

Take the course or not?

This course is designed for ESE undergraduates having no or weak data analysis background, with emphasis on R.

Good reasons for not taking the course:

  • After browsing the syllabus and schedule, there is nothing new for me
  • I am able to analyze, and visualize data sets through programming efforts already
  • I want to learn other programming languages (C, C++, Java, MATLAB, Python, IDL, etc.)
  • I want to take an “easy” class to meet the graduation requirement

Bad reasons for not taking the course:

  • Data analysis is intimidating; perhaps I can never learn it
  • I don’t have coding/programming experience
  • My project does not require programming efforts

The R language

R is the leading tool for statistics, data analysis, and machine learning. It is more than a statistical package; it’s a programming language, so you can create your own objects, functions, and packages.

Academics and statisticians have developed R over two decades. R now has one of the richest ecosystems to perform data analysis. There are around 16000 packages available in CRAN (Comprehensive R Archive Network). It is possible to find a library for whatever the analysis you want to perform. The rich variety of libraries makes R the first choice for statistical and data analysis. R also makes communicating the findings with a presentation, document, or website very easy.

The following figure summaries important reasons for learning R:

drawing

Figure source


First look at RStudio

Follow instructions to install RStudio.

The RStudio IDE (Integrated Development Environment) is the most popular integrated development environment for R. It allows you to write, run, and debug your R code. drawing

Check this cheat sheet (may need VPN) for more features and shortcuts of the RStudio IDE.


The notes below are modified from the excellent online R tutorial freely available on the Software Carpentry website.


Quick R tutorial

Using R as a calculator

The simplest thing you could do with R is to do arithmetic. Let try this in the Console window:

1 + 2
## [1] 3

R will print out the answer with a preceding ## [1] (my PC) or [1] (your laptop). Don’t worry about this for now, we will explain that later. For now, think of it as indicating output.

You will find the spaces have no impact on the result.

1+2
## [1] 3

When using R as a calculator, the order of operations is the same as you would have learned back in school. From highest to lowest precedence:

  • Parentheses: (, )
  • Exponents: ^ or **
  • Multiply: *
  • Divide: /
  • Add: +
  • Subtract: -

Let’s try

1 + 2 * 3
## [1] 7
(1 + 2) * 3
## [1] 9
(1 + 2) ^ 3
## [1] 27

Always think about clarifying your intentions, as others may later read your code. Here we call such intentions as comments. Anything that follows after the hash symbol # is ignored by R when it executes code.

# Get the square root of 100
100 ^ 0.5
## [1] 10

Really small or large numbers get a scientific notation:

2 / 10000
## [1] 2e-04

You can write numbers in scientific notation too:

5e+5 * 1e+5
## [1] 5e+10
5.2E+5 + 4.8E+6
## [1] 5320000

Mathematical functions

R has many built-in mathematical functions. To call a function, we can type its name, followed by open and closing parentheses (). Anything we type inside the parentheses is called the function’s arguments:

# Trigonometry functions
sin(1)
## [1] 0.841471
sin(0.5*pi)
## [1] 1
# Natural logarithm
log(10)
## [1] 2.302585
# Base-10 logarithm
log10(10)
## [1] 1
# e^(1/2)
exp(0.5)
## [1] 1.648721

Don’t worry about trying to remember every function in R. You can look them up online, or if you can remember the start of the function’s name, use the Tab completion in RStudio.

This is one advantage that RStudio has over R on its own; it has auto-completion abilities that allow you to more easily look up functions, their arguments, and the values that they take. In fact, the auto-completion abilities do not limit to functions, but also to variables.

Typing a ? before the name of a command will open the help page for that command. When using RStudio, this will open the Help window. The help page will include a detailed description of the command and how it works. Scrolling to the bottom of the help page will usually show a collection of code examples that illustrate command usage. We will go back to how to get help later in this section.

Comparing things

We can also do comparisons in R:

# Equality (note two equals signs, read as "is equal to")
1 == 1
## [1] TRUE
# Inequality (read as "is not equal to")
3 != 2  
## [1] TRUE
# Less than
100 < 101  
## [1] TRUE
# Less than or equal to
1e3 <= 2e3
## [1] TRUE
# Greater than
1/3 > 1/5
## [1] TRUE
# Greater than or equal to
-100 > -200
## [1] TRUE

A word of warning about comparing numbers: you should never use == to compare two numbers unless they are integers (a data type which can specifically represent only whole numbers).

Variables and assignment

We can store values in variables using the assignment operator <-, like this:

# Assignment
x <- 1/100

Notice that the assignment does not print a value. Instead, we stored it for later in something called a variable. x now contains the value 0.01:

# Print x
x
## [1] 0.01

Remember, you can always print a variable (in fact, anything) using the function print():

print(x)
## [1] 0.01
print("Hello World")
## [1] "Hello World"

You can also assign a character to a variable

MyName <- "SUSTech"
print(MyName)
## [1] "SUSTech"

Look for the Environment window from the top right panel of RStudio, and you will see that x and its value have appeared. Our variable x can be used in place of a number in any calculation that expects a number:

# Print x
log10(x)
## [1] -2

Notice also that variables can be reassigned:

# Assignment
x <- 100

x used to contain the value 0.01, and now it has the value 100.

Assignment values can contain the variable being assigned to:

# Notice how RStudio updates its description of x on the top right tab
x <- x + 1 
y <- x * 2

The right-hand side of the assignment can be any valid R expression. The right-hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores, periods, but no spaces. They must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). Variables beginning with a period are hidden variables. Different people use different conventions for long variable names. Whatever you use is up to you, but be consistent.

Vectorization

One thing to be aware of is that R is vectorized, meaning that variables and functions can have vectors as values. In contrast to physics and mathematics, a vector in R describes a set of values in a certain order of the same data type. For example:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
2 ^ (1:10)
##  [1]    2    4    8   16   32   64  128  256  512 1024
x <- 1:10
2 ^ x
##  [1]    2    4    8   16   32   64  128  256  512 1024

This is incredibly powerful; we will discuss this further in an upcoming section.

Managing your environment

There are a few useful commands you can use to interact with the R session.

ls() will list all of the variables and functions stored in the global environment (your working R session):

ls()
##   [1] "ACF"                 "AQI"                 "AQI1"               
##   [4] "AQI2"                "average"             "Baseline"           
##   [7] "Beta0_hat"           "Beta1_hat"           "cd_data"            
##  [10] "Cd_data"             "cd_data_tbl"         "Check_Air_Quality"  
##  [13] "co2"                 "co2_components"      "Control"            
##  [16] "Daily_T"             "Day"                 "density"            
##  [19] "density1"            "density2"            "density3"           
##  [22] "density4"            "df"                  "df_B"               
##  [25] "df_error"            "df_regression"       "df_W"               
##  [28] "df1"                 "df2"                 "df3"                
##  [31] "df4"                 "Diff"                "difference"         
##  [34] "F_ratio"             "Forecast_List"       "full_model"         
##  [37] "Groupings_A"         "GZ_n"                "GZ_PM2.5"           
##  [40] "GZ_sigma"            "Hour"                "Hourly_T"           
##  [43] "i"                   "k"                   "Keeling_Data"       
##  [46] "Keeling_Data_tbl"    "lambda"              "logistic"           
##  [49] "M"                   "mean_A"              "mean_B"             
##  [52] "Month_CO2"           "MSB"                 "MSE"                
##  [55] "MSR"                 "MSW"                 "MyName"             
##  [58] "n"                   "N"                   "n1"                 
##  [61] "N1"                  "n2"                  "N2"                 
##  [64] "n3"                  "n4"                  "Obs_A"              
##  [67] "Obs_all"             "Obs_B"               "Obs_difference"     
##  [70] "output"              "Output_List"         "Output_Matrix"      
##  [73] "Output_Matrix2"      "Ozone_data"          "p_2.5th"            
##  [76] "p_97.5th"            "p_value"             "P_value"            
##  [79] "P1"                  "P2"                  "P3"                 
##  [82] "PACF"                "Pesticide_data"      "Pesticide_data2"    
##  [85] "PM"                  "PM_data"             "PM2.5_2019"         
##  [88] "PM2.5_2020"          "PM2.5_data"          "Pop"                
##  [91] "pop1"                "pop2"                "Pred_band"          
##  [94] "Prediction"          "r"                   "R2"                 
##  [97] "R2_adj"              "Rainfall_data"       "reg"                
## [100] "res_aov"             "residual"            "s1"                 
## [103] "s2"                  "Samle_size"          "sample"             
## [106] "Sample"              "Sample_mean"         "sample1"            
## [109] "Sample1"             "sample2"             "Sample2"            
## [112] "sample3"             "sample4"             "sample5"            
## [115] "sample6"             "sd"                  "SD"                 
## [118] "sd1"                 "sd2"                 "SE"                 
## [121] "SE_beta1_hat"        "SE_W"                "Seeded"             
## [124] "Simulations"         "Soil_conc"           "SSB"                
## [127] "SSE"                 "SSR"                 "SST"                
## [130] "SSW"                 "SZ_n"                "SZ_PM2.5"           
## [133] "SZ_sigma"            "t"                   "t_beta1"            
## [136] "Temp_Value"          "TOC"                 "Total_simulations"  
## [139] "Treat"               "trModel"             "two_way_additive"   
## [142] "two_way_interaction" "Unseeded"            "Uptaken_amount"     
## [145] "UV"                  "walsh.test"          "x"                  
## [148] "x1"                  "x2"                  "y"                  
## [151] "y_new"               "y1"                  "y2"                 
## [154] "z"                   "Z"

Note here that we didn’t give any arguments to ls(), but we still needed to give the parentheses () to tell R to call the function.

You can use rm() to delete objects you no longer need:

rm(x)
ls()
##   [1] "ACF"                 "AQI"                 "AQI1"               
##   [4] "AQI2"                "average"             "Baseline"           
##   [7] "Beta0_hat"           "Beta1_hat"           "cd_data"            
##  [10] "Cd_data"             "cd_data_tbl"         "Check_Air_Quality"  
##  [13] "co2"                 "co2_components"      "Control"            
##  [16] "Daily_T"             "Day"                 "density"            
##  [19] "density1"            "density2"            "density3"           
##  [22] "density4"            "df"                  "df_B"               
##  [25] "df_error"            "df_regression"       "df_W"               
##  [28] "df1"                 "df2"                 "df3"                
##  [31] "df4"                 "Diff"                "difference"         
##  [34] "F_ratio"             "Forecast_List"       "full_model"         
##  [37] "Groupings_A"         "GZ_n"                "GZ_PM2.5"           
##  [40] "GZ_sigma"            "Hour"                "Hourly_T"           
##  [43] "i"                   "k"                   "Keeling_Data"       
##  [46] "Keeling_Data_tbl"    "lambda"              "logistic"           
##  [49] "M"                   "mean_A"              "mean_B"             
##  [52] "Month_CO2"           "MSB"                 "MSE"                
##  [55] "MSR"                 "MSW"                 "MyName"             
##  [58] "n"                   "N"                   "n1"                 
##  [61] "N1"                  "n2"                  "N2"                 
##  [64] "n3"                  "n4"                  "Obs_A"              
##  [67] "Obs_all"             "Obs_B"               "Obs_difference"     
##  [70] "output"              "Output_List"         "Output_Matrix"      
##  [73] "Output_Matrix2"      "Ozone_data"          "p_2.5th"            
##  [76] "p_97.5th"            "p_value"             "P_value"            
##  [79] "P1"                  "P2"                  "P3"                 
##  [82] "PACF"                "Pesticide_data"      "Pesticide_data2"    
##  [85] "PM"                  "PM_data"             "PM2.5_2019"         
##  [88] "PM2.5_2020"          "PM2.5_data"          "Pop"                
##  [91] "pop1"                "pop2"                "Pred_band"          
##  [94] "Prediction"          "r"                   "R2"                 
##  [97] "R2_adj"              "Rainfall_data"       "reg"                
## [100] "res_aov"             "residual"            "s1"                 
## [103] "s2"                  "Samle_size"          "sample"             
## [106] "Sample"              "Sample_mean"         "sample1"            
## [109] "Sample1"             "sample2"             "Sample2"            
## [112] "sample3"             "sample4"             "sample5"            
## [115] "sample6"             "sd"                  "SD"                 
## [118] "sd1"                 "sd2"                 "SE"                 
## [121] "SE_beta1_hat"        "SE_W"                "Seeded"             
## [124] "Simulations"         "Soil_conc"           "SSB"                
## [127] "SSE"                 "SSR"                 "SST"                
## [130] "SSW"                 "SZ_n"                "SZ_PM2.5"           
## [133] "SZ_sigma"            "t"                   "t_beta1"            
## [136] "Temp_Value"          "TOC"                 "Total_simulations"  
## [139] "Treat"               "trModel"             "two_way_additive"   
## [142] "two_way_interaction" "Unseeded"            "Uptaken_amount"     
## [145] "UV"                  "walsh.test"          "x1"                 
## [148] "x2"                  "y"                   "y_new"              
## [151] "y1"                  "y2"                  "z"                  
## [154] "Z"
rm(MyName, y)
ls()
##   [1] "ACF"                 "AQI"                 "AQI1"               
##   [4] "AQI2"                "average"             "Baseline"           
##   [7] "Beta0_hat"           "Beta1_hat"           "cd_data"            
##  [10] "Cd_data"             "cd_data_tbl"         "Check_Air_Quality"  
##  [13] "co2"                 "co2_components"      "Control"            
##  [16] "Daily_T"             "Day"                 "density"            
##  [19] "density1"            "density2"            "density3"           
##  [22] "density4"            "df"                  "df_B"               
##  [25] "df_error"            "df_regression"       "df_W"               
##  [28] "df1"                 "df2"                 "df3"                
##  [31] "df4"                 "Diff"                "difference"         
##  [34] "F_ratio"             "Forecast_List"       "full_model"         
##  [37] "Groupings_A"         "GZ_n"                "GZ_PM2.5"           
##  [40] "GZ_sigma"            "Hour"                "Hourly_T"           
##  [43] "i"                   "k"                   "Keeling_Data"       
##  [46] "Keeling_Data_tbl"    "lambda"              "logistic"           
##  [49] "M"                   "mean_A"              "mean_B"             
##  [52] "Month_CO2"           "MSB"                 "MSE"                
##  [55] "MSR"                 "MSW"                 "n"                  
##  [58] "N"                   "n1"                  "N1"                 
##  [61] "n2"                  "N2"                  "n3"                 
##  [64] "n4"                  "Obs_A"               "Obs_all"            
##  [67] "Obs_B"               "Obs_difference"      "output"             
##  [70] "Output_List"         "Output_Matrix"       "Output_Matrix2"     
##  [73] "Ozone_data"          "p_2.5th"             "p_97.5th"           
##  [76] "p_value"             "P_value"             "P1"                 
##  [79] "P2"                  "P3"                  "PACF"               
##  [82] "Pesticide_data"      "Pesticide_data2"     "PM"                 
##  [85] "PM_data"             "PM2.5_2019"          "PM2.5_2020"         
##  [88] "PM2.5_data"          "Pop"                 "pop1"               
##  [91] "pop2"                "Pred_band"           "Prediction"         
##  [94] "r"                   "R2"                  "R2_adj"             
##  [97] "Rainfall_data"       "reg"                 "res_aov"            
## [100] "residual"            "s1"                  "s2"                 
## [103] "Samle_size"          "sample"              "Sample"             
## [106] "Sample_mean"         "sample1"             "Sample1"            
## [109] "sample2"             "Sample2"             "sample3"            
## [112] "sample4"             "sample5"             "sample6"            
## [115] "sd"                  "SD"                  "sd1"                
## [118] "sd2"                 "SE"                  "SE_beta1_hat"       
## [121] "SE_W"                "Seeded"              "Simulations"        
## [124] "Soil_conc"           "SSB"                 "SSE"                
## [127] "SSR"                 "SST"                 "SSW"                
## [130] "SZ_n"                "SZ_PM2.5"            "SZ_sigma"           
## [133] "t"                   "t_beta1"             "Temp_Value"         
## [136] "TOC"                 "Total_simulations"   "Treat"              
## [139] "trModel"             "two_way_additive"    "two_way_interaction"
## [142] "Unseeded"            "Uptaken_amount"      "UV"                 
## [145] "walsh.test"          "x1"                  "x2"                 
## [148] "y_new"               "y1"                  "y2"                 
## [151] "z"                   "Z"

Conditional statements

Often when we are coding, we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.

There are several ways you can control flow in R. For conditional statements, the most commonly used approaches are the if and else constructs.

Given today’s AQI (Air Quality Index) value, suppose we want to write a piece of code to check whether the Air Quality is excellent (AQI <= 50) or not.

AQI <- 69

Open a new R script (File -> New File -> R Script), and you should see a new panel in your RStudio. Type the following lines in the script, and save it. Select the lines you want to run, there are two ways to do so:

  • Type Ctrl+Enter
  • Click Run then Run Select Line(s) from the top-right of the Script window.
AQI <- 69
# If this condition is TRUE
if (AQI <= 50) { 
  # Do the following
  print("Air Quality is Excellent")
}

The print statement does not appear in the console because AQI is larger than 50. To print a different message for numbers larger than 50, we can add an else statement.

# If this condition is TRUE
if (AQI <= 50) {
  # Do the following
  print("Air Quality is Excellent")
# If this condition is FALSE  
} else {
  print("Air Quality is NOT Excellent")
}
## [1] "Air Quality is NOT Excellent"

You can also test multiple conditions by using else if.

if (AQI <= 50) {
  print("Air Quality is Excellent")
} else if (AQI <= 100) {
  print("Air Quality is GOOD")
} else {
  print("Air Pollution!")
}
## [1] "Air Quality is GOOD"

Change AQI to 40, 80, and 120, check the ouput.

Important: when R evaluates the condition inside if() statements, it is looking for a logical element, i.e., TRUE or FALSE. This can cause some headaches for beginners. For example:

x  <-  4 == 3
if (x) {
  "4 equals 3"
} else {
  "4 does not equal 3"          
}
## [1] "4 does not equal 3"

We can use logical AND && and OR || operator for more than one condition:

if (AQI > 50 && AQI <= 100) {
  print("Air Quality is Good")
}
## [1] "Air Quality is Good"

Change AQI to 40, 80, and 120, check the ouput.

AQI1 <- 69
AQI2 <- 140

if (AQI1 <= 100 || AQI2 <= 100) {
  print("There is at least 1 site with a GOOD air quality")
}
## [1] "There is at least 1 site with a GOOD air quality"

Change AQI1 to 40, 80, and 120, check the ouput.

Defining a Function

You probably have realized it’s really tedious to change the AQI variables. It would be really helpful to define a function that handles different inputs automatically.

A function is a set of statements organized together to perform a specific task. R has a large number of in-built functions and the user can create their own functions. In R, a function is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions. The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects.

An R function is created by using the keyword function. The basic syntax of an R function definition is as follows:

function_name <- function(argument_1, argument_2, ...) {
   Function body 
}

In the above AQI example, we can define a function named Check_Air_Quality as:

Check_Air_Quality<- function(AQI) {
  #  Excellent 
  if (AQI <= 50) {
    print("Air Quality is Excellent")
  }
  # Good
  if (AQI > 50 && AQI <= 100) {
    print("Air Quality is Good")
  }
  # Polluted, Level I
  if (AQI > 100 && AQI <= 150) {
    print("Air pollution, level I")
  }  
  # Polluted, Level II
  if (AQI > 150 && AQI <= 200) {
    print("Air pollution, level II")
  }  
  # Polluted, Level III
  if (AQI > 200 && AQI <= 300) {
    print("Air pollution, level III")
  }  
  # Polluted, Level IV
  if (AQI > 300) {
    print("Air pollution, level IV")
  }  
}

Call Check_Air_Quality with various AQI values: 40, 80, 120, 160, 240, and 340:

Check_Air_Quality(40)
## [1] "Air Quality is Excellent"
Check_Air_Quality(80)
## [1] "Air Quality is Good"
Check_Air_Quality(120)
## [1] "Air pollution, level I"
Check_Air_Quality(160)
## [1] "Air pollution, level II"
Check_Air_Quality(240)
## [1] "Air pollution, level III"
Check_Air_Quality(500)
## [1] "Air pollution, level IV"

Repeating operations

If you want to iterate over a set of values, when the order of iteration is important, and perform the same operation on each, a for() loop will do the job. This is the most flexible of looping operations, but therefore also the hardest to use correctly.

In general, the advice of many R users would be to learn about for() loops but to avoid using for() loops unless the order of iteration is important: i.e., the calculation at each iteration depends on the results of previous iterations.

Let’s define a list Forecast_List, which contains daily mean temperature forecasts in 5 days in Shenzhen:

Forecast_List <- c(28, 27, 28, 26, 27)

Here c() means “combine”:

print(Forecast_List)
## [1] 28 27 28 26 27

Now loop each element in Forecast_List:

for (Daily_T in Forecast_List) { # If this condition is TRUE
  # Do following
  print(Daily_T)
} # End of the for loop
## [1] 28
## [1] 27
## [1] 28
## [1] 26
## [1] 27

We can use a for() loop nested within another for() loop to iterate over two things at once.

for (Daily_T in Forecast_List) {
  for (Hour in 1:24) {
    Hourly_T <- rnorm(1,Daily_T,5)
    print(paste(Daily_T,Hourly_T))
  }
}
## [1] "28 19.6988198575309"
## [1] "28 34.7176875258543"
## [1] "28 23.5407589873969"
## [1] "28 26.4972068379152"
## [1] "28 24.0353814622211"
## [1] "28 27.8044412113289"
## [1] "28 26.8376399783147"
## [1] "28 26.2494242029103"
## [1] "28 35.0038013334599"
## [1] "28 19.8961014740756"
## [1] "28 23.5221516919203"
## [1] "28 29.0170639776425"
## [1] "28 28.5408933749185"
## [1] "28 32.7267269315267"
## [1] "28 23.738164034384"
## [1] "28 33.0162911134238"
## [1] "28 29.0843599670089"
## [1] "28 24.9010772610372"
## [1] "28 32.6856925841404"
## [1] "28 22.4521922981223"
## [1] "28 30.8799870205304"
## [1] "28 31.1638784123406"
## [1] "28 33.3237129416972"
## [1] "28 23.2423690100525"
## [1] "27 33.067796470596"
## [1] "27 28.0434706486471"
## [1] "27 22.4278226103209"
## [1] "27 28.7324888606353"
## [1] "27 24.5433812627168"
## [1] "27 21.1776601757421"
## [1] "27 33.4751359364445"
## [1] "27 21.2891796107356"
## [1] "27 29.2440198838129"
## [1] "27 16.5133748069372"
## [1] "27 31.6960865967774"
## [1] "27 23.5479845558448"
## [1] "27 26.1302350609059"
## [1] "27 32.0065574083835"
## [1] "27 33.2156719018574"
## [1] "27 28.2888722820638"
## [1] "27 28.712446134501"
## [1] "27 23.6026068705502"
## [1] "27 23.3059926199877"
## [1] "27 19.0911923473211"
## [1] "27 30.5098035332685"
## [1] "27 25.8789238911282"
## [1] "27 26.3008489020255"
## [1] "27 22.9905866381837"
## [1] "28 15.735857850233"
## [1] "28 34.250303719434"
## [1] "28 22.4128506736347"
## [1] "28 32.0568428033343"
## [1] "28 28.3986702684282"
## [1] "28 32.6361004778093"
## [1] "28 32.3124297261998"
## [1] "28 22.5724360013808"
## [1] "28 29.8913759351521"
## [1] "28 27.9724374100295"
## [1] "28 21.1919810213329"
## [1] "28 23.415237621939"
## [1] "28 26.953394038341"
## [1] "28 23.1273598405587"
## [1] "28 36.263039474969"
## [1] "28 25.4106381185103"
## [1] "28 28.7029588626408"
## [1] "28 28.802008575414"
## [1] "28 26.3817221158694"
## [1] "28 31.7968910679295"
## [1] "28 32.8854447059878"
## [1] "28 35.2197893538432"
## [1] "28 34.3856479705603"
## [1] "28 25.1906784670751"
## [1] "26 33.2783576652864"
## [1] "26 30.0219848549018"
## [1] "26 25.6269960264162"
## [1] "26 21.4946391736449"
## [1] "26 31.7813058832885"
## [1] "26 28.6364913536878"
## [1] "26 29.4537690753901"
## [1] "26 25.8050517274399"
## [1] "26 32.8428844976266"
## [1] "26 35.772279307081"
## [1] "26 24.9684710433982"
## [1] "26 34.6628015183426"
## [1] "26 26.7727056740464"
## [1] "26 18.4514659761618"
## [1] "26 18.8653313678537"
## [1] "26 18.7603415428819"
## [1] "26 33.5814552670635"
## [1] "26 27.0758802184479"
## [1] "26 23.3376806048711"
## [1] "26 30.9772206025228"
## [1] "26 16.0742669232243"
## [1] "26 17.809131218727"
## [1] "26 17.9379408659617"
## [1] "26 23.0329227542311"
## [1] "27 27.2546368033119"
## [1] "27 31.9152111483565"
## [1] "27 22.991993604737"
## [1] "27 21.8161463614006"
## [1] "27 24.7413711351016"
## [1] "27 25.4472712665952"
## [1] "27 22.0527681335378"
## [1] "27 25.8074165527213"
## [1] "27 32.8639623021498"
## [1] "27 26.4347776027182"
## [1] "27 33.5706520016112"
## [1] "27 29.5574969842314"
## [1] "27 39.5216758229394"
## [1] "27 32.2939253960163"
## [1] "27 28.7496304991599"
## [1] "27 28.8871506122696"
## [1] "27 31.0169254569996"
## [1] "27 22.7558703839084"
## [1] "27 29.5390932002974"
## [1] "27 23.997574222305"
## [1] "27 28.3833489760957"
## [1] "27 42.7900797871735"
## [1] "27 33.5623015274267"
## [1] "27 29.0069543395444"

Here at each Hour we use the rnorm() function to generate 1 random Gaussian sample with a mean of Daily_T and a standard deviation of 5.

We notice in the output that when the first index (Daily_T) is set to 28, the second index (Hour) iterates through its full set of indices. Once the indices of Hour have been iterated through, then Daily_T moves to the next one (i.e., 27). This process continues until the last index has been used for each for() loop.

Rather than printing the results, we could write the loop output to a new object:

Output_List <- c()
for (Daily_T in Forecast_List) {
  for (Hour in 1:24) {
    Hourly_T    <- rnorm(1,Daily_T,5)
    Temp_Value  <- paste(Daily_T,Hourly_T)
    Output_List <- c(Output_List, Temp_Value)
  }
}

print(Output_List)
##   [1] "28 35.9560042085396" "28 27.4272321076041" "28 29.0967258831916"
##   [4] "28 30.4752616740164" "28 36.1726091378138" "28 34.852719258776" 
##   [7] "28 27.5099473898483" "28 25.5868872630691" "28 30.7304034893396"
##  [10] "28 28.0676834011107" "28 35.5388858156731" "28 18.4827433829708"
##  [13] "28 23.107520379023"  "28 29.8585795132678" "28 29.8416052268138"
##  [16] "28 19.4950787783915" "28 24.6443832783154" "28 33.7857903517879"
##  [19] "28 23.5571998885049" "28 32.9516961234688" "28 33.2000764971939"
##  [22] "28 32.5878567722075" "28 19.0920337448699" "28 30.5607463109605"
##  [25] "27 41.6524269071484" "27 27.2887279013432" "27 29.5599858832661"
##  [28] "27 30.0745652247668" "27 36.8358500616671" "27 22.4290573012545"
##  [31] "27 25.6666552383642" "27 32.1015866140773" "27 18.2778405207192"
##  [34] "27 27.3459273206051" "27 33.6016339721711" "27 27.569786685234" 
##  [37] "27 30.734406245852"  "27 34.4594862759856" "27 26.9636995224397"
##  [40] "27 18.0031483273811" "27 25.3174189375982" "27 24.3363541745574"
##  [43] "27 24.6443221276294" "27 28.0800782791775" "27 25.2067710667407"
##  [46] "27 30.9165568356724" "27 32.5055101868929" "27 26.6020319795304"
##  [49] "28 30.0145927893302" "28 27.2766862851854" "28 29.1376145659962"
##  [52] "28 32.0763265844605" "28 35.6570993923308" "28 23.9399161241214"
##  [55] "28 26.0969214051809" "28 24.492687739191"  "28 24.8581290694942"
##  [58] "28 36.8088214161083" "28 29.4057186498681" "28 27.0268944287513"
##  [61] "28 32.8577831458928" "28 17.9406479868824" "28 30.5004452817609"
##  [64] "28 25.8513227282157" "28 24.3141164320583" "28 25.1764661720114"
##  [67] "28 32.762336786253"  "28 25.9386649468695" "28 30.2358179956051"
##  [70] "28 31.9685948804035" "28 26.6873899941407" "28 23.7554264937126"
##  [73] "26 26.320143490558"  "26 27.6532633682481" "26 18.5952507116159"
##  [76] "26 26.2167540500831" "26 24.3026724220006" "26 29.5992086624888"
##  [79] "26 14.2471015857349" "26 23.2016638337446" "26 24.882388614599" 
##  [82] "26 26.6135833033579" "26 16.9920524507822" "26 25.9634496006038"
##  [85] "26 20.8722287693698" "26 25.3630309023992" "26 21.8112212637651"
##  [88] "26 26.3771978909162" "26 23.8746316859169" "26 16.006996579869" 
##  [91] "26 25.0501289856061" "26 35.7539509551053" "26 28.9895432668958"
##  [94] "26 35.355198104932"  "26 25.4481171285791" "26 31.6930562949691"
##  [97] "27 31.0421785801952" "27 26.5346038326342" "27 31.7288080667447"
## [100] "27 20.4279570192446" "27 24.3543721122051" "27 31.6355582519865"
## [103] "27 19.9225431448492" "27 22.0326139769271" "27 21.7708938950088"
## [106] "27 32.6849968513473" "27 33.0895198305758" "27 27.3403150040361"
## [109] "27 33.7424907390228" "27 23.2080630002514" "27 14.9996319126961"
## [112] "27 18.0193665292616" "27 25.0710046757951" "27 34.1420572635162"
## [115] "27 29.4695974875797" "27 26.9924035991331" "27 24.5048544622555"
## [118] "27 28.5821180604296" "27 24.1977737347385" "27 22.1153183728972"

This approach can be useful, but “growing your results” (building the result object incrementally) is computationally inefficient, so avoid it when you are iterating through a lot of values.

Important: One of the biggest things that trips up novices and experienced R users alike, is building a results object (vector, list, matrix, data frame) as your for loop progresses. Computers are very bad at handling this, so your calculations can very quickly slow to a crawl. It’s much better to define an empty results object beforehand of appropriate dimensions, rather than initializing an empty object without dimensions. So if you know the end result will be stored in a matrix-like above, create an empty matrix with 5 row and 24 columns, then at each iteration stores the results in the appropriate location.

Output_Matrix    <- matrix(nrow=5, ncol=24)
for (Day in 1:5) {
  Daily_T        <- Forecast_List[Day]
  for (Hour in 1:24) {
    Hourly_T     <- rnorm(1,Daily_T,5)
    Temp_Value   <- paste(Daily_T,Hourly_T)
    Output_Matrix[Day, Hour] <- Temp_Value
  }
}
Output_Matrix2 <- as.vector(Output_Matrix)
Output_Matrix2
##   [1] "28 26.0845783118647" "27 30.9971670126357" "28 30.2024321073016"
##   [4] "26 28.5930062916697" "27 22.7836459822549" "28 26.5269507731028"
##   [7] "27 22.7441622881206" "28 24.1387290053255" "26 33.6866272282495"
##  [10] "27 23.9130524746896" "28 26.4645657396432" "27 27.4305253513199"
##  [13] "28 33.9879209647618" "26 31.8129803810144" "27 26.2004294345087"
##  [16] "28 27.9469867000322" "27 28.5593788990881" "28 32.1687986202537"
##  [19] "26 24.6866855221858" "27 33.087738577415"  "28 31.1530402254051"
##  [22] "27 29.0417207045254" "28 26.1942994688406" "26 24.005121865571" 
##  [25] "27 24.839183622239"  "28 26.3600589008015" "27 27.5734994136344"
##  [28] "28 31.9552835899299" "26 28.8322318215254" "27 25.532406732028" 
##  [31] "28 31.1384057308229" "27 34.0455374013156" "28 32.7200882129987"
##  [34] "26 21.7580040984604" "27 22.8360016448912" "28 27.5358222390532"
##  [37] "27 18.1974761418415" "28 30.6710308487102" "26 25.0150455604761"
##  [40] "27 29.9893562558979" "28 36.8578432425402" "27 22.8739224568164"
##  [43] "28 19.5848984529932" "26 11.0334683137787" "27 21.2193383870309"
##  [46] "28 22.9061964617472" "27 30.572537401687"  "28 30.7589024170426"
##  [49] "26 29.015241138442"  "27 29.1001022492544" "28 32.4859117352971"
##  [52] "27 27.7501393372524" "28 28.353088197033"  "26 20.1940194777277"
##  [55] "27 32.0955850936003" "28 31.9173206741632" "27 21.05035263"     
##  [58] "28 22.8414525114609" "26 23.055214065725"  "27 20.8541543319491"
##  [61] "28 29.0565829237837" "27 35.0911975344304" "28 26.4803097803627"
##  [64] "26 32.824762150798"  "27 32.9525858373154" "28 26.8376567031671"
##  [67] "27 29.021681582474"  "28 27.8387289428645" "26 23.2209309708009"
##  [70] "27 19.2906657259738" "28 24.5502164432897" "27 26.8944335953995"
##  [73] "28 36.0634250352685" "26 23.3602159884947" "27 27.6883910453084"
##  [76] "28 33.3728271123548" "27 24.9480229269942" "28 28.8492044383811"
##  [79] "26 24.6228595422683" "27 38.8947415179612" "28 31.082843522355" 
##  [82] "27 23.3140324792684" "28 24.012135328906"  "26 34.1668303625732"
##  [85] "27 19.9589924307211" "28 24.0144878765282" "27 26.6934385183864"
##  [88] "28 20.7879687490384" "26 26.9735608122812" "27 27.0611597205979"
##  [91] "28 39.6410891903129" "27 11.930122888901"  "28 27.4283647716058"
##  [94] "26 29.003561682171"  "27 25.2240237334685" "28 34.1320308389913"
##  [97] "27 26.5852479141149" "28 31.3692281152596" "26 27.5169848749315"
## [100] "27 19.0180762510365" "28 27.6957467120799" "27 29.100771430287" 
## [103] "28 24.5440466031329" "26 25.2523651254085" "27 26.1700421718757"
## [106] "28 27.5144916108336" "27 24.1139394268434" "28 31.6814524882512"
## [109] "26 25.26498800478"   "27 21.0856661630376" "28 29.7718857721345"
## [112] "27 30.607926265359"  "28 19.7190536483737" "26 29.1504043927915"
## [115] "27 29.4731219933199" "28 22.9425083930788" "27 27.1916605338197"
## [118] "28 29.6801550621288" "26 20.4351148619303" "27 24.5776336751132"

Here we use matrix() to create an empty matrix with 5 rows and 24 columns, and use as.vector() to convert the 5x24 matrix into a vector with a length of 120.

Sometimes you will find yourself needing to repeat an operation as long as a certain condition is met. You can do this with a while() loop.

z <- 0
while(z <= 5){ # While this condition is TRUE
  # Do following
  z <- z + 1
  print(z)
} # End of the while loop
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

OK, can you figure out why the number 6 is also printed?

while() loops will not always be appropriate. You have to be particularly careful that you don’t end up stuck in an infinite loop because your condition is always met, and hence the while statement never terminates.

R Packages

It is possible to add functions to R by writing a package or by obtaining a package written by someone else. As of this writing, there are over 16000 packages available on CRAN (the comprehensive R archive network). RStudio has functionality for managing packages. You can:

  • See what packages are installed by typing installed.packages()
  • Install packages by typing install.packages("packagename"), where packagename is the package name, in quotes.
  • Update installed packages by typing update.packages()
  • Remove a package with remove.packages("packagename")
  • Make a package available for use with library(packagename), no quotes.

Packages can also be viewed, loaded, and detached in the Packages tab of the lower-right panel in RStudio.

drawing

Clicking on this tab will display all of the installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded, and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package.

Packages can be installed and updated from the Package tab with the Install and Update buttons at the top of the tab.

Working directory

If you want to read files from a specific location or write files to a specific location, you need to set working directory in R. This can be accomplished by specifying path with setwd() function. First, let’s check the working directory using getwd().

# Get current working directory in R
getwd()

You can change the working directory using setwd():

# Set working directory
setwd("D://ese335")

Or you may use RStudio through GUI - click Session then Set Working Directory, followed by Choose Directory.

Seeking Help

Reading Help files

R, and every package, provides help files for functions. The general syntax to search for help on any function, ? function_name, from a specific function that is in a package loaded into your namespace (your interactive R session):

?function_name or help(function_name)

This will load up a help page in RStudio.

Special Operators

To seek help on special operators, use quotes: ?">="

When you have no idea where to begin

If you don’t know what function or package you need to use CRAN Task Views is a specially maintained list of packages grouped into fields. This can be a good starting point.

When your code doesn’t work: seeking help from your peers

If you are having trouble using a function, 9 times out of 10, the answers you are seeking have already been answered on Stack Overflow. You can search using the [r] tag.


In-class exercises

Exercise #1

  • Creat a folder named ese335
    • Windows: In C:\ or D:\ disk
    • macOS: In /home/
  • Change RStudio Working directory to the above folder

Exercise #2

X1  <- 50
X2  <- 120
X3  <- X2 * 2.0
X4  <- X1 - 20
X5  <- X1 > X2

What will be the value of each variable after each statement in the program?

  • Now type the above codes in Console, check your results.
  • Write a command to compare X3 to X4. Which one is larger?
  • Clean up your working environment by deleting the X1, X2, and X3 variables.