Programing in R for Data Science

R is defined as an open-source language and environment for performing statistical computations and graphics with its thousands of inbuilt or user created libraries. The outstanding journey of R is marked with the text editor to interactive R studio and Jupyter notebook attracting data scientists. It was the result of efforts made by R users across the world. The powerful packages offered by R Programming becoming more powerful and useful with time. Packages such as dplyr, readr, SparkR, ggplot2 etc., holds the capability to visualise, manipulate and compute faster.

Why R?

1). R is open source and free to download and use.

2). It is broadly used for data science and machine learning.

3). R can be easily executed on multiple platforms such as Windows, MacOS, Linux etc.

4). More than 5000 packages are offered by R to simplify your work.

How to install R/Rstudio?

You can download Rstudio from https://www.rstudio.com/products/rstudio/download and easily install it by selecting R studio installer on the basis of your operating system. Click next..next.. and finish.

Rstudio Interface includes:

1). R Console: R console presents the output delivered by a successful program execution. You can also write down your code program inside the console. The only drawback is that the written code can not be traced further in history. This problem is replaced with the R script.

2). R script: R script provides space to write your program codes. To execute the written code, you simply need to press Ctrl + Enter button or you can click on the ‘Run’ button at the top right corner.

3). R environment: The R environment shows the externally added elements set. These elements can be variables, functions, data sets etc. To cross check your uploaded data in R, you can go through this region.

4). Graphical Output: it displays the graphs created during the data analysis. It doesn’t only include graphs, but you can include the various packages embedded in R’s documentation.

Installing R Packages:

The main power of R lies under its outstanding packages. Most of the operations like data handling tasks are managed by two ways in R:

1). By using R packages
2). With R base functions

To install an R package, type in console:

install.packages(“package_name”)

Basic Features:

R contains five classes of objects. That is:

1). Numeric- real numbers
2). Character
3). Integer- whole numbers
4). Logical
5). Complex

These objects can have attributes such as:

1). Dimensions
2). Class
3). Names
4). Length

These attributes are accessed with attribute() function.

Data Types in R:

R contains various data types including matrices, vectors such as numeric, integer etc., data frames.

Vector:

Vector holds object of same class, but you can also mix objects from different class.

You can convert the vector class with “as.” command.

>deg<-0.8

>class(deg)

>”integer”

>as.numeric(deg)

List:
A list contains the elements of different data types. The symbol [[1]] represents the first element of the list and so on.

E.g., >hello_list<-list(11, “ew”, TRUE, 1+4i)

Matrices:

A vector represented with dimension attribute i.e., row and column, is known as matrix. It is a two dimensional data structure which contains the elements of same class.

E.g., >first_matrix<-matrix(1:8, nrow=2, ncol=4)

Data Frame:

The data frame is used to store the tabular data. From matrix,, data frame is only differ in fact that it can store vectors with different classes. Every column of the data frame is same as the list.

> df <- data.frame( emname = c(“reli”, “jessy”, “polli”, “marcus”), marks = c(23,45,65,67))

The variables can be distinguish in two categories:

1). Continuous variables: It represents any form such that 2,5,6,7,8.99 etc.

2). Categorical variable: They represent only discrete values such that 1,2,5,6,7,88 etc.

Control Structures:

It controls the flow of the written code or program inside a function. A function represents the multiple line code written to reduce the effort of repetitive tasks. The control structure in R programming are:

If, else- used to test conditions.

if(<conditions>) {

#statement

}else

{

#statement

}

for- for is used to execute a loop within bounded times of iterations.

for(<search condition>) {

#statement

}

while- It is used to test conditions and execute the statement once the condition met true.

# condition initialisation

Marks <- 20

# check if mark is less than 33

while(Marks<33){

print(Marks)

Marks <- Marks + 1

}

R Packages:

R offers more than seven thousand of packages to implement in your program. Here I’m discussing few useful packages for modelling:

1). For importing Data: R provides various packages with multiple formats like .txt, .json, .sql etc. If you want to quickly import large files then you can install packages like data.table, sqldf, jsonlite, RMySQL or readr.

2). Data manipulation: R packages for data collection allows users to perform simple to advanced computational tasks easily and quickly. You can install and use packages like dplyr, tidyr, stringr, plyr etc.

3). Data Visualisation: R can leverage with the incredible packages and commands for graph plotting. In case of creating advanced graphics, the scenario becomes complex. Thus you can install and use package like ggplot2.

4). Modelling or Machine Learning: The package named caret is powerful enough to deal with modeling, allows you to develop machine learning model. Beside it, depending upon your need or useful algorithms, you can use packages like randomForest, gbm, rpart etc.

Data Analysis in R:

Data exploration is an important stage of predictive modelling. You can create a good model only if you know how to explore the data.

You can download the data set for the practical work from various sources available over the internet. Lets understand few terms first:

Dependent or Response Variable: The response variable in data set is defined as the one over which the predictions to be make.

Independent or Predictor Variable: Predictor variables are the one which are used to make prediction on the response variable.

Train Data: The predictive model is created on the train data set. The best way to recognise the train data is that it is always included with the response variable.

Test Data: Once you have created the model, you need to check its accuracy. This is where the test data is used. It requires fewer observations in comparison to train data set and does not include the response variable.

Now let's start with importing data set and perform exploration on it.

Path <- “.../data/downloaded”
#set working directory path
setwd(path)

Maintain train and test files inside your working directory. It will reduce the unnecessary troubles in future. Once the directory is completely maintained, you can import the .csv files by entering command:

Train <- read.csv(“Train_7abcbfh67u.csv”)

Test <- read.csv(“Test_jikl45hrjkl.csv”)

To cross check whether the uploaded operation is performed successfully, see the R environment such as:

//to check dimensions
>dim(train)
>dim(test)

It will represent the total number of rows and columns in train and test data set. Now, To check the variables and its types, write the command below:

>str(train)

Now you have checked every data types and variables in your data set. It's time for quick data exploration. The beginning is marked with checking the missing values. Type command:

>table(is.na(train))

It will help you to recognise the total missing value in your data set. Now, check for the variables in which these values are missing. It is strongly recommended to check for the missing values in your data during data exploration.

>colSums(is.na(train))
>summary(train)

These collected inference will lead you to treat variables in accurate and precise manner.

Graphical Representation:

Now to present these variables in a visual format, you need graphs. Here the data is analysed with two ways:

1). Univariate Analysis: performed with single variable.

2). Bivariate Analysis: performed with two variables.

To recognise the hidden insights from these data set, you need to perform the bivariate analysis.

Install ggplot2 package for visualising these data.

> ggplot(train, aes(x= first_parameter, y = second_parameter)) + geom_point(size= 3.5, color= “blue”) + xlab(“first_parameter”) + ylab(“second_parameter + ggtitle(“visibility of the data”)

Dealing with Continuous or Categorical Variables:

These variables need special attention during the analysis. Find the missing variable and make the imputation by using the meridian.

>combi$parameter<-ifelse(combi$parameter==0, median(combi$parameter), combi$parameter)

Now move to categorical variables and correct the mismatch found during the data exploration.

> levels(combi$parameter)[1] <- “Other”
>library(plyr)
>combi$parameter <- revalue(combi$parameter, c(“LF”= “Low Fat”, “reg” = “Regular”))
>combi$parameter <- revalue(combi$parameter, c(“low fat” = “Low Fat”))

Data Manipulation:

Create new variables for extracting and presenting new information to the model. This way model can make accurate predictions. Now is the time to find the variables which can affect the your model. It is the part of feature engineering.

>library(dplyr)

Label Encoding: Label encoding refers to the numerically replacement or encoding the various level for a categorical variable. Let's check out from above example we have two levels that is Low Fat and Regular for our parameter. So we can encode the Regular with 1 and Low Fat with 0 by the expression:

>combi$parameter <- ifelse(cobi$parameter == “Regular”,1,0)

One Hot Encoding: It splits the categorical variable in unique levels and removes the main one from data set.

>sample <- select(combi, parameter)
>demo <- data.frame(model.matrix(~.-1,sample))
>head(demo)

Here model.matrix will create a matrix with the encoded variables.

~.-1 represents that R will encode all the variables within the frame.

Predictive Modeling with Machine Learning:

Before starting, first remove the columns which are converted by using other variables or are identifier variables with the help of the dplyr package:

> combi <-  select(combi, -c(parameter1, parameter2, parameter3, parameter4, parameter5))
> str(combi)

a). Linear (Multiple) Regression:

The main role of multiple regression is in case of various predictors and continuity in response variable. It relies on following assumptions:

1). Linear relationship between the response and predictor variables.

2). Predictor variables should not correlate on each other otherwise it may cause multicollinearity.

3). Error terms could have constant variance with uncorrelated behaviour.

Now create a regression model on the data set with the help of lm() function:

> linear_model <- lm(parameter ~ . , data = new_train)

> summary(linear_model)

Find out the amount of correlation:

> cor(new_train)

b). Decision Trees:

Use rpart package to implement the decision tree algorithm in your model building. Also, to validate the technique, you can use caret package. The decision trees in R include the complexity parameters (cp) which represents the tradeoff between the accuracy and model complexity. Smaller cp will result in bigger tree and larger cp will provide smaller tree which might be overfit or underfit the model respectively.

//Load Useful Libraries
> library(rpart)
> library(rpart.plot)
> library(caret)
//set tree control parameters
> fitControl <- trainControl(method = “cv”, number= 5)
> cartGrid <- expand.grid(.cp=(2.50)*0.02)

//Decision Tree

> tree_model <- train(parameter1 ~ ., data = new_train, method = “rpart”, trControl = fitControl, trueGrid = cartGrid)
> print (tree_model)

Now, lets create a decision tree with cp as a complexity parameter. It can be 0.01, 0.02 depending upon the model with least cp.

>main_tree <- rpart(parameter ~ . , data = new_train, control = rpart.control(cp=x))
> prp(main_tree)

It will present the final result with improved RMSE.

Final Words:

In this tutorial i’ve covered the basic to advanced model building concept. The decision tree obtained doesn't use encoded variables in the model. The main motive was to help the beginners with predictive modeling in R.

Mega Menu

TRENDING

Slider

Programing in R for Data Science

Why R?

Installing R Packages:

Graphical Representation:

a). Linear (Multiple) Regression:

Final Words:

No comments

DON'T MISS

LATEST

POPULAR

Market Reports

USEFUL

RESOURCES