This post will explain us, how to install R on windows environment and how to work with Machine learning project using R with simple dataset
First Download R https://cran.r-project.org/bin/windows/base/ from this link.
You can download latest version of R.
1. Once download completes, then install the same in your machine. This is like any other software installation. There is no special instructions required for this.
2. After successful installation we need to setup the path- go to MyComputer-RightClick- environment variables- System variables-
C:\Program Files\R\R-3.5.1\bin
3. After setting up the path, Now we need to start the R
4. Go to command prompt and Type R
5. Now we can see the simple R terminal
6. Now we will understand what is machine learning? and what is datasets?
7. When we are applying machine learning to our own datasets, we are working on a project.
The process of a machine learning project may not be linear, but there are a number of well-known steps:
Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Results.
8.The best way to really come in terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps.
Namely, from loading data, summarizing your data, evaluating algorithms and making some predictions.
Machine Learning using R Step By Step
Now this the time to work simple machine learning program using R and inbuilt dataset called iris
We already installed R and it has started.
Install any default packages using following syntax.
Packages are third party add-ons or libraries that we can use in R.
install.packages("caret") //While installing the package, after typing the above command , it will ask us for select mirror, you can select default one. install.packages(“caret”,dependencies=c(“Depends”,”Suggests”)) install.packages(“ellipse”) //Load the package, which we are going to use. libray(caret)Load the data from inbuilt data and rename the same using following syntax.
// Attach iris dataset to the current environment data(iris) // Rename iris dataset to dataset dataset <- irisNow iris data loaded in R and accessible with variable called dataset Now we will create validation dataset. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
// We create a list of 80% of the rows in the original dataset we can use for training validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE) //select 20% of the data for validation validation <- dataset[-validation_index,] //use the remaining 80% of data to training and testing the models dataset <- dataset[validation_index,]Now we have training data in the dataset variable and a validation set we will use later in the validation variable. Note that we replaced our dataset variable with the 80% sample of the dataset. 1. dim function We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.
dim(dataset)2. Attribute types - Knowing the types is important as it will give us an idea of how to better summarize the data we have and the types of transforms we might need to use to prepare the data before we model it.
sapply(dataset,class)3. head function used to display the first five rows.
head(dataset)4. The class variable is a factor. A factor is a class that has multiple class labels or levels
levels(dataset$Species)5. Class Distribution Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage. 6. Summary of each Attribute
summary(dataset)Visualize Dataset We now have a seen the basic details about the data. We need to extend that with some visualizations. We are going to look at two types of plots: 1. Univariate plots to better understand each attribute. 2. Multivariate plots to better understand the relationships between attributes. First we will see the Univariate plots, this is for each individual variable. Input attributes x and the output attributes y.
//Split input and output x <- dataset[,1:4] y <- dataset[,5]Given that the input variables are numeric, we can create box and whisker plots of each.
par(mfrow=c(1,4)) for(i in 1:4) { boxplot(x[,i], main=names(iris)[i]) }We can also create a barplot of the Species class variable to get a graphical representation of the class distribution (generally uninteresting in this case because they’re even).
plot(y)This confirms what we learned in the last section, that the instances are evenly distributed across the three class: Multivariate Plots First let’s look at scatterplots of all pairs of attributes and color the points by class. In addition, because the scatterplots show that points for each class are generally separate, we can draw ellipses around them.
featurePlot(x=x,y=y,plot=”ellipse”)We can also look at box and whisker plots of each input variable again, but this time broken down into separate plots for each class. This can help to tease out obvious linear separations between the classes.
featurePlot(x=x,y=y,plot=”box”)Next we can get an idea of the distribution of each attribute, again like the box and whisker plots, broken down by class value. Sometimes histograms are good for this, but in this case we will use some probability density plots to give nice smooth lines for each distribution.
// density plots for each attribute by class value scales <- list(x=list(relation="free"), y=list(relation="free")) featurePlot(x=x, y=y, plot="density", scales=scales)Evaluating the Algorithms Set-up the test harness to use 10-fold cross validation. We will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train –test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.
control <- tarinControl(method=”csv”,number=10) metric <- “Accuarcy”Build 5 different models to predict species from flower measurements Linear Discriminant Analysis (LDA) Classification and Regression Trees (CART). k-Nearest Neighbors (kNN). Support Vector Machines (SVM) with a linear kernel. Random Forest (RF)
set.seed(7) fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control) # b) nonlinear algorithms # CART set.seed(7) fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control) # kNN set.seed(7) fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control) # c) advanced algorithms # SVM set.seed(7) fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control) # Random Forest set.seed(7) fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable. Select the best model. We now have 5 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate. We can report on the accuracy of each model by first creating a list of the created models and using the summary function.
# summarize accuracy of models results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf)) summary(results)We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation)
dotplot(results)The results can be summarized. This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4% How to Predictions using predict and confusion Matrix The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set. This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result. We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.
predictions <- predict(fit.lda,validation) confusionMatrix(predictions,validation$Species)