Essentials of Machine Learning Algorithms

  • Linear Regression 1
    • Type…………………………………………………………………………………………………………………… 1
    • Dataset: Auto Insurance……………………………………………………………………………………. 1
    • Result: Prediction of TotalPayment……………………………………………………………………. 1
    • Summary…………………………………………………………………………………………………………… 2
  • Multiple Linear Regression 2
    • Type………………………………………………………………………………………………………………….. 2
    • Dataset: Iris dataset………………………………………………………………………………………….. 2
    • Result: Prediction of Length…………………………………………………………………….. 3
    • Summary………………………………………………………………………………………………………….. 4
  • Logistic Regression 4
    • Type………………………………………………………………………………………………………………….. 4
    • Dataset: Airline dataset…………………………………………………………………………………….. 4
    • Result: Prediction of Arrival Delay (IsArrDelay)………………………………………………… 4
    • Summary…………………………………………………………………………………………………………… 5
  • Time Series Analysis 5
  • K-Means Clustering 6
    • Type………………………………………………………………………………………………………………….. 6
    • Dataset: Crime dataset……………………………………………………………………………………… 6
    • Result: Classification of crime state-wise………………………………………………………….. 6
    • Summary…………………………………………………………………………………………………………… 7

 

1           Linear Regression

The linear regression is used, when we have only two variables. One is dependent variable and other is the independent variable. The linear equation between two variables are given by: Y = a X + b Examples: If we want to calculate the cost of house, total sales year wise, and temperature value year wise than linear regression helps us to model the prediction based on the independent variable.

 

1.1          Type

This algorithm can be used for both forecasting as well as for prediction. (only for numerical values)

 

1.2         Dataset: Auto Insurance

The sample auto insurance (Sweden) dataset is  shown  in  the  following  Table  1.  There  are two features in the dataset as shown in Table 1 i.e. NoofClaims  and  TotalPayment.  In  this dataset, TotalPayment is the total payment for all the claims in thousands of Swedish kronor for geographical zones in Sweden.

 

 

NoofClaims TotalPayment
108 392.5
19 46.2
13 15.7
124 422.2
40 119.4
57 170.9

 

Table 1: Sample auto insurance dataset

 

 

1.3         Result: Prediction of TotalPayment

Let’s   predict   the   TotalPayment   variable   using   the  linear  regression.   The prediction of TotalPayment depends on the feature NoofClaims. To  predict  the  TotalPayment  value  using  linear Regression, we need to divide the insurance dataset into training and testing dataset. The 70 % of data should fall under training set and rest of the 30 % dataset should fall under testing set. We have built the model using training set and summary of the model is shown below: H2ORegressionModel:           glm

Reported on training data. MSE: 1270.742

RMSE: 35.64747

Mean  Residual Deviance :      1270.742 R^2  :           0.8576389

Variable Importances:    (Extract with h2o.varimp)

Standardized Coefficient Magnitudes:    standardized coefficient magnitudes names coefficients sign

1 NoofClaims 81.428357 POS

The above summary provides the information about the accuracy/efficiency, error, coefficients etc. From the summary, we can dig out the accuracy of the prediction model is 85.76% (Rˆ2), and error

 

 

 

is 35.64 (RMSE). Here the surprising thing is, the efficiency of the model is high along with the error.   So prediction of  TotalPayment is not only depends on NoofClaims but also depends on   the other parameters or training set doesn’t capture the all the variability of the NoofClaims vs TotalPayment. The actual value vs predicted values for the testing set are shown in following Table

  1. The table also captures the high degree of the error as the difference between actual values and predicted values are higher.

 

Predicted Value Actual Value
69.82 15.7
212.98 170.9
43.79 20.9
102.35 39.6
50.30 48.8
105.61 134.9

Table 2: TotalPayment: Actual values vs Predicted values

 

 

1.4         Summary

  • The Linear regression algorithm can used for both forecasting as well as for
  • Linearity in the dataset can be handled using this algorithm.
  • For applying linear algorithm, we need to make sure that data is highly linear and highly co-related to each other. Otherwise, model captures the higher efficiency along with high degree of

 

2          Multiple Linear Regression

It is the extension of linear regression when we have more than one independent variables with non- linear relation between variables and if we want to predict the value than multiple linear regression helps us to model the prediction. The multiple linear equation is given by Y= a*X+b*Y+c. Examples: Calculation of the rainfall values which depends on min temp, max. Temp, humidity and other parameters. The annual sales calculation based on multiple parameters are calculated using the multiple linear regression.

 

2.1         Type

This algorithm can be used for both forecasting as well as for prediction (only for numerical values).

 

2.2        Dataset: Iris dataset

The sample iris dataset (in-built R dataset) is shown in the following Table  3. There are  five features in the dataset as shown in Table 3. The unit for the variables Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width is in cm and there are three classes for the  variable Species i.e. setosa, versicolour, and virginica (iris flower with three related species).

 

 

 

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.1 1.8 virginica

 

Table  3:  Sample iris dataset

 

2.3        Result: Prediction of Sepal.Length

Let’s predict the Sepal.Length variable using the multiple linear regression. The prediction of Sepal.Length depends on the features Sepal.Width  and  Species.  To  predict  the  Sepal.Length  value using Multiple Linear Regression, we need to divide the iris dataset into training and testing dataset. The 70 % of data should fall under training set and rest of the 30 % dataset should fall under testing set. We have built the model using training set and summary of the model is shown below:

H2ORegressionModel:     glm Reported on training data. MSE: 0.1790401

RMSE:  0.4231313

Mean  Residual Deviance :      0.1790401 R^2  :           0.6948533

Variable Importances:    (Extract with h2o.varimp)

Standardized Coefficient Magnitudes:    standardized coefficient magnitudes names coefficients sign

  • setosa 1.266123 NEG
  • virginica 0.617879 POS
  • Width 0.333738 POS
  • versicolor 0.161686 POS

The above summary provides the information about the accuracy/efficiency, error, coefficients etc. From the summary, we can dig out the accuracy of the prediction model is 69.48% (Rˆ2), and error is 0.4213 (RMSE). The actual value vs predicted values for the testing set are shown in following Table 4:

 

Predicted Values Actual Values
4.793432 4.6
5.402796 5.4
5.021944 4.6
5.021944 5.0
4.641091 4.4
5.021944 4.8

Table 4: Sepal.Length: Actual values vs Predicted values

 

 

 

2.4        Summary

  • The Multiple linear regression algorithm can used for both forecasting as well as for
  • Non-linearity in the dataset can be handled using this
  • Output variable should be numerical, where there is no limitation for input variables. The input variables can be numeric as well as

 

3          Logistic Regression

It is used when the outcome is discrete values (binary values like 0/1, yes/no, true/false etc). It calculates the probability of outcome and maps between 0 and 1.

Examples: If we want to predict the outcome of result i.e. pass/fail, whether an email is spam/not spam, it’s raining/ not raining and image contains face/not face than logistic regression helps to model the binary prediction based on the independent variable (s).

 

3.1         Type

This algorithm can be used for forecasting, prediction, and classification of categorical values.

 

3.2        Dataset: Airline dataset

The sample airline dataset is shown in the following Table 5. There are more than 10 features in the dataset, but for the sack of simplicity on required features are shown in Table 5. The dataset provides information about the airline with flight number, source, destination, arr delay, dep delay, the distance between source to destination, whether there is any arrival delay or departure delay etc. The goal is to predict whether is there any arrival delay of airline or not using this dataset. Such kind of categorical value can be predicted using the logistic regression.

 

FlightNum Origin Dest ArrDelay DepDelay Distance IsArrDelayed IsDepDelayed
942 SYR BWI 23 17 273 Yes Yes
942 SYR BWI 7 8 273 Yes Yes
943 LGA SYR -9 0 198 No No
943 SYR BUF 28 14 134 Yes Yes
944 JFK UCA 12 3 191 Yes Yes
944 JFK UCA 8 0 191 Yes No

Table 5: Sample airline dataset

 

 

3.3        Result: Prediction of Arrival Delay (IsArrDelay)

Let’s predict the IsArrDelay categorical variable using the logistic regression.   The prediction     of IsArrDelay depends on the features ArrDelay,DepDelay and IsDepDelayed. To predict the IsArrDelay value using logistic regression, we need to divide the airline dataset into training and testing dataset. The 70 % of data should fall under training set and rest of the 30 % dataset should fall under testing set. We have built the model using training set and summary of the model is shown below:

H2ORegressionModel: glm

 

 

 

Reported on training data. MSE: 0.02115256

RMSE:  0.1454392

R^2:    0.9153332

Confusion Matrix for F1-optimal threshold: NO          YES                Error        Rate

NO         2094894     16           0.000008        =16/2094910

YES         0           1989301      0.000000        =0/1989301

Totals 2094894 1989317       0.000004        =16/4084211

Variable Importances:    (Extract  with h2o.varimp)

Standardized Coefficient Magnitudes:    standardized coefficient magnitudes names                                  coefficients          sign

  • ArrDelay 279527              POS
  • NO 0.200458                NEG
  • YES 0.176553                POS
  • DepDelay 047439                POS

The above summary provides the information about the accuracy/efficiency, error, coefficients etc. From  the  summary,  we  can  dig  out  the  accuracy  of  the  prediction  model  is  91.53%  (Rˆ2),  and error is 0.1454 (RMSE). The logistic regression also provides the confusion matrix in the summary which is the tabular representation of correctly classified class vs in-correctly classified classes. We can identify the error (classification error) using this confusion matrix also. The actual value vs predicted values for the testing set are shown in following Table 6:

 

Predicted Values Actual Values
Yes Yes
No No
Yes Yes
No No
Yes Yes
No No

Table 6: Arrival Delay: Actual values vs Predicted values

 

 

3.4        Summary

  • The logistic regression algorithm can be used for forecasting, prediction, and
  • Non-linearity of categorical value in the dataset can be handled using this
  • Output variable should be categorical, where there is no limitation for input variables. The input variables can be numeric as well as

 

4         Time Series Analysis

When the goal is to predict the values or model building based on the timing parameters (years, days, hours, minutes etc.) then such kind of analysis come to picture.

Examples: The best examples of the time series analysis are analyze the sales for the next year, prediction of temperature relative to time series data etc.

 

 

 

Currently, H2O doesn’t provide any support to perform time-series analysis. The development of the algorithm is in progress and this is an open issue from the H2O community1.

 

5          K-Means Clustering

When we want to group/classify the values based on the grouping pattern of the data than K-Means helps us to built such kind of model on top of the group data.

Examples: The classification of sales country wise, quarter-wise, product wise are the best examples of the K-Means. Other examples could be if we want to group the persons based on their nature, age, gender, and locality than K-Means plays an important role in the model development.

 

5.1         Type

This algorithm can be used for classification of the group data.

 

5.2        Dataset: Crime dataset

The sample crime dataset is shown in the following Table 7. There are 5 features in the dataset, and for the sack of simplicity on few tuples are shown in Table 7. The dataset provides information about state with murder, assault, urbanpop and rape rate. The goal is to classify the state based on the crime rate in terms of murder, assault, urbanpop, and rape rate. Thus classification of group data can be done using the K-means clustering algorithm.

 

State Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9 276 91 40.6
Colorado 7.9 204 78 38.7

Table 7:  Sample crime dataset

 

 

5.3        Result:  Classification of crime state-wise

Let’s classify the crime rate using Murder, Assault, UrbanPop, and Rape categorical variables using K-means clustering. K-means clustering algorithm is unsupervised form of learning algorithm hence it doesn’t involve the target/dependent variable. We have built the model using the crime dataset and summary of the model is shown below:

H2OClusteringMetrics:        kmeans Total Within SS: 38.14257 Between SS: 97.85743

Total SS: 136 Centroid  Statistics:

centroid         size         within cluster sum of squares

1https://0xdata.atlassian.net/browse/PUBDEV-2590

 

 

 

1 1 8.00000 9.13784
2 2 8.00000 6.76423
3 3 12.0000 16.08226
4 4 7.00000 6.15824

The above summary provides the information about the error, no. of clusters, centroids of clusters etc. From the summary, we can dig out the total error is 136, error within clusters is 38.142, error between clusters is 97.85. The K-means clustering also provides information about error of each cluster, centroid of each cluster, and size of the each cluster. From the output it is clear that size of cluster 3 is highest i.e. 12. We can map the output with the state to group state based on their crime rate as shown in Table 8. The state values vs cluster index is shown in the following Table 8. From table it is visible that Alaska, California, and Colorado states have the similar crime rate.

 

State Cluster Index
Alaska 3
California 3
Colorado 3
Iowa 1
Kansas 2
Kentucky 1

Table 8: Group of state with similar crime rate

 

 

5.4        Summary

  • The K-means clustering algorithm is used for
  • It is unsupervised form of learning algorithm in which existence of target variable doesn’t exist.
  • The outcome can be numerical as well as categorical values, where there is no limitation for input

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.