- Linear Regression 1
- Type…………………………………………………………………………………………………………………… 1
- Dataset: Auto Insurance……………………………………………………………………………………. 1
- Result: Prediction of TotalPayment……………………………………………………………………. 1
- Summary…………………………………………………………………………………………………………… 2

- Multiple Linear Regression 2
- Type………………………………………………………………………………………………………………….. 2
- Dataset: Iris dataset………………………………………………………………………………………….. 2
- Result: Prediction of Length…………………………………………………………………….. 3
- Summary………………………………………………………………………………………………………….. 4

- Logistic Regression 4
- Type………………………………………………………………………………………………………………….. 4
- Dataset: Airline dataset…………………………………………………………………………………….. 4
- Result: Prediction of Arrival Delay (IsArrDelay)………………………………………………… 4
- Summary…………………………………………………………………………………………………………… 5

- Time Series Analysis 5
- K-Means Clustering 6
- Type………………………………………………………………………………………………………………….. 6
- Dataset: Crime dataset……………………………………………………………………………………… 6
- Result: Classification of crime state-wise………………………………………………………….. 6
- Summary…………………………………………………………………………………………………………… 7

# 1 Linear Regression

The linear regression is used, when we have only two variables. One is dependent variable and other is the independent variable. The linear equation between two variables are given by: *Y *= *a **∗ **X *+ *b* Examples: If we want to calculate the cost of house, total sales year wise, and temperature value year wise than linear regression helps us to model the prediction based on the independent variable.

## 1.1 Type

This algorithm can be used for both forecasting as well as for prediction. (only for numerical values)

## 1.2 Dataset: Auto Insurance

The sample auto insurance (Sweden) dataset is shown in the following Table 1. There are two features in the dataset as shown in Table 1 i.e. NoofClaims and TotalPayment. In this dataset, TotalPayment is the total payment for all the claims in thousands of Swedish kronor for geographical zones in Sweden.

NoofClaims | TotalPayment |

108 | 392.5 |

19 | 46.2 |

13 | 15.7 |

124 | 422.2 |

40 | 119.4 |

57 | 170.9 |

Table 1: Sample auto insurance dataset

## 1.3 Result: Prediction of TotalPayment

Let’s predict the TotalPayment variable using the linear regression. The prediction of TotalPayment depends on the feature NoofClaims. To predict the TotalPayment value using linear Regression, we need to divide the insurance dataset into training and testing dataset. The 70 % of data should fall under training set and rest of the 30 % dataset should fall under testing set. We have built the model using training set and summary of the model is shown below: H2ORegressionModel: glm

Reported on training data. MSE: 1270.742

RMSE: 35.64747

Mean Residual Deviance : 1270.742 R^2 : 0.8576389

Variable Importances: (Extract with h2o.varimp)

Standardized Coefficient Magnitudes: standardized coefficient magnitudes names coefficients sign

1 NoofClaims 81.428357 POS

The above summary provides the information about the accuracy/efficiency, error, coefficients etc. From the summary, we can dig out the accuracy of the prediction model is 85.76% (Rˆ2), and error

is 35.64 (RMSE). Here the surprising thing is, the efficiency of the model is high along with the error. So prediction of TotalPayment is not only depends on NoofClaims but also depends on the other parameters or training set doesn’t capture the all the variability of the NoofClaims vs TotalPayment. The actual value vs predicted values for the testing set are shown in following Table

- The table also captures the high degree of the error as the difference between actual values and predicted values are higher.

Predicted Value | Actual Value |

69.82 | 15.7 |

212.98 | 170.9 |

43.79 | 20.9 |

102.35 | 39.6 |

50.30 | 48.8 |

105.61 | 134.9 |

Table 2: TotalPayment: Actual values vs Predicted values

## 1.4 Summary

- The Linear regression algorithm can used for both forecasting as well as for
- Linearity in the dataset can be handled using this algorithm.
- For applying linear algorithm, we need to make sure that data is highly linear and highly co-related to each other. Otherwise, model captures the higher efficiency along with high degree of

# 2 Multiple Linear Regression

It is the extension of linear regression when we have more than one independent variables with non- linear relation between variables and if we want to predict the value than multiple linear regression helps us to model the prediction. The multiple linear equation is given by Y= a*X+b*Y+c. Examples: Calculation of the rainfall values which depends on min temp, max. Temp, humidity and other parameters. The annual sales calculation based on multiple parameters are calculated using the multiple linear regression.

## 2.1 Type

This algorithm can be used for both forecasting as well as for prediction (only for numerical values).

## 2.2 Dataset: Iris dataset

The sample iris dataset (in-built R dataset) is shown in the following Table 3. There are five features in the dataset as shown in Table 3. The unit for the variables Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width is in cm and there are three classes for the variable Species i.e. setosa, versicolour, and virginica (iris flower with three related species).

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

4.6 | 3.1 | 1.5 | 0.2 | setosa |

5.0 | 3.6 | 1.4 | 0.2 | setosa |

6.7 | 3.3 | 5.7 | 2.5 | virginica |

6.7 | 3.0 | 5.1 | 1.8 | virginica |

Table 3: Sample iris dataset

## 2.3 Result: Prediction of Sepal.Length

Let’s predict the Sepal.Length variable using the multiple linear regression. The prediction of Sepal.Length depends on the features Sepal.Width and Species. To predict the Sepal.Length value using Multiple Linear Regression, we need to divide the iris dataset into training and testing dataset. The 70 % of data should fall under training set and rest of the 30 % dataset should fall under testing set. We have built the model using training set and summary of the model is shown below:

H2ORegressionModel: glm Reported on training data. MSE: 0.1790401

RMSE: 0.4231313

Mean Residual Deviance : 0.1790401 R^2 : 0.6948533

Variable Importances: (Extract with h2o.varimp)

Standardized Coefficient Magnitudes: standardized coefficient magnitudes names coefficients sign

- setosa 1.266123 NEG
- virginica 0.617879 POS
- Width 0.333738 POS
- versicolor 0.161686 POS

The above summary provides the information about the accuracy/efficiency, error, coefficients etc. From the summary, we can dig out the accuracy of the prediction model is 69.48% (Rˆ2), and error is 0.4213 (RMSE). The actual value vs predicted values for the testing set are shown in following Table 4:

Predicted Values | Actual Values |

4.793432 | 4.6 |

5.402796 | 5.4 |

5.021944 | 4.6 |

5.021944 | 5.0 |

4.641091 | 4.4 |

5.021944 | 4.8 |

Table 4: Sepal.Length: Actual values vs Predicted values

## 2.4 Summary

- The Multiple linear regression algorithm can used for both forecasting as well as for
- Non-linearity in the dataset can be handled using this
- Output variable should be numerical, where there is no limitation for input variables. The input variables can be numeric as well as

# 3 Logistic Regression

It is used when the outcome is discrete values (binary values like 0/1, yes/no, true/false etc). It calculates the probability of outcome and maps between 0 and 1.

Examples: If we want to predict the outcome of result i.e. pass/fail, whether an email is spam/not spam, it’s raining/ not raining and image contains face/not face than logistic regression helps to model the binary prediction based on the independent variable (s).

## 3.1 Type

This algorithm can be used for forecasting, prediction, and classification of categorical values.

## 3.2 Dataset: Airline dataset

The sample airline dataset is shown in the following Table 5. There are more than 10 features in the dataset, but for the sack of simplicity on required features are shown in Table 5. The dataset provides information about the airline with flight number, source, destination, arr delay, dep delay, the distance between source to destination, whether there is any arrival delay or departure delay etc. The goal is to predict whether is there any arrival delay of airline or not using this dataset. Such kind of categorical value can be predicted using the logistic regression.

FlightNum | Origin | Dest | ArrDelay | DepDelay | Distance | IsArrDelayed | IsDepDelayed |

942 | SYR | BWI | 23 | 17 | 273 | Yes | Yes |

942 | SYR | BWI | 7 | 8 | 273 | Yes | Yes |

943 | LGA | SYR | -9 | 0 | 198 | No | No |

943 | SYR | BUF | 28 | 14 | 134 | Yes | Yes |

944 | JFK | UCA | 12 | 3 | 191 | Yes | Yes |

944 | JFK | UCA | 8 | 0 | 191 | Yes | No |

Table 5: Sample airline dataset

## 3.3 Result: Prediction of Arrival Delay (IsArrDelay)

Let’s predict the IsArrDelay categorical variable using the logistic regression. The prediction of IsArrDelay depends on the features ArrDelay,DepDelay and IsDepDelayed. To predict the IsArrDelay value using logistic regression, we need to divide the airline dataset into training and testing dataset. The 70 % of data should fall under training set and rest of the 30 % dataset should fall under testing set. We have built the model using training set and summary of the model is shown below:

H2ORegressionModel: glm

Reported on training data. MSE: 0.02115256

RMSE: 0.1454392

R^2: 0.9153332

Confusion Matrix for F1-optimal threshold: NO YES Error Rate

NO 2094894 16 0.000008 =16/2094910

YES 0 1989301 0.000000 =0/1989301

Totals 2094894 1989317 0.000004 =16/4084211

Variable Importances: (Extract with h2o.varimp)

Standardized Coefficient Magnitudes: standardized coefficient magnitudes names coefficients sign

- ArrDelay 279527 POS
- NO 0.200458 NEG
- YES 0.176553 POS
- DepDelay 047439 POS

The above summary provides the information about the accuracy/efficiency, error, coefficients etc. From the summary, we can dig out the accuracy of the prediction model is 91.53% (Rˆ2), and error is 0.1454 (RMSE). The logistic regression also provides the confusion matrix in the summary which is the tabular representation of correctly classified class vs in-correctly classified classes. We can identify the error (classification error) using this confusion matrix also. The actual value vs predicted values for the testing set are shown in following Table 6:

Predicted Values | Actual Values |

Yes | Yes |

No | No |

Yes | Yes |

No | No |

Yes | Yes |

No | No |

Table 6: Arrival Delay: Actual values vs Predicted values

## 3.4 Summary

- The logistic regression algorithm can be used for forecasting, prediction, and
- Non-linearity of categorical value in the dataset can be handled using this
- Output variable should be categorical, where there is no limitation for input variables. The input variables can be numeric as well as

# 4 Time Series Analysis

When the goal is to predict the values or model building based on the timing parameters (years, days, hours, minutes etc.) then such kind of analysis come to picture.

Examples: The best examples of the time series analysis are analyze the sales for the next year, prediction of temperature relative to time series data etc.

### Currently, H2O doesn’t provide any support to perform time-series analysis. The development of the algorithm is in progress and this is an open issue from the H2O community^{1}.

** **

# 5 K-Means Clustering

When we want to group/classify the values based on the grouping pattern of the data than K-Means helps us to built such kind of model on top of the group data.

Examples: The classification of sales country wise, quarter-wise, product wise are the best examples of the K-Means. Other examples could be if we want to group the persons based on their nature, age, gender, and locality than K-Means plays an important role in the model development.

## 5.1 Type

This algorithm can be used for classification of the group data.

## 5.2 Dataset: Crime dataset

The sample crime dataset is shown in the following Table 7. There are 5 features in the dataset, and for the sack of simplicity on few tuples are shown in Table 7. The dataset provides information about state with murder, assault, urbanpop and rape rate. The goal is to classify the state based on the crime rate in terms of murder, assault, urbanpop, and rape rate. Thus classification of group data can be done using the K-means clustering algorithm.

State | Murder | Assault | UrbanPop | Rape |

Alabama | 13.2 | 236 | 58 | 21.2 |

Alaska | 10 | 263 | 48 | 44.5 |

Arizona | 8.1 | 294 | 80 | 31.0 |

Arkansas | 8.8 | 190 | 50 | 19.5 |

California | 9 | 276 | 91 | 40.6 |

Colorado | 7.9 | 204 | 78 | 38.7 |

Table 7: Sample crime dataset

## 5.3 Result: Classification of crime state-wise

Let’s classify the crime rate using Murder, Assault, UrbanPop, and Rape categorical variables using K-means clustering. K-means clustering algorithm is unsupervised form of learning algorithm hence it doesn’t involve the target/dependent variable. We have built the model using the crime dataset and summary of the model is shown below:

H2OClusteringMetrics: kmeans Total Within SS: 38.14257 Between SS: 97.85743

Total SS: 136 Centroid Statistics:

centroid size within cluster sum of squares

1https://0xdata.atlassian.net/browse/PUBDEV-2590

1 | 1 | 8.00000 | 9.13784 |

2 | 2 | 8.00000 | 6.76423 |

3 | 3 | 12.0000 | 16.08226 |

4 | 4 | 7.00000 | 6.15824 |

The above summary provides the information about the error, no. of clusters, centroids of clusters etc. From the summary, we can dig out the total error is 136, error within clusters is 38.142, error between clusters is 97.85. The K-means clustering also provides information about error of each cluster, centroid of each cluster, and size of the each cluster. From the output it is clear that size of cluster 3 is highest i.e. 12. We can map the output with the state to group state based on their crime rate as shown in Table 8. The state values vs cluster index is shown in the following Table 8. From table it is visible that Alaska, California, and Colorado states have the similar crime rate.

State | Cluster Index |

Alaska | 3 |

California | 3 |

Colorado | 3 |

Iowa | 1 |

Kansas | 2 |

Kentucky | 1 |

Table 8: Group of state with similar crime rate

## 5.4 Summary

- The K-means clustering algorithm is used for
- It is unsupervised form of learning algorithm in which existence of target variable doesn’t exist.
- The outcome can be numerical as well as categorical values, where there is no limitation for input