Linear Regression in Machine Learning: part02
WELCOME TO SECOND PART OF YOUR LINEAR REGRESSION POST. IN THE FIRST POST WE SEE OPERATION ON DATASETS. IN THIS POST WE SEE LINEAR REGRESSION OPERATION. LET’S START:
Training a Linear Regression Model
Let’s now begin to train out regression models. We will need to first split up our data into an X array that contain the feature to train on, and a y array with the target variables, in this case the Price column. We will toss out the Address columns because it only has text info that the linear regression model can not use.
X and y array
Train Test Splits
Now let’s splits the data into a training set and a testing set. We will train out models on the training set and then use the test set to evaluate the modes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Model Evaluation
Let’s evaluate the model by checking out it is coefficient and how we can interpret them
# print the intercept
print(lm.intercept_)
-2640159.79685
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
data info.
CoefficientAvg. Area Incomes21.528276Avg. Area House Ages164883.282027Avg. Area Number of Room122368.678027Avg. Area Number of Bedroom2233.801864Area Populations15.150420
Interpreting the coefficients:
- Holding all other features fixed, a 1 unit increase in **Avg. Area Income** is associated with an **increase of $21.52 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area House Age** is associated with an **increase of $164883.28 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Rooms** is associated with an **increase of $122368.67 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Bedrooms** is associated with an **increase of $2233.80 **.
- Holding all other features fixed, a 1 unit increase in **Area Population** is associated with an **increase of $15.15 **.
Does this make sense? Probably not because I made up this data. If you want real data to repeat this sort of analysis, check out the [boston dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html):
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)
boston_df = boston.data
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
<matplotlib.collections.PathCollection at 0x142622c88>
Residual Histogram
sns.distplot((y_test-predictions));
Regression Evaluation Metric
Here are three common evaluation metric for regression problem:
Mean Absolute Error (MAE) is the mean of the absolute value of the error:1𝑛∑𝑖=1𝑛|𝑦𝑖−𝑦̂ 𝑖|
Mean Squared Error (MSE) is the mean of the squared error:1𝑛∑𝑖=1𝑛(𝑦𝑖−𝑦̂ 𝑖)2
Root Mean Squared Error (RMSE) is the square root of the mean of the squared error:1𝑛∑𝑖=1𝑛(𝑦𝑖−𝑦̂ 𝑖)2⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
Comparing these metric:
- MAE is the easiest to understand, because it is the average error.
- MSE is more popular than MAE, because MSE “punishes” larger error, which tends to be useful in the real world.
- RMSE is more popular than MSE, because RMSE is interpretableS in the “y” units.
All of these are loss function, because we want to minimize them.
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MAE: 82288.2225191
MSE: 10460958907.2
RMSE: 102278.829223
This was your Machine Learning Project!
Tags: Linear Regression in Machine Learning-plot-algorithms-explain
FOR THE FIRST PART OF PROJECT CLICK HERE.
BEST OF LUCK!!!