Predict Data with Linear Regression Analysis
Machine Learning is a branch of Artificial Intelligence and it is based on the idea that systems can learn from data, identify hidden patterns and make decisions with minimal human intervention.
Linear regression is one of the most widely used predictive modeling techniques. Here we identify one variable as independent (Y) and one or more another variable as a dependant. As shown in the following we have coefficients and intercept. The number of coefficients depends on the number of observations.
In this post Let’s see how to use a linear regression model to predict Apparent temperature given other related features present in the dataset.
The dataset that I have chosen for this example is the Weather dataset of the Szeged City of Hungary. Its 10 years of data from 2006–2016 and it has hourly entries of the weather-related features.
Data Set — https://www.kaggle.com/budincsevity/szeged-weather
Here I used Google Colab as my runtime environment. For better readability, I have divided this post into 3 sections.
- Data Preprocessing
- Feature Engineering
- Linear Regression Model applying
Section 1- Data Preprocessing
As the first step, we need to import libraries that help us to do preprocessing, modeling, and data visualizing. For that following libraries are used.
#All the Library importsimport pandas as pdfrom sklearn.preprocessing import FunctionTransformer
import numpy as np
import seaborn as sns
from sklearn.preprocessing import StandardScaler ,LabelEncoder
from sklearn import linear_model
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.decomposition import PCA
We have a number of options to load a data set to Google Colab. Here I used panda read_csv
. After that to see whether dataset is loaded I printed the last 10 rows of the data set
Let’s check some information about our dataset. Here the output gives us all the column names, non-null count, and data type.
In the previous screenshot, we could see that Precip Type
has less count relative to other columns. This gives us a clue that there exist null values under this column. First, let’s drop any duplicate records available. We already know that Formatted Date
must be unique and in same time periods there can’t be two records. So we can drop duplicates as following.
In the above result, we can clearly see that number of records before and after duplicate removal are not the same. We have removed existing duplicates from our data set. In the above code cell, I have also removed Loud Cover
the column from the dataset as it has null values on all records. Next, let’s check what columns have null values and how much percentage each has.
From the above output, we can clearly see that Precip Type
has some missing values. We have to deal with those missing values. There have many ways to handle missing values in a dataset. Here I am going to drop the records which have null values. The reason to choose this method is that the percentage of null values is lower than 1%
Let’s see again the percentage of missing values to confirm we have handled all the missing value fields.
Let’s plot the box plot for each column and see whether there are any outliers associated.
By looking at the boxplots we can clearly see that Pressure and Humidity have Zero values. In a practical scenario, we know that either pressure or Humidity can’t be Zero. So these outliers must be handled to have a good result. In other columns, there are many datapoints relay outside the box plot. As there are many points outside we don’t define them as outliers.
To handle both Humidity and Pressure we can use drop those outlier values. The below code cells show the handling of outliers of those two columns.
So now we have completed both missing values and outlier handling. Next, we are going to explore Q-Q plots and Histograms to check whether we need to apply any transformations to the dataset
In the above code, it only generates a Q-Q plot and Histogram for Temperature(C)
the column. In the same way, we can plot for all the columns by replacing current_feature
variables.
By looking at these plots we can see some features need a transformation. The reason is the distribution is not symmetrical. So the skewness is present. To eliminate this issue we can perform transformations like logarithm transformation, square root transformation, or exponential transformation.
The trick here is if the distribution is right-skewed we have to apply explorational transformation where for left-skewed we need to apply logarithm transformation or square root transformation.
We know that logarithm of 0 undefined so we have to use square root transformation. The below screenshot shows applying the transformation, and plotting again histograms
Next, we are going to apply coding techniques. Here I took Precip Type because it has two string values rain and snow. The thing we are going to do here is replace string values with number values. The reason to apply to code is models only work with numerical data. If we want to use string data like this the option is to convert it using a coding technique. Here we can use LabelEncoder()
.Applied feature names are Summery
, Daily Summery
, Precip Type
Next, we are going to standardize the features. The reason for standardization is because variables that are measured at different scales do not contribute equally to the model fitting & model learned function and might end up creating a bias. Thus, to deal with this potential problem feature-wise standardized (μ=0, σ=1) is usually used prior to model fitting. Prior to Standardization, we need to remove all categorical features
I have used StandardScaller()
in sklearn.preprocessing library.
Now we have finished the preprocessing part. There are many other alternative methods that allow doing the same preprocessing in a different way.
Section 2- Feature Engineering
In this section, we’ll see how to identify significant and independent features and also apply PCA (Principal Component Analysis) for feature reduction.
Let’s first discover which features we can use to fit in our Linear Regression model. First, we can eliminate non-numerical columns namely Precip Type, Summary, Daily Summary, Formatted Date. Next in the remaining features, we have to see the correlation between them for this I used a correlation matrix.
By looking at this correlation matrix we can clearly see the correlation between Apparent Temperature(C) and other features. Here Wind Speed (km/h) and Wind Bearing (degrees) have the lowest correlation is nearly zero. Other features show a good correlation compared to this. As all features have a correlation against Apparent Temperature(C) I will use all 6 features.
Next, I will remove independent Variables from our preprocessed dataset. Let’s refer to our dataset as X
after onwards.
The next step is feature reduction. Feature reduction is reducing the number of features to a lower number for ease of training the model. As an example, a dataset having 15 features can be reduced to 3 features so this makes the training the model in an efficient and error-reducing way. Here I used PCA for feature reduction. We have SVD in other hands to do the same process. Let’s apply PCA to our dataset.
Now we see that our feature dimension has reduced to two. Now the complexity of the dataset is also reduced. Then we are going to split our dataset into training and testing for this we can use sklearn.model_selection
Now we have completed the feature engineering section next we have to train the Linear Regression with our dataset and see the accuracy of the model we built.
Section 3- Linear Regression Model applying
For this first, we have to create an instance of LinearRegression()
.Then fit the x_train
, y_train
values to Linear Regression model. Then after we can obtain y_hat
values. Those operations can be conducted as follows.
So we now have y_hat values. Next, we can test our model. First I will show the Mean Square Error
The percentage of explained variance of the predictions.
If we want to see what are our parameters of the model we can print and see them as shown here.
As we reduced our number of features to two. We have two parameters. Also, the intercept is also shown.
Finally, Let’s plot the graph for Testing data and Hat data.
The complete code can be found in this GitHub repo
In conclusion, we can see that that there is a relationship between humidity and temperature. Also, a relation between humidity and apparent temperature also exists. As there is a direct correlation between apparent temperature and humidity we can predict apparent temperature given humidity.