Predict Data with Linear Regression Analysis

Nishshanka Jayasinghe
9 min readMay 27, 2021

Machine Learning is a branch of Artificial Intelligence and it is based on the idea that systems can learn from data, identify hidden patterns and make decisions with minimal human intervention.

Linear regression is one of the most widely used predictive modeling techniques. Here we identify one variable as independent (Y) and one or more another variable as a dependant. As shown in the following we have coefficients and intercept. The number of coefficients depends on the number of observations.

General Formula for Linear Regression

In this post Let’s see how to use a linear regression model to predict Apparent temperature given other related features present in the dataset.

The dataset that I have chosen for this example is the Weather dataset of the Szeged City of Hungary. Its 10 years of data from 2006–2016 and it has hourly entries of the weather-related features.

Data Set — https://www.kaggle.com/budincsevity/szeged-weather

Here I used Google Colab as my runtime environment. For better readability, I have divided this post into 3 sections.

  1. Data Preprocessing
  2. Feature Engineering
  3. Linear Regression Model applying

Section 1- Data Preprocessing

As the first step, we need to import libraries that help us to do preprocessing, modeling, and data visualizing. For that following libraries are used.

#All the Library importsimport pandas as pdfrom sklearn.preprocessing import FunctionTransformer
import numpy as np
import seaborn as sns
from sklearn.preprocessing import StandardScaler ,LabelEncoder
from sklearn import linear_model
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.decomposition import PCA

We have a number of options to load a data set to Google Colab. Here I used panda read_csv . After that to see whether dataset is loaded I printed the last 10 rows of the data set

Last 10 rows of the loaded dataset

Let’s check some information about our dataset. Here the output gives us all the column names, non-null count, and data type.

In the previous screenshot, we could see that Precip Type has less count relative to other columns. This gives us a clue that there exist null values under this column. First, let’s drop any duplicate records available. We already know that Formatted Date must be unique and in same time periods there can’t be two records. So we can drop duplicates as following.

Remove duplicates and null column

In the above result, we can clearly see that number of records before and after duplicate removal are not the same. We have removed existing duplicates from our data set. In the above code cell, I have also removed Loud Cover the column from the dataset as it has null values on all records. Next, let’s check what columns have null values and how much percentage each has.

From the above output, we can clearly see that Precip Type has some missing values. We have to deal with those missing values. There have many ways to handle missing values in a dataset. Here I am going to drop the records which have null values. The reason to choose this method is that the percentage of null values is lower than 1%

Removing null valued records

Let’s see again the percentage of missing values to confirm we have handled all the missing value fields.

Let’s plot the box plot for each column and see whether there are any outliers associated.

Boxplot for each feature

By looking at the boxplots we can clearly see that Pressure and Humidity have Zero values. In a practical scenario, we know that either pressure or Humidity can’t be Zero. So these outliers must be handled to have a good result. In other columns, there are many datapoints relay outside the box plot. As there are many points outside we don’t define them as outliers.

To handle both Humidity and Pressure we can use drop those outlier values. The below code cells show the handling of outliers of those two columns.

Handling Humidity Outliers
Before and after outlier handled Humidity
Handling Pressure outliers
Before and after outlier handled Pressure

So now we have completed both missing values and outlier handling. Next, we are going to explore Q-Q plots and Histograms to check whether we need to apply any transformations to the dataset

Code cell for Q-Q plot and Histogram

In the above code, it only generates a Q-Q plot and Histogram for Temperature(C) the column. In the same way, we can plot for all the columns by replacing current_feature variables.

Plotting Histograms
Plotted Histograms

By looking at these plots we can see some features need a transformation. The reason is the distribution is not symmetrical. So the skewness is present. To eliminate this issue we can perform transformations like logarithm transformation, square root transformation, or exponential transformation.
The trick here is if the distribution is right-skewed we have to apply explorational transformation where for left-skewed we need to apply logarithm transformation or square root transformation.
We know that logarithm of 0 undefined so we have to use square root transformation. The below screenshot shows applying the transformation, and plotting again histograms

Humidity transformation
Wind Speed transformation
After transformations are applied

Next, we are going to apply coding techniques. Here I took Precip Type because it has two string values rain and snow. The thing we are going to do here is replace string values with number values. The reason to apply to code is models only work with numerical data. If we want to use string data like this the option is to convert it using a coding technique. Here we can use LabelEncoder() .Applied feature names are Summery , Daily Summery , Precip Type

Applying Label encoder
Dataset after label encoding

Next, we are going to standardize the features. The reason for standardization is because variables that are measured at different scales do not contribute equally to the model fitting & model learned function and might end up creating a bias. Thus, to deal with this potential problem feature-wise standardized (μ=0, σ=1) is usually used prior to model fitting. Prior to Standardization, we need to remove all categorical features

I have used StandardScaller() in sklearn.preprocessing library.

Applying StandardScaler

Now we have finished the preprocessing part. There are many other alternative methods that allow doing the same preprocessing in a different way.

Section 2- Feature Engineering

In this section, we’ll see how to identify significant and independent features and also apply PCA (Principal Component Analysis) for feature reduction.

Let’s first discover which features we can use to fit in our Linear Regression model. First, we can eliminate non-numerical columns namely Precip Type, Summary, Daily Summary, Formatted Date. Next in the remaining features, we have to see the correlation between them for this I used a correlation matrix.

Correlation Matrix
Heatmap

By looking at this correlation matrix we can clearly see the correlation between Apparent Temperature(C) and other features. Here Wind Speed (km/h) and Wind Bearing (degrees) have the lowest correlation is nearly zero. Other features show a good correlation compared to this. As all features have a correlation against Apparent Temperature(C) I will use all 6 features.

Next, I will remove independent Variables from our preprocessed dataset. Let’s refer to our dataset as X after onwards.

The next step is feature reduction. Feature reduction is reducing the number of features to a lower number for ease of training the model. As an example, a dataset having 15 features can be reduced to 3 features so this makes the training the model in an efficient and error-reducing way. Here I used PCA for feature reduction. We have SVD in other hands to do the same process. Let’s apply PCA to our dataset.

Applying PCA
Dataset after PCA applied

Now we see that our feature dimension has reduced to two. Now the complexity of the dataset is also reduced. Then we are going to split our dataset into training and testing for this we can use sklearn.model_selection

Dataset splitting

Now we have completed the feature engineering section next we have to train the Linear Regression with our dataset and see the accuracy of the model we built.

Section 3- Linear Regression Model applying

For this first, we have to create an instance of LinearRegression() .Then fit the x_train, y_train values to Linear Regression model. Then after we can obtain y_hat values. Those operations can be conducted as follows.

So we now have y_hat values. Next, we can test our model. First I will show the Mean Square Error

MSE

The percentage of explained variance of the predictions.

The variance of the prediction

If we want to see what are our parameters of the model we can print and see them as shown here.

Intercept and other parameters

As we reduced our number of features to two. We have two parameters. Also, the intercept is also shown.

Finally, Let’s plot the graph for Testing data and Hat data.

Actual vs Predict Results

The complete code can be found in this GitHub repo

In conclusion, we can see that that there is a relationship between humidity and temperature. Also, a relation between humidity and apparent temperature also exists. As there is a direct correlation between apparent temperature and humidity we can predict apparent temperature given humidity.

--

--

Nishshanka Jayasinghe

An undergraduate student in University of Moratuwa, Sri Lanka.