In this article, we will talk about the simplest model we can apply to our data, a linear regression LinearRegression()
. Remember, we talk about regression when our target variable is continuous, so that we are not predicting categories as in our LogisticClassifier()
.
Loading the dataset
from sklearn.datasets import load_iris# Load Iris Data
iris = load_iris()
iris_df = pd.DataFrame(data= iris.data, columns= iris.feature_names)
target_df = pd.DataFrame(data= iris.target, columns= ['species'])
target_df['species'] = target_df['species'].map({0:"setosa", 1:"versicolor", 2:"virginica"}) # apply map converter (dicts), alternatively: apply(function)
iris_df = pd.concat([iris_df, target_df], axis= 1) # concatenate the DataFrames
we can get now a brief visual of the columns.
iris_df.describe()
.describe() generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN
values.
iris_df.info()
.info() prints a concise summary of a DataFrame.
sns.pairplot(iris_df, hue= 'species')
Problem Statement: Predict the sepal length (cm) of the iris flowers
Now, our objective is to predict the target sepal length (cm)
using the rest of features. To do so, we have labeled the species
column into a categorical variable, so our model can understand that in a non-number way. Remember, not all the models let us to do this technique.
# Converting Objects to Numerical dtype
iris_df.drop('species', axis= 1, inplace= True)
target_df = pd.DataFrame(columns= ['species'], data= iris.target)
iris_df = pd.concat([iris_df, target_df], axis= 1)# Variables
X= iris_df.drop(labels= 'sepal length (cm)', axis= 1)
y= iris_df['sepal length (cm)']
# Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)
# Instantiating LinearRegression() Model
lr = LinearRegression()
# Training/Fitting the Model
lr.fit(X_train, y_train)
# Making Predictions
lr.predict(X_test)
pred = lr.predict(X_test)
# Evaluating Model's Performance
print('Mean Absolute Error:', mean_absolute_error(y_test, pred))
print('Mean Squared Error:', mean_squared_error(y_test, pred))
print('Mean Root Squared Error:', np.sqrt(mean_squared_error(y_test, pred)))
The results are:
Mean Absolute Error: 0.26498350887555133
Mean Squared Error: 0.10652500975036944
Mean Root Squared Error: 0.3263816933444176