Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables:
import pandas as pd
import statsmodels.formula.api as smf
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
model = smf.ols('Calorie_Burnage ~ Average_Pulse + Duration', data = full_health_data)
results = model.fit()
print(results.summary())
Try it Yourself »
The linear regression function can be rewritten mathematically as:
Calorie_Burnage = Average_Pulse * 3.1695 + Duration * 5.8424 - 334.5194
Rounded to two decimals:
Calorie_Burnage = Average_Pulse * 3.17 + Duration * 5.84 - 334.52
Define the linear regression function in Python to perform predictions.
What is Calorie_Burnage if:
def Predict_Calorie_Burnage(Average_Pulse, Duration):
return(3.1695*Average_Pulse + 5.8434 * Duration - 334.5194)
print(Predict_Calorie_Burnage(110,60))
print(Predict_Calorie_Burnage(140,45))
print(Predict_Calorie_Burnage(175,20))
Try it Yourself »
The Answers:
Look at the coefficients:
Look at the P-value for each coefficient.
So here we can conclude that Average_Pulse and Duration has a relationship with Calorie_Burnage.
There is a problem with R-squared if we have more than one explanatory variable.
R-squared will almost always increase if we add more variables, and will never decrease.
This is because we are adding more data points around the linear regression function.
If we add random variables that does not affect Calorie_Burnage, we risk to falsely conclude that the linear regression function is a good fit. Adjusted R-squared adjusts for this problem.
It is therefore better to look at the adjusted R-squared value if we have more than one explanatory variable.
The Adjusted R-squared is 0.814.
The value of R-Squared is always between 0 to 1 (0% to 100%).
Conclusion: The model fits the data point well!
Congratulations! You have now finished the final module of the data science library.