Mastering Machine Learning with R
上QQ阅读APP看书,第一时间看更新

Building a univariate model

Our first case focuses on the goal of predicting the water yield (in inches) of the Snake River Watershed in Wyoming, USA, as a function of the water content of the year's snowfall. This forecast will be useful in managing the water flow and reservoir levels, as the Snake River provides much-needed irrigation water for the farms and ranches of several western states. The snake dataset is available in the alr3 package (note that alr stands for applied linear regression):

> install.packages("alr3")
> library(alr3)
> data(snake)
> dim(snake)
[1] 17 2
> head(snake)
X Y
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

Now that we have 17 observations, data exploration can begin. But first, let's change X and Y to meaningful variable names, as follows:

> names(snake) <- c("content", "yield")
> attach(snake) # attach data with new names
> head(snake)

content yield
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

> plot(content,
yield, main = "Scatterplot of Snow vs. Yield",
xlab = "water content of snow",
ylab = "water yield")

The output of the preceding code is as follows:

This is an intriguing plot as the data is linear and has a slight curvilinear shape driven by two potential outliers at both ends of the extreme. 

To perform a linear regression in R, we use the lm() function to create a model in the standard form of fit = lm(Y ~ X). You can then test your assumptions using various functions on your fitted model by using the following code:

> yield_fit <- lm(yield ~ content)

> summary(yield_fit)

Call:
lm(formula = yield ~ content)

Residuals:
Min 1Q Median 3Q Max
-2.1793 -1.5149 -0.3624 1.6276 3.1973

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.72538 1.54882 0.468 0.646
content 0.49808 0.04952 10.058 4.63e-08
***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

Residual standard error: 1.743 on 15 degrees of
freedom
Multiple R-squared: 0.8709, Adjusted R-squared:
0.8623
F-statistic: 101.2 on 1 and 15 DF, p-value:
4.632e-08

With the summary() function, we can examine some items, including the model specification, descriptive statistics about the residuals, the coefficients, codes to model significance, and a summary of the model error and fit. Right now, let's focus on the parameter coefficient estimates, and see whether our predictor variable has a significant p-value and whether the overall model F-test has a significant p-value. Looking at the parameter estimates, the model tells us that yield is equal to 0.72538 plus 0.49808 times content. We can state that for every one unit change in the content, the yield will increase by 0.49808 units. F-statistic is used to test the null hypothesis that the model coefficients are all zero.

Since p-value is highly significant, we can reject the null and move on to the t-test for content, which tests the null hypothesis that it's zero. Again, we can reject the null. Additionally, we can see the Multiple R-squared and Adjusted R-squared values. Adjusted R-squared will be covered under the multivariate regression topic, so let's zero in on Multiple R-squared; here, we see that it's 0.8709. In theory, it can range from zero to one and is a measure of the strength of the association between X and Y. The interpretation, in this case, is that the water content of snow can explain 87 percent of the variation in the water yield. On a side note, R-squared is nothing more than the correlation coefficient of [X, Y] squared.

We can recall our scatter plot and now add the best fit line produced by our model using the following code:

> plot(content, yield)

> abline(yield_fit, lwd = 3, col = "red")

The output of the preceding code is as follows: