Announcement

### Student's t Test for Independent Samples Is A Special Case of Simple Linear Regression

Student's t test for independent samples is equivalent to the linear regression of the response variable on the grouping variable, where the grouping variable is recoded to have numerical values, if necessary.

Here's an example involving glucose levels in two strains of rats, A and B. First, the data are displayed in a dot plot. Then, Glucose is plotted against A0B1, where A0B1 is created by setting it equal 0 for strain A and 1 for strain B.

Student's t test for independent samples yields

```Variable: GLU

STRAIN       N         Mean      Std Dev    Std Error
-----------------------------------------------------
A           10  80.40000000  29.20502240   9.23543899
B           12  99.66666667  19.95601223   5.76080452

Variances        T       DF    Prob>|T|
---------------------------------------
Unequal    -1.7700     15.5      0.0965
Equal      -1.8327     20.0      0.0818
```

The linear regression of glucose on A0B1 gives the equation GLU = b0 + bA0B1 A0B1 .

```Dependent Variable: GLU
Parameter Estimates

Parameter      Standard    T for H0:
Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

INTERCEPT  1     80.400000    7.76436303        10.355        0.0001
A0B1       1     19.266667   10.51299725         1.833        0.0818
```

The P value for the Equal Variances version of the t test is equal to the P value for the regression coefficient of the grouping variable A0B1 (P = 0.0818). The corresponding t statistics are equal in magnitude (|t| = 1.833). This is not a coincidence. Statistical theory says the two P values must be equal, while the t statistics must be equal in magnitude. The signs of the t statistics will be the same if the t statistic for Student's t test is calculated by subtracting the mean of group 0 from the mean of group 1.

The equal variances version of Student's t test is used to test the hypothesis of the equality of A and B, the means of two normally distributed populations with equal population variances (H0: A=B). The population means can be reexpressed as A= and B=+, where =B-A (that is, data from strain A are normally distributed with mean and standard deviation while data from strain B are normally distributed with mean + and standard deviation ) and the hypothesis can be rewriten as H0: =0.

The linear regression model says data are normally distributed about the regression line with constant standard deviation . The predictor variable A0B1 (the grouping variable) takes on only two values. Therefore, there are only two locations along the regression line where there are data (see the display). "Homoscedastic (constant spread about the regression line) normally distributed values about the regression line" is equivalent to "two normally distributed populations with equal variances".

• 0 is equal to A,
• 0+A0B1, is equal to B, and
• A0B1 is equal to .
Thus, the hypothesis of equal means (H0: =0) is equivalent to the hypothesis that the regression coefficient of A0B1 is 0 (H0: A0B1= 0). That is, the population means are equal if and only if the regression line is horizontal.

Since the probability structure is the same for the two problems (homoscedastic, normally distributed data), test statistics and P values will be the same, too. The numbers confirm this.

• For strain A, the predicted value b0+bA0B1*0, is 80.400000 + 19.266667*0 = 80.40, the mean of strain A.
• For strain B, b0+bA0B1*1 is 80.400000 + 19.266667*1 = 99.67, the mean of strain B.
Had the numerical codes for strain been different from 0 and 1, the intercept and regression coefficient would change so that the two predicted values would continue to be the sample means. The t statistic and P value for the regression coefficient would not change. The best fitting line passes through the two points whose X-values are equal to the coded Strain values and whose Y-values are equal to the corresponding sample means. This minimizes the sum of squared differences between observed and predicted Y-values. Since this involves only two points and two points determine a straight line, the linear regression equation will always have the slope & intercept necessary to make the line pass through the two points. To put it another way, two points define the regression line. The Y-values are the sample means. The X-values are determined by the coding scheme. Whatever the X-values, the slope & intercept of the regression line will be those of the line that passes through the two points.