STAT 3360 Notes

1. \(\chi^2\) Tests
- 1.1. Goodness-of-Fit Test of One Random Variable
- 1.2. Independence (Homogeneity) Test of Two R.V.
2. References

1 \(\chi^2\) Tests

1.1 Goodness-of-Fit Test of One Random Variable

1.1.1 Introduction

In the section Discrete Probability Distribution, we have learned probability mass function which is used to describe the distribution of a discrete random variable. However, in reality, the real distribution is usually unknown. Just as we did for the population proportion, we may implement hypothesis testing to test whether a hypothesized probability mass function is true (i.e., whether the hypothesized distribution matches the real distribution of the random variable).
Since it is about "matching", the test is named "Goodness-of-Fit" test.
The reason why we put this test under the section "\(\chi^2\) test" is that the critical value for this test is based on a \(\chi^2\) distribution.
Actually, there are many types of Goodness-of-Fit test. This particular \(\chi^2\) Goodness-of-Fit test here is mainly used for testing discrete distributions. For continuous distributions, there are other types of Goodness-of-Fit test procedures which are not covered in the course.

1.1.2 Notations

Suppose
- \(X\) is a discrete random variable;
- \(X\) has \(K\) different possible values;
- \(x_1, x_2, \dots, x_K\) are the \(K\) possible values of \(X\);
- A sample about \(X\) and of size \(n\) is obtained;

1.1.3 Hypothesized Distribution

Remark
- The Goodness-of-Fit test determines whether an assumed (hypothesized) distribution matches the reality. Before applying the test, we should first clarify the hypothesized distribution.
The hypothesized distribution of the discrete random variable \(X\) can be represented by a table, where \(p_1, p_2, \dots, p_K\) are the hypothesized proportions for values \(x_1, x_2, \dots, x_K\), respectively.

\(X\) \(x_1\) \(x_2\) … \(x_K\) Total

Hypothetical Probability \(p_1\) \(p_2\) … \(p_K\) \(100\%\)

1.1.4 Observed Counts and Expected Counts

The count of each value is called the Observed Count for that value.
For each value of \(X\), \(\boxed{\text{Expected Count} = \text{Total Count} \times \text{Corresponding Hypothesized Probability}}\).
We denote
- by \(O_i\) the observed count of \(x_i\);
- by \(E_i\) the expected count of \(x_i\).

The above information can be summarized in a table

\(X\)	\(x_1\)	\(x_2\)	…	\(x_K\)	Total
Hypothetical Probability	\(p_1\)	\(p_2\)	…	\(p_K\)	\(100\%\)
Observed Count	\(O_1\)	\(O_2\)	…	\(O_K\)	\(n\)
Expected Count	\(E_1=np_1\)	\(E_2=np_2\)	…	\(E_K=np_K\)	\(n\)

where \(\boxed{E_i = n \cdot p_i}\) for \(i=1,2,\dots,K\).

1.1.5 Test Procedure

At significance level \(\alpha\), to test \(\boxed{H_0: \text{the real distribution is the same as hypothesized}}\) vs \(\boxed{H_A: \text{the real distribution is NOT the same as hypothesized}}\),
- Test Statistic: \(\boxed{T = \sum\limits_{i=1}^{K} \frac{(E_i - O_i)^2}{E_i}}\)
- Degrees of Freedom for the \(\chi^2\) statistic: \(K-1\)
- Critical Value: \(\boxed{\chi^2_{\alpha, K-1}}\)
- Rejection Rule: Reject \(H_0\) if \(\boxed{T > \chi^2_{\alpha, K-1}}\)

1.1.6 Example

According to his experience, the teacher of STAT 3360 believes that among all students the proportions for grades A, B, C, D, F and W are \(15\%, 20\%, 40\%, 10\%, 10\%\) and \(5\%\), respectively. Suppose now \(400\) students are randomly sampled, among which \(66\) get grade A, \(84\) get B, \(148\) get C, \(48\) get D, \(46\) get F and \(8\) get W.

Q1: At \(5\%\) significance level, do we have enough evidence that the teacher's assumption is incorrect?

We want to test
- \(H_0:\) the real distribution of grade is the same as the teacher's assumption
- \(H_A:\) the reel distribution of grade is NOT hte same as the teacher's assumption

Let's summarize the information by a table using \(E_i = n \cdot p_i\), where \(n=400\) is the sample size

Grade	A	B	C	D	F	W	Total
Hypothesized Proportion	\(p_1=15\%\)	\(p_2=20\%\)	\(p_3=40\%\)	\(p_4=10\%\)	\(p_5=10\%\)	\(p_6=5\%\)	\(100\%\)
Observed Count	\(O_1=66\)	\(O_2=84\)	\(O_3=148\)	\(O_4=48\)	\(O_5=46\)	\(O_6=8\)	\(n=400\)
Expected Count	\(E_1=400 \cdot 15\% = 60\)	\(E_2=400 \cdot 20\% = 80\)	\(E_3=400 \cdot 40\% = 160\)	\(E_4=400 \cdot 10\% = 40\)	\(E_5=400 \cdot 10\% = 40\)	\(E_6=400 \cdot 5\% = 20\)	\(n=400\)

Significance Level: \(\alpha = 5\% = 0.05\)
Degrees of Freedom for the \(\chi^2\) statistic
- There are \(K = 6\) different values (categories) of the Grade
- Thus the degrees of freedom for the \(\chi^2\) statistic is \(K-1=6-1=5\)
Critical Value: \(\chi^2_{\alpha, K-1} = \chi^2_{0.05, 5} = 11.1\)
Test Statistic: \(T = \sum\limits_{i=1}^{6} \frac{(E_i - O_i)^2}{E_i} = \frac{(60-66)^2}{60} + \frac{(80-84)^2}{80} + \frac{(160-148)^2}{160} + \frac{(40-48)^2}{40} + \frac{(40-46)^2}{40} + \frac{(20-8)^2}{20} = 11.4\)
Rejection Rule: Reject \(H_0\) if \(T > \chi^2_{\alpha, K-1}\)
Decision: Since \(T = 11.4 > 11.1 = \chi^2_{\alpha, K-1}\), we reject \(H_0\). That is, there is sufficient evidence that the teacher's assumption is not true.

Q2: At \(1\%\) significance level, do we have enough evidence that the teacher's assumption is incorrect?
- We want to test
  - \(H_0:\) the real distribution of grade is the same as the teacher's assumption
  - \(H_A:\) the reel distribution of grade is NOT hte same as the teacher's assumption
- Significance Level: \(\alpha = 1\% = 0.01\)
- Degrees of Freedom for the \(\chi^2\) statistic
  - There are \(K = 6\) different values (categories) of Grade, the same as in Q1
  - Thus the degrees of freedom for the \(\chi^2\) statistic is \(K-1=6-1=5\), the same as in Q1
- Critical Value: \(\chi^2_{\alpha, K-1} = \chi^2_{0.01, 5} = 15.1\)
- Test Statistic: \(T = 11.4\), the same as in Q1
- Rejection Rule: Rejection \(H_0\) if \(T > \chi^2_{\alpha, K-1}\)
- Decision: Since \(T = 11.4 < 15.1 = \chi^2_{\alpha, K-1}\), we don't reject \(H_0\). That is, there is not sufficient evidence that the teacher's assumption is not true.

1.2 Independence (Homogeneity) Test of Two R.V.

1.2.1 Introduction

Given two random variables, we are naturally interested in whether they are independent. If they are independent, then there is no need to consider the two random variables together; otherwise, we may want further investigation of the relationship between them.
For numercial random variables, we use linear regression to study their relationship.
For categorical random variables, we can use the so-called Independence Test or Homogeneity Test to detect the dependence between them. The two words "independence" and "homogeneity" actually have the same meaning. Think about two variables, Gender and Age. If Gender and Age are independent, then Gender has not impact on the distribution of Age. That is, the distribution of Age within the Male group and the distribution of Age within the Female group are the same, which means "homogeneity" between the two distributions.
The reason why we put this test under the section "\(\chi^2\)" test is that the critical value for this test is based on a \(\chi^2\) distribution.
In the Goodness-of-Fit test, a hypothesized distribution is given, and we would like to determine whether this hypothesized distribution matches (fit) the reality.
However, in the independence (homogeneity) test, no hypothesized distribution is needed, and we test the independence between the two random variables using only observed data.
Therefore, the expected counts can not be calculated by "total \(\times\) hypothesized proportion". Instead, the "hypothesized proportion" is replaced by a theorectical proportion based on the independence assumption and sample proportions.

1.2.2 Notations

Suppose
- \(X\) and \(Y\) are two discrete random variables;
- \(X\) has \(K\) different possible values;
- \(Y\) has \(L\) different possible values;
- \(x_1, x_2, \dots, x_K\) represents the \(K\) different possible values of \(X\);
- \(y_1, y_2, \dots, y_L\) represents the \(L\) different possible values of \(Y\);
- A sample of size \(n\) (grand total) is obtained, which is simultaneously about the two random variables \(X\) and \(Y\).

1.2.3 Observed Counts

We may then summarize the information in the sample by a table as follows, where

\(O_{ij}\) represents the Observed Counts corresponding to the event [\(X = x_i\) AND \(Y = y_j\)];
\(R_i\) represents the \(i\)-th row total, \(i=1,2,\dots,K\);
\(C_j\) represents the \(j\)-th row total, \(j=1,2,\dots,L\).

Observed Count	\(y_1\)	\(y_2\)	…	\(y_L\)	Total
\(x_1\)	\(O_{11}\)	\(O_{12}\)	…	\(O_{1L}\)	\(R_1=\sum\limits_{j=1}^{L}O_{1j}\)
\(x_2\)	\(O_{21}\)	\(O_{22}\)	…	\(O_{2L}\)	\(R_2=\sum\limits_{j=1}^{L}O_{2j}\)
…	…	…	…	…	…
\(x_K\)	\(O_{K1}\)	\(O_{K2}\)	…	\(O_{KL}\)	\(R_K=\sum\limits_{j=1}^{L}O_{Kj}\)
Total	\(C_1=\sum\limits_{i=1}^{K}O_{i1}\)	\(C_2=\sum\limits_{i=1}^{K}O_{i2}\)	…	\(C_L=\sum\limits_{i=1}^{K}O_{iL}\)	Grand Total \(=n\)

1.2.4 Expected Counts

Let \(E_{ij}\) represent the Expected Count corresponding to [\(X = x_i\) AND \(Y = y_j\)].
When \(X\) and \(Y\) are independent, by defintion of independence, we have

\(\boxed{E_{ij}} = \text{Expected Count for } [X = x_i \text{ AND } Y = y_j]\)

\(= \text{Grand Total} \times \text{Theorectial Proportion of the Event } [X = x_i \text{ AND } Y = y_j]\)

\(= \text{Grand Total} \times P(X = x_i \text{ AND } Y = y_j)\)

\(= \text{Grand Total} \times \underbrace{P(X = x_i)}_{\text{population proportion, estimated by sample proportion}} \times \underbrace{P(Y = y_j)}_{\text{population proportion estimated by sample proportion}}\)

\(\approx \text{Grand Total} \times \underbrace{\frac{\text{Total of }x_i}{\text{Grand Total}}}_{\text{sample proportion of } x_i} \times \underbrace{\frac{\text{Total of }y_j}{\text{Grand Total}}}_{\text{sample proportion of }y_j}\)

\(= n \times \frac{\text{Total of }x_i}{n} \times \frac{\text{Total of }y_j}{n}\)

\(= \boxed{ \frac{(\text{Total of }x_i) \times (\text{Total of }y_j)}{n} }\)

If the information are shown by a Table as follows, then we have

\(\boxed{ E_{ij} } = \frac{(i\text{-th row total}) \times (j\text{-th column total})}{n} = \boxed{ \frac{R_i C_j}{n} }\)

Expected Count	\(y_1\)	\(y_2\)	…	\(y_L\)	Total
\(x_1\)	\(E_{11}=\frac{R_1 C_1}{n}\)	\(E_{12}=\frac{R_1 C_2}{n}\)	…	\(E_{1L}=\frac{R_1 C_L}{n}\)	\(R_1\)
\(x_2\)	\(E_{21}=\frac{R_2 C_1}{n}\)	\(E_{22}=\frac{R_2 C_2}{n}\)	…	\(E_{2L}=\frac{R_2 C_L}{n}\)	\(R_2\)
…	…	…	…	…	…
\(x_K\)	\(E_{K1}=\frac{R_K C_1}{n}\)	\(E_{K2}=\frac{R_K C_2}{n}\)	…	\(E_{KL}=\frac{R_K C_L}{n}\)	\(R_K\)
Total	\(C_1\)	\(C_2\)	…	\(C_L\)	Grand Total \(=n\)

1.2.5 Test Procedure

Remark

Similar to the Goodness-of-Fit test, the independence test also measures the deviation between the assumption and the reality via the deviation between the expected count and the observed count. The test statistic also follows a \(\chi^2\) distribution.

Since the test use both expected and observed counts, we may put them together into a single table as follows.

Expected \(\vert\) Observed	\(y_1\)	\(y_2\)	…	\(y_L\)	Total
\(x_1\)	\(E_{11} \vert O_{11}\)	\(E_{12} \vert O_{12}\)	…	\(E_{1L} \vert O_{1L}\)	\(R_1\)
\(x_2\)	\(E_{21} \vert O_{21}\)	\(E_{22} \vert O_{22}\)	…	\(E_{2L} \vert O_{2L}\)	\(R_2\)
…	…	…	…	…	…
\(x_K\)	\(E_{K1} \vert O_{K1}\)	\(E_{K2} \vert O_{K2}\)	…	\(E_{KL} \vert O_{KL}\)	\(R_K\)
Total	\(C_1\)	\(C_2\)	…	\(C_L\)	Grand Total \(=n\)

At significance level \(\alpha\), to test
- \(\boxed{H_0: \text{Random variables } X \text{ and } Y \text{ are INDEPENDENT}}\) (or the distribution of \(X\) is homogeneous across all values of \(Y\)) vs
- \(\boxed{H_A: \text{the two random variables are NOT INDEPENDENT}}\) (or the distribution of \(X\) is NOT homogeneous across all values of \(Y\)),
- Test Statistic: \(\boxed{T = \sum\limits_{j=1}^{L} \sum\limits_{i=1}^{K} \frac{(E_{ij} - O_{ij})^2}{E_{ij}}}\)
- Degrees of Freedom for the \(\chi^2\) statistic: \((K-1)(L-1)\)
- Critical Value: \(\boxed{\chi^2_{\alpha, (K-1)(L-1)}}\)
- Rejection Rule: Reject \(H_0\) if \(\boxed{T > \chi^2_{\alpha, (K-1)(L-1)}}\)

2 References

Keller, Gerald. (2015). Statistics for Management and Economics, 10th Edition. Stamford: Cengage Learning.