STAT 3360 Notes
Table of Contents
1 \(\chi^2\) Tests
1.1 Goodness-of-Fit Test of One Random Variable
1.1.1 Introduction
- In the section Discrete Probability Distribution, we have learned probability mass function which is used to describe the distribution of a discrete random variable. However, in reality, the real distribution is usually unknown. Just as we did for the population proportion, we may implement hypothesis testing to test whether a hypothesized probability mass function is true (i.e., whether the hypothesized distribution matches the real distribution of the random variable).
- Since it is about "matching", the test is named "Goodness-of-Fit" test.
- The reason why we put this test under the section "\(\chi^2\) test" is that the critical value for this test is based on a \(\chi^2\) distribution.
- Actually, there are many types of Goodness-of-Fit test. This particular \(\chi^2\) Goodness-of-Fit test here is mainly used for testing discrete distributions. For continuous distributions, there are other types of Goodness-of-Fit test procedures which are not covered in the course.
1.1.2 Notations
- Suppose
- \(X\) is a discrete random variable;
- \(X\) has \(K\) different possible values;
- \(x_1, x_2, \dots, x_K\) are the \(K\) possible values of \(X\);
- A sample about \(X\) and of size \(n\) is obtained;
1.1.3 Hypothesized Distribution
- Remark
- The Goodness-of-Fit test determines whether an assumed (hypothesized) distribution matches the reality. Before applying the test, we should first clarify the hypothesized distribution.
The hypothesized distribution of the discrete random variable \(X\) can be represented by a table, where \(p_1, p_2, \dots, p_K\) are the hypothesized proportions for values \(x_1, x_2, \dots, x_K\), respectively.
\(X\) \(x_1\) \(x_2\) … \(x_K\) Total Hypothetical Probability \(p_1\) \(p_2\) … \(p_K\) \(100\%\)
1.1.4 Observed Counts and Expected Counts
- The count of each value is called the Observed Count for that value.
- For each value of \(X\), \(\boxed{\text{Expected Count} = \text{Total Count} \times \text{Corresponding Hypothesized Probability}}\).
- We denote
- by \(O_i\) the observed count of \(x_i\);
- by \(E_i\) the expected count of \(x_i\).
The above information can be summarized in a table
\(X\) \(x_1\) \(x_2\) … \(x_K\) Total Hypothetical Probability \(p_1\) \(p_2\) … \(p_K\) \(100\%\) Observed Count \(O_1\) \(O_2\) … \(O_K\) \(n\) Expected Count \(E_1=np_1\) \(E_2=np_2\) … \(E_K=np_K\) \(n\) where \(\boxed{E_i = n \cdot p_i}\) for \(i=1,2,\dots,K\).
1.1.5 Test Procedure
- At significance level \(\alpha\), to test \(\boxed{H_0: \text{the real distribution is the same as hypothesized}}\) vs \(\boxed{H_A: \text{the real distribution is NOT the same as hypothesized}}\),
- Test Statistic: \(\boxed{T = \sum\limits_{i=1}^{K} \frac{(E_i - O_i)^2}{E_i}}\)
- Degrees of Freedom for the \(\chi^2\) statistic: \(K-1\)
- Critical Value: \(\boxed{\chi^2_{\alpha, K-1}}\)
- Rejection Rule: Reject \(H_0\) if \(\boxed{T > \chi^2_{\alpha, K-1}}\)
1.1.6 Example
- According to his experience, the teacher of STAT 3360 believes that among all students the proportions for grades A, B, C, D, F and W are \(15\%, 20\%, 40\%, 10\%, 10\%\) and \(5\%\), respectively. Suppose now \(400\) students are randomly sampled, among which \(66\) get grade A, \(84\) get B, \(148\) get C, \(48\) get D, \(46\) get F and \(8\) get W.
- Q1: At \(5\%\) significance level, do we have enough evidence that the teacher's assumption is incorrect?
- We want to test
- \(H_0:\) the real distribution of grade is the same as the teacher's assumption
- \(H_A:\) the reel distribution of grade is NOT hte same as the teacher's assumption
Let's summarize the information by a table using \(E_i = n \cdot p_i\), where \(n=400\) is the sample size
Grade A B C D F W Total Hypothesized Proportion \(p_1=15\%\) \(p_2=20\%\) \(p_3=40\%\) \(p_4=10\%\) \(p_5=10\%\) \(p_6=5\%\) \(100\%\) Observed Count \(O_1=66\) \(O_2=84\) \(O_3=148\) \(O_4=48\) \(O_5=46\) \(O_6=8\) \(n=400\) Expected Count \(E_1=400 \cdot 15\% = 60\) \(E_2=400 \cdot 20\% = 80\) \(E_3=400 \cdot 40\% = 160\) \(E_4=400 \cdot 10\% = 40\) \(E_5=400 \cdot 10\% = 40\) \(E_6=400 \cdot 5\% = 20\) \(n=400\) - Significance Level: \(\alpha = 5\% = 0.05\)
- Degrees of Freedom for the \(\chi^2\) statistic
- There are \(K = 6\) different values (categories) of the Grade
- Thus the degrees of freedom for the \(\chi^2\) statistic is \(K-1=6-1=5\)
- Critical Value: \(\chi^2_{\alpha, K-1} = \chi^2_{0.05, 5} = 11.1\)
- Test Statistic: \(T = \sum\limits_{i=1}^{6} \frac{(E_i - O_i)^2}{E_i} = \frac{(60-66)^2}{60} + \frac{(80-84)^2}{80} + \frac{(160-148)^2}{160} + \frac{(40-48)^2}{40} + \frac{(40-46)^2}{40} + \frac{(20-8)^2}{20} = 11.4\)
- Rejection Rule: Reject \(H_0\) if \(T > \chi^2_{\alpha, K-1}\)
- Decision: Since \(T = 11.4 > 11.1 = \chi^2_{\alpha, K-1}\), we reject \(H_0\). That is, there is sufficient evidence that the teacher's assumption is not true.
- We want to test
- Q2: At \(1\%\) significance level, do we have enough evidence that the teacher's assumption is incorrect?
- We want to test
- \(H_0:\) the real distribution of grade is the same as the teacher's assumption
- \(H_A:\) the reel distribution of grade is NOT hte same as the teacher's assumption
- Significance Level: \(\alpha = 1\% = 0.01\)
- Degrees of Freedom for the \(\chi^2\) statistic
- There are \(K = 6\) different values (categories) of Grade, the same as in Q1
- Thus the degrees of freedom for the \(\chi^2\) statistic is \(K-1=6-1=5\), the same as in Q1
- Critical Value: \(\chi^2_{\alpha, K-1} = \chi^2_{0.01, 5} = 15.1\)
- Test Statistic: \(T = 11.4\), the same as in Q1
- Rejection Rule: Rejection \(H_0\) if \(T > \chi^2_{\alpha, K-1}\)
- Decision: Since \(T = 11.4 < 15.1 = \chi^2_{\alpha, K-1}\), we don't reject \(H_0\). That is, there is not sufficient evidence that the teacher's assumption is not true.
- We want to test
1.2 Independence (Homogeneity) Test of Two R.V.
1.2.1 Introduction
- Given two random variables, we are naturally interested in whether they are independent. If they are independent, then there is no need to consider the two random variables together; otherwise, we may want further investigation of the relationship between them.
- For numercial random variables, we use linear regression to study their relationship.
- For categorical random variables, we can use the so-called Independence Test or Homogeneity Test to detect the dependence between them. The two words "independence" and "homogeneity" actually have the same meaning. Think about two variables, Gender and Age. If Gender and Age are independent, then Gender has not impact on the distribution of Age. That is, the distribution of Age within the Male group and the distribution of Age within the Female group are the same, which means "homogeneity" between the two distributions.
- The reason why we put this test under the section "\(\chi^2\)" test is that the critical value for this test is based on a \(\chi^2\) distribution.
- In the Goodness-of-Fit test, a hypothesized distribution is given, and we would like to determine whether this hypothesized distribution matches (fit) the reality.
- However, in the independence (homogeneity) test, no hypothesized distribution is needed, and we test the independence between the two random variables using only observed data.
- Therefore, the expected counts can not be calculated by "total \(\times\) hypothesized proportion". Instead, the "hypothesized proportion" is replaced by a theorectical proportion based on the independence assumption and sample proportions.
1.2.2 Notations
- Suppose
- \(X\) and \(Y\) are two discrete random variables;
- \(X\) has \(K\) different possible values;
- \(Y\) has \(L\) different possible values;
- \(x_1, x_2, \dots, x_K\) represents the \(K\) different possible values of \(X\);
- \(y_1, y_2, \dots, y_L\) represents the \(L\) different possible values of \(Y\);
- A sample of size \(n\) (grand total) is obtained, which is simultaneously about the two random variables \(X\) and \(Y\).
1.2.3 Observed Counts
We may then summarize the information in the sample by a table as follows, where
- \(O_{ij}\) represents the Observed Counts corresponding to the event [\(X = x_i\) AND \(Y = y_j\)];
- \(R_i\) represents the \(i\)-th row total, \(i=1,2,\dots,K\);
- \(C_j\) represents the \(j\)-th row total, \(j=1,2,\dots,L\).
Observed Count \(y_1\) \(y_2\) … \(y_L\) Total \(x_1\) \(O_{11}\) \(O_{12}\) … \(O_{1L}\) \(R_1=\sum\limits_{j=1}^{L}O_{1j}\) \(x_2\) \(O_{21}\) \(O_{22}\) … \(O_{2L}\) \(R_2=\sum\limits_{j=1}^{L}O_{2j}\) … … … … … … \(x_K\) \(O_{K1}\) \(O_{K2}\) … \(O_{KL}\) \(R_K=\sum\limits_{j=1}^{L}O_{Kj}\) Total \(C_1=\sum\limits_{i=1}^{K}O_{i1}\) \(C_2=\sum\limits_{i=1}^{K}O_{i2}\) … \(C_L=\sum\limits_{i=1}^{K}O_{iL}\) Grand Total \(=n\)
1.2.4 Expected Counts
- Let \(E_{ij}\) represent the Expected Count corresponding to [\(X = x_i\) AND \(Y = y_j\)].
When \(X\) and \(Y\) are independent, by defintion of independence, we have
\(\boxed{E_{ij}} = \text{Expected Count for } [X = x_i \text{ AND } Y = y_j]\)
\(= \text{Grand Total} \times \text{Theorectial Proportion of the Event } [X = x_i \text{ AND } Y = y_j]\)
\(= \text{Grand Total} \times P(X = x_i \text{ AND } Y = y_j)\)
\(= \text{Grand Total} \times \underbrace{P(X = x_i)}_{\text{population proportion, estimated by sample proportion}} \times \underbrace{P(Y = y_j)}_{\text{population proportion estimated by sample proportion}}\)
\(\approx \text{Grand Total} \times \underbrace{\frac{\text{Total of }x_i}{\text{Grand Total}}}_{\text{sample proportion of } x_i} \times \underbrace{\frac{\text{Total of }y_j}{\text{Grand Total}}}_{\text{sample proportion of }y_j}\)
\(= n \times \frac{\text{Total of }x_i}{n} \times \frac{\text{Total of }y_j}{n}\)
\(= \boxed{ \frac{(\text{Total of }x_i) \times (\text{Total of }y_j)}{n} }\)
If the information are shown by a Table as follows, then we have
\(\boxed{ E_{ij} } = \frac{(i\text{-th row total}) \times (j\text{-th column total})}{n} = \boxed{ \frac{R_i C_j}{n} }\)
Expected Count \(y_1\) \(y_2\) … \(y_L\) Total \(x_1\) \(E_{11}=\frac{R_1 C_1}{n}\) \(E_{12}=\frac{R_1 C_2}{n}\) … \(E_{1L}=\frac{R_1 C_L}{n}\) \(R_1\) \(x_2\) \(E_{21}=\frac{R_2 C_1}{n}\) \(E_{22}=\frac{R_2 C_2}{n}\) … \(E_{2L}=\frac{R_2 C_L}{n}\) \(R_2\) … … … … … … \(x_K\) \(E_{K1}=\frac{R_K C_1}{n}\) \(E_{K2}=\frac{R_K C_2}{n}\) … \(E_{KL}=\frac{R_K C_L}{n}\) \(R_K\) Total \(C_1\) \(C_2\) … \(C_L\) Grand Total \(=n\)
1.2.5 Test Procedure
- Remark
- Similar to the Goodness-of-Fit test, the independence test also measures the deviation between the assumption and the reality via the deviation between the expected count and the observed count. The test statistic also follows a \(\chi^2\) distribution.
Since the test use both expected and observed counts, we may put them together into a single table as follows.
Expected \(\vert\) Observed \(y_1\) \(y_2\) … \(y_L\) Total \(x_1\) \(E_{11} \vert O_{11}\) \(E_{12} \vert O_{12}\) … \(E_{1L} \vert O_{1L}\) \(R_1\) \(x_2\) \(E_{21} \vert O_{21}\) \(E_{22} \vert O_{22}\) … \(E_{2L} \vert O_{2L}\) \(R_2\) … … … … … … \(x_K\) \(E_{K1} \vert O_{K1}\) \(E_{K2} \vert O_{K2}\) … \(E_{KL} \vert O_{KL}\) \(R_K\) Total \(C_1\) \(C_2\) … \(C_L\) Grand Total \(=n\)
- At significance level \(\alpha\), to test
- \(\boxed{H_0: \text{Random variables } X \text{ and } Y \text{ are INDEPENDENT}}\) (or the distribution of \(X\) is homogeneous across all values of \(Y\)) vs
- \(\boxed{H_A: \text{the two random variables are NOT INDEPENDENT}}\) (or the distribution of \(X\) is NOT homogeneous across all values of \(Y\)),
- Test Statistic: \(\boxed{T = \sum\limits_{j=1}^{L} \sum\limits_{i=1}^{K} \frac{(E_{ij} - O_{ij})^2}{E_{ij}}}\)
- Degrees of Freedom for the \(\chi^2\) statistic: \((K-1)(L-1)\)
- Critical Value: \(\boxed{\chi^2_{\alpha, (K-1)(L-1)}}\)
- Rejection Rule: Reject \(H_0\) if \(\boxed{T > \chi^2_{\alpha, (K-1)(L-1)}}\)
2 References
- Keller, Gerald. (2015). Statistics for Management and Economics, 10th Edition. Stamford: Cengage Learning.