STAT 3360 Notes
1 Estimation
1.1 Introduction
- As seen before, once the type (eg, Binomial, Poisson, Normal) and parameters (eg, \(n\), \(p\), \(\mu\), \(\sigma\)) of a distribution are given, the probability mass/density function is determined and thus the probability of any event can be calculated.
- However, in reality, the values of the parameters are usually unknown. For example, we may believe the test scores of students follows a Normal distribution with population mean \(\mu\) and standard deviation \(\sigma\), but usually we don't know the exact value of \(\mu\) or \(\sigma\). Just as we did in the simple linear regression, we believe there is a linear relationship between the height and weight of a student, but we don't know the true values of the slope and intercept of the regression line.
- When the true value of the population parameter is unknown, we can make a guess of it using the given sample, which is called Estimation. For example, for simple linear regression, we may use the Least Squares approach to get estimates of the slope and intercept of the regression line.
1.2 Types of Estimation
- There are two types of estimation: point estimate and confidence interval.
- A Point Estimate uses a single value as the esimate of the population parameter.
- For example, if the population mean in unknown, then we may use the sample mean as its estimate. That is, if the sample mean is \(3\), then we may just use the single number \(3\) as our point estimate for the population mean.
- However, there is a drawback of the point estimate. Since we use only a single number as the estimate, it is quite likely that our estimation is incorrect. For example, while we use the sample mean \(3\) as the point estimate for the population mean, the true value of the population mean may be \(2.97\), \(3.12\) or any other value. As long as the true value is not exactly \(3\), our point estimate is incorrect.
- A Confidence Interval gives an whole interval as the estimation, hoping that some point on the interval could hit the true value of the population parameter.
- For example, suppose the true value of the population mean is \(3.1\) and the sample mean we have is \(3\). Instead of using the sample mean \(3\), we use an interval \([3 - 0.2, 3 + 0.2] = [2.8, 3.2]\) to estimate the population mean. While the sample mean \(3\) doesn't correctly hit the true value \(3.1\) of the population mean, the interval \([2.8, 3.2]\) does so.
- What kind of confidence interval do we prefer? Suppose we are interested in the population mean of all students' test scores, say \(\mu\).
- We may use \([0, 100]\) as our confidence interval, and then we are \(100\%\) sure that it covers the true value of \(\mu\). However, does the interval \([0, 100]\) make any sense? No, we still don't know whether the average score of the students is high or low.
- In contrast, we may use \([85, 95]\) as our confidence interval for \(\mu\), which narrows down the scope of the estimation and thus is more accuarate. However, using \([85, 95]\), we are no longer \(100\%\) sure that the true value of \(\mu\) will be covered by the interval.
- As we see,
- the wider the confidence interval is, the more probable it covers the true value of the population parameter, but the estimate is then less accurate.
- the shorter the confidence interval is, the less probable it covers the true value of the population parameter, but the estimate is then more accurate.
- In reality, we usually first determine the level of coverage we want (called confidence level), and then find the shortest confidence interval at this level.
2 Estimating Population Mean
2.1 Point Estimate
2.1.1 Construction
- Usually, we use the sample mean \(\bar{x}\) as the point estimate for the population mean.
2.1.2 Example
- Suppose the students' scores (decimal values) follow a Normal distribution with population mean \(\mu\) and population variance \(9\).
- Q: If we randomly select \(16\) students and find their average score is \(79\), what is a point estimate of \(\mu\)?
- We just use the sample mean \(\bar{X} = 79\) as the point estimate for the population mean \(\mu\).
2.2 Confidence Interval
2.2.1 Construction
- Suppose
- \(X\sim N(\mu, \sigma^2)\) where \(\mu\) is unknown and \(\sigma\) is known
- \(X_1, X_2, \dots, X_n\) are random observations of \(X\)
- Then
- the \((1-\alpha)\) level confidence interval for \(\mu\) is
\[ \boxed{ CI = \left[ \bar{X} - z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \ , \ \bar{X} + z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \right] } \]
- where
- \((1-\alpha)\) is called the Confidence Level, which means the probability that the confidence interval covers the true value of \(\mu\) is \((1 - \alpha)\), ie,
\[ P(\mu \in CI) = P\left( \bar{X} - z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \le \mu \le \bar{X} + z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \right) = 1 - \alpha \]
The proof is omitted here.
- the sample mean \(\bar{X}\) is called the Mid-Point
- the \(M = z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\) is called the Margin of Error, in which
- \(z_t\), called the Critical Value at \(t\), is the number such that for standard Normal \(Z\) the right tail probability \(P(Z > z_t) = t\)
- For example, if \(\alpha = 0.1\), then \(z_{\alpha/2} = z_{0.05} = 1.64\) because the right tail probability \(P(Z > 1.64) = 0.05\)
- For example, if \(\alpha = 0.05\), then \(z_{\alpha/2} = z_{0.025} = 1.96\) because the right tail probability \(P(Z > 1.96) = 0.025\)
- \(\frac{\sigma}{\sqrt{n}}\) is the standard deviation of \(\bar{X}\) (ie, \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\), see section "Sampling Distribution of Sample Mean")
- to summarize, we have
\[ \boxed{ \text{Margin of Error} = \text{Critical Value} \times \text{Standard Deviation of } \bar{X} } \]
\[ \boxed{ CI = [\text{Mid Point} - \text{Margin of Error }, \text{ Mid Point} + \text{Margin of Error}] }\]
- (Mid Point - Margin of Error) = \(\bar{X} - z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\), the left endpoint of the CI, is called the Lower Confidence Limit (LCL) of the CI
- (Mid Point + Margin of Error) = \(\bar{X} + z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\), the right endpoint of the CI, is called the Upper Confidence Limit (UCL) of the CI
- \((\text{UCL} - \text{LCL}) = 2 \times (\text{Margin of Error})\) is the Width of the confidence interval.
2.2.2 Sample Size
- We know the margin of error is \(M = z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\), so
\[ \boxed{ n = \left( \frac{z_{\alpha/2} \sigma}{M} \right)^2 } \]
which can be used for calculating the sample size needed for achieving a given margin of error.
- By the formula, the smaller the \(M\) is, the larger the \(n\) is. That is, to achieve a smaller margin of error and thus a shorter confidence interval, a larger sample size is needed. Therefore, to achieve a margin of error not longer than \(M\), the sample size should be at least \(n = \left( \frac{z_{\alpha/2} \sigma}{M} \right)^2\). If the value of \(n\) calculated by the formula is not an integer, then we should round it up in order to ensure the actual margin of error is not longer than \(M\).
2.2.3 Example
- Suppose the students' scores (decimal values) follow a Normal distribution with population mean \(\mu\) and population variance \(9\).
- Q1: If we randomly select \(16\) students and find their average score is \(79\), what is the confidence interval of the population mean \(\mu\) at confidence level of \(90\%\)?
- First, write down the information in mathematical language.
- Let \(X\) be the score of a student, then we know \(X \sim N(\mu, \sigma^2 = 9)\) and thus the standard deviation \(\sigma = \sqrt{9} = 3\)
- There are \(16\) observations of \(X\), so \(n = 16\) and \(\bar{X} = 79\) is the average of the \(16\) observations
- We are asked about the confidence interval of \(\mu\) at confidence level of \(90\%\).
- By the formulas,
- \(1 - \alpha = 90\%\) is the confidence level, so \(\alpha = 0.1\),
- the midpoint of the confidence interval is just \(\bar{X} = 79\),
- the margin of error of the confidence interval is \(M = z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\), where
- \(z_{\alpha/2} = z_{0.1/2} = z_{0.05} = 1.64\) because the right tail probability \(P(Z > 1.64) = 0.05\)
- \(\frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{16}} = 0.75\)
- so \(M = 1.64 \cdot 0.75 = 1.23\)
- the confidence interval is just \([79 - 1.23, 79 + 1.23] = [77.77, 80.23]\)
- Q2: If we want a confidence level with width \(2\), what is the smallest sample size we should have? Is \(24\) enough?
- we know \(\text{Width} = 2 \times \text{Margin of Error}\), so when \(\text{Width} = 2\), we have \(\text{Margin of Error} = M = 1\).
- By formula, \(n = \left( \frac{z_{\alpha/2} \sigma}{M} \right)^2 = \left( \frac{1.64 \cdot 3}{1} \right)^2 = 24.21\),
- \(n = 24.21\) is not an integer, so we should round it up and thus the smallest sample size is \(25\),
- therefore, to get a CI with width \(2\), the sample size should be at least \(25\) and thus \(24\) is not enough.
3 References
- Keller, Gerald. (2015). Statistics for Management and Economics, 10th Edition. Stamford: Cengage Learning.
Yunfei Wang
2016-03-02 Wed 06:09