Normal Distribution & Central Limit Theorem
September 19, 2017
notes review studyNormal Distribution & Central Limit Theorem
Notation:
- \(\mu\) is the population average
- \(\sigma\) is the standard deviation
- \(\sigma^2\) is the variance
- \(X_1, \cdots, X_M\) or \(Y_1, \cdots, Y_N\) represents each of the observations
- \(\bar{X}\) or \(\bar{Y}\) is the sample average
- \(SE\) refers to standard error
Formulaic representations:
\[\mu = \frac{1}{m} \displaystyle\sum_{i=1}^{m} x_i\]
\[\bar{X} = \frac{1}{M} \displaystyle \sum_{i=1}^{M}X_i\]
\[\sigma^2 = \frac{1}{m} \displaystyle\sum_{i=1}^{m} (x_i - \mu)^2\]
\[SE(\bar{X}) = \sigma / \sqrt{N}\]
FYI, \(\sigma\) is the standard deviation and \(\sigma^2\) is the variance.
Normal distribution
This is represented by the following formula:
\[Pr(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} , dx \]
As you can see, the only knowns required to approximate this distriubtion are the average, \(\mu\), and the standard deviation, \(\sigma\).
| 1 \(\sigma\) | 2 \(\sigma\) | 2.5 \(\sigma\) | 3 \(\sigma\) |
|---|---|---|---|
| 68% | 95% | 99% | 99.7% |
68-95-99.7 rule
Standardized Units
If your data follows a normal distribution, you can convert it to “standard units.” This is done using the following formula:
\[Z_i = \frac{X_i - \bar{X}}{s_X}\]
Converting to Z-scores (unitless), we know exactly how many std deviations we are from the mean.
Central Limit Theorem
\[\bar{X}\sim N\left(\mu_X,\dfrac{\sigma^2}{n}\right)\]
here, \(\sigma_X\) is the population standard deviation. Below, \(\sigma_X^2\) is the population variance. What’s important to notice in the Central Limit Theorem (CLT) is the effect of sample size on the spread of the distribution. By recognizing that the sample size, \(n\) is the denominator for \(\frac{\sigma^2}{n}\), we can see that as \(n\) gets bigger, our spread gets smaller.
\[\sigma_X^2 = \frac{1}{m} \displaystyle \sum_{i=1}^{m}(x_i - \mu_X)^2\]
Use
qqnorm()orqqline()to compare observational data against theoretical normal distribution.
T-test
t.test() is a function in R that allows us to calculate the p-value, t-statistic, confidence interval, etc. very simply.
For example:
t.test(treatment, control)
# To see just the p-value, you can use $
t.test(treatment,control)$p.value
One-sample \(t\)-test is calculated as:
\[t = \frac{\bar{X}-\mu{}}{s_{\overline{X}}}\]
where
\[s_{\overline{X}} = \frac{s_X}{\sqrt{n}}\]
\(s_X\) is the sample standard deviation
\(s_{\overline{X}}\) is the estimated standard error of the mean
Welch’s \(t\)-test is used when the 2 population variances are assumed to not be equal.
\[ t = \frac{\bar{X} - \bar{Y}}{s_\Delta}\]
where \(s_\Delta\) is given by the following:
\[s_\Delta = \sqrt{\frac{s^2_X}{M} - \frac{s^2_Y}{N}}\]
\(s^2_i\) is the unbiased estimator of the variance each of the 2 samples.
Sean Kross has an excellent post reviewing
pnorm,qnorm, etc. I’m not sure if he’s written one specifically for t-distributions, but there is a succinct answer regarding this on Quora.
Type I and Type II error
Type I error, also known as the False Positive, is the probability of rejecting the null hypothesis when, in fact, the null hypothesis is true. Typically, a significance level of the test, \(\alpha\), is defined. \(\alpha\) is often set as being \(0.05\). The null hypothesis is then rejected if the p-value is less than \(\alpha\). Therefore, the probability that we mistakenly reject the null hypothesis (when it is in reality true) is equal to \(\alpha\).
\[\text{Type I error} = \alpha\]
Type II error, aka False Negative, is the probability of not rejecting the null hypothesis when, in fact, the null hypothesis is false. Type II error isn’t easily computed because it requires knowledge of the population mean, which we often don’t know. However, it can be calculated by determining the power of the test.
\[\text{Type II error} = 1 - \text{power}\]
Type I – rejecting the null when we should not.
Type II – failing to reject the null when we should.
Power
Power is the probability of rejecting the null hypothesis when, in fact, the alternative hypothesis is true. Power is proportional to the sample size, \(N\). In other words, as \(N\) increases, power increases.