Normal Distribution & Central Limit Theorem

September 19, 2017
notes review study

Normal Distribution & Central Limit Theorem

Notation:

  • \(\mu\) is the population average
  • \(\sigma\) is the standard deviation
  • \(\sigma^2\) is the variance
  • \(X_1, \cdots, X_M\) or \(Y_1, \cdots, Y_N\) represents each of the observations
  • \(\bar{X}\) or \(\bar{Y}\) is the sample average
  • \(SE\) refers to standard error

Formulaic representations:

\[\mu = \frac{1}{m} \displaystyle\sum_{i=1}^{m} x_i\]

\[\bar{X} = \frac{1}{M} \displaystyle \sum_{i=1}^{M}X_i\]

\[\sigma^2 = \frac{1}{m} \displaystyle\sum_{i=1}^{m} (x_i - \mu)^2\]

\[SE(\bar{X}) = \sigma / \sqrt{N}\]

FYI, \(\sigma\) is the standard deviation and \(\sigma^2\) is the variance.


Normal distribution

This is represented by the following formula:

\[Pr(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} , dx \]

As you can see, the only knowns required to approximate this distriubtion are the average, \(\mu\), and the standard deviation, \(\sigma\).

1 \(\sigma\) 2 \(\sigma\) 2.5 \(\sigma\) 3 \(\sigma\)
68% 95% 99% 99.7%
68-95-99.7 rule

68-95-99.7 rule

Standardized Units

If your data follows a normal distribution, you can convert it to “standard units.” This is done using the following formula:

\[Z_i = \frac{X_i - \bar{X}}{s_X}\]

Converting to Z-scores (unitless), we know exactly how many std deviations we are from the mean.


Central Limit Theorem

\[\bar{X}\sim N\left(\mu_X,\dfrac{\sigma^2}{n}\right)\]

here, \(\sigma_X\) is the population standard deviation. Below, \(\sigma_X^2\) is the population variance. What’s important to notice in the Central Limit Theorem (CLT) is the effect of sample size on the spread of the distribution. By recognizing that the sample size, \(n\) is the denominator for \(\frac{\sigma^2}{n}\), we can see that as \(n\) gets bigger, our spread gets smaller.

\[\sigma_X^2 = \frac{1}{m} \displaystyle \sum_{i=1}^{m}(x_i - \mu_X)^2\]

Use qqnorm() or qqline() to compare observational data against theoretical normal distribution.


T-test

t.test() is a function in R that allows us to calculate the p-value, t-statistic, confidence interval, etc. very simply.

For example:

t.test(treatment, control)

# To see just the p-value, you can use $
t.test(treatment,control)$p.value

One-sample \(t\)-test is calculated as:

\[t = \frac{\bar{X}-\mu{}}{s_{\overline{X}}}\]

where

\[s_{\overline{X}} = \frac{s_X}{\sqrt{n}}\]

\(s_X\) is the sample standard deviation
\(s_{\overline{X}}\) is the estimated standard error of the mean

Welch’s \(t\)-test is used when the 2 population variances are assumed to not be equal.

\[ t = \frac{\bar{X} - \bar{Y}}{s_\Delta}\]

where \(s_\Delta\) is given by the following:

\[s_\Delta = \sqrt{\frac{s^2_X}{M} - \frac{s^2_Y}{N}}\]

\(s^2_i\) is the unbiased estimator of the variance each of the 2 samples.

Sean Kross has an excellent post reviewing pnorm, qnorm, etc. I’m not sure if he’s written one specifically for t-distributions, but there is a succinct answer regarding this on Quora.


Type I and Type II error

Type I error, also known as the False Positive, is the probability of rejecting the null hypothesis when, in fact, the null hypothesis is true. Typically, a significance level of the test, \(\alpha\), is defined. \(\alpha\) is often set as being \(0.05\). The null hypothesis is then rejected if the p-value is less than \(\alpha\). Therefore, the probability that we mistakenly reject the null hypothesis (when it is in reality true) is equal to \(\alpha\).

\[\text{Type I error} = \alpha\]

Type II error, aka False Negative, is the probability of not rejecting the null hypothesis when, in fact, the null hypothesis is false. Type II error isn’t easily computed because it requires knowledge of the population mean, which we often don’t know. However, it can be calculated by determining the power of the test.

\[\text{Type II error} = 1 - \text{power}\]

Type I – rejecting the null when we should not.
Type II – failing to reject the null when we should.


Power

Power is the probability of rejecting the null hypothesis when, in fact, the alternative hypothesis is true. Power is proportional to the sample size, \(N\). In other words, as \(N\) increases, power increases.

Importing Data into R

January 21, 2018
howto notes R tutorial

Python Basics: From Zero to Full Monty

September 27, 2017
notes study tutorial python

Machine Learning Notes: An Introduction

September 19, 2017
notes study review machinelearning