Website snapshot

2025-11-28 07:40:25 +00:00 · 2020-01-15 21:51:49 -05:00 · 2020-01-15 21:51:49 -05:00 · 50ec3688a5
commit 50ec3688a5
parent ee0ab66d73
281 changed files with 21066 additions and 0 deletions
--- a/content/notes/bayesianstatistics/week1.md
+++ b/content/notes/bayesianstatistics/week1.md
@ -0,0 +1,413 @@
+# Bayesian Statistics
+
+## Rules of Probability
+
+Probabilities must be between zero and one, i.e., $0≤P(A)≤1$ for any event A.
+
+Probabilities add to one, i.e., $\sum{P(X_i)} = 1$
+
+The complement of an event, $A^c$, denotes that the event did not happen. Since probabilities must add to one, $P(A^c) = 1 - P(A)$
+
+If A and B are two events, the probability that A or B happens (this is an inclusive or) is the probability of the union of the events:
+$$
+P(A \cup B) = P(A) + P(B) - P(A\cap B)
+$$
+where $\cup$ represents union ("or") and $\cap$ represents intersection ("and"). If a set of events $A_i$ are mutually exclusive (only one event may happen), then
+$$
+P(\cup_{i=1}^n{A_i}) = \sum_{i=1}^n{P(A_i)}
+$$
+
+## Odds
+
+The odds for event A, denoted $\mathcal{O}(A)$ is defined as $\mathcal{O}(A) = P(A)/P(A^c)$ 
+
+This is the probability for divided by probability against the event
+
+From odds, we can also compute back probabilities
+$$
+\frac{P(A)}{P(A^c)} = \mathcal{O}(A)
+$$
+
+$$
+\frac{P(A)}{1-P(A)} = \mathcal{O}(A)
+$$
+
+$$
+\frac{1 -P(A)}{P(A)} = \frac{1}{\mathcal{O}(A)}
+$$
+
+$$
+\frac{1}{P(A)} - 1 = \frac{1}{\mathcal{O}(A)}
+$$
+
+$$
+\frac{1}{P(A)} = \frac{1}{\mathcal{O}(A)} + 1
+$$
+
+$$
+\frac{1}{P(A)} = \frac{1 + \mathcal{O}(A)}{\mathcal{O}(A)}
+$$
+
+$$
+P(A) = \frac{\mathcal{O}(A)}{1 + \mathcal{O}(A)}
+$$
+
+## Expectation
+
+The expected value of a random variable X is a weighted average of values X can take, with weights given by the probabilities of those values.
+$$
+E(X) = \sum_{i=1}^n{x_i * P(X=x_i)}
+$$
+
+## Frameworks of probability
+
+Classical -- Outcomes that are equally likely have equal probabilities
+
+Frequentist -- In an infinite sequence of events, what is the relative frequency
+
+Bayesian -- Personal perspective (your own measure of uncertainty)
+
+In betting, one must make sure that all the rules of probability are followed. That the events are "coherent", otherwise one might construct a series of bets where you're guaranteed to lose money. This is referred to as a Dutch book.
+
+## Conditional probability
+
+$$
+P(A|B) = \frac{P(A\cup B)}{P(B)}
+$$
+
+Where $A|B$ denotes "A given B"
+
+Example from lecture:
+
+Suppose there are 30 students, 9 of which are female. From the 30 students, 12 are computer science majors. 4 of those 12 computer science majors are female
+$$
+P(Female) = \frac{9}{30} = \frac{3}{10}
+$$
+
+$$
+P(CS) = \frac{12}{30} = \frac{2}{5}
+$$
+
+$$
+P(F\cap CS) = \frac{4}{30} = \frac{2}{15}
+$$
+
+$$
+P(F|CS) = \frac{P(F \cap CS)}{P(CS)} = \frac{2/15}{2/5} = \frac{1}{3}
+$$
+
+An intuitive way to think about a conditional probability is that we're looking at a subsegment of the original population, and asking a probability question within that segment
+$$
+P(F|CS^c) = \frac{P(F\cap CS^c)}{PS(CS^c)} = \frac{5/30}{18/30} = \frac{5}{18}
+$$
+The concept of independence is when one event does not depend on another.
+$$
+P(A|B) = P(A)
+$$
+It doesn't matter that B occurred.
+
+If two events are independent then the following is true
+$$
+P(A\cap B) = P(A)P(B)
+$$
+This can be derived from the conditional probability equation.
+
+## Conditional Probabilities in terms of other conditional
+
+Suppose we don't know what $P(A|B)$ is but we do know what $P(B|A)$ is. We can then rewrite $P(A|B)$ in terms of $P(B|A)$
+$$
+P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^c)P(A^c)}
+$$
+Let's look at an example of an early test for HIV antibodies known as the ELISA test.
+$$
+P(+ | HIV) = 0.977
+$$
+
+$$
+P(- | NO\_HIV) = 0.926
+$$
+
+As you can see over 90% of the time, this test was accurate.
+
+The probability of someone in North America having this disease was $P(HIV) = .0026$
+
+Now let's consider the following problem: the probability of having the disease given that they tested positive $P(HIV | +)$
+$$
+P(HIV|+) = \frac{P(+|HIV)P(HIV)}{P(+|HIV)P(HIV) + P(+|NO\_HIV){P(NO\_HIV)}}
+$$
+
+$$
+P(HIV|+) = \frac{(.977)(.0026)}{(.977)(.0026) + (1-.977)(1-.0026)}
+$$
+
+$$
+P(HIV|+) = 0.033
+$$
+
+This example looked at Bayes Theorem for the two event case. We can generalize it to n events through the following formula
+$$
+P(A|B) = \frac{P(B|A_1){(A_1)}}{\sum_{i=1}^{n}{P(B|A_i)}P(A_i)}
+$$
+
+
+
+## Bernoulli Distribution
+
+~ means 'is distributed as'
+
+We'll be first studying the Bernoulli Distribution. This is when your event has two outcomes, which is commonly referred to as a success outcome and a failure outcome. The probability of success is $p$ which means the probability of failure is $(1-p)$
+$$
+X \sim B(p)
+$$
+
+$$
+P(X = 1) = p
+$$
+
+$$
+P(X = 0) = 1-p
+$$
+
+The probability of a random variable $X$ taking some value $x$ given $p$ is
+$$
+f(X = x | p) = f(x|p) = p^x(1-p)^{1 - x}I
+$$
+Where $I$ is the Heavenside function
+
+Recall the expected value
+$$
+E(X) = \sum_{x_i}{x_iP(X=x_i)} = (1)p + (0)(1-p) = p
+$$
+We can also define the variance of Bernoulli
+$$
+Var(X) = p(1-p)
+$$
+
+## Binomial Distribution
+
+The binomial distribution is the sum of n *independent* Bernoulli trials
+$$
+X \sim Bin(n, p)
+$$
+
+$$
+P(X=x|p) = f(x|p) =  {n \choose x} p^x (1-p)^{n-x}
+$$
+
+$n\choose x$ is the combinatoric term which is defined as
+$$
+{n \choose x} = \frac{n!}{x! (n - x)!}
+$$
+
+$$
+E(X) = np
+$$
+
+$$
+Var(X) = np(1-p)
+$$
+
+## Uniform distribution
+
+Let's say X is uniformally distributed
+$$
+X \sim U[0,1]
+$$
+
+$$
+f(x) = \left\{
+     \begin{array}{lr}
+       1 & : x \in  [0,1]\\
+       0 & : otherwise
+     \end{array}
+   \right.
+$$
+
+$$
+P(0 < x < \frac{1}{2}) = \int_0^\frac{1}{2}{f(x)dx} = \int_0^\frac{1}{2}{dx} = \frac{1}{2}
+$$
+
+$$
+P(0 \leq x \leq \frac{1}{2}) = \int_0^\frac{1}{2}{f(x)dx} = \int_0^\frac{1}{2}{dx} = \frac{1}{2}
+$$
+
+$$
+P(x = \frac{1}{2}) = 0
+$$
+
+## Rules of probability density functions
+
+$$
+\int_{-\infty}^\infty{f(x)dx} = 1
+$$
+
+$$
+f(x) \ge 0
+$$
+
+$$
+E(X) = \int_{-\infty}^\infty{xf(x)dx}
+$$
+
+$$
+E(g(X))  = \int{g(x)f(x)dx}
+$$
+
+$$
+E(aX) = aE(X)
+$$
+
+$$
+E(X + Y) = E(X) + E(Y)
+$$
+
+If X & Y are independent
+$$
+E(XY) = E(X)E(Y)
+$$
+
+## Exponential Distribution
+
+$$
+X \sim Exp(\lambda)
+$$
+
+Where $\lambda$ is the average unit between observations
+$$
+f(x|\lambda) = \lambda e^{-\lambda x}
+$$
+
+$$
+E(X) = \frac{1}{\lambda}
+$$
+
+$$
+Var(X) = \frac{1}{\lambda^2}
+$$
+
+## Uniform (Continuous) Distribution
+
+$$
+X \sim [\theta_1, \theta_2]
+$$
+
+$$
+f(x|\theta_1,\theta_2) = \frac{1}{\theta_2 - \theta_1}I_{\theta_1 \le x \le \theta_2}
+$$
+
+## Normal Distribution
+
+$$
+X \sim N(\mu, \sigma^2)
+$$
+
+$$
+f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}
+$$
+
+$$
+E(X) = \mu
+$$
+
+$$
+Var(X) = \sigma^2
+$$
+
+## Variance
+
+Variance is the squared distance from the mean
+$$
+Var(X) = \int_{-\infty}^\infty {(x - \mu)^2f(x)dx}
+$$
+
+## Geometric Distribution (Discrete)
+
+The geometric distribution is the number of trails needed to get the first success, i.e, the number of Bernoulli events until a success is observed.
+$$
+X \sim Geo(p)
+$$
+
+$$
+P(X = x|p) = p(1-p)^{x-1}
+$$
+
+$$
+E(X) = \frac{1}{p}
+$$
+
+## Multinomial Distribution (Discrete)
+
+Multinomial is like a binomial when there are more than two possible outcomes.
+
+ 
+$$
+f(x_1,...,x_k|p_1,...,p_k) = \frac{n!}{x_1! ... x_k!}p_1^{x_1}...p_k^{x_k}
+$$
+
+## Poisson Distribution (Discrete)
+
+The Poisson distribution is used for counts. The parameter $\lambda > 0$ is the rate at which we expect to observe the thing we are counting.
+$$
+X \sim Pois(\lambda)
+$$
+
+$$
+P(X=x|\lambda) = \frac{\lambda^xe^{-\lambda}}{x!}
+$$
+
+$$
+E(X) = \lambda
+$$
+
+$$
+Var(X) = \lambda
+$$
+
+## Gamma Distribution (Continuous)
+
+If $X_1, X_2, ..., X_n$ are independent and identically distributed Exponentials,waiting time between success events, then the total waiting time for all $n$ events to occur will follow a gamma distribution with shape parameter $\alpha = n$ and rate parameter $\beta = \lambda$
+$$
+Y \sim Gamma(\alpha, \beta)
+$$
+
+$$
+f(y|\alpha,\beta) = \frac{\beta^n}{\Gamma(\alpha)}y^{n-1}e^{-\beta y}I_{y\ge0}(y)
+$$
+
+$$
+E(Y) = \frac{\alpha}{\beta}
+$$
+
+$$
+Var(Y) = \frac{\alpha}{\beta^2}
+$$
+
+Where $\Gamma(x)$ is the gamma function. The exponential distribution is a special case of the gamma distribution with $\alpha = 1$. As $\alpha$ increases, the gamma distribution more closely resembles the normal distribution.
+
+## Beta Distribution (Continuous)
+
+The beta distribution is used for random variables which take on values between 0 and 1. For this reason, the beta distribution is commonly used to model probabilities.
+$$
+X \sim Beta(\alpha, \beta)
+$$
+
+$$
+f(x|\alpha,\beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{n -1}(1 - x)^{\beta - 1}I_{\{0 < x < 1\}}
+$$
+
+$$
+E(X) = \frac{\alpha}{\alpha + \beta}
+$$
+
+$$
+Var(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha+\beta+1)}
+$$
+
+The standard uniform distribution is a special case of the beta distribution with $\alpha = \beta = 1$
+
+## Bayes Theorem for continuous distribution
+
+$$
+f(\theta|y) = \frac{f(y|\theta)f(\theta)}{\int{f(y|\theta)f(\theta)d\theta}}
+$$
+
--- a/content/notes/bayesianstatistics/week2.md
+++ b/content/notes/bayesianstatistics/week2.md
@ -0,0 +1,537 @@
+Under the frequentest paradigm, you view the data as a random sample from some larger, potentially hypothetical population. We can then make probability statements i.e, long-run frequency statements based on this larger population.
+
+## Coin Flip Example (Central Limit Theorem)
+
+Let's suppose we flip a coin 100 times and we get 44 heads and 56 tails.
+$$
+n = 100
+$$
+We can view these 100 flips as a random sample from a much larger infinite hypothetical population of flips from this coin.
+
+Let's say that each flip follows a Bournelli distribution with some probability p. In this case $p$ is unknown, but we're assuming it's fixed because we have a particular physical coin.
+
+We can ask what's our best estimate of the probability of getting a head, or an estimate of $p$. We can also ask about how confident we are about that estimate.
+
+Let's start by applying the Central Limit Theorem. The Central Limit Theorem states that the sum of 100 flips will follow approximately a Normal distribution with mean 100p and variance 100p(1-p)
+$$
+\sum^n_{i=1}{x_i} \sim N(np, np(1-p))
+$$
+
+$$
+\sum_{i = 1}^{100}{x_i} \sim N(100p, 100p(1-p))
+$$
+
+By the properties of a Normal distribution, 95% of the time we'll get a result within 1.96 standard deviations of the mean. Our estimate is $100\hat{p}$ and our error is 1.96 times the standard deviation.
+$$
+n\hat{p} \pm 1.96\sqrt{n\hat{p}(1-\hat{p})}
+$$
+
+$$
+100\hat{p} \pm 1.96\sqrt{100\hat{p}(1-\hat{p})}
+$$
+
+This is referred to as a Confidence Interval. Confidence Intervals are commonly abbreviated as CI. In our example $\hat{p} = \frac{44}{n} = \frac{44}{100}$. Therefore, the 95% Confidence Interval in the true number of heads after flipping a coin 100 times is:
+$$
+100(.44) \pm 1.96\sqrt{100(.44)(1-.44)}
+$$
+
+$$
+44 \pm 1.96\sqrt{44(.56)}
+$$
+
+$$
+44\pm 1.96\sqrt{24.64}
+$$
+
+$$
+(34.27, 53.73)
+$$
+
+We can divide this by 100 to get the 95% Confidence Interval for $p$
+$$
+(0.34, 0.53)
+$$
+Let's step back and ask, what does it mean when I say we're 95% confident?
+
+Under the frequentest paradigm, what this means is we have to think back to our infinite hypothetical sequence of events. So if we were to repeat this trial an infinite number of times, or an arbitrary large number of times. Each time we create a confidence interval based on the data we observe, than on average 95% of the intervals we make will contain the true value of p.
+
+On the other hand, we might want to know something about this particular interval. Does this interval contain the true p? What's the probability that this interval contains a true p? Well, we don't know for this particular interval. But under the frequentest paradigm, we're assuming that there is a fixed answer for p. Either p is in that interval or it's not in that interval. The probability that p is in that interval is either 0 or 1.
+
+## Example: Heart Attack Patients (Maximum Likelihood)
+
+Consider a hospital where 400 patients are admitted over a month for heart attacks, and a month later 72 of them have died and 328 of them have survived.
+
+We can ask, what's our estimate of the mortality rate?
+
+Under the frequentest paradigm, we must first establish our reference population. What do we think our reference population is here? One possibility is we could think about heart attack patients in the region.
+
+Another reference population we can think about is heart attack patients that are admitted to this hospital, but over a longer period of time. 
+
+Both of these might be reasonable attempts, but in this case our actual data are not random sample from either of those populations. We could sort of pretend they are and move on, or we could also try to think harder about what a random sample situation might be. We can think about all the people in the region who might possibly have a heart attack and might possibly get admitted to this hospital.
+
+It's a bit of an odd hypothetical situation, and so there are some philosophical issues with the setup of this whole problem with the frequentest paradigm, In any case, let's forge forward and think about how we might do some estimation.
+
+Moving on, we can say each patient comes from a Bernoulli distribution with an unknown parameter $\theta$. 
+$$
+Y_i \sim B(\theta)
+$$
+
+$$
+P(Y_i = 1) = \theta
+$$
+
+In this case, let's call the "success" a mortality. 
+
+The probability density function for the entire set of data we can write in vector form. Probability of all the Y's take some value little y given a value of theta.
+$$
+P(Y = y | \theta) = P(Y_1 = y_1, Y_2, = y_2,\dots, Y_n=y_n|\theta)
+$$
+*Since we're viewing these as independent events*, then the probability of each of these individual ones we can write in product notation.
+$$
+P(Y = y | \theta) = P(Y_1 = y_1|\theta)\dots P(Y_n = y_n | \theta)
+$$
+
+$$
+P(Y = y | \theta) = \prod_{i = 1}^n{P(Y_i =y_i | \theta)} = \prod_{i = 1}^n{(\theta^{y_i}(1-\theta)^{1-y_i})}
+$$
+
+This is the probability of observing the actual data that we collected, conditioned on a value of the parameter $\theta$.  We can now think about this expression as a function of theta. This is a concept of a likelihood.
+$$
+L(\theta|y) = \prod_{i = 1}^n{(\theta^{y_i}(1-\theta)^{1-y_i})}
+$$
+It looks like the same function, but here it is a function of y given $\theta$. And now we're thinking of it as a function of $\theta$ given y.
+
+This is not a probability distribution anymore, but it is still a function for $\theta$.
+
+One we to estimate $\theta$ is that we choose the $\theta$ that gives us the largest value of the likelihood. It makes the data the most likely to occur for the particular data we observed.
+
+This is referred to as the maximum likelihood estimate (MLE),
+
+We're trying to find the $\theta$ that maximizes the likelihood.
+
+In practice, it's usually easier to maximize the natural logarithm of the likelihood, commonly referred to as the log likelihood.
+$$
+\mathcal{L}(\theta) = \log{L(\theta|y)}
+$$
+Since the logarithm is a monotonic function, if we maximize the logarithm of the function, we also maximize the original function.
+$$
+\mathcal{L(\theta)} = \log{\prod_{i = 1}^n{(\theta^{y_i}(1-\theta)^{1-y_i})}}
+$$
+
+$$
+\mathcal{L}(\theta) = \sum_{i = 1}^n{\log{(\theta^{y_i}(1-\theta)^{1-y_i})}}
+$$
+
+$$
+\mathcal{L}(\theta) = \sum_{i = 1}^n{(\log{(\theta^{y_i}})  + \log{(1-\theta)^{1-y_i}})}
+$$
+
+$$
+\mathcal{L}(\theta) = \sum_{i = 1}^n{(y_i\log{\theta} + (1 - y_i)\log{(1-\theta)})}
+$$
+
+$$
+\mathcal{L}(\theta) = \log{\theta}\sum_{i  = 1}^n{y_i} + \log{(1-\theta)}\sum_{i = 1}^n{(1-y_i)}
+$$
+
+How do we find the theta that maximizes this function? Recall from calculus that we can maximize a function by taking the derivative and setting it equal to 0.
+$$
+\mathcal{L}^\prime(\theta) = \frac{1}{\theta}\sum{y_i} - \frac{1}{1-\theta}\sum{(1 - y_i)}
+$$
+Now we need to set it equal to zero and solve for $\hat{\theta}$
+$$
+\frac{\sum{y_i}}{\hat{\theta}} = \frac{\sum{(1-y_i)}}{1-\hat{\theta}}
+$$
+
+$$
+\hat{\theta}\sum{(1-y_i)} = (1-\hat{\theta})\sum{y_i}
+$$
+
+$$
+\hat{\theta}\sum{(1-y_i)} + \hat{\theta}\sum{y_i} = \sum{y_i}
+$$
+
+$$
+\hat{\theta}(\sum^n{(1 - y_i + y_i)}) = \sum{y_i}
+$$
+
+$$
+\hat{\theta} = \frac{1}{n}\sum{y_i} = \hat{p} = \frac{72}{400} = 0.18
+$$
+
+Maximum likelihood estimators have many desirable mathematical properties. They're unbiased, they're consistent, and they're invariant.
+
+In general, under certain regularity conditions, we can say that the MLE is approximately normally distributed with mean at the true value of $\theta$ and the variance $\frac{1}{I(\hat{\theta})}$ where $I(\hat{\theta})$ is the Fisher information at the value of $\hat{\theta}$. The Fisher information is a measure of how much information about $\theta$ is in each data point.
+$$
+\hat{\theta} \sim N(\theta, \frac{1}{I(\hat{\theta})})
+$$
+For a Bernoulli random variable, the Fisher information turns out to be
+$$
+I(\theta) = \frac{1}{\theta(1-\theta)}
+$$
+So the information is larger, when theta is near zero or near one, and it's the smallest when theta is near one half.
+
+This makes sense, because if you're flipping a coin, and you're getting a mix of heads and tails, that tells you a little bit less than if you're getting nearly all heads or nearly all tails.
+
+## Exponential Likelihood Example
+
+Let's say $X_i$ are distributed so
+$$
+X_i \sim Exp(\lambda)
+$$
+Let's say the data is independent and identically distributed, therefore making the overall density function
+$$
+f(x|\lambda) = \prod_{i = 1}^n{\lambda e^{-\lambda x_i}}
+$$
+
+$$
+f(x|\lambda) = \lambda^ne^{-\lambda \sum{x_i}}
+$$
+
+Now the likelihood function is
+$$
+L(\lambda|x) = \lambda^n e^{-\lambda \sum{x_i}}
+$$
+
+$$
+\mathcal{L}(\lambda) = n\ln{\lambda} - \lambda\sum{x_i}
+$$
+
+Taking the derivative
+$$
+\mathcal{L}^\prime(\lambda) = \frac{n}{\lambda} - \sum{x_i}
+$$
+Setting this equal to zero
+$$
+\frac{n}{\hat{\lambda}} =\sum{x_i}
+$$
+
+$$
+\hat{\lambda} = \frac{n}{\sum{x_i}} = \frac{1}{\bar{x}}
+$$
+
+## Uniform Distribution
+
+$$
+X_i \sim U[0, \theta]
+$$
+
+$$
+f(x|\theta) = \prod_{i = 1}^n{\frac{1}{\theta}I_{0 \le x_i \le \theta}}
+$$
+
+Combining all the indicator functions, for this to be a 1, each of these has to be true. These are going to be true if all the observations are bigger than 0, as in the minimum of the x is bigger than or equal to 0. The maximum of the x's is also less than or equal to $\theta$.
+$$
+L(\theta|x) = \theta^{-1} I_{0\le min(x_i) \le max(x_i) \le \theta}
+$$
+
+$$
+L^\prime(\theta) = -n\theta^{-(n + 1)}I_{0 \le min(x_i) \le max(x_i)\le \theta}
+$$
+
+So now we can ask, can we set this equal to zero and solve for $\theta$? Well it turns out, this is not equal to zero for any $\theta$ positive value. We need $\theta$ to be strictly larger than zero.
+
+However, we can also note that for $\theta$ positive, this will always be negative. The derivative is negative, that says this is a decreasing function. So this funciton will be maximized when we pick $\theta$ as small as possible. What's the smallest possible value of $\theta$ we can pick? Well we need in particular for $\theta$ to be larger than all of the $X_i$ . And so, the maximum likelihood estimate is the maximum of $X_i$
+$$
+\hat{\theta} = max(x_i)
+$$
+
+## Products of Indicator Functions
+
+Because 0 * 1 = 0, the product of indicator functions can be combined into a single indicator function with a modified condition.
+
+**Example**: $I_{x < 5} * I_{x \ge 0} = I_{0 \le x < 5}$ 
+
+**Example**: $\prod_{i = 1}^n{I_{x_i < 2}} = I_{x_i < 2 for all i} = I_{max(x_i)  < 2}$
+
+## Introduction to R
+
+R has some nice functions that one can use for analysis
+
+`mean(z)` gives the mean of some row vector $z$
+
+`var(z)` reports the variance of some row vector
+
+`sqrt(var(z))` gives the standard deviation of some row vector
+
+`seq(from=0.1, to = 0.9, by = 0.1)` creates a vector that starts from $0.1$ and goes to $0.9$ incrementing by $0.1$
+
+```R
+seq(from=0.1, to = 0.9, by = 0.1)
+[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
+seq(1, 10)
+[1] 1 2 3 4 5 6 7 8 9 10
+```
+
+`names(x)` gives the names of all the columns in the dataset.
+
+```R
+names(trees)
+[1] "Girth"  "Height"  "Volume"
+```
+
+`hist(x)` provides a histogram based on a vector
+
+The more general `plot` function tries to guess at which type of plot to make. Feeding it two numerical vectors will make a scatter plot.
+
+The R function `pairs` takes in a data frame and tries to make all possible Pairwise scatterplots for the dataset. 
+
+The `summary` command gives the five/six number summary (minimum, first quartile, median, mean,  third quartile, maximum)
+
+## Plotting the likelihood function in R
+
+Going back to the hospital example
+
+```R
+## Likelihood function
+likelihood = function(n, y, theta) {
+  return(theta^y * (1 - theta)^(n - y))
+}
+theta = seq(from = 0.01, to = 0.99, by = 0.01)
+plot(theta, likelihood(400, 72, theta))
+```
+
+You can also do this with log likelihoods. This is typically more numerically stable to compute
+
+```R
+loglike = function(n, y, theta) {
+  return(y * log(theta) + (n - y) * log(1 - theta))
+}
+plot(theta, loglike(400, 72, theta))
+```
+
+Having these plotted as points makes it difficult to see, let's plot it as lines
+
+```R
+plot(theta, loglike(400, 72, theta), type = "l")
+```
+
+
+
+## Cumulative Distribution Function
+
+The cumulative distribution function (CDF) exists for every distribution. We define it as $F(x) = P(X \le x)$ for random variable $X$. 
+
+If $X$ is discrete-valued, then the CDF is computed with summation $F(x) = \sum_{t = -\infty}^x {f(t)}$. where $f(t) = P(X = t)$ is the probability mass function (PMF) which we've already seen.
+
+If $X$ is continuous, the CDF is computed with an integral $F(x) = \int_{-\infty}^x{f(t)dt}$
+
+The CDF is convenient for calculating probabilities of intervals. Let $a$ and $b$ be any real numbers with $a < b$. Then the probability that $X$ falls between $a$ and $b$ is equal to $P(a < X < b) = P(X \le b) - P(X \le a) = F(b) - F(a)$  
+
+
+
+## Quantile Function
+
+The CDF takes a value for a random variable and returns a probability. Suppose instead we start with a number between $0$ and $1$, which we call $p$, and we wish to find a value $x$ so that $P(X \le x) = p$. The value $x$ which satisfies this equation is called the $p$ quantile. (or $100p$ percentile) of the distribution of $X$.
+
+
+
+## Probability Distributions in R
+
+Each of the distributions introduced in Lesson 3 have convenient functions in R which allow you to evaluate the PDF/PMF, CDF, and quantile functions, as well as generate random samples from the distribution. To illustrate, Table I list these functions for the normal distribution
+
+| Function             | What it does                             |
+| -------------------- | ---------------------------------------- |
+| `dnorm(x, mean, sd)` | Evaluate the PDF at $x$ (mean = $\mu$ and sd = $\sqrt{\sigma^2}$) |
+| `pnorm(q, mean, sd)` | Evaluate the CDF at $q$                  |
+| `qnorm(p, mean, sd)` | Evaluate the quantile function at $p$    |
+| `rnorm(n, mean, sd)` | Generate $n$ pseudo-random samples from the normal distribution |
+
+These four functions exist for each distribution where `d...` function evaluates the density/mass, `p...` evaluates the CDF, `q...` evaluates the quantile, and `r...` generates a sample. Table 2 lists the `d...` functions for some of the most popular distributions. The `d` can be replaced with `p`, `q`, or `r` for any of the distributions, depending on what you want to calculate.
+
+For details enter `?dnorm` to view R's documentation page for the Normal distribution. As usual , replace the `norm` with any distribution to read the documentation for that distribution.
+
+| Distribution           | Function                   | Parameters                           |
+| ---------------------- | -------------------------- | ------------------------------------ |
+| $Binomial(n,p)$        | `dbinom(x, size, prob)`    | size = $n$, prob = $p$               |
+| $Poisson(\lambda)$     | `dpois(x, lambda)`         | lambda = $\lambda$                   |
+| $Exp(\lambda)$         | `dexp(x, rate)`            | rate = $\lambda$                     |
+| $Gamma(\alpha, \beta)$ | `dgamma(x, shape, rate)`   | shape = $\alpha$, rate = $\beta$     |
+| $Uniform(a, b)$        | `dunif(x, min, max)`       | min = $a$, max = $b$                 |
+| $Beta(\alpha, \beta)$  | `dbeta(x, shape1, shape2)` | shape1 = $\alpha$, shape2 = $\beta$  |
+| $N(\mu, \sigma^2)$     | `dnorm(x, mean, sd)`       | mean = $\mu$, sd = $\sqrt{\sigma^2}$ |
+| $t_v$                  | `dt(x, df)`                | df = $v$                             |
+
+## Two Coin Example
+
+Suppose your brother has a coin which you know to be loaded so that it comes up heads 70% of the time. He then comes to you with some coin, you're not sure which one and he wants to make a bet with you. Betting money that it's going to come up heads.
+
+You're not sure if it's the loaded coin or if it's just a fair one. So he gives you a chance to flip it 5 times to check it out.
+
+You flip it five times and get 2 heads and 3 tails. Which coin do you think it is and how sure are you about that?
+
+We'll start by defining the unknown parameter $\theta$, this is either that the coin is fair or it's a loaded coin.
+$$
+\theta = \{fair ,loaded\}
+$$
+
+$$
+X \sim Bin(5, ?)
+$$
+
+$$
+f(x|\theta) = \begin{cases} 
+      {5 \choose x}(\frac{1}{2})^5            & \theta = fair \\
+      {5 \choose x} (.7)^x (.3)^{5 - x}       & \theta = loaded\\
+   \end{cases}
+$$
+
+We can also rewrite $f(x|\theta)$ with indicator functions
+$$
+f(x|\theta) = {5\choose x}(.5)^5I_{\{\theta= fair\}} + {5 \choose x}(.7)^x(.3)^{5 - x}I_{\{\theta = loaded\}}
+$$
+In this case, we observed that $x = 2$ 
+$$
+f(\theta | x = 2) = \begin{cases} 
+	0.3125 & \theta = fair \\
+	0.1323 & \theta = loaded
+\end{cases}
+$$
+MLE $\hat{\theta} = fair$ 
+
+That's a good point estimate, but then how do we answer the question, how sure are you?
+
+This is not a question that's easily answered in the frequentest paradigm. Another question is that we might like to know what is the probability that theta equals fair, give, we observe two heads.
+$$
+P(\theta = fair|x = 2) = ?
+$$
+In the frequentest paradigm, the coin is a physical quantity. It's a fixed coin, and therefore it has a fixed probability of coining up heads. It is either the fair coin, or it's the loaded coin.
+$$
+P(\theta = fair) = \{0,1\}
+$$
+
+### Bayesian Approach to the Problem
+
+An advantage of the Bayesian approach is that it allows you to easily incorporate prior information, when you know something in advance of the looking at the data. This is difficult to do under the Frequentest paradigm.
+
+In this case, we're talking about your brother. You probably know him pretty well. So suppose you think that before you've looked at the coin, there's a 60% probability that this is the loaded coin.
+
+This case, we put this into our prior. Our prior is that the probability the coin is loaded is 0.6. We can update our prior with the data to get our posterior beliefs, and we can do this using the bayes theorem.
+
+Prior: $P(loaded) = 0.6$
+$$
+f(\theta|x) = \frac{f(x|\theta)f(\theta)}{\sum_\theta{f(x|\theta)f(\theta)}}
+$$
+
+$$
+f(\theta|x) = \frac{{5\choose x} [(\frac{1}{2})^5(.4)I_{\{\theta = fair\}} + (.7)^x (.3)^{5-x}(.6)I_{\{\theta = loaded\}}  ] }
+{{5\choose x} [(\frac{1}{2})^5(.4) + (.7)^x (.3)^{5-x}(0.6)  ] }
+$$
+
+$$
+f(\theta|x=2)= \frac{0.0125I_{\{\theta=fair\}}  + 0.0079I_{\{\theta=loaded\}} }{0.0125+0.0079}
+$$
+
+$$
+f(\theta|x=2) = 0.612I_{\{\theta=fair\}} + 0.388I_{\{\theta = loaded\}}
+$$
+
+As you can see in the calculation here, we have the likelihood times the prior in the numerator, and in the denominator, we have a normalizing constant, so that when we divide by this, we'll get answer that add up to one. These numbers match exactly in this case, because it's a very simple problem. But this is a concept that goes on, what's in the denominator here is always a normalizing constant.
+$$
+P(\theta = loaded | x = 2) = 0.388
+$$
+This here updates our beliefs after seeing some data about what the probability might be.
+
+We can also examine what would happen under different choices of prior.
+$$
+P(\theta = loaded) = \frac{1}{2} \implies P(\theta = loaded | x = 2) = 0.297
+$$
+
+$$
+P(\theta = loaded) = 0.9 \implies P(\theta = loaded | x = 2) = 0.792
+$$
+
+In this case, the Bayesian approach is inherently subjective. It represents your own personal perspective, and this is an important part of the paradigm. If you have a different perspective, you will get different answers, and that's okay. It's all done in a mathematically vigorous framework, and it's all mathematically consistent and coherent.
+
+And in the end, we get results that are interpretable
+
+## Continuous Bayes
+
+$$
+f(\theta | y) = \frac{f(y | \theta)f(\theta)}{f(y)} = \frac{f(y|\theta)f(\theta)}{\int{f(y|\theta)f(\theta)d\theta}} = \frac{likelihood * prior}{normalization} \propto likelihood * prior
+$$
+
+In practice, sometimes this integral can be a pain to compute. And so, we may work with looking at saying this is proportional to the likelihood times the prior. And if we can figure out what this looks like and just put the appropriate normalizing constant on at the end, we don't necessarily have to compute this integral.
+
+
+
+So for example, suppose we're looking at a coin and it has unknown probability $\theta$ of coming up heads. Suppose we express ignorance about the value of $\theta$ by assigning it a uniform distribution.
+$$
+\theta \sim U[0, 1]
+$$
+
+$$
+f(\theta) = I_{\{0 \le \theta\le 1\}}
+$$
+
+$$
+f(\theta | y = 1) =   \frac{\theta^1(1-\theta)^0I_{\{0 \le \theta\le1\}}}{\int_{-\infty}^\infty{\theta^1(1-\theta)^0I_{\{0\le \theta \le 1\}}}}
+$$
+
+$$
+f(\theta | y = 1) = \frac{\theta I_{\{0\le\theta\le1\}}}{\int_0^1{\theta d\theta}} = 2\theta I_{\{0\le\theta\le1\}}
+$$
+
+Now if we didn't want to take the integral we could've done this approach
+$$
+f(\theta | y) \propto f(y|\theta)f(\theta) \propto \theta I_{\{0\le\theta\le1\}}
+$$
+Which then we need to find the constant such that it's a proper PMF. In this case, it's $2$.
+
+Since it's a proper PMF, we can perform interval probabilities as well.  This is called Posterior interval estimates.
+$$
+P(0.025 < \theta < 0.975) = \int_{0.025}^{0.975}{2\theta d \theta} = (0.975)^2 - (0.025)^2 = 0.95
+$$
+
+$$
+P(\theta > 0.05) = 1 - (0.05)^2 = 0.9975
+$$
+
+These are the sort of intervals we would get from the prior and asking about what their posterior probability is.
+
+In other cases, we may want to ask, what is the posterior interval of interest? What's an interval that contains 95% of posterior probability in some meaningful way? This would be equivalent then to a frequentest confidence interval. We can do this in several different ways, 2 main ways that we make Bayesian Posterior intervals or credible intervals are equal-tailed intervals and highest posterior density intervals.
+
+## Equal-tailed Interval
+
+In the case of an equal-tailed interval, we put the equal amount of probability in each tail. So to make a 95% interval we'll put 0.025 in each tail. 
+
+To be able to do this, we're going to have to figure out what the quantiles are. So we're going to need some value, $q$, so that
+$$
+P(\theta < q | Y = 1) = \int_0^9{2\theta d\theta} = q^2
+$$
+
+$$
+P(\sqrt{0.025} < \theta < \sqrt{0.975}) = P(0.158 < \theta < 0.987) = 0.95
+$$
+
+This is an equal tailed interval in that the probability that $\theta$ is less than 0.18 is the same as the probability that $\theta$ is greater than 0.987. We can say that under the posterior, there's a 95% probability that $\theta$ is in this interval.
+
+## Highest Posterior Density (HPD)
+
+Here we want to ask where in the density function is it highest? Theoretically this will be the shortest possible interval that contains the given probability, in this case a 95% probability.
+$$
+P(\theta > \sqrt{0.05} | Y = 1) = P(\theta > 0.224 | Y = 1) = 0.95
+$$
+This is the shortest possible interval, that under the posterior has a probability 0.95. it's $\theta$ going from 0.224 up to 1.
+
+
+
+The posterior distribution describes our understanding of our uncertainty combinbing our prior beliefs and the data. It does this with a probability density function, so at the end of teh day, we can make intervals and talk about probabilities of data being in the interval. 
+
+This is different from the frequentest approach, where we get confidence intervals. But we can't say a whole lot about the actual parameter relative to the confidence interval. We can only make long run frequency statements about hypothetical intervals.
+
+In this case, we can legitimately say that the posterior probability that $\theta$ is bigger than 0.05 is $0.9975$. We can also say that we believe there's a 95% probability that $\theta$ is in between 0.158 and 0.987.
+
+
+
+Bayesians represent uncertainty with probabilities, so that the coin itself is a physical quantity. It may have a particular value for $\theta$.
+
+It may be fixed, but because we don't know what that value is, we represent our uncertainty about that value with a distribution. And at the end of the day, we can represent our uncertainty, collect it with the data, and get a posterior distribution and make intuitive statements.
+
+
+
+#### 
+
+Frequentest confidence intervals have the interpretation that "If you were to repeat many times the process of collecting data and computing a 95% confidence interval, then on average about 95% of those intervals would contain the true parameter value; however, once you observe data and compute an interval the true value is either in the interval or it is not, but you can't tell which." 
+
+Bayesian credible intervals have the interpretation that "Your posterior probability that the parameter is in a 95% credible interval is 95%." 
--- a/content/notes/bayesianstatistics/week3.md
+++ b/content/notes/bayesianstatistics/week3.md
@ -0,0 +1,409 @@
+How do we choose a prior?
+
+Our prior needs to represent our personal perspective, beliefs, and our uncertainties.
+
+Theoretically, we're defining a cumulative distribution function for the parameter
+$$
+\begin{cases}
+P(\theta \le c)  & c \in \mathbb{R}
+\end{cases}
+$$
+This is true for an infinite number of possible sets. This isn't practical to do, and it would be very difficult to do coherently so that all the probabilities were consistent.
+
+In practice, we work with a convenient family that's sufficiently flexible such that a member of a  family represents our beliefs.
+
+Generally if one has enough data, the information in the data will overwhealm the information in the prior. And so, the prior is not particularly important in terms of what you get for the posterior. Any reasonable choice of prior will lead to approximately the same posterior. However, there are some things that can go wrong.
+
+## Example of Bad Prior
+
+Suppose we chose a prior that says the probability of $P(\theta = \frac{1}{2}) = 1$ 
+
+And thus, the probability of $\theta$ equaling any other value is $0$. If we do this, our data won't make a difference since we only put a probability of $1$ at a single point.
+$$
+f(\theta|y) \propto f(y|\theta)f(\theta) = f(\theta) = \delta(\theta)
+$$
+
+In the basic context, events with prior probability of zero have a posterior probability of zero. Events with a prior probability of one, have a posterior probability of one.
+
+Thus a good Bayesian will not assign probability of zero or one to any event that has already occurred or already known not to occur.
+
+## Calibration
+
+A useful concept in terms of choosing priors is that of the calibration of predictive intervals. 
+
+If we make an interval where we're saying we predict 95% of new data points will occur in this interval. It would be good if in reality 95% of new data points actually did fall in that interval. 
+
+How do we calibrate to reality? This is actually a frequentest concept but this is important for practical statistical purposes that our results reflect reality.
+
+We can compute a predictive interval, this is an interval such that 95% of new observations are expected to fall into it. It's an interval for the **data** rather than an interval for $\theta$
+$$
+f(y) = \int{f(y|\theta)f(\theta)d\theta} = \int{f(y, \theta)d\theta}
+$$
+Where $f(y,\theta)$ is the joint density of Y and $\theta$.
+
+This is the prior predictive before any data is observed.
+
+**Side Note:** From this you can say that $f(y, \theta) = f(y|\theta)f(\theta)$
+
+## Binomial Example
+
+Suppose we're going to flip a coin ten times and count the number of heads we see. We're thinking about this in advance of actually doing it, so we're interested in the predictive distribution. How many heads do we predict we're going to see?
+$$
+X = \sum_{i = 1}^{10}{Y_i}
+$$
+Where $Y_i$ is each individual coin flip.
+
+If we think that all possible coins or all possible probabilities are equally likely, then we can put a prior for $\theta$ that's flat over the interval from 0 to 1.
+$$
+f(\theta) = I_{\{0 \le \theta \le 1\}}
+$$
+
+$$
+f(x) = \int{f(x|\theta)f(\theta)d\theta} = \int_0^1{\frac{10!}{x!(10-x)!}\theta^x(1-\theta)^{10 -x}(1)d\theta}
+$$
+
+Note that because we're interested in $X$ at the end, it's important that we distinguish between a binomial density and a Bernoulli density. Here we just care about the total count rather than the exact ordering which would be Bernoulli's
+
+For most of the analyses we're doing, where we're interested in $\theta$ rather than x, the binomial and the Bernoulli are interchangeable because the part in here that depends on $\theta$ is the same.
+
+To solve this integral let us recall some facts
+$$
+n! = \Gamma(n + 1)
+$$
+
+$$
+Z \sim Beta(\alpha, \beta)
+$$
+
+$$
+f(z) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}z^{\alpha - 1}(1-z)^{\beta - 1}
+$$
+
+Let us rewrite $f(x)$
+$$
+f(x) = \int_0^1{\frac{\Gamma(11)}{\Gamma(x + 1)\Gamma(11 - x)}\theta^{(x + 1)-1}(1-\theta)^{(11-x)-1}d\theta}
+$$
+
+$$
+f(x) = \frac{\Gamma(11)}{\Gamma(12)}\int_0^1{\frac{\Gamma(12)}{\Gamma(x + 1)\Gamma(11 - x)}\theta^{(x + 1)-1}(1-\theta)^{(11-x)-1}d\theta}
+$$
+
+The integral above is a beta density, all integrals of valid beta densities equals to one.
+$$
+f(x) = \frac{\Gamma(11)}{\Gamma(12)} = \frac{10!}{11!} = \frac{1}{11}
+$$
+For $x \in {0, 1, 2, \dots, 10}$
+
+Thus we see that if we start with a uniform prior, we then end up with a discrete uniform predictive density for $X$. If all possible probabilities are equally likely, then all possible $X$ outcomes are equally likely.
+
+## Posterior Predictive Distribution
+
+What about after we've observed data? What's our posterior predictive distribution?
+
+Going from the previous example, let us observe after one flip that we got a head. We want to ask, what's our predictive distribution for the second flip, given we saw a head on the first flip.
+$$
+f(y_2|y_1) = \int{f(y_2|\theta,y_1)f(\theta|y_1)}d\theta
+$$
+We're going to assume that $Y_2$ is independent of $Y_1$. Therefore,
+$$
+f(y_2 |y_1) = \int{f(y_2|\theta)f(\theta|y_1)d\theta}
+$$
+Suppose we're thinking of a uniform distribution for $\theta$ and we observe the first flip is a heads. What do we predict for the second flip?
+
+This is no longer going to be a uniform distribution like it was before, because we have some data. We're going to think it's more likely that we're going to get a second head. We think this because since we observed a head $\theta$ is now likely to be at least $\frac{1}{2}$ possibly larger. 
+$$
+f(y_2 | Y_1 = 1) = \int_0^1{\theta^{y_2}(1-\theta)^{1-y_2}2\theta d\theta}
+$$
+
+$$
+f(y_2|Y_1 = 1) = \int_0^1{2\theta^{y_2 + 1}(1-\theta)^{1-y_2}d\theta}
+$$
+
+We could work this out in a more general form, but in this case, $Y_2$ has to take the value $0$ or $1$. The next flip is either going to be heads or tails so it's easier to just plop in a particular example.
+$$
+P(Y_2 = 1|Y_1 = 1) = \int_0^1{2\theta^2d\theta} = \frac{2}{3}
+$$
+
+$$
+P(Y_2 = 0 | Y_1 = 1) = 1 - P(Y_2 = 1 | Y_1 = 1) = 1 - \frac{2}{3} = \frac{1}{3}
+$$
+
+We can see here that the posterior is a combination of the information in the prior and the information in the data. In this case, our prior is like having two data points, one head and one tail. 
+
+Saying we have a uniform prior for $\theta$ is equivalent in an information sense as saying we have observed one head and one tail.
+
+So then when we observe one head, it's like we now have seen two heads and one tail. So our predictive distribution for the second flip says if we have two heads and one tail, then we have a $\frac{2}{3}$ probability of getting another head and a $\frac{1}{3}$ probability of getting another tail.
+
+## Binomial Likelihood with Uniform Prior
+
+Likelihood of y given theta is
+$$
+f(y|\theta) = \theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}
+$$
+
+Our prior for theta is just a uniform distribution
+$$
+f(\theta) = I_{\{0 \le \theta \le 1\}}
+$$
+Thus our posterior for $\theta$ is
+$$
+f(\theta | y) = \frac{f(y|\theta)f(\theta)}{\int{f(y|\theta)f(\theta)d\theta}} = \frac{\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}  I_{\{0 \le \theta \le 1\}}}{\int_0^1{\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}  I_{\{0 \le \theta \le 1\}}d\theta}}
+$$
+Recalling the form of the beta distribution we can rewrite our posterior as
+$$
+f(\theta | y) = \frac{\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}  I_{\{0 \le \theta \le 1\}}}{\frac{\Gamma(\sum{y_i} + 1)\Gamma(n - \sum{y_i} + 1)}{\Gamma(n + 2)}\int_0^1{\frac{\Gamma(n + 2)}{\Gamma(\sum{y_i} + 1)\Gamma(n - \sum{y_i} + 1)}\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}d\theta}}
+$$
+Since the beta density integrates to $1$, we can simplify this as
+$$
+f(\theta | y) = \frac{\Gamma(n + 2)}{\Gamma(\sum{y_i}+ 1)\Gamma(n - \sum{y_i}+ 1)}\theta^{\sum{y_i}}(1-\theta)^{n-\sum{y_i}}I_{\{0 \le \theta \le 1\}}
+$$
+From here we can see that the posterior follows a beta distribution
+$$
+\theta | y \sim Beta(\sum{y_i} + 1, n - \sum{y_i} + 1)
+$$
+
+## Conjugate Priors
+
+The uniform distribution is $Beta(1, 1)$ 
+
+Any beta distribution is conjugate for the Bernoulli distribution. Any beta prior will give a beta posterior.
+$$
+f(\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1-\theta)^{\beta -1}I_{\{0 \le \theta \le 1\}}
+$$
+
+$$
+f(\theta | y) \propto f(y|\theta)f(\theta) = \theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}I_{\{0 \le \theta \le 1\}}
+$$
+
+$$
+f(y|\theta)f(\theta) \propto \theta^{\alpha + \sum{y_i}-1}(1-\theta)^{\beta + n - \sum{y_i} - 1}
+$$
+
+Thus we see that this is a beta distribution
+$$
+\theta | y \sim Beta(\alpha + \sum{y_i}, \beta + n - \sum{y_i})
+$$
+When $\alpha$ and $\beta$ is one like in the uniform distribution, then we get the same result as earlier.
+
+
+
+This whole process where we choose a particular form of prior that works with a likelihood is called using a conjugate family.
+
+A family of distributions is referred to as conjugate if when you use a member of that family as a prior, you get another member of that family as your posterior.
+
+The beta distribution is conjugate for the Bernoulli distribution. It's also conjugate for the binomial distribution. The only difference in the binomial likelihood is that there is a combinatoric term. Since that does not depend on $\theta$, we get the same posterior.
+
+We often use conjugate priors because they make life much more simpler, sticking to conjugate families allows us to get closed form solutions easily.
+
+If the family is flexible enough, then you can find a member of that family that closely represents your beliefs.
+
+## Posterior Mean and Effect Size
+
+Returning to the beta posterior model it is clear how both the prior and the data contribute to the posterior.
+
+We can say that the effect size of the prior is $\alpha + \beta$
+
+Recall that the expected value or mean of a beta distribution is $\frac{\alpha}{\alpha + \beta}$
+
+Therefore we can derive the posterior mean as
+$$
+posterior_{mean} = \frac{\alpha + \sum{y_i}}{\alpha + \sum{y_i}+\beta + n - \sum{y_i}}= \frac{\alpha+\sum{y_i}}{\alpha + \beta + n}
+$$
+We can further decompose this as
+$$
+posterior_{mean} = \frac{\alpha + \beta}{\alpha + \beta + n}\frac{\alpha}{\alpha + \beta} + \frac{n}{\alpha + \beta + n}\frac{\sum{y_i}}{n}
+$$
+We can describe this as the (prior weight * prior mean) + (data weight * data mean)
+
+The posterior mean is a weighted average of the prior mean and the data mean.
+
+This effective sample size gives you an idea of how much data you would need to make sure that your prior doesn't have much influence on your posterior.
+
+If $\alpha + \beta$ is small compared to $n$ then the posterior will largely just be driven by the data. If $\alpha + \beta$ is large relative to $n$ then the posterior will be largely driven by the prior.
+
+We can make a 95% credible interval using our posterior distribution for $\theta$. We can find an interval that actually has 95% probability of containing $\theta$.
+
+Using Bayesian Statistics we can chain together dong a sequential update every time we get new data. We can get a new posterior, and we just use our previous posterior as a prior to do another update using Baye's theorem.
+
+## Data Analysis Example in R
+
+Suppose we're giving two students a multiple-choice exam with 40 questions, where each question has four choices. We don't know how much the students have studied for this exam, but we think that they'll do better than just guessing randomly
+
+1) What are the parameters of interest?
+
+The parameters of interests are $\theta_1 = true$ the probability that the first student will answer a question correctly, $\theta_2 = true$ the probability that the second student will answer a question correctly.
+
+2) What is our likelihood?
+
+The likelihood is $Binomial(40, \theta)$, if we assume that each question is independent and that the probability a student gets each question right is the same for all questions for that student.
+
+3) What prior should we use?
+
+The conjugate prior is a beta prior. We can plot the density with `dbeta`
+
+```R
+theta = seq(from = 0, to = 1, by = 0.1)
+# Uniform
+plot(theta, dbeta(theta, 1, 1), type = 'l')
+# Prior mean 2/3
+plot(theta, dbeta(theta, 4, 2), type = 'l')
+# Prior mean 2/3 but higher effect size (more concentrated at mean)
+plot(theta, dbeta(theta, 8, 4), type = 'l')
+```
+
+4 ) What are the prior probabilities $P(\theta > 0.25)$? $P(\theta > 0.5)$? $P(\theta > 0.8)$?
+
+```R
+1 - pbeta(0.25, 8, 4)
+#[1] 0.998117
+1 - pbeta(0.5, 8, 4)
+#[1] 0.8867188
+1 - pbeta(0.8, 8, 4)
+#[1] 0.16113392
+```
+
+
+
+5) Suppose the first student gets 33 questions right. What is the posterior distribution for $\theta_1$?  $P(\theta > 0.25)$? $P(\theta > 0.5)$? $P(\theta > 0.8)$? What is the 95% posterior credible interval for $\theta_1$?
+$$
+Posterior \sim Beta(8 + 33, 4 + 40 - 33) = Beta(41, 11)
+$$
+With a posterior mean of $\frac{41}{41+11} = \frac{41}{52}$
+
+We can plot the posterior distribution with the prior
+
+```R
+plot(theta, dbeta(theta, 41, 11), type = 'l')
+lines(theta, dbeta(theta, 8 ,4), lty = 2) #Dashed line for prior
+```
+
+Posterior probabilities
+
+```R
+1 - pbeta(0.25, 41, 11)
+#[1] 1
+1 - pbeta(0.5, 41, 11)
+#[1] 0.9999926
+1 - pbeta(0.8, 41, 11)
+#[1] 0.4444044
+```
+
+Equal tailed 95% credible interval
+
+```R
+qbeta(0.025, 41, 11)
+#[1] 0.6688426
+qbeta(0.975, 41, 11)
+#[1] 0.8871094
+```
+
+95% confidence that $\theta_1$ is between 0.67 and 0.89
+
+
+
+6) Suppose the second student gets 24 questions right. What is the posterior distribution for $\theta_2$?  $P(\theta > 0.25)$? $P(\theta > 0.5)$? $P(\theta > 0.8)$? What is the 95% posterior credible interval for $\theta_2$
+$$
+Posterior \sim Beta(8 + 24, 4 + 40 - 24) = Beta(32, 20)
+$$
+With a posterior mean of $\frac{32}{32+20} = \frac{32}{52}$
+
+We can plot the posterior distribution with the prior
+
+```R
+plot(theta, dbeta(theta, 32, 20), type = 'l')
+lines(theta, dbeta(theta, 8 ,4), lty = 2) #Dashed line for prior
+```
+
+Posterior probabilities
+
+```R
+1 - pbeta(0.25, 32, 20)
+#[1] 1
+1 - pbeta(0.5, 32, 20)
+#[1] 0.9540427
+1 - pbeta(0.8, 32, 20)
+#[1] 0.00124819
+```
+
+Equal tailed 95% credible interval
+
+```R
+qbeta(0.025, 32, 20)
+#[1] 0.4808022
+qbeta(0.975, 32, 20)
+#[1] 0.7415564
+```
+
+95% confidence that $\theta_1$ is between 0.48 and 0.74
+
+
+
+7) What is the posterior probability that $\theta_1 > \theta_2$? i.e., that the first student has a better chance of getting a question right than the second student?
+
+Estimate by simulation: draw 1,000 samples from each and see how often we observe $\theta_1 > \theta_2$
+
+```R
+theta1 = rbeta(100000, 41, 11)
+theta2 = rbeta(100000, 32, 20)
+mean(theta1 > theta2)
+#[1] 0.975
+```
+
+## Poisson Data (Chocolate Chip Cookie Example)
+
+In mass produced chocolate chip cookies, they make a large amount of dough. They mix in a large number of chips, mix it up really well and then chunk out individual cookies. In this process, the number of chips per cookie approximately follow a Poisson distribution.
+
+If we were to assume that chips have no volume, then this would be exactly a Poisson process and follow exactly a Poisson distribution. In practice, however, chips aren't that big so they follow approximately a Poisson distribution for the number of chips per cookie.
+$$
+Y_i \sim Poisson(\lambda)
+$$
+
+$$
+f(y|\lambda) = \frac{\lambda^{\sum{y_i}}e^{-n\lambda}}{\prod_{i = 1}^n{y_i!}}
+$$
+
+This is for $\lambda > 0$
+
+What type of prior should we put on $\lambda$? It would be convenient if we could put a conjugate prior. What distribution looks like lambda raised to a power and e raised to a negative power?
+
+For this, we're going to use a Gamma prior.
+$$
+\lambda \sim \Gamma(\alpha, \beta)
+$$
+
+$$
+f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha - 1}e^{-\beta\lambda}
+$$
+
+$$
+f(\lambda | y) \propto f(y|\lambda)f(\lambda) \propto \lambda^{\sum{y_i}}e^{-n\lambda}\lambda^{\alpha - 1}e^{-\beta \lambda}
+$$
+
+$$
+f(\lambda | y) \propto \lambda^{\alpha + \sum{y_i} - 1}e^{-(\beta + n)\lambda}
+$$
+
+Thus we can see that the posterior is a Gamma Distribution
+$$
+\lambda|y \sim \Gamma(\alpha + \sum{y_i}, \beta + n)
+$$
+The mean of Gamma under this parameterization is $\frac{\alpha}{\beta}$
+
+The posterior mean is going to be
+$$
+posterior_{mean} = \frac{\alpha + \sum{y_i}}{\beta + n} = \frac{\beta}{\beta + n}\frac{\alpha}{\beta} + \frac{n}{\beta + n}\frac{\sum{y_i}}{n}
+$$
+As you can see here the posterior mean of the Gamma distribution is also the weighted average of the prior mean and the data mean.
+
+Let us present two strategies on how to choose our hyper parameters $\alpha$ and $\beta$
+
+1. Think about the prior mean. For example, what do you think the number of chips per cookie on average is?
+
+After this, we need some other piece of knowledge to pin point both parameters. Here are some options.
+
+- What is your error on the number of chips per cookie? In other words, what do you think the standard deviation. Under the Gamma prior the standard deviation is $\frac{\sqrt{\alpha}}{\beta}$
+- What is the effective sample size $\beta$? How many units of information do you think we have in our prior versus our data points.
+
+2. In Bayesian Statistics, a vague prior refers to one that's relatively flat across much of the space. For a Gamma prior we can choose $\Gamma(\epsilon, \epsilon)$ where $\epsilon$ is small and strictly positive.
+
+This would create a distribution with a mean of 1 and a huge standard deviation across the whole space. Hence the posterior will be largely driven by the data and very little by the prior.
--- a/content/notes/bayesianstatistics/week4.md
+++ b/content/notes/bayesianstatistics/week4.md
@ -0,0 +1,617 @@
+## Exponential Data
+
+Suppose you're waiting for a bus that you think comes on average once every 10 minutes, but you're not sure exactly how often it comes.
+$$
+Y \sim Exp(\lambda)
+$$
+Your waiting time has a prior expectation of $\frac{1}{\lambda}$
+
+
+
+It turns out the gamma distribution is conjugate for an exponential likelihood. We need to specify a prior, or a particular gamma in this case. If we think that the buses come on average every ten minutes, that's a rate of one over ten.
+$$
+prior_{mean} = \frac{1}{10}
+$$
+Thus, we'll want to specify a gamma distribution so that the first parameter divded by the second parameter is $\frac{1}{10}$
+
+We can now think about our variability. Perhaps you specify
+$$
+\Gamma(100, 1000)
+$$
+This will indeed have a prior mean of $\frac{1}{10}$ and it'll have a standard deviation of $\frac{1}{100}$. If you want to have a rough estimate of our mean plus or minus two standard deviations then we have the following
+$$
+0.1 \pm 0.02
+$$
+Suppose that we wait for 12 minutes and a bus arrives. Now you want to update your posterior for $\lambda$ about how often this bus will arrive. 
+$$
+f(\lambda | y) \propto f(y|\lambda)f(\lambda)
+$$
+
+$$
+f(\lambda | y) \propto \lambda e^{-\lambda y}\lambda^{\alpha - 1}e^{-\beta \lambda}
+$$
+
+$$
+f(\lambda | y)  \propto \lambda^{(\alpha + 1) - 1}e^{-(\beta + y)\lambda}
+$$
+
+$$
+\lambda | y \sim \Gamma(\alpha + 1, \beta + y)
+$$
+
+Plugging in our particular prior gives us a posterior for $\lambda$ which is
+$$
+\lambda | y \sim \Gamma(101, 1012)
+$$
+Thus our posterior mean is going to be $\frac{101}{1012} Which is equal to 0.0998.
+
+
+
+This one observation doesn't contain a lot of data under this likelihood. When the bus comes and it takes 12 minutes instead of 10, it barely shifts our posterior mean up. One data point doesn't have a big impact here.
+
+
+
+## Normal/Gaussian Data
+
+Let's suppose the standard deviation or variance is known and we're only interested in learning about the mean. This is the situation that often arises in monitoring industrial production processes.
+$$
+X_i \sim N(\mu, \sigma^2)
+$$
+It turns out that the Normal distribution is conjugate for itself when looking for the mean parameter
+
+Prior
+$$
+\mu \sim N(m_0,S_0^2)
+$$
+
+$$
+f(\mu |x ) \propto f(x|\mu)f(\mu)
+$$
+
+$$
+\mu | x \sim N(\frac{n\bar{x}/\sigma_0^2 + m_0/s_0^2}{n/\sigma_0^2 + 1/s_0^2}, \frac{1}{n/\sigma_0^2 + 1/s_0^2})
+$$
+
+Let's look at the posterior mean
+$$
+posterior_{mean} = \frac{n/\sigma_0^2}{n/\sigma_0^2 + 1/s_0^2}\bar{x} + \frac{1/s_0^2}{n/\sigma_0^2 + 1/s_0^2}
+$$
+
+$$
+posterior_{mean} = \frac{n}{n + \sigma_0^2/s_0^2}\bar{x} + \frac{\sigma_0^2/s_0^2}{n + \sigma_0^2/s_0^2}
+$$
+
+Thus we see, that the posterior mean is a weighted average of the prior mean and the data mean. And indeed that the effective sample size for this prior is the ratio of the variance for the data to the variance in the prior.
+
+This makes sense, because the larger the variance of the prior, the less information that's in it.
+
+The marginal distribution for Y is 
+$$
+N(m_0, s_0^2 + \sigma^2)
+$$
+
+### When $\mu$ and $\sigma^2$  is unknown
+
+$$
+X_i | \mu, \sigma^2 \sim N(\mu, \sigma^2)
+$$
+
+A prior from $\mu$ conditional on the value for $\sigma^2$ 
+$$
+\mu | \sigma^2 \sim N(m, \frac{\sigma^2}{w})
+$$
+$w$ is going to be the ratio of $\sigma^2$ and some variance for the Normal distribution. This is the effective sample size of the prior.
+
+Finally, the last step is to specify a prior for $\sigma^2$. The conjugate prior here is an inverse gamma distribution with parameters $\alpha$ and $\beta$.
+$$
+\sigma^2 \sim \Gamma^{-1}(\alpha, \beta)
+$$
+After many calculations... we get the posterior distribution
+$$
+\sigma^2 | x \sim \Gamma^{-1}(\alpha + \frac{n}{2}, \beta + \frac{1}{2}\sum_{i = 1}^n{(x-\bar{x}^2 + \frac{nw}{2(n+2)}(\bar{x} - m)^2)})
+$$
+
+$$
+\mu | \sigma^2,x \sim N(\frac{n\bar{x}+wm}{n+w}, \frac{\sigma^2}{n + w})
+$$
+
+Where the posterior mean can be written as the weighted average of the prior mean and the data mean.
+$$
+\frac{n\bar{x}+wm}{n+w} = \frac{w}{n + w}m + \frac{n}{n + w}\bar{x}
+$$
+In some cases, we really only care about $\mu$. We want some inference on $\mu$ and we may want it such that it doesn't depend on $\sigma^2$. We can marginalize that $\sigma^2$ integrating it out. The posterior for $\mu$ marginally follows a $t$ distribution.
+$$
+\mu | x \sim t
+$$
+Similarly the posterior predictive distribution also is a $t$ distribution.
+
+Finally, note that we can extend this in various directions, this can be extended to the multivariate normal case that requires matrix vector notations and can be extended in a hierarchial fashion if we want to specify priors for $m$, $w$ and $\beta$ 
+
+## Non Informative Priors
+
+We've seen examples of choosing priors that contain a significant amount of information. We've also seen some examples of choosing priors where we're attempting to not put too much information in to keep them vague.
+
+Another approach is referred to as objective Bayesian statistics or inference where we explicitly try to minimize the amount of information that goes into the prior.
+
+This is an attempt to have the data have maximum influence on the posterior
+
+Let's go back to coin flipping
+$$
+Y_i \sim B(\theta)
+$$
+How do we minimize our prior information in $\theta$? One obvious intuitive approach is to say that all values of $\theta$ are equally likely. So we could have a piror for $\theta$ which follows a uniform distribution on the interval $[0, 1]$ 
+
+Saying all values of $\theta$ are equally likely **seems** like it would have no information in it.
+
+Recall however, that a $Uniform(0, 1)$ is the same as $Beta(1, 1)$ 
+
+The effective sample size of a beta prior is the sum of its two parameters. So in this case, it has an effective sample size of 2. This is equivalent to data, with one head and one tail already in it.
+
+So this is not a completely non informative prior.
+
+We could think about a prior that has less information. For example $Beta(\frac{1}{2}, \frac{1}{2})$, this would have half as much information with an effective sample size of one.
+
+We can take this even further. Think about something like $Beta(0.001, 0.001)$ This would have much less information, with the effective sample size fairly close to zero. In this case, the data would determine the posterior and there would be very little influence from the prior.
+
+###Improper priors
+
+Can we go even further? We can think of the limiting case. Let's think of $Beta(0,0)$, what would that look like? 
+$$
+f(\theta) \propto \theta^{-1}(1-\theta)^{-1}
+$$
+This is not a proper density. If you integrate this over $(0,1)$, you'll get an infinite integral, so it's not a true density in the sense of it not integrating to 1.
+
+There's no way to normalize it, since it has an infinite integral. This is what we refer to as an improper prior.
+
+It's improper in the sense that it doesn't have a proper density. But it's not necessarily imporper in the sense that we can't use it. If we collect data, we use this prior and as long as we observe one head and one tail, or **at least one success and one failure**. Then we can get a posterior
+$$
+f(\theta|y) \propto \theta^{y-1}(1-\theta)^{n-y-1} \sim Beta(y, n-y)
+$$
+With a posterior mean of $\frac{y}{n} =\hat{\theta}$, which you should recognize as the maximum likelihood estimate. So by using this improper prior, we get a posterior which gives us point estimates exactly the same as the frequentest approach.
+
+But in this case, we can also think of having a full posterior. From this, we can make interval statements, probability statements, and we can actually find an interval and say that there's 95% probability that $\theta$ is in this interval. Which is not something you can do under the frequentest approach even though we may get the same exact interval.
+
+### Statements about improper priors
+
+Improper priors are okay as long as the posterior itself is proper. There may be some mathematical things that need to be checked and you may need to have certain restrictions on the data. In this case, we needed to make sure that we observed at least one head and one tail to get a proper posterior.
+
+But as long as the posterior is proper, we can go forwards and do Bayesian inference even with an improper prior.
+
+The second point is that for many problems there does exist a prior, typically an improper prior that will lead to the same point estimates as you would get under the frequentest paradigm. So we can get very similar results, results that are fully dependent on the data, under the Bayesian approach.
+
+But in this case, we can also have continue to have a posterior and make posterior interval estimates and talk about posterior probabilities of the parameter.
+
+### Normal Case
+
+Another example is thinking about the normal case.
+$$
+Y_i \stackrel{iid}\sim N(\mu, \sigma^2)
+$$
+Let's start off by assuming that $\sigma^2$ is known and we'll just focus on the mean $\mu$.
+
+We can think about a vague prior like before and say that 
+$$
+\mu \sim N(0, 1000000^2)
+$$
+This would just spread things out across the real line. That would be a fairly non informative prior covering a lot of possibilities. We can then think about taking the limit, what happens if we let the variance go to $\infty$. In that case, we're basically spreading out this distribution across the entire real number line. We can say that the density is just a constant across the whole real line.
+$$
+f(\mu) \propto 1
+$$
+This is an improper prior because if you integrate the real line you get an infinite answer. However, if we go ahead and find the posterior
+$$
+f(\mu|y) \propto f(y|\mu)f(\mu) \propto exp(-\frac{1}{2\sigma^2}\sum{(y_i - \mu)^2})(1)
+$$
+
+$$
+f(\mu | y) \propto exp(-\frac{1}{2\sigma^2/n}(\mu - \bar{y})^2)
+$$
+
+$$
+\mu | y \sim N(\bar{y}, \frac{\sigma^2}{n})
+$$
+
+This should look just like the maximum likelihood estimate.
+
+### Unknown Variance
+
+In the case that $\sigma^2$ is unknown, the standard non informative prior is
+$$
+f(\sigma^2) \propto \frac{1}{\sigma^2}
+$$
+
+$$
+\sigma^2 \sim \Gamma^{-1}(0,0)
+$$
+
+This is an improper prior and it's uniform on the log scale of $\sigma^2$.
+
+In this case, we'll end up with a posterior for $\sigma^2$
+$$
+\sigma^2|y \sim \Gamma^{-1}(\frac{n-1}{2}, \frac{1}{2}\sum{(y_i - \bar{y})^2})
+$$
+This should also look reminiscent of quantities we get as a frequentest. For example, the samples standard deviation
+
+## Jeffreys Prior
+
+Choosing a uniform prior depends upon the particular parameterization. 
+
+Suppose I used a prior which is uniform on the log scale for $\sigma^2$
+$$
+f(\sigma^2) \propto \frac{1}{\sigma^2}
+$$
+Suppose somebody else decides, that they just want to put a uniform prior on $\sigma^2$ itself.
+$$
+f(\sigma^2) \propto 1
+$$
+These are both uniform on certain scales or certain parameterizations, but they are different priors. So when we compute the posteriors, they will be different as well.
+
+ The key thing is that uniform priors are not invariant with respect to transformation. Depending on how you parameterize the problem, you can get different answers by using a uniform prior
+
+One attempt to round this out is to use Jeffrey's Prior
+
+Jeffrey's Prior is defined as the following
+$$
+f(\theta) \propto \sqrt{\mathcal{I(\theta)}}
+$$
+Where $\mathcal{I}(\theta)$ is the fisher information of $\theta$. In most cases, this will be an improper prior.
+
+### Normal Data
+
+For the example of Normal Data
+$$
+Y_i \sim N(\mu, \sigma^2)
+$$
+
+$$
+f(\mu) \propto 1
+$$
+
+$$
+f(\sigma^2) \propto \frac{1}{\sigma^2}
+$$
+
+Where $\mu$ is uniform and $\sigma^2$ is uniform on the log scale.
+
+This prior will then be transformation invariant. We will end up putting the same information into the prior even if we use a different parameterization for the Normal.
+
+### Binomial
+
+$$
+Y_i \sim B(\theta)
+$$
+
+$$
+f(\theta) \propto \theta^{-\frac{1}{2}}(1-\theta)^{-\frac{1}{2}} \sim Beta(\frac{1}{2},\frac{1}{2})
+$$
+
+This is a rare example of where the Jeffreys prior turns out to be a proper prior.
+
+You'll note that this prior actually does have some information in it. It's equivalent to an effective sample size of one data point. However, this information will be the same, not depending on the parameterization we use.
+
+In this case, we have $\theta$ as a probability, but another alternative which is sometimes used is when we model things on a logistics scale.
+
+By using the Jeffreys prior, we'll maintain the exact same information.
+
+### Closing information about priors
+
+Other possible approaches to objective Bayesian inference includes priors such as reference priors and maximum entropy priors.
+
+A related concept to this is called empirical Bayesian analysis. The idea in empirical Baye's is that you use the data to help inform your prior; such as by using the mean of the data to set the mean of the prior distribution. This approach often leads to reasonable point estimates in your posterior. However, it's sort of cheating since you're using your data twice and as a result may lead to improper uncertainty estimates.
+
+## Fisher Information
+
+The Fisher information (for one parameter) is defined as
+$$
+\mathcal{I}(\theta) = E[(\frac{d}{d\theta}log{(f(X|\theta))})^2]
+$$
+Where the expectation is taken with respect to $X$ which has PDF $f(x|\theta)$. This quantity is useful in obtaining estimators for $\theta$ with good properties, such as low variance. It is also the basis for Jeffreys prior.
+
+**Example:** Let $X | \theta \sim N(\theta, 1)$. Then we have
+$$
+f(x|\theta) = \frac{1}{\sqrt{2\pi}}exp[-\frac{1}{2}(x-\theta)^2]
+$$
+
+$$
+\log{(f(x|\theta))} = -\frac{1}{2}\log{(2\pi)}-\frac{1}{2}(x-\theta)^2
+$$
+
+$$
+(\frac{d}{d\theta}log{(f(x|\theta))})^2 = (x-\theta)^2
+$$
+
+and so $\mathcal{I}(\theta) = E[(X - \theta)^2] = Var(X) = 1$
+
+## Linear Regression
+
+###Brief Review of Regression
+
+Recall that linear regression is a model for predicting a response or dependent variable ($Y$, also called an output) from one or more covariates or independent variables ($X$, also called explanatory variables, inputs, or features). For a given value of a single $x$, the expected value of $y$ is
+$$
+E[y] = \beta_0 + \beta_1x
+$$
+or we could say that $Y \sim N(\beta_0 + \beta_1x, \sigma^2)$. For data $(x_1, y_1), \dots , (x_n, y_n)$, the fitted values for the coefficients, $\hat{\beta_0}$ and $\hat{\beta_1}$ are those that minimize the sum of squared errors $\sum_{i = 1}^n{(y_i - \hat{y_i})^2}$, where the predicted values for the response are $\hat{y} = \hat{\beta_0}  + \hat{\beta_1}x$. We can get these values from R. These fitted coefficients give the least-squares line for the data.
+
+This model extends to multiple covariates, with one $\beta_j$ for each $k$ covariates
+$$
+E[y_i] = \beta_0 + \beta_1x_{i1} + \dots + \beta_kx_{ik}
+$$
+Optionally, we can represent the multivariate case using vector-matrix notation.
+
+### Conjugate Modeling
+
+In the Bayesian framework, we treat the $\beta$ parameters as unknown, put a prior on them, and then find the posterior. We might treat $\sigma^2$ as fixed and known, or we might treat it as an unknown and also put a prior on it. Because the underlying assumption of a regression model is that the errors are independent and identically normally distributed with mean $0$ and variance $\sigma^2$, this defines a normal likelihood.
+
+#### $\sigma^2$ known
+
+Sometimes we may know the value of the error variance $\sigma^2$. This simplifies calculations. The conjugate prior for the $\beta$'s is a normal prior. In practice, people typically use a non-informative prior, i.e., the limit as the variance of the normal prior goes to infinity, which has the same mean as the standard least-squares estimates. If we are only estimating $\beta$ and treating $\sigma^2$ as known, then the posterior for $\beta$ is a (multivariate) normal distribution. If we just have a single covariate, then the posterior for the slope is
+$$
+\beta_1 | y \sim N(\frac{\sum_{i = 1}^n{(x_i-\bar{x})(y_i - \bar{y})}}{\sum_{i=1}^n{(x_i-\bar{x})^2}}, \frac{\sigma^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}})
+$$
+If we have multiple covariates, then using a matrix-vector notation, the posterior for the vector of coefficients is
+$$
+\beta | y \sim N((X^tX)^{-1}X^ty, (X^tX)^{-1}\sigma^2)
+$$
+where $X$ denotes the design matrix and $X^t$ is the transpose of $X$. The intercept is typically included in $X$ as a column of $1$'s. Using an improper prior requires us to have at least as many data points as we have parameters to ensure that the posterior is proper.
+
+#### $\sigma^2$ Unknown
+
+If we treat both $\beta$ and $\sigma^2$ as unknown, the standard prior is the non-informative Jeffreys prior, $f(\beta, \sigma^2) \propto \frac{1}{\sigma^2}$. Again, the posterior mean for $\beta$ will be the same as the standard least-squares estimates. The posterior for $\beta$ conditional on $\sigma^2$ is the same normal distributions as when $\sigma^2$ is known, but the marginal posterior distribution for $\beta$, with $\sigma^2$ integrated out is a $t$ distribution, analogous to the $t$ tests for significance in standard linear regression. The posterior $t$ distribution has mean $(X^tX)^{-1}X^ty$ and scale matrix (related to the variance matrix) $s^2(X^tX)^{-1}$, where $s^2 = \sum_{i = 1}^n{(y_i - \hat{y_i})^2/(n - k - 1)}$. The posterior distribution for $\sigma^2$ is an inverse gamma distribution
+$$
+\sigma^2 | y \sim \Gamma^{-1}(\frac{n - k - 1}{2}, \frac{n - k - 1}{2}s^2)
+$$
+In the simple linear regression case (single variable), the marginal posterior for $\beta$ is a $t$ distribution with mean $\frac{\sum_{i = 1}^n{(x_i-\bar{x})(y_i - \bar{y})}}{\sum_{i=1}^n{(x_i-\bar{x})^2}}$ and scale $ \frac{s^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}}$. If we are trying to predict a new observation at a specified input $x^*$, that predicted value has a marginal posterior predictive distribution that is a $t$ distribution, with mean $\hat{y} = \hat{\beta_0} + \hat{\beta_1}x^*$ and scale $se_r\sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{(n - 1)s_x^2}}$. $se_r$ is the residual standard error of the regression, which can be found easily in R. $s_x^2$ is the sample variance of $x$. Recall that the predictive distribution for a new observation has more variability than the posterior distribution for $\hat{y}$ , because individual observations are more variable than the mean.
+
+## Linear Regression
+
+### Single Variable Regression
+
+We'll be looking at the Challenger dataset. It contains 23 past launches where it has the temperature at the day of launch and the O-ring damage index
+
+http://www.randomservices.org/random/data/Challenger2.txt
+Read in the data
+
+```R
+oring=read.table("http://www.randomservices.org/random/data/Challenger2.txt",
+                 header=T)
+# Note that attaching this masks T which is originally TRUE
+attach(oring)
+```
+
+Now we'll see the plot
+
+```R
+plot(T, I)
+```
+
+![Challenger](files/courses/BayesianStatistics/Challenger.png)
+
+Fit a linear model
+
+```R
+oring.lm=lm(I~T)
+summary(oring.lm)
+```
+
+Output of the summary
+
+```
+Call:
+lm(formula = I ~ T)
+
+Residuals:
+    Min      1Q  Median      3Q     Max 
+-2.3025 -1.4507 -0.4928  0.7397  5.5337 
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept) 18.36508    4.43859   4.138 0.000468
+T           -0.24337    0.06349  -3.833 0.000968
+               
+(Intercept) ***
+T           ***
+---
+Signif. codes:  
+0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+
+Residual standard error: 2.102 on 21 degrees of freedom
+Multiple R-squared:  0.4116,	Adjusted R-squared:  0.3836 
+F-statistic: 14.69 on 1 and 21 DF,  p-value: 0.0009677
+```
+
+Add the fitted line into the scatterplot
+
+```R
+lines(T,fitted(oring.lm))     
+```
+
+![challengerfitted](files/courses/BayesianStatistics/challengerfitted.png)
+
+Create a 95% posterior interval for the slope
+
+```R
+-0.24337 - 0.06349*qt(.975,21)
+# [1] -0.3754047
+-0.24337 + 0.06349*qt(.975,21)
+# [1] -0.1113353
+```
+
+**Note:** These are the same as the frequentest confidence intervals
+
+If the challenger launch was at 31 degrees Fahrenheit, how much O-Ring damage would we predict?
+
+```R
+coef(oring.lm)[1] + coef(oring.lm)[2]*31  
+# [1] 10.82052 
+```
+
+Let's make our posterior prediction interval
+
+```R
+predict(oring.lm,data.frame(T=31),interval="predict")
+```
+
+Output of `predict`
+
+```
+       fit      lwr      upr
+1 10.82052 4.048269 17.59276
+```
+
+We can calculate the lower bound through the following formula
+
+```R
+10.82052-2.102*qt(.975,21)*sqrt(1+1/23+((31-mean(T))^2/22/var(T)))
+```
+
+What's the posterior probability that the damage index is greater than zero?
+
+```R
+1-pt((0-10.82052)/(2.102*sqrt(1+1/23+((31-mean(T))^2/22/var(T)))),21)
+```
+
+### Multivariate Regression
+
+We're looking at Galton's seminal data predicting the height of children from the height of the parents
+
+http://www.randomservices.org/random/data/Galton.txt
+Read in the data
+
+```R
+heights=read.table("http://www.randomservices.org/random/data/Galton.txt",
+                   header=T)
+attach(heights)
+```
+
+What are the columns in the dataset?
+
+```R
+names(heights)
+# [1] "Family" "Father" "Mother" "Gender" "Height" "Kids"  
+```
+
+Let's look at the relationship between the different variables
+
+```R
+pairs(heights)
+```
+
+![heightpairs](files/courses/BayesianStatistics/heightpairs.png)
+
+First let's start by creating a linear model taking all of the columns into account
+
+```R
+summary(lm(Height~Father+Mother+Gender+Kids))
+```
+
+Output of `summary`
+
+```
+Call:
+lm(formula = Height ~ Father + Mother + Gender + Kids)
+
+Residuals:
+    Min      1Q  Median      3Q     Max 
+-9.4748 -1.4500  0.0889  1.4716  9.1656 
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept) 16.18771    2.79387   5.794 9.52e-09
+Father       0.39831    0.02957  13.472  < 2e-16
+Mother       0.32096    0.03126  10.269  < 2e-16
+GenderM      5.20995    0.14422  36.125  < 2e-16
+Kids        -0.04382    0.02718  -1.612    0.107
+               
+(Intercept) ***
+Father      ***
+Mother      ***
+GenderM     ***
+Kids           
+---
+Signif. codes:  
+0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+
+Residual standard error: 2.152 on 893 degrees of freedom
+Multiple R-squared:  0.6407,	Adjusted R-squared:  0.6391 
+F-statistic: 398.1 on 4 and 893 DF,  p-value: < 2.2e-16
+```
+
+As you can see here, the `Kids` column is not significant. Let's look at a model with it removed.
+
+```R
+summary(lm(Height~Father+Mother+Gender))
+```
+
+Output of `summary`
+
+```
+Call:
+lm(formula = Height ~ Father + Mother + Gender)
+
+Residuals:
+   Min     1Q Median     3Q    Max 
+-9.523 -1.440  0.117  1.473  9.114 
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept) 15.34476    2.74696   5.586 3.08e-08
+Father       0.40598    0.02921  13.900  < 2e-16
+Mother       0.32150    0.03128  10.277  < 2e-16
+GenderM      5.22595    0.14401  36.289  < 2e-16
+               
+(Intercept) ***
+Father      ***
+Mother      ***
+GenderM     ***
+---
+Signif. codes:  
+0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+
+Residual standard error: 2.154 on 894 degrees of freedom
+Multiple R-squared:  0.6397,	Adjusted R-squared:  0.6385 
+F-statistic:   529 on 3 and 894 DF,  p-value: < 2.2e-16
+```
+
+This model looks good, let's go ahead and save it to a variable
+
+```R
+heights.lm=lm(Height~Father+Mother+Gender)
+```
+
+From this we can tell that for each extra inch of height in a father is correlated with an extra 0.4 inches extra to the height of a child.
+
+We can also tell that each extra inch of height in a mother is correlated with an extra 0.3 inches extra to the height of the child.
+
+A male child is on average 5.2 inches taller than a female child.
+
+Let's create a 95% posterior interval for the difference in height by gender
+
+```R
+5.226 - 0.144*qt(.975,894)
+# [1] 4.943383 
+5.226 + 0.144*qt(.975,894)
+# [1] 5.508617
+```
+
+Let's make a posterior prediction interval for a male and female with a father whose 68 inches and a mother whose 64 inches.
+
+```R
+predict(heights.lm,data.frame(Father=68,Mother=64,Gender="M"),
+        interval="predict")
+
+#       fit      lwr     upr
+# 1 68.75291 64.51971 72.9861
+```
+
+```R
+predict(heights.lm,data.frame(Father=68,Mother=64,Gender="F"),
+        interval="predict")
+
+#       fit      lwr      upr
+# 1 63.52695 59.29329 67.76062
+```
+
+
+
+## What's next?
+
+This concludes this course, if you want to go further with Bayesian statistics, the next topics to explore would be hierarchal modeling and fitting of non conjugate models with Markov Chain Monte Carlo or MCMC.