Website snapshot

This commit is contained in:
Brandon Rozek 2020-01-15 21:51:49 -05:00
parent ee0ab66d73
commit 50ec3688a5
281 changed files with 21066 additions and 0 deletions

Binary file not shown.

0
content/notes/_index.md Normal file
View file

View file

@ -0,0 +1,73 @@
# Abstract Algebra 2 Definitions
## Chapter 17
By a **ring** we mean a set $A$ with operations called addition and multiplication which satisfy the following axioms:
- $A$ with addition alone is an abelian group
- Multiplication is associative
- Multiplication is distributive over addition
Since $\langle A, + \rangle$ is an abelian group, there is in $A$ a neutral element for addition. This is called the **zero** element. Also, every element has an additive inverse called its **negative**.
If multiplication in a ring $A$ is commutative then we say that $A$ is a **commutative ring**.
If a ring $A$ has a neutral element for multiplication then we say that the neutral element is the **unity** of $A$.
If $A$ is a ring with unity, there may be elements in $A$ which have a multiplicative inverse. Such elements are said to be **invertible**.
If $A$ is a commutative ring with unity in which every nonzero element is invertible, then $A$ is called a **field**.
In any ring, a nonzero element $a$ is called a **divisor of zero** if there is a nonzero element $b$ in a ring such that the product $ab$ or $ba$ is equal to zero.
An **integral domain** is defined to be a commutative ring with unity that has the cancellation property. Another way of saying this is that an integral domain is a commutative ring with unity and has no zero divisors.
## Chapter 18
Let $A$ be a ring, and $B$ be a nonempty subset of $A$. $B$ is called a **subring** if it satisfies the following properties:
- Closed with respect to addition
- Closed with respect to negatives
- Closed with respect to multiplication
Let $B$ be a subring of $A$. We call $B$ an **ideal** of $A$ if $xb, bx \in B$ for all $b \in B$ and $x \in A$.
The **principle ideal generated by $a$**, denoted $\langle a \rangle$ is the subring defined by fixing an element $a$ in a subring $B$ of $A$ and multiplying all elements of $A$ by $a$.
$$
\langle a \rangle = \{ xa : x \in A \}
$$
A **homomorphism** from a ring $A$ to a ring $B$ is a function $f : A \to B$ satisfying the identities:
$$
f(x_1 + x_2 ) = f(x_1) + f(x_2) \\
f(x_1x_2) = f(x_1)f(x_2)
$$
If there is a homomorphism from $A$ onto $B$, we call $B$ a **homomorphic image** of $A$.
If $f$ is a homomorphism from a ring $A$ to a ring $B$, the **kernel** of $f$ is the set of all the elements of $A$ which are carried by $f$ onto the zero element of $B$. In symbols, the kernel of $f$ is the set
$$
K = \{x \in A: f(x) = 0\}
$$
If $A$ and $b$ are rings, an **isomorphism** from $A$ to $B$ is a homomorphism which is a one-to-one correspondence from $A$ to $B$. In other words, it is injective and surjective homomorphism.
If there is an isomorphism from $A$ to $B$ we say that $A$ is **isomorphic** to $B$, and this fact is expressed by writing $A \cong B$.
## Chapter 19
Let $A$ be a ring, and $J$ an ideal of $A$. For any element $a \in A$, the symbol $J + a$ denotes the set of all sums $j + a$ as $a$ remains fixed and $j$ ranges over $J$. That is,
$$
J + a = \{j + a : j \in J\}
$$
$J + a$ is called a **coset** of $J$ in $A$.
Now think of the set which contains all cosets of $J$ in $A$. This set is conventionally denoted by $A / J$ and reads $A$ mod $J$. Then, $A / J$ with coset addition and multiplication is a ring.
An ideal $J$ of a commutative ring is said to be a **prime ideal** if for any two elements $a$ and $b$ in the ring,
$$
ab \in J \implies a \in J \text{ or } b \in J
$$
<u>Theorem:</u> If $J$ is a prime ideal of a community ring with unity $A$, then the quotient ring $A / J$ is an integral domain.
An ideal $J$ of $A$ with $J \ne A$ is called a **maximal ideal** if there exists no proper ideal $K$ of $A$ such that $J \subseteq K$ with $J \ne K$.
<u>Theorem:</u> If $A$ is a commutative ring with unity, then $J$ is a maximal ideal of $A$ iff $A/J$ is a field.

View file

@ -0,0 +1,17 @@
---
title: Algorithms Book Study
---
# Algorithms Book Study
A couple of my friends and I decided to start a book club following "Algorithms" by Jeff Erickson. One bonus is that he gives it away for free on [his website](http://jeffe.cs.illinois.edu/teaching/algorithms/)!
Of course you should totally check his book out rather than reading my notes. There are tons of witty and fun things in his textbook, not a dry reading I promise. These notes are here mostly for archival purposes.
## Notes
[Chapter 1](recursion)
[Chapter 2](backtracking)
[Chapter 3](dynamic)
[Chapter 4](greedy)

View file

@ -0,0 +1,62 @@
# Backtracking
This algorithm tries to construct a solution to a problem one piece at a time. Whenever the algorithm needs to decide between multiple alternatives to the part of the solution it *recursively* evaluates every option and chooses the best one.
## How to Win
To beat any *non-random perfect information* game you can define a Backtracking algorithm that only needs to know the following.
- A game state is good if either the current player has already won or if the current player can move to a bad state for the opposing player.
- A game state is bad if either the current player has already lost or if every available move leads to a good state for the opposing player.
```
PlayAnyGame(X, player)
if player has already won in state X
return GOOD
if player has lost in state X
return BAD
for all legal moves X -> Y
if PlayAnyGame(y, other player) = Bad
return GOOD
return BAD
```
In practice, most problems have an enormous number of states not making it possible to traverse the entire game tree.
## Subset Sum
For a given set, can you find a subset that sums to a certain value?
```
SubsetSum(X, T):
if T = 0
return True
else if T < 0 or X is empty
return False
else
x = any element of X
with = SubsetSum(X \ {x}, T - x)
without = SubsetSum(X \ {x}, T)
return (with or without)
```
X \ {x} denotes set subtraction. It means X without x.
```
ConstructSubset(X, i, T):
if T = 0
return empty set
if T < 0 or n = 0
return None
Y = ConstructSubset(X, i - 1, T)
if Y does not equal None
return Y
Y = ConstructSubset(X, i - 1, T - X[i])
if Y does not equal None
return Y with X[i]
return None
```
## Big Idea
Backtracking algorithms are used to make a *sequence of decisions*.
When we design a new recursive backtracking algorithm, we must figure out in advance what information we will need about past decisions in the middle of the algorithm.

View file

@ -0,0 +1,82 @@
# Dynamic Programming
The book first goes into talking about the complexity of the Fibonacci algorithm
```
RecFibo(n):
if n = 0
return 0
else if n = 1
return 1
else
return RecFibo(n - 1) + RecFibo(n - 2)
```
It talks about how the complexity of this is exponential.
"A single call to `RecFibo(n)` results in one recursive call to `RecFibo(n1)`, two recursive calls to `RecFibo(n2)`, three recursive calls to `RecFibo(n3)`, five recursive calls to `RecFibo(n4)`"
Now consider the memoized version of this algorithm...
```
MemFibo(n):
if n = 0
return 0
else if n = 1
return 1
else
if F[n] is undefined:
F[n] <- MemFibo(n - 1) + MemFibo(n - 2)
return F[n]
```
This actually makes the algorithm run in linear time!
![1564017052666](/home/rozek/Documents/StudyGroup/Algorithms/1564017052666.png)
Dynamic programming makes use of this fact and just intentionally fills up an array with the values of $F$.
```
IterFibo(n):
F[0] <- 0
F[1] <- 1
for i <- 2 to n
F[i] <- F[i - 1] + F[i - 2]
return F[n]
```
Here the linear complexity becomes super apparent!
<u>Interesting snippet</u>
"We had a very interesting gentleman in Washington named Wilson. He was secretary of Defense, and he actually had a pathological fear and hatred of the word *research*. Im not using the term lightly; Im using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term *research* in his presence. You can imagine how he felt, then, about the term *mathematical*.... I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose?"
Dynamic programming is essentially smarter recursion. It's about not repeating the same work.
These algorithms are best developed in two distinct stages.
(1) Formulate the problem recursively.
(a) Specification: What problem are you trying to solve?
(b) Solution: Why is the whole problem in terms of answers t smaller instances exactly the same problem?
(2) Find solutions to your recurrence from the bottom up
(a) Identify the subproblems
(b) Choose a memoization data structure
(c) Identify dependencies
(d) Find a good evaluation order
(e) Analyze space and running time
(f) Write down the algorithm
## Greedy Algorithms
If we're lucky we can just make decisions directly instead of solving any recursive subproblems. The problem is that greedly algorithms almost never work.

View file

@ -0,0 +1,36 @@
# Greedy Algorithms
Greedy Algorithms are about making the best local choice and then blindly plowing ahead.
An example of this is documents on a tape. Let's say to read any given document on a tape, you first need to read all the documents that come before it.
What is the best way to have the documents laid out in order to decrease the expected read time?
Answer: Putting the smaller documents first.
The book then generalizes it further by adding access frequencies. The answer to that is to sort by increasing length first and then decreasing access times.
## Scheduling Classes
To maximize the number of classes one can take one greedy algorithm is to always select the class that ends first.
That will produce one maximal conflict free schedule.
## Structure of Correctness Proofs for Greedy Algorithms
- Assume that there is an optimal solution that is different from the greedy solution.
- Find the first "difference" between the two solutions.
- Argue that we can exchange the optimal choice for the greedy choice without making the solution worse.
## Hospital-Doctor Matching
Let's say you have a list of doctors that need internships and hospitals accepting interns. We need to make it so that every doctor has a job and every hospital has a doctor.
Assuming we have the same number of doctors and hospitals and that each hospital offers only one internship, how do we make an algorithm to ensure that no unstable matchings exist?
An unstable match is when
- a is matched with some other hospital A even though she prefers B
- B is matched with some doctor b even though they prefer a.
The Gale-Shapley algorithm is a great greedy fit. It goes like this
1. An arbitrary unmatched hospital A offers its position to the best doctor a who has not already rejected it.
2. If a is unmatched, she tentatively accepts A's offer. If a already had a match but prefers A, she rejects her current match and tentatively accepts the new offer from A. Otherwise a rejects the new offer.

View file

@ -0,0 +1,86 @@
# Recursion
## Reductions
Quote: *Reduction* is the single most common technique used in designing algorithms. Reduce one problem $X$ to another problem $Y$.
The running time of the algorithm can be affected by $Y$, but $Y$ does not affect the correctness of the algorithm. So it is often useful to not be concerned with the inner workings of $Y$.
## Simplify and Delegate
Quote: *Recursion* is a particularly powerful kind of reduction, which can be described loosely as follows:
- If the given instance of the problem can be solved directly, then do so.
- Otherwise, reduce it to one or more **simpler instances of the same problem.**
The book likes to call the delegation of simpler tasks "giving it to the recursion fairy."
Your own task as an algorithm writer is to simplify the original problem and solve base cases. The recursion fairy will handle the rest.
Tying this to mathematics, this is known as the **Induction Hypothesis**.
The only caveat is that simplifying the tasks must eventually lead to the **base case** otherwise the algorithm might run infinitely!
#### Example: Tower of Hanoi
Assuming you know how the game works, we will describe how to solve it.
Quote: We can't move it all at the beginning, because all the other disks are in the way. So first we have to move those $n - 1$ smaller disks to the spare peg. Once that's done we can move the largest disk directly to its destination. Finally to finish the puzzle, we have to move the $n -1$ disks from the spare peg to the destination.
**That's it.**
Since the problem was reduced to a base case and a $(n - 1)$ problem, we're good. The book has a funny quote "Our job is finished. If we didn't trust the junior monks, we wouldn't have hired them; let them do their job in peace."
```
Hanoi(n, src, dst, tmp):
if n > 0
Hanoi(n - 1, src, tmp, dst)
move disk n from src to dst
Hanoi(n - 1, tmp, dst, src)
```
## Sorting Algorithms
## Merge Sort
```
MergeSort(A[1..n]):
if n > 1
m = floor(n / 2)
MergeSort(A[1..m])
MergeSort(A[m + 1..n])
Merge(A[1..n], m)
```
```
Merge(A[1..n], m):
i = 1
j = m + 1
for k = 1 to n
if j > n
B[k] = A[i]
i++
else if i > m
B[k] = A[j]
j++
else if A[i] < A[j]
B[k] = A[i]
i++
else
B[k] = A[j]
j++
Copy B to A
```
I think an important part to recall here is that the algorithm will break it down to the lowest level. An array with one element and slowly work its way up.
That means we can always assume that each subarray that we're merging is already sorted! Which is why the merge algorithm is written the way it is.
## The Pattern
This section is quoted verbatim.
1. **Divide** the given instance of the problem into several *independent smaller* instances of *exactly* the same problem.
2. **Delegate** each smaller instance to the Recursion Fairy.
3. **Combine** the solutions for the smaller instances into the final solution for the given instance.

View file

@ -0,0 +1,20 @@
---
title: Bayesian Statistics
showthedate: false
---
# Bayesian Statistics: From Concept to Data Analysis
In the Winter of 2017, I took a course on Bayesian Statistics on Coursera offered by Dr. Herbert Lee.
Below are the notes for each of the four weeks.
[Week 1](week1)
[Week 2](week2)
[Week 3](week3)
[Week 4](week4)

View file

@ -0,0 +1,413 @@
# Bayesian Statistics
## Rules of Probability
Probabilities must be between zero and one, i.e., $0≤P(A)≤1$ for any event A.
Probabilities add to one, i.e., $\sum{P(X_i)} = 1$
The complement of an event, $A^c$, denotes that the event did not happen. Since probabilities must add to one, $P(A^c) = 1 - P(A)$
If A and B are two events, the probability that A or B happens (this is an inclusive or) is the probability of the union of the events:
$$
P(A \cup B) = P(A) + P(B) - P(A\cap B)
$$
where $\cup$ represents union ("or") and $\cap$ represents intersection ("and"). If a set of events $A_i$ are mutually exclusive (only one event may happen), then
$$
P(\cup_{i=1}^n{A_i}) = \sum_{i=1}^n{P(A_i)}
$$
## Odds
The odds for event A, denoted $\mathcal{O}(A)$ is defined as $\mathcal{O}(A) = P(A)/P(A^c)$
This is the probability for divided by probability against the event
From odds, we can also compute back probabilities
$$
\frac{P(A)}{P(A^c)} = \mathcal{O}(A)
$$
$$
\frac{P(A)}{1-P(A)} = \mathcal{O}(A)
$$
$$
\frac{1 -P(A)}{P(A)} = \frac{1}{\mathcal{O}(A)}
$$
$$
\frac{1}{P(A)} - 1 = \frac{1}{\mathcal{O}(A)}
$$
$$
\frac{1}{P(A)} = \frac{1}{\mathcal{O}(A)} + 1
$$
$$
\frac{1}{P(A)} = \frac{1 + \mathcal{O}(A)}{\mathcal{O}(A)}
$$
$$
P(A) = \frac{\mathcal{O}(A)}{1 + \mathcal{O}(A)}
$$
## Expectation
The expected value of a random variable X is a weighted average of values X can take, with weights given by the probabilities of those values.
$$
E(X) = \sum_{i=1}^n{x_i * P(X=x_i)}
$$
## Frameworks of probability
Classical -- Outcomes that are equally likely have equal probabilities
Frequentist -- In an infinite sequence of events, what is the relative frequency
Bayesian -- Personal perspective (your own measure of uncertainty)
In betting, one must make sure that all the rules of probability are followed. That the events are "coherent", otherwise one might construct a series of bets where you're guaranteed to lose money. This is referred to as a Dutch book.
## Conditional probability
$$
P(A|B) = \frac{P(A\cup B)}{P(B)}
$$
Where $A|B$ denotes "A given B"
Example from lecture:
Suppose there are 30 students, 9 of which are female. From the 30 students, 12 are computer science majors. 4 of those 12 computer science majors are female
$$
P(Female) = \frac{9}{30} = \frac{3}{10}
$$
$$
P(CS) = \frac{12}{30} = \frac{2}{5}
$$
$$
P(F\cap CS) = \frac{4}{30} = \frac{2}{15}
$$
$$
P(F|CS) = \frac{P(F \cap CS)}{P(CS)} = \frac{2/15}{2/5} = \frac{1}{3}
$$
An intuitive way to think about a conditional probability is that we're looking at a subsegment of the original population, and asking a probability question within that segment
$$
P(F|CS^c) = \frac{P(F\cap CS^c)}{PS(CS^c)} = \frac{5/30}{18/30} = \frac{5}{18}
$$
The concept of independence is when one event does not depend on another.
$$
P(A|B) = P(A)
$$
It doesn't matter that B occurred.
If two events are independent then the following is true
$$
P(A\cap B) = P(A)P(B)
$$
This can be derived from the conditional probability equation.
## Conditional Probabilities in terms of other conditional
Suppose we don't know what $P(A|B)$ is but we do know what $P(B|A)$ is. We can then rewrite $P(A|B)$ in terms of $P(B|A)$
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^c)P(A^c)}
$$
Let's look at an example of an early test for HIV antibodies known as the ELISA test.
$$
P(+ | HIV) = 0.977
$$
$$
P(- | NO\_HIV) = 0.926
$$
As you can see over 90% of the time, this test was accurate.
The probability of someone in North America having this disease was $P(HIV) = .0026$
Now let's consider the following problem: the probability of having the disease given that they tested positive $P(HIV | +)$
$$
P(HIV|+) = \frac{P(+|HIV)P(HIV)}{P(+|HIV)P(HIV) + P(+|NO\_HIV){P(NO\_HIV)}}
$$
$$
P(HIV|+) = \frac{(.977)(.0026)}{(.977)(.0026) + (1-.977)(1-.0026)}
$$
$$
P(HIV|+) = 0.033
$$
This example looked at Bayes Theorem for the two event case. We can generalize it to n events through the following formula
$$
P(A|B) = \frac{P(B|A_1){(A_1)}}{\sum_{i=1}^{n}{P(B|A_i)}P(A_i)}
$$
## Bernoulli Distribution
~ means 'is distributed as'
We'll be first studying the Bernoulli Distribution. This is when your event has two outcomes, which is commonly referred to as a success outcome and a failure outcome. The probability of success is $p$ which means the probability of failure is $(1-p)$
$$
X \sim B(p)
$$
$$
P(X = 1) = p
$$
$$
P(X = 0) = 1-p
$$
The probability of a random variable $X$ taking some value $x$ given $p$ is
$$
f(X = x | p) = f(x|p) = p^x(1-p)^{1 - x}I
$$
Where $I$ is the Heavenside function
Recall the expected value
$$
E(X) = \sum_{x_i}{x_iP(X=x_i)} = (1)p + (0)(1-p) = p
$$
We can also define the variance of Bernoulli
$$
Var(X) = p(1-p)
$$
## Binomial Distribution
The binomial distribution is the sum of n *independent* Bernoulli trials
$$
X \sim Bin(n, p)
$$
$$
P(X=x|p) = f(x|p) = {n \choose x} p^x (1-p)^{n-x}
$$
$n\choose x$ is the combinatoric term which is defined as
$$
{n \choose x} = \frac{n!}{x! (n - x)!}
$$
$$
E(X) = np
$$
$$
Var(X) = np(1-p)
$$
## Uniform distribution
Let's say X is uniformally distributed
$$
X \sim U[0,1]
$$
$$
f(x) = \left\{
\begin{array}{lr}
1 & : x \in [0,1]\\
0 & : otherwise
\end{array}
\right.
$$
$$
P(0 < x < \frac{1}{2}) = \int_0^\frac{1}{2}{f(x)dx} = \int_0^\frac{1}{2}{dx} = \frac{1}{2}
$$
$$
P(0 \leq x \leq \frac{1}{2}) = \int_0^\frac{1}{2}{f(x)dx} = \int_0^\frac{1}{2}{dx} = \frac{1}{2}
$$
$$
P(x = \frac{1}{2}) = 0
$$
## Rules of probability density functions
$$
\int_{-\infty}^\infty{f(x)dx} = 1
$$
$$
f(x) \ge 0
$$
$$
E(X) = \int_{-\infty}^\infty{xf(x)dx}
$$
$$
E(g(X)) = \int{g(x)f(x)dx}
$$
$$
E(aX) = aE(X)
$$
$$
E(X + Y) = E(X) + E(Y)
$$
If X & Y are independent
$$
E(XY) = E(X)E(Y)
$$
## Exponential Distribution
$$
X \sim Exp(\lambda)
$$
Where $\lambda$ is the average unit between observations
$$
f(x|\lambda) = \lambda e^{-\lambda x}
$$
$$
E(X) = \frac{1}{\lambda}
$$
$$
Var(X) = \frac{1}{\lambda^2}
$$
## Uniform (Continuous) Distribution
$$
X \sim [\theta_1, \theta_2]
$$
$$
f(x|\theta_1,\theta_2) = \frac{1}{\theta_2 - \theta_1}I_{\theta_1 \le x \le \theta_2}
$$
## Normal Distribution
$$
X \sim N(\mu, \sigma^2)
$$
$$
f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}
$$
$$
E(X) = \mu
$$
$$
Var(X) = \sigma^2
$$
## Variance
Variance is the squared distance from the mean
$$
Var(X) = \int_{-\infty}^\infty {(x - \mu)^2f(x)dx}
$$
## Geometric Distribution (Discrete)
The geometric distribution is the number of trails needed to get the first success, i.e, the number of Bernoulli events until a success is observed.
$$
X \sim Geo(p)
$$
$$
P(X = x|p) = p(1-p)^{x-1}
$$
$$
E(X) = \frac{1}{p}
$$
## Multinomial Distribution (Discrete)
Multinomial is like a binomial when there are more than two possible outcomes.
$$
f(x_1,...,x_k|p_1,...,p_k) = \frac{n!}{x_1! ... x_k!}p_1^{x_1}...p_k^{x_k}
$$
## Poisson Distribution (Discrete)
The Poisson distribution is used for counts. The parameter $\lambda > 0$ is the rate at which we expect to observe the thing we are counting.
$$
X \sim Pois(\lambda)
$$
$$
P(X=x|\lambda) = \frac{\lambda^xe^{-\lambda}}{x!}
$$
$$
E(X) = \lambda
$$
$$
Var(X) = \lambda
$$
## Gamma Distribution (Continuous)
If $X_1, X_2, ..., X_n$ are independent and identically distributed Exponentials,waiting time between success events, then the total waiting time for all $n$ events to occur will follow a gamma distribution with shape parameter $\alpha = n$ and rate parameter $\beta = \lambda$
$$
Y \sim Gamma(\alpha, \beta)
$$
$$
f(y|\alpha,\beta) = \frac{\beta^n}{\Gamma(\alpha)}y^{n-1}e^{-\beta y}I_{y\ge0}(y)
$$
$$
E(Y) = \frac{\alpha}{\beta}
$$
$$
Var(Y) = \frac{\alpha}{\beta^2}
$$
Where $\Gamma(x)$ is the gamma function. The exponential distribution is a special case of the gamma distribution with $\alpha = 1$. As $\alpha$ increases, the gamma distribution more closely resembles the normal distribution.
## Beta Distribution (Continuous)
The beta distribution is used for random variables which take on values between 0 and 1. For this reason, the beta distribution is commonly used to model probabilities.
$$
X \sim Beta(\alpha, \beta)
$$
$$
f(x|\alpha,\beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{n -1}(1 - x)^{\beta - 1}I_{\{0 < x < 1\}}
$$
$$
E(X) = \frac{\alpha}{\alpha + \beta}
$$
$$
Var(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha+\beta+1)}
$$
The standard uniform distribution is a special case of the beta distribution with $\alpha = \beta = 1$
## Bayes Theorem for continuous distribution
$$
f(\theta|y) = \frac{f(y|\theta)f(\theta)}{\int{f(y|\theta)f(\theta)d\theta}}
$$

View file

@ -0,0 +1,537 @@
Under the frequentest paradigm, you view the data as a random sample from some larger, potentially hypothetical population. We can then make probability statements i.e, long-run frequency statements based on this larger population.
## Coin Flip Example (Central Limit Theorem)
Let's suppose we flip a coin 100 times and we get 44 heads and 56 tails.
$$
n = 100
$$
We can view these 100 flips as a random sample from a much larger infinite hypothetical population of flips from this coin.
Let's say that each flip follows a Bournelli distribution with some probability p. In this case $p$ is unknown, but we're assuming it's fixed because we have a particular physical coin.
We can ask what's our best estimate of the probability of getting a head, or an estimate of $p$. We can also ask about how confident we are about that estimate.
Let's start by applying the Central Limit Theorem. The Central Limit Theorem states that the sum of 100 flips will follow approximately a Normal distribution with mean 100p and variance 100p(1-p)
$$
\sum^n_{i=1}{x_i} \sim N(np, np(1-p))
$$
$$
\sum_{i = 1}^{100}{x_i} \sim N(100p, 100p(1-p))
$$
By the properties of a Normal distribution, 95% of the time we'll get a result within 1.96 standard deviations of the mean. Our estimate is $100\hat{p}$ and our error is 1.96 times the standard deviation.
$$
n\hat{p} \pm 1.96\sqrt{n\hat{p}(1-\hat{p})}
$$
$$
100\hat{p} \pm 1.96\sqrt{100\hat{p}(1-\hat{p})}
$$
This is referred to as a Confidence Interval. Confidence Intervals are commonly abbreviated as CI. In our example $\hat{p} = \frac{44}{n} = \frac{44}{100}$. Therefore, the 95% Confidence Interval in the true number of heads after flipping a coin 100 times is:
$$
100(.44) \pm 1.96\sqrt{100(.44)(1-.44)}
$$
$$
44 \pm 1.96\sqrt{44(.56)}
$$
$$
44\pm 1.96\sqrt{24.64}
$$
$$
(34.27, 53.73)
$$
We can divide this by 100 to get the 95% Confidence Interval for $p$
$$
(0.34, 0.53)
$$
Let's step back and ask, what does it mean when I say we're 95% confident?
Under the frequentest paradigm, what this means is we have to think back to our infinite hypothetical sequence of events. So if we were to repeat this trial an infinite number of times, or an arbitrary large number of times. Each time we create a confidence interval based on the data we observe, than on average 95% of the intervals we make will contain the true value of p.
On the other hand, we might want to know something about this particular interval. Does this interval contain the true p? What's the probability that this interval contains a true p? Well, we don't know for this particular interval. But under the frequentest paradigm, we're assuming that there is a fixed answer for p. Either p is in that interval or it's not in that interval. The probability that p is in that interval is either 0 or 1.
## Example: Heart Attack Patients (Maximum Likelihood)
Consider a hospital where 400 patients are admitted over a month for heart attacks, and a month later 72 of them have died and 328 of them have survived.
We can ask, what's our estimate of the mortality rate?
Under the frequentest paradigm, we must first establish our reference population. What do we think our reference population is here? One possibility is we could think about heart attack patients in the region.
Another reference population we can think about is heart attack patients that are admitted to this hospital, but over a longer period of time.
Both of these might be reasonable attempts, but in this case our actual data are not random sample from either of those populations. We could sort of pretend they are and move on, or we could also try to think harder about what a random sample situation might be. We can think about all the people in the region who might possibly have a heart attack and might possibly get admitted to this hospital.
It's a bit of an odd hypothetical situation, and so there are some philosophical issues with the setup of this whole problem with the frequentest paradigm, In any case, let's forge forward and think about how we might do some estimation.
Moving on, we can say each patient comes from a Bernoulli distribution with an unknown parameter $\theta$.
$$
Y_i \sim B(\theta)
$$
$$
P(Y_i = 1) = \theta
$$
In this case, let's call the "success" a mortality.
The probability density function for the entire set of data we can write in vector form. Probability of all the Y's take some value little y given a value of theta.
$$
P(Y = y | \theta) = P(Y_1 = y_1, Y_2, = y_2,\dots, Y_n=y_n|\theta)
$$
*Since we're viewing these as independent events*, then the probability of each of these individual ones we can write in product notation.
$$
P(Y = y | \theta) = P(Y_1 = y_1|\theta)\dots P(Y_n = y_n | \theta)
$$
$$
P(Y = y | \theta) = \prod_{i = 1}^n{P(Y_i =y_i | \theta)} = \prod_{i = 1}^n{(\theta^{y_i}(1-\theta)^{1-y_i})}
$$
This is the probability of observing the actual data that we collected, conditioned on a value of the parameter $\theta$. We can now think about this expression as a function of theta. This is a concept of a likelihood.
$$
L(\theta|y) = \prod_{i = 1}^n{(\theta^{y_i}(1-\theta)^{1-y_i})}
$$
It looks like the same function, but here it is a function of y given $\theta$. And now we're thinking of it as a function of $\theta$ given y.
This is not a probability distribution anymore, but it is still a function for $\theta$.
One we to estimate $\theta$ is that we choose the $\theta$ that gives us the largest value of the likelihood. It makes the data the most likely to occur for the particular data we observed.
This is referred to as the maximum likelihood estimate (MLE),
We're trying to find the $\theta$ that maximizes the likelihood.
In practice, it's usually easier to maximize the natural logarithm of the likelihood, commonly referred to as the log likelihood.
$$
\mathcal{L}(\theta) = \log{L(\theta|y)}
$$
Since the logarithm is a monotonic function, if we maximize the logarithm of the function, we also maximize the original function.
$$
\mathcal{L(\theta)} = \log{\prod_{i = 1}^n{(\theta^{y_i}(1-\theta)^{1-y_i})}}
$$
$$
\mathcal{L}(\theta) = \sum_{i = 1}^n{\log{(\theta^{y_i}(1-\theta)^{1-y_i})}}
$$
$$
\mathcal{L}(\theta) = \sum_{i = 1}^n{(\log{(\theta^{y_i}}) + \log{(1-\theta)^{1-y_i}})}
$$
$$
\mathcal{L}(\theta) = \sum_{i = 1}^n{(y_i\log{\theta} + (1 - y_i)\log{(1-\theta)})}
$$
$$
\mathcal{L}(\theta) = \log{\theta}\sum_{i = 1}^n{y_i} + \log{(1-\theta)}\sum_{i = 1}^n{(1-y_i)}
$$
How do we find the theta that maximizes this function? Recall from calculus that we can maximize a function by taking the derivative and setting it equal to 0.
$$
\mathcal{L}^\prime(\theta) = \frac{1}{\theta}\sum{y_i} - \frac{1}{1-\theta}\sum{(1 - y_i)}
$$
Now we need to set it equal to zero and solve for $\hat{\theta}$
$$
\frac{\sum{y_i}}{\hat{\theta}} = \frac{\sum{(1-y_i)}}{1-\hat{\theta}}
$$
$$
\hat{\theta}\sum{(1-y_i)} = (1-\hat{\theta})\sum{y_i}
$$
$$
\hat{\theta}\sum{(1-y_i)} + \hat{\theta}\sum{y_i} = \sum{y_i}
$$
$$
\hat{\theta}(\sum^n{(1 - y_i + y_i)}) = \sum{y_i}
$$
$$
\hat{\theta} = \frac{1}{n}\sum{y_i} = \hat{p} = \frac{72}{400} = 0.18
$$
Maximum likelihood estimators have many desirable mathematical properties. They're unbiased, they're consistent, and they're invariant.
In general, under certain regularity conditions, we can say that the MLE is approximately normally distributed with mean at the true value of $\theta$ and the variance $\frac{1}{I(\hat{\theta})}$ where $I(\hat{\theta})$ is the Fisher information at the value of $\hat{\theta}$. The Fisher information is a measure of how much information about $\theta$ is in each data point.
$$
\hat{\theta} \sim N(\theta, \frac{1}{I(\hat{\theta})})
$$
For a Bernoulli random variable, the Fisher information turns out to be
$$
I(\theta) = \frac{1}{\theta(1-\theta)}
$$
So the information is larger, when theta is near zero or near one, and it's the smallest when theta is near one half.
This makes sense, because if you're flipping a coin, and you're getting a mix of heads and tails, that tells you a little bit less than if you're getting nearly all heads or nearly all tails.
## Exponential Likelihood Example
Let's say $X_i$ are distributed so
$$
X_i \sim Exp(\lambda)
$$
Let's say the data is independent and identically distributed, therefore making the overall density function
$$
f(x|\lambda) = \prod_{i = 1}^n{\lambda e^{-\lambda x_i}}
$$
$$
f(x|\lambda) = \lambda^ne^{-\lambda \sum{x_i}}
$$
Now the likelihood function is
$$
L(\lambda|x) = \lambda^n e^{-\lambda \sum{x_i}}
$$
$$
\mathcal{L}(\lambda) = n\ln{\lambda} - \lambda\sum{x_i}
$$
Taking the derivative
$$
\mathcal{L}^\prime(\lambda) = \frac{n}{\lambda} - \sum{x_i}
$$
Setting this equal to zero
$$
\frac{n}{\hat{\lambda}} =\sum{x_i}
$$
$$
\hat{\lambda} = \frac{n}{\sum{x_i}} = \frac{1}{\bar{x}}
$$
## Uniform Distribution
$$
X_i \sim U[0, \theta]
$$
$$
f(x|\theta) = \prod_{i = 1}^n{\frac{1}{\theta}I_{0 \le x_i \le \theta}}
$$
Combining all the indicator functions, for this to be a 1, each of these has to be true. These are going to be true if all the observations are bigger than 0, as in the minimum of the x is bigger than or equal to 0. The maximum of the x's is also less than or equal to $\theta$.
$$
L(\theta|x) = \theta^{-1} I_{0\le min(x_i) \le max(x_i) \le \theta}
$$
$$
L^\prime(\theta) = -n\theta^{-(n + 1)}I_{0 \le min(x_i) \le max(x_i)\le \theta}
$$
So now we can ask, can we set this equal to zero and solve for $\theta$? Well it turns out, this is not equal to zero for any $\theta$ positive value. We need $\theta$ to be strictly larger than zero.
However, we can also note that for $\theta$ positive, this will always be negative. The derivative is negative, that says this is a decreasing function. So this funciton will be maximized when we pick $\theta$ as small as possible. What's the smallest possible value of $\theta$ we can pick? Well we need in particular for $\theta$ to be larger than all of the $X_i$ . And so, the maximum likelihood estimate is the maximum of $X_i$
$$
\hat{\theta} = max(x_i)
$$
## Products of Indicator Functions
Because 0 * 1 = 0, the product of indicator functions can be combined into a single indicator function with a modified condition.
**Example**: $I_{x < 5} * I_{x \ge 0} = I_{0 \le x < 5}$
**Example**: $\prod_{i = 1}^n{I_{x_i < 2}} = I_{x_i < 2 for all i} = I_{max(x_i) < 2}$
## Introduction to R
R has some nice functions that one can use for analysis
`mean(z)` gives the mean of some row vector $z$
`var(z)` reports the variance of some row vector
`sqrt(var(z))` gives the standard deviation of some row vector
`seq(from=0.1, to = 0.9, by = 0.1)` creates a vector that starts from $0.1$ and goes to $0.9$ incrementing by $0.1$
```R
seq(from=0.1, to = 0.9, by = 0.1)
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
```
`names(x)` gives the names of all the columns in the dataset.
```R
names(trees)
[1] "Girth" "Height" "Volume"
```
`hist(x)` provides a histogram based on a vector
The more general `plot` function tries to guess at which type of plot to make. Feeding it two numerical vectors will make a scatter plot.
The R function `pairs` takes in a data frame and tries to make all possible Pairwise scatterplots for the dataset.
The `summary` command gives the five/six number summary (minimum, first quartile, median, mean, third quartile, maximum)
## Plotting the likelihood function in R
Going back to the hospital example
```R
## Likelihood function
likelihood = function(n, y, theta) {
return(theta^y * (1 - theta)^(n - y))
}
theta = seq(from = 0.01, to = 0.99, by = 0.01)
plot(theta, likelihood(400, 72, theta))
```
You can also do this with log likelihoods. This is typically more numerically stable to compute
```R
loglike = function(n, y, theta) {
return(y * log(theta) + (n - y) * log(1 - theta))
}
plot(theta, loglike(400, 72, theta))
```
Having these plotted as points makes it difficult to see, let's plot it as lines
```R
plot(theta, loglike(400, 72, theta), type = "l")
```
## Cumulative Distribution Function
The cumulative distribution function (CDF) exists for every distribution. We define it as $F(x) = P(X \le x)$ for random variable $X$.
If $X$ is discrete-valued, then the CDF is computed with summation $F(x) = \sum_{t = -\infty}^x {f(t)}$. where $f(t) = P(X = t)$ is the probability mass function (PMF) which we've already seen.
If $X$ is continuous, the CDF is computed with an integral $F(x) = \int_{-\infty}^x{f(t)dt}$
The CDF is convenient for calculating probabilities of intervals. Let $a$ and $b$ be any real numbers with $a < b$. Then the probability that $X$ falls between $a$ and $b$ is equal to $P(a < X < b) = P(X \le b) - P(X \le a) = F(b) - F(a)$
## Quantile Function
The CDF takes a value for a random variable and returns a probability. Suppose instead we start with a number between $0$ and $1$, which we call $p$, and we wish to find a value $x$ so that $P(X \le x) = p$. The value $x$ which satisfies this equation is called the $p$ quantile. (or $100p$ percentile) of the distribution of $X$.
## Probability Distributions in R
Each of the distributions introduced in Lesson 3 have convenient functions in R which allow you to evaluate the PDF/PMF, CDF, and quantile functions, as well as generate random samples from the distribution. To illustrate, Table I list these functions for the normal distribution
| Function | What it does |
| -------------------- | ---------------------------------------- |
| `dnorm(x, mean, sd)` | Evaluate the PDF at $x$ (mean = $\mu$ and sd = $\sqrt{\sigma^2}$) |
| `pnorm(q, mean, sd)` | Evaluate the CDF at $q$ |
| `qnorm(p, mean, sd)` | Evaluate the quantile function at $p$ |
| `rnorm(n, mean, sd)` | Generate $n$ pseudo-random samples from the normal distribution |
These four functions exist for each distribution where `d...` function evaluates the density/mass, `p...` evaluates the CDF, `q...` evaluates the quantile, and `r...` generates a sample. Table 2 lists the `d...` functions for some of the most popular distributions. The `d` can be replaced with `p`, `q`, or `r` for any of the distributions, depending on what you want to calculate.
For details enter `?dnorm` to view R's documentation page for the Normal distribution. As usual , replace the `norm` with any distribution to read the documentation for that distribution.
| Distribution | Function | Parameters |
| ---------------------- | -------------------------- | ------------------------------------ |
| $Binomial(n,p)$ | `dbinom(x, size, prob)` | size = $n$, prob = $p$ |
| $Poisson(\lambda)$ | `dpois(x, lambda)` | lambda = $\lambda$ |
| $Exp(\lambda)$ | `dexp(x, rate)` | rate = $\lambda$ |
| $Gamma(\alpha, \beta)$ | `dgamma(x, shape, rate)` | shape = $\alpha$, rate = $\beta$ |
| $Uniform(a, b)$ | `dunif(x, min, max)` | min = $a$, max = $b$ |
| $Beta(\alpha, \beta)$ | `dbeta(x, shape1, shape2)` | shape1 = $\alpha$, shape2 = $\beta$ |
| $N(\mu, \sigma^2)$ | `dnorm(x, mean, sd)` | mean = $\mu$, sd = $\sqrt{\sigma^2}$ |
| $t_v$ | `dt(x, df)` | df = $v$ |
## Two Coin Example
Suppose your brother has a coin which you know to be loaded so that it comes up heads 70% of the time. He then comes to you with some coin, you're not sure which one and he wants to make a bet with you. Betting money that it's going to come up heads.
You're not sure if it's the loaded coin or if it's just a fair one. So he gives you a chance to flip it 5 times to check it out.
You flip it five times and get 2 heads and 3 tails. Which coin do you think it is and how sure are you about that?
We'll start by defining the unknown parameter $\theta$, this is either that the coin is fair or it's a loaded coin.
$$
\theta = \{fair ,loaded\}
$$
$$
X \sim Bin(5, ?)
$$
$$
f(x|\theta) = \begin{cases}
{5 \choose x}(\frac{1}{2})^5 & \theta = fair \\
{5 \choose x} (.7)^x (.3)^{5 - x} & \theta = loaded\\
\end{cases}
$$
We can also rewrite $f(x|\theta)$ with indicator functions
$$
f(x|\theta) = {5\choose x}(.5)^5I_{\{\theta= fair\}} + {5 \choose x}(.7)^x(.3)^{5 - x}I_{\{\theta = loaded\}}
$$
In this case, we observed that $x = 2$
$$
f(\theta | x = 2) = \begin{cases}
0.3125 & \theta = fair \\
0.1323 & \theta = loaded
\end{cases}
$$
MLE $\hat{\theta} = fair$
That's a good point estimate, but then how do we answer the question, how sure are you?
This is not a question that's easily answered in the frequentest paradigm. Another question is that we might like to know what is the probability that theta equals fair, give, we observe two heads.
$$
P(\theta = fair|x = 2) = ?
$$
In the frequentest paradigm, the coin is a physical quantity. It's a fixed coin, and therefore it has a fixed probability of coining up heads. It is either the fair coin, or it's the loaded coin.
$$
P(\theta = fair) = \{0,1\}
$$
### Bayesian Approach to the Problem
An advantage of the Bayesian approach is that it allows you to easily incorporate prior information, when you know something in advance of the looking at the data. This is difficult to do under the Frequentest paradigm.
In this case, we're talking about your brother. You probably know him pretty well. So suppose you think that before you've looked at the coin, there's a 60% probability that this is the loaded coin.
This case, we put this into our prior. Our prior is that the probability the coin is loaded is 0.6. We can update our prior with the data to get our posterior beliefs, and we can do this using the bayes theorem.
Prior: $P(loaded) = 0.6$
$$
f(\theta|x) = \frac{f(x|\theta)f(\theta)}{\sum_\theta{f(x|\theta)f(\theta)}}
$$
$$
f(\theta|x) = \frac{{5\choose x} [(\frac{1}{2})^5(.4)I_{\{\theta = fair\}} + (.7)^x (.3)^{5-x}(.6)I_{\{\theta = loaded\}} ] }
{{5\choose x} [(\frac{1}{2})^5(.4) + (.7)^x (.3)^{5-x}(0.6) ] }
$$
$$
f(\theta|x=2)= \frac{0.0125I_{\{\theta=fair\}} + 0.0079I_{\{\theta=loaded\}} }{0.0125+0.0079}
$$
$$
f(\theta|x=2) = 0.612I_{\{\theta=fair\}} + 0.388I_{\{\theta = loaded\}}
$$
As you can see in the calculation here, we have the likelihood times the prior in the numerator, and in the denominator, we have a normalizing constant, so that when we divide by this, we'll get answer that add up to one. These numbers match exactly in this case, because it's a very simple problem. But this is a concept that goes on, what's in the denominator here is always a normalizing constant.
$$
P(\theta = loaded | x = 2) = 0.388
$$
This here updates our beliefs after seeing some data about what the probability might be.
We can also examine what would happen under different choices of prior.
$$
P(\theta = loaded) = \frac{1}{2} \implies P(\theta = loaded | x = 2) = 0.297
$$
$$
P(\theta = loaded) = 0.9 \implies P(\theta = loaded | x = 2) = 0.792
$$
In this case, the Bayesian approach is inherently subjective. It represents your own personal perspective, and this is an important part of the paradigm. If you have a different perspective, you will get different answers, and that's okay. It's all done in a mathematically vigorous framework, and it's all mathematically consistent and coherent.
And in the end, we get results that are interpretable
## Continuous Bayes
$$
f(\theta | y) = \frac{f(y | \theta)f(\theta)}{f(y)} = \frac{f(y|\theta)f(\theta)}{\int{f(y|\theta)f(\theta)d\theta}} = \frac{likelihood * prior}{normalization} \propto likelihood * prior
$$
In practice, sometimes this integral can be a pain to compute. And so, we may work with looking at saying this is proportional to the likelihood times the prior. And if we can figure out what this looks like and just put the appropriate normalizing constant on at the end, we don't necessarily have to compute this integral.
So for example, suppose we're looking at a coin and it has unknown probability $\theta$ of coming up heads. Suppose we express ignorance about the value of $\theta$ by assigning it a uniform distribution.
$$
\theta \sim U[0, 1]
$$
$$
f(\theta) = I_{\{0 \le \theta\le 1\}}
$$
$$
f(\theta | y = 1) = \frac{\theta^1(1-\theta)^0I_{\{0 \le \theta\le1\}}}{\int_{-\infty}^\infty{\theta^1(1-\theta)^0I_{\{0\le \theta \le 1\}}}}
$$
$$
f(\theta | y = 1) = \frac{\theta I_{\{0\le\theta\le1\}}}{\int_0^1{\theta d\theta}} = 2\theta I_{\{0\le\theta\le1\}}
$$
Now if we didn't want to take the integral we could've done this approach
$$
f(\theta | y) \propto f(y|\theta)f(\theta) \propto \theta I_{\{0\le\theta\le1\}}
$$
Which then we need to find the constant such that it's a proper PMF. In this case, it's $2$.
Since it's a proper PMF, we can perform interval probabilities as well. This is called Posterior interval estimates.
$$
P(0.025 < \theta < 0.975) = \int_{0.025}^{0.975}{2\theta d \theta} = (0.975)^2 - (0.025)^2 = 0.95
$$
$$
P(\theta > 0.05) = 1 - (0.05)^2 = 0.9975
$$
These are the sort of intervals we would get from the prior and asking about what their posterior probability is.
In other cases, we may want to ask, what is the posterior interval of interest? What's an interval that contains 95% of posterior probability in some meaningful way? This would be equivalent then to a frequentest confidence interval. We can do this in several different ways, 2 main ways that we make Bayesian Posterior intervals or credible intervals are equal-tailed intervals and highest posterior density intervals.
## Equal-tailed Interval
In the case of an equal-tailed interval, we put the equal amount of probability in each tail. So to make a 95% interval we'll put 0.025 in each tail.
To be able to do this, we're going to have to figure out what the quantiles are. So we're going to need some value, $q$, so that
$$
P(\theta < q | Y = 1) = \int_0^9{2\theta d\theta} = q^2
$$
$$
P(\sqrt{0.025} < \theta < \sqrt{0.975}) = P(0.158 < \theta < 0.987) = 0.95
$$
This is an equal tailed interval in that the probability that $\theta$ is less than 0.18 is the same as the probability that $\theta$ is greater than 0.987. We can say that under the posterior, there's a 95% probability that $\theta$ is in this interval.
## Highest Posterior Density (HPD)
Here we want to ask where in the density function is it highest? Theoretically this will be the shortest possible interval that contains the given probability, in this case a 95% probability.
$$
P(\theta > \sqrt{0.05} | Y = 1) = P(\theta > 0.224 | Y = 1) = 0.95
$$
This is the shortest possible interval, that under the posterior has a probability 0.95. it's $\theta$ going from 0.224 up to 1.
The posterior distribution describes our understanding of our uncertainty combinbing our prior beliefs and the data. It does this with a probability density function, so at the end of teh day, we can make intervals and talk about probabilities of data being in the interval.
This is different from the frequentest approach, where we get confidence intervals. But we can't say a whole lot about the actual parameter relative to the confidence interval. We can only make long run frequency statements about hypothetical intervals.
In this case, we can legitimately say that the posterior probability that $\theta$ is bigger than 0.05 is $0.9975$. We can also say that we believe there's a 95% probability that $\theta$ is in between 0.158 and 0.987.
Bayesians represent uncertainty with probabilities, so that the coin itself is a physical quantity. It may have a particular value for $\theta$.
It may be fixed, but because we don't know what that value is, we represent our uncertainty about that value with a distribution. And at the end of the day, we can represent our uncertainty, collect it with the data, and get a posterior distribution and make intuitive statements.
####
Frequentest confidence intervals have the interpretation that "If you were to repeat many times the process of collecting data and computing a 95% confidence interval, then on average about 95% of those intervals would contain the true parameter value; however, once you observe data and compute an interval the true value is either in the interval or it is not, but you can't tell which."
Bayesian credible intervals have the interpretation that "Your posterior probability that the parameter is in a 95% credible interval is 95%."

View file

@ -0,0 +1,409 @@
How do we choose a prior?
Our prior needs to represent our personal perspective, beliefs, and our uncertainties.
Theoretically, we're defining a cumulative distribution function for the parameter
$$
\begin{cases}
P(\theta \le c) & c \in \mathbb{R}
\end{cases}
$$
This is true for an infinite number of possible sets. This isn't practical to do, and it would be very difficult to do coherently so that all the probabilities were consistent.
In practice, we work with a convenient family that's sufficiently flexible such that a member of a family represents our beliefs.
Generally if one has enough data, the information in the data will overwhealm the information in the prior. And so, the prior is not particularly important in terms of what you get for the posterior. Any reasonable choice of prior will lead to approximately the same posterior. However, there are some things that can go wrong.
## Example of Bad Prior
Suppose we chose a prior that says the probability of $P(\theta = \frac{1}{2}) = 1$
And thus, the probability of $\theta$ equaling any other value is $0$. If we do this, our data won't make a difference since we only put a probability of $1$ at a single point.
$$
f(\theta|y) \propto f(y|\theta)f(\theta) = f(\theta) = \delta(\theta)
$$
In the basic context, events with prior probability of zero have a posterior probability of zero. Events with a prior probability of one, have a posterior probability of one.
Thus a good Bayesian will not assign probability of zero or one to any event that has already occurred or already known not to occur.
## Calibration
A useful concept in terms of choosing priors is that of the calibration of predictive intervals.
If we make an interval where we're saying we predict 95% of new data points will occur in this interval. It would be good if in reality 95% of new data points actually did fall in that interval.
How do we calibrate to reality? This is actually a frequentest concept but this is important for practical statistical purposes that our results reflect reality.
We can compute a predictive interval, this is an interval such that 95% of new observations are expected to fall into it. It's an interval for the **data** rather than an interval for $\theta$
$$
f(y) = \int{f(y|\theta)f(\theta)d\theta} = \int{f(y, \theta)d\theta}
$$
Where $f(y,\theta)$ is the joint density of Y and $\theta$.
This is the prior predictive before any data is observed.
**Side Note:** From this you can say that $f(y, \theta) = f(y|\theta)f(\theta)$
## Binomial Example
Suppose we're going to flip a coin ten times and count the number of heads we see. We're thinking about this in advance of actually doing it, so we're interested in the predictive distribution. How many heads do we predict we're going to see?
$$
X = \sum_{i = 1}^{10}{Y_i}
$$
Where $Y_i$ is each individual coin flip.
If we think that all possible coins or all possible probabilities are equally likely, then we can put a prior for $\theta$ that's flat over the interval from 0 to 1.
$$
f(\theta) = I_{\{0 \le \theta \le 1\}}
$$
$$
f(x) = \int{f(x|\theta)f(\theta)d\theta} = \int_0^1{\frac{10!}{x!(10-x)!}\theta^x(1-\theta)^{10 -x}(1)d\theta}
$$
Note that because we're interested in $X$ at the end, it's important that we distinguish between a binomial density and a Bernoulli density. Here we just care about the total count rather than the exact ordering which would be Bernoulli's
For most of the analyses we're doing, where we're interested in $\theta$ rather than x, the binomial and the Bernoulli are interchangeable because the part in here that depends on $\theta$ is the same.
To solve this integral let us recall some facts
$$
n! = \Gamma(n + 1)
$$
$$
Z \sim Beta(\alpha, \beta)
$$
$$
f(z) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}z^{\alpha - 1}(1-z)^{\beta - 1}
$$
Let us rewrite $f(x)$
$$
f(x) = \int_0^1{\frac{\Gamma(11)}{\Gamma(x + 1)\Gamma(11 - x)}\theta^{(x + 1)-1}(1-\theta)^{(11-x)-1}d\theta}
$$
$$
f(x) = \frac{\Gamma(11)}{\Gamma(12)}\int_0^1{\frac{\Gamma(12)}{\Gamma(x + 1)\Gamma(11 - x)}\theta^{(x + 1)-1}(1-\theta)^{(11-x)-1}d\theta}
$$
The integral above is a beta density, all integrals of valid beta densities equals to one.
$$
f(x) = \frac{\Gamma(11)}{\Gamma(12)} = \frac{10!}{11!} = \frac{1}{11}
$$
For $x \in {0, 1, 2, \dots, 10}$
Thus we see that if we start with a uniform prior, we then end up with a discrete uniform predictive density for $X$. If all possible probabilities are equally likely, then all possible $X$ outcomes are equally likely.
## Posterior Predictive Distribution
What about after we've observed data? What's our posterior predictive distribution?
Going from the previous example, let us observe after one flip that we got a head. We want to ask, what's our predictive distribution for the second flip, given we saw a head on the first flip.
$$
f(y_2|y_1) = \int{f(y_2|\theta,y_1)f(\theta|y_1)}d\theta
$$
We're going to assume that $Y_2$ is independent of $Y_1$. Therefore,
$$
f(y_2 |y_1) = \int{f(y_2|\theta)f(\theta|y_1)d\theta}
$$
Suppose we're thinking of a uniform distribution for $\theta$ and we observe the first flip is a heads. What do we predict for the second flip?
This is no longer going to be a uniform distribution like it was before, because we have some data. We're going to think it's more likely that we're going to get a second head. We think this because since we observed a head $\theta$ is now likely to be at least $\frac{1}{2}$ possibly larger.
$$
f(y_2 | Y_1 = 1) = \int_0^1{\theta^{y_2}(1-\theta)^{1-y_2}2\theta d\theta}
$$
$$
f(y_2|Y_1 = 1) = \int_0^1{2\theta^{y_2 + 1}(1-\theta)^{1-y_2}d\theta}
$$
We could work this out in a more general form, but in this case, $Y_2$ has to take the value $0$ or $1$. The next flip is either going to be heads or tails so it's easier to just plop in a particular example.
$$
P(Y_2 = 1|Y_1 = 1) = \int_0^1{2\theta^2d\theta} = \frac{2}{3}
$$
$$
P(Y_2 = 0 | Y_1 = 1) = 1 - P(Y_2 = 1 | Y_1 = 1) = 1 - \frac{2}{3} = \frac{1}{3}
$$
We can see here that the posterior is a combination of the information in the prior and the information in the data. In this case, our prior is like having two data points, one head and one tail.
Saying we have a uniform prior for $\theta$ is equivalent in an information sense as saying we have observed one head and one tail.
So then when we observe one head, it's like we now have seen two heads and one tail. So our predictive distribution for the second flip says if we have two heads and one tail, then we have a $\frac{2}{3}$ probability of getting another head and a $\frac{1}{3}$ probability of getting another tail.
## Binomial Likelihood with Uniform Prior
Likelihood of y given theta is
$$
f(y|\theta) = \theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}
$$
Our prior for theta is just a uniform distribution
$$
f(\theta) = I_{\{0 \le \theta \le 1\}}
$$
Thus our posterior for $\theta$ is
$$
f(\theta | y) = \frac{f(y|\theta)f(\theta)}{\int{f(y|\theta)f(\theta)d\theta}} = \frac{\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}} I_{\{0 \le \theta \le 1\}}}{\int_0^1{\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}} I_{\{0 \le \theta \le 1\}}d\theta}}
$$
Recalling the form of the beta distribution we can rewrite our posterior as
$$
f(\theta | y) = \frac{\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}} I_{\{0 \le \theta \le 1\}}}{\frac{\Gamma(\sum{y_i} + 1)\Gamma(n - \sum{y_i} + 1)}{\Gamma(n + 2)}\int_0^1{\frac{\Gamma(n + 2)}{\Gamma(\sum{y_i} + 1)\Gamma(n - \sum{y_i} + 1)}\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}d\theta}}
$$
Since the beta density integrates to $1$, we can simplify this as
$$
f(\theta | y) = \frac{\Gamma(n + 2)}{\Gamma(\sum{y_i}+ 1)\Gamma(n - \sum{y_i}+ 1)}\theta^{\sum{y_i}}(1-\theta)^{n-\sum{y_i}}I_{\{0 \le \theta \le 1\}}
$$
From here we can see that the posterior follows a beta distribution
$$
\theta | y \sim Beta(\sum{y_i} + 1, n - \sum{y_i} + 1)
$$
## Conjugate Priors
The uniform distribution is $Beta(1, 1)$
Any beta distribution is conjugate for the Bernoulli distribution. Any beta prior will give a beta posterior.
$$
f(\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1-\theta)^{\beta -1}I_{\{0 \le \theta \le 1\}}
$$
$$
f(\theta | y) \propto f(y|\theta)f(\theta) = \theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}I_{\{0 \le \theta \le 1\}}
$$
$$
f(y|\theta)f(\theta) \propto \theta^{\alpha + \sum{y_i}-1}(1-\theta)^{\beta + n - \sum{y_i} - 1}
$$
Thus we see that this is a beta distribution
$$
\theta | y \sim Beta(\alpha + \sum{y_i}, \beta + n - \sum{y_i})
$$
When $\alpha$ and $\beta$ is one like in the uniform distribution, then we get the same result as earlier.
This whole process where we choose a particular form of prior that works with a likelihood is called using a conjugate family.
A family of distributions is referred to as conjugate if when you use a member of that family as a prior, you get another member of that family as your posterior.
The beta distribution is conjugate for the Bernoulli distribution. It's also conjugate for the binomial distribution. The only difference in the binomial likelihood is that there is a combinatoric term. Since that does not depend on $\theta$, we get the same posterior.
We often use conjugate priors because they make life much more simpler, sticking to conjugate families allows us to get closed form solutions easily.
If the family is flexible enough, then you can find a member of that family that closely represents your beliefs.
## Posterior Mean and Effect Size
Returning to the beta posterior model it is clear how both the prior and the data contribute to the posterior.
We can say that the effect size of the prior is $\alpha + \beta$
Recall that the expected value or mean of a beta distribution is $\frac{\alpha}{\alpha + \beta}$
Therefore we can derive the posterior mean as
$$
posterior_{mean} = \frac{\alpha + \sum{y_i}}{\alpha + \sum{y_i}+\beta + n - \sum{y_i}}= \frac{\alpha+\sum{y_i}}{\alpha + \beta + n}
$$
We can further decompose this as
$$
posterior_{mean} = \frac{\alpha + \beta}{\alpha + \beta + n}\frac{\alpha}{\alpha + \beta} + \frac{n}{\alpha + \beta + n}\frac{\sum{y_i}}{n}
$$
We can describe this as the (prior weight * prior mean) + (data weight * data mean)
The posterior mean is a weighted average of the prior mean and the data mean.
This effective sample size gives you an idea of how much data you would need to make sure that your prior doesn't have much influence on your posterior.
If $\alpha + \beta$ is small compared to $n$ then the posterior will largely just be driven by the data. If $\alpha + \beta$ is large relative to $n$ then the posterior will be largely driven by the prior.
We can make a 95% credible interval using our posterior distribution for $\theta$. We can find an interval that actually has 95% probability of containing $\theta$.
Using Bayesian Statistics we can chain together dong a sequential update every time we get new data. We can get a new posterior, and we just use our previous posterior as a prior to do another update using Baye's theorem.
## Data Analysis Example in R
Suppose we're giving two students a multiple-choice exam with 40 questions, where each question has four choices. We don't know how much the students have studied for this exam, but we think that they'll do better than just guessing randomly
1) What are the parameters of interest?
The parameters of interests are $\theta_1 = true$ the probability that the first student will answer a question correctly, $\theta_2 = true$ the probability that the second student will answer a question correctly.
2) What is our likelihood?
The likelihood is $Binomial(40, \theta)$, if we assume that each question is independent and that the probability a student gets each question right is the same for all questions for that student.
3) What prior should we use?
The conjugate prior is a beta prior. We can plot the density with `dbeta`
```R
theta = seq(from = 0, to = 1, by = 0.1)
# Uniform
plot(theta, dbeta(theta, 1, 1), type = 'l')
# Prior mean 2/3
plot(theta, dbeta(theta, 4, 2), type = 'l')
# Prior mean 2/3 but higher effect size (more concentrated at mean)
plot(theta, dbeta(theta, 8, 4), type = 'l')
```
4 ) What are the prior probabilities $P(\theta > 0.25)$? $P(\theta > 0.5)$? $P(\theta > 0.8)$?
```R
1 - pbeta(0.25, 8, 4)
#[1] 0.998117
1 - pbeta(0.5, 8, 4)
#[1] 0.8867188
1 - pbeta(0.8, 8, 4)
#[1] 0.16113392
```
5) Suppose the first student gets 33 questions right. What is the posterior distribution for $\theta_1$? $P(\theta > 0.25)$? $P(\theta > 0.5)$? $P(\theta > 0.8)$? What is the 95% posterior credible interval for $\theta_1$?
$$
Posterior \sim Beta(8 + 33, 4 + 40 - 33) = Beta(41, 11)
$$
With a posterior mean of $\frac{41}{41+11} = \frac{41}{52}$
We can plot the posterior distribution with the prior
```R
plot(theta, dbeta(theta, 41, 11), type = 'l')
lines(theta, dbeta(theta, 8 ,4), lty = 2) #Dashed line for prior
```
Posterior probabilities
```R
1 - pbeta(0.25, 41, 11)
#[1] 1
1 - pbeta(0.5, 41, 11)
#[1] 0.9999926
1 - pbeta(0.8, 41, 11)
#[1] 0.4444044
```
Equal tailed 95% credible interval
```R
qbeta(0.025, 41, 11)
#[1] 0.6688426
qbeta(0.975, 41, 11)
#[1] 0.8871094
```
95% confidence that $\theta_1$ is between 0.67 and 0.89
6) Suppose the second student gets 24 questions right. What is the posterior distribution for $\theta_2$? $P(\theta > 0.25)$? $P(\theta > 0.5)$? $P(\theta > 0.8)$? What is the 95% posterior credible interval for $\theta_2$
$$
Posterior \sim Beta(8 + 24, 4 + 40 - 24) = Beta(32, 20)
$$
With a posterior mean of $\frac{32}{32+20} = \frac{32}{52}$
We can plot the posterior distribution with the prior
```R
plot(theta, dbeta(theta, 32, 20), type = 'l')
lines(theta, dbeta(theta, 8 ,4), lty = 2) #Dashed line for prior
```
Posterior probabilities
```R
1 - pbeta(0.25, 32, 20)
#[1] 1
1 - pbeta(0.5, 32, 20)
#[1] 0.9540427
1 - pbeta(0.8, 32, 20)
#[1] 0.00124819
```
Equal tailed 95% credible interval
```R
qbeta(0.025, 32, 20)
#[1] 0.4808022
qbeta(0.975, 32, 20)
#[1] 0.7415564
```
95% confidence that $\theta_1$ is between 0.48 and 0.74
7) What is the posterior probability that $\theta_1 > \theta_2$? i.e., that the first student has a better chance of getting a question right than the second student?
Estimate by simulation: draw 1,000 samples from each and see how often we observe $\theta_1 > \theta_2$
```R
theta1 = rbeta(100000, 41, 11)
theta2 = rbeta(100000, 32, 20)
mean(theta1 > theta2)
#[1] 0.975
```
## Poisson Data (Chocolate Chip Cookie Example)
In mass produced chocolate chip cookies, they make a large amount of dough. They mix in a large number of chips, mix it up really well and then chunk out individual cookies. In this process, the number of chips per cookie approximately follow a Poisson distribution.
If we were to assume that chips have no volume, then this would be exactly a Poisson process and follow exactly a Poisson distribution. In practice, however, chips aren't that big so they follow approximately a Poisson distribution for the number of chips per cookie.
$$
Y_i \sim Poisson(\lambda)
$$
$$
f(y|\lambda) = \frac{\lambda^{\sum{y_i}}e^{-n\lambda}}{\prod_{i = 1}^n{y_i!}}
$$
This is for $\lambda > 0$
What type of prior should we put on $\lambda$? It would be convenient if we could put a conjugate prior. What distribution looks like lambda raised to a power and e raised to a negative power?
For this, we're going to use a Gamma prior.
$$
\lambda \sim \Gamma(\alpha, \beta)
$$
$$
f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha - 1}e^{-\beta\lambda}
$$
$$
f(\lambda | y) \propto f(y|\lambda)f(\lambda) \propto \lambda^{\sum{y_i}}e^{-n\lambda}\lambda^{\alpha - 1}e^{-\beta \lambda}
$$
$$
f(\lambda | y) \propto \lambda^{\alpha + \sum{y_i} - 1}e^{-(\beta + n)\lambda}
$$
Thus we can see that the posterior is a Gamma Distribution
$$
\lambda|y \sim \Gamma(\alpha + \sum{y_i}, \beta + n)
$$
The mean of Gamma under this parameterization is $\frac{\alpha}{\beta}$
The posterior mean is going to be
$$
posterior_{mean} = \frac{\alpha + \sum{y_i}}{\beta + n} = \frac{\beta}{\beta + n}\frac{\alpha}{\beta} + \frac{n}{\beta + n}\frac{\sum{y_i}}{n}
$$
As you can see here the posterior mean of the Gamma distribution is also the weighted average of the prior mean and the data mean.
Let us present two strategies on how to choose our hyper parameters $\alpha$ and $\beta$
1. Think about the prior mean. For example, what do you think the number of chips per cookie on average is?
After this, we need some other piece of knowledge to pin point both parameters. Here are some options.
- What is your error on the number of chips per cookie? In other words, what do you think the standard deviation. Under the Gamma prior the standard deviation is $\frac{\sqrt{\alpha}}{\beta}$
- What is the effective sample size $\beta$? How many units of information do you think we have in our prior versus our data points.
2. In Bayesian Statistics, a vague prior refers to one that's relatively flat across much of the space. For a Gamma prior we can choose $\Gamma(\epsilon, \epsilon)$ where $\epsilon$ is small and strictly positive.
This would create a distribution with a mean of 1 and a huge standard deviation across the whole space. Hence the posterior will be largely driven by the data and very little by the prior.

View file

@ -0,0 +1,617 @@
## Exponential Data
Suppose you're waiting for a bus that you think comes on average once every 10 minutes, but you're not sure exactly how often it comes.
$$
Y \sim Exp(\lambda)
$$
Your waiting time has a prior expectation of $\frac{1}{\lambda}$
It turns out the gamma distribution is conjugate for an exponential likelihood. We need to specify a prior, or a particular gamma in this case. If we think that the buses come on average every ten minutes, that's a rate of one over ten.
$$
prior_{mean} = \frac{1}{10}
$$
Thus, we'll want to specify a gamma distribution so that the first parameter divded by the second parameter is $\frac{1}{10}$
We can now think about our variability. Perhaps you specify
$$
\Gamma(100, 1000)
$$
This will indeed have a prior mean of $\frac{1}{10}$ and it'll have a standard deviation of $\frac{1}{100}$. If you want to have a rough estimate of our mean plus or minus two standard deviations then we have the following
$$
0.1 \pm 0.02
$$
Suppose that we wait for 12 minutes and a bus arrives. Now you want to update your posterior for $\lambda$ about how often this bus will arrive.
$$
f(\lambda | y) \propto f(y|\lambda)f(\lambda)
$$
$$
f(\lambda | y) \propto \lambda e^{-\lambda y}\lambda^{\alpha - 1}e^{-\beta \lambda}
$$
$$
f(\lambda | y) \propto \lambda^{(\alpha + 1) - 1}e^{-(\beta + y)\lambda}
$$
$$
\lambda | y \sim \Gamma(\alpha + 1, \beta + y)
$$
Plugging in our particular prior gives us a posterior for $\lambda$ which is
$$
\lambda | y \sim \Gamma(101, 1012)
$$
Thus our posterior mean is going to be $\frac{101}{1012} Which is equal to 0.0998.
This one observation doesn't contain a lot of data under this likelihood. When the bus comes and it takes 12 minutes instead of 10, it barely shifts our posterior mean up. One data point doesn't have a big impact here.
## Normal/Gaussian Data
Let's suppose the standard deviation or variance is known and we're only interested in learning about the mean. This is the situation that often arises in monitoring industrial production processes.
$$
X_i \sim N(\mu, \sigma^2)
$$
It turns out that the Normal distribution is conjugate for itself when looking for the mean parameter
Prior
$$
\mu \sim N(m_0,S_0^2)
$$
$$
f(\mu |x ) \propto f(x|\mu)f(\mu)
$$
$$
\mu | x \sim N(\frac{n\bar{x}/\sigma_0^2 + m_0/s_0^2}{n/\sigma_0^2 + 1/s_0^2}, \frac{1}{n/\sigma_0^2 + 1/s_0^2})
$$
Let's look at the posterior mean
$$
posterior_{mean} = \frac{n/\sigma_0^2}{n/\sigma_0^2 + 1/s_0^2}\bar{x} + \frac{1/s_0^2}{n/\sigma_0^2 + 1/s_0^2}
$$
$$
posterior_{mean} = \frac{n}{n + \sigma_0^2/s_0^2}\bar{x} + \frac{\sigma_0^2/s_0^2}{n + \sigma_0^2/s_0^2}
$$
Thus we see, that the posterior mean is a weighted average of the prior mean and the data mean. And indeed that the effective sample size for this prior is the ratio of the variance for the data to the variance in the prior.
This makes sense, because the larger the variance of the prior, the less information that's in it.
The marginal distribution for Y is
$$
N(m_0, s_0^2 + \sigma^2)
$$
### When $\mu$ and $\sigma^2$ is unknown
$$
X_i | \mu, \sigma^2 \sim N(\mu, \sigma^2)
$$
A prior from $\mu$ conditional on the value for $\sigma^2$
$$
\mu | \sigma^2 \sim N(m, \frac{\sigma^2}{w})
$$
$w$ is going to be the ratio of $\sigma^2$ and some variance for the Normal distribution. This is the effective sample size of the prior.
Finally, the last step is to specify a prior for $\sigma^2$. The conjugate prior here is an inverse gamma distribution with parameters $\alpha$ and $\beta$.
$$
\sigma^2 \sim \Gamma^{-1}(\alpha, \beta)
$$
After many calculations... we get the posterior distribution
$$
\sigma^2 | x \sim \Gamma^{-1}(\alpha + \frac{n}{2}, \beta + \frac{1}{2}\sum_{i = 1}^n{(x-\bar{x}^2 + \frac{nw}{2(n+2)}(\bar{x} - m)^2)})
$$
$$
\mu | \sigma^2,x \sim N(\frac{n\bar{x}+wm}{n+w}, \frac{\sigma^2}{n + w})
$$
Where the posterior mean can be written as the weighted average of the prior mean and the data mean.
$$
\frac{n\bar{x}+wm}{n+w} = \frac{w}{n + w}m + \frac{n}{n + w}\bar{x}
$$
In some cases, we really only care about $\mu$. We want some inference on $\mu$ and we may want it such that it doesn't depend on $\sigma^2$. We can marginalize that $\sigma^2$ integrating it out. The posterior for $\mu$ marginally follows a $t$ distribution.
$$
\mu | x \sim t
$$
Similarly the posterior predictive distribution also is a $t$ distribution.
Finally, note that we can extend this in various directions, this can be extended to the multivariate normal case that requires matrix vector notations and can be extended in a hierarchial fashion if we want to specify priors for $m$, $w$ and $\beta$
## Non Informative Priors
We've seen examples of choosing priors that contain a significant amount of information. We've also seen some examples of choosing priors where we're attempting to not put too much information in to keep them vague.
Another approach is referred to as objective Bayesian statistics or inference where we explicitly try to minimize the amount of information that goes into the prior.
This is an attempt to have the data have maximum influence on the posterior
Let's go back to coin flipping
$$
Y_i \sim B(\theta)
$$
How do we minimize our prior information in $\theta$? One obvious intuitive approach is to say that all values of $\theta$ are equally likely. So we could have a piror for $\theta$ which follows a uniform distribution on the interval $[0, 1]$
Saying all values of $\theta$ are equally likely **seems** like it would have no information in it.
Recall however, that a $Uniform(0, 1)$ is the same as $Beta(1, 1)$
The effective sample size of a beta prior is the sum of its two parameters. So in this case, it has an effective sample size of 2. This is equivalent to data, with one head and one tail already in it.
So this is not a completely non informative prior.
We could think about a prior that has less information. For example $Beta(\frac{1}{2}, \frac{1}{2})$, this would have half as much information with an effective sample size of one.
We can take this even further. Think about something like $Beta(0.001, 0.001)$ This would have much less information, with the effective sample size fairly close to zero. In this case, the data would determine the posterior and there would be very little influence from the prior.
###Improper priors
Can we go even further? We can think of the limiting case. Let's think of $Beta(0,0)$, what would that look like?
$$
f(\theta) \propto \theta^{-1}(1-\theta)^{-1}
$$
This is not a proper density. If you integrate this over $(0,1)$, you'll get an infinite integral, so it's not a true density in the sense of it not integrating to 1.
There's no way to normalize it, since it has an infinite integral. This is what we refer to as an improper prior.
It's improper in the sense that it doesn't have a proper density. But it's not necessarily imporper in the sense that we can't use it. If we collect data, we use this prior and as long as we observe one head and one tail, or **at least one success and one failure**. Then we can get a posterior
$$
f(\theta|y) \propto \theta^{y-1}(1-\theta)^{n-y-1} \sim Beta(y, n-y)
$$
With a posterior mean of $\frac{y}{n} =\hat{\theta}$, which you should recognize as the maximum likelihood estimate. So by using this improper prior, we get a posterior which gives us point estimates exactly the same as the frequentest approach.
But in this case, we can also think of having a full posterior. From this, we can make interval statements, probability statements, and we can actually find an interval and say that there's 95% probability that $\theta$ is in this interval. Which is not something you can do under the frequentest approach even though we may get the same exact interval.
### Statements about improper priors
Improper priors are okay as long as the posterior itself is proper. There may be some mathematical things that need to be checked and you may need to have certain restrictions on the data. In this case, we needed to make sure that we observed at least one head and one tail to get a proper posterior.
But as long as the posterior is proper, we can go forwards and do Bayesian inference even with an improper prior.
The second point is that for many problems there does exist a prior, typically an improper prior that will lead to the same point estimates as you would get under the frequentest paradigm. So we can get very similar results, results that are fully dependent on the data, under the Bayesian approach.
But in this case, we can also have continue to have a posterior and make posterior interval estimates and talk about posterior probabilities of the parameter.
### Normal Case
Another example is thinking about the normal case.
$$
Y_i \stackrel{iid}\sim N(\mu, \sigma^2)
$$
Let's start off by assuming that $\sigma^2$ is known and we'll just focus on the mean $\mu$.
We can think about a vague prior like before and say that
$$
\mu \sim N(0, 1000000^2)
$$
This would just spread things out across the real line. That would be a fairly non informative prior covering a lot of possibilities. We can then think about taking the limit, what happens if we let the variance go to $\infty$. In that case, we're basically spreading out this distribution across the entire real number line. We can say that the density is just a constant across the whole real line.
$$
f(\mu) \propto 1
$$
This is an improper prior because if you integrate the real line you get an infinite answer. However, if we go ahead and find the posterior
$$
f(\mu|y) \propto f(y|\mu)f(\mu) \propto exp(-\frac{1}{2\sigma^2}\sum{(y_i - \mu)^2})(1)
$$
$$
f(\mu | y) \propto exp(-\frac{1}{2\sigma^2/n}(\mu - \bar{y})^2)
$$
$$
\mu | y \sim N(\bar{y}, \frac{\sigma^2}{n})
$$
This should look just like the maximum likelihood estimate.
### Unknown Variance
In the case that $\sigma^2$ is unknown, the standard non informative prior is
$$
f(\sigma^2) \propto \frac{1}{\sigma^2}
$$
$$
\sigma^2 \sim \Gamma^{-1}(0,0)
$$
This is an improper prior and it's uniform on the log scale of $\sigma^2$.
In this case, we'll end up with a posterior for $\sigma^2$
$$
\sigma^2|y \sim \Gamma^{-1}(\frac{n-1}{2}, \frac{1}{2}\sum{(y_i - \bar{y})^2})
$$
This should also look reminiscent of quantities we get as a frequentest. For example, the samples standard deviation
## Jeffreys Prior
Choosing a uniform prior depends upon the particular parameterization.
Suppose I used a prior which is uniform on the log scale for $\sigma^2$
$$
f(\sigma^2) \propto \frac{1}{\sigma^2}
$$
Suppose somebody else decides, that they just want to put a uniform prior on $\sigma^2$ itself.
$$
f(\sigma^2) \propto 1
$$
These are both uniform on certain scales or certain parameterizations, but they are different priors. So when we compute the posteriors, they will be different as well.
The key thing is that uniform priors are not invariant with respect to transformation. Depending on how you parameterize the problem, you can get different answers by using a uniform prior
One attempt to round this out is to use Jeffrey's Prior
Jeffrey's Prior is defined as the following
$$
f(\theta) \propto \sqrt{\mathcal{I(\theta)}}
$$
Where $\mathcal{I}(\theta)$ is the fisher information of $\theta$. In most cases, this will be an improper prior.
### Normal Data
For the example of Normal Data
$$
Y_i \sim N(\mu, \sigma^2)
$$
$$
f(\mu) \propto 1
$$
$$
f(\sigma^2) \propto \frac{1}{\sigma^2}
$$
Where $\mu$ is uniform and $\sigma^2$ is uniform on the log scale.
This prior will then be transformation invariant. We will end up putting the same information into the prior even if we use a different parameterization for the Normal.
### Binomial
$$
Y_i \sim B(\theta)
$$
$$
f(\theta) \propto \theta^{-\frac{1}{2}}(1-\theta)^{-\frac{1}{2}} \sim Beta(\frac{1}{2},\frac{1}{2})
$$
This is a rare example of where the Jeffreys prior turns out to be a proper prior.
You'll note that this prior actually does have some information in it. It's equivalent to an effective sample size of one data point. However, this information will be the same, not depending on the parameterization we use.
In this case, we have $\theta$ as a probability, but another alternative which is sometimes used is when we model things on a logistics scale.
By using the Jeffreys prior, we'll maintain the exact same information.
### Closing information about priors
Other possible approaches to objective Bayesian inference includes priors such as reference priors and maximum entropy priors.
A related concept to this is called empirical Bayesian analysis. The idea in empirical Baye's is that you use the data to help inform your prior; such as by using the mean of the data to set the mean of the prior distribution. This approach often leads to reasonable point estimates in your posterior. However, it's sort of cheating since you're using your data twice and as a result may lead to improper uncertainty estimates.
## Fisher Information
The Fisher information (for one parameter) is defined as
$$
\mathcal{I}(\theta) = E[(\frac{d}{d\theta}log{(f(X|\theta))})^2]
$$
Where the expectation is taken with respect to $X$ which has PDF $f(x|\theta)$. This quantity is useful in obtaining estimators for $\theta$ with good properties, such as low variance. It is also the basis for Jeffreys prior.
**Example:** Let $X | \theta \sim N(\theta, 1)$. Then we have
$$
f(x|\theta) = \frac{1}{\sqrt{2\pi}}exp[-\frac{1}{2}(x-\theta)^2]
$$
$$
\log{(f(x|\theta))} = -\frac{1}{2}\log{(2\pi)}-\frac{1}{2}(x-\theta)^2
$$
$$
(\frac{d}{d\theta}log{(f(x|\theta))})^2 = (x-\theta)^2
$$
and so $\mathcal{I}(\theta) = E[(X - \theta)^2] = Var(X) = 1$
## Linear Regression
###Brief Review of Regression
Recall that linear regression is a model for predicting a response or dependent variable ($Y$, also called an output) from one or more covariates or independent variables ($X$, also called explanatory variables, inputs, or features). For a given value of a single $x$, the expected value of $y$ is
$$
E[y] = \beta_0 + \beta_1x
$$
or we could say that $Y \sim N(\beta_0 + \beta_1x, \sigma^2)$. For data $(x_1, y_1), \dots , (x_n, y_n)$, the fitted values for the coefficients, $\hat{\beta_0}$ and $\hat{\beta_1}$ are those that minimize the sum of squared errors $\sum_{i = 1}^n{(y_i - \hat{y_i})^2}$, where the predicted values for the response are $\hat{y} = \hat{\beta_0} + \hat{\beta_1}x$. We can get these values from R. These fitted coefficients give the least-squares line for the data.
This model extends to multiple covariates, with one $\beta_j$ for each $k$ covariates
$$
E[y_i] = \beta_0 + \beta_1x_{i1} + \dots + \beta_kx_{ik}
$$
Optionally, we can represent the multivariate case using vector-matrix notation.
### Conjugate Modeling
In the Bayesian framework, we treat the $\beta$ parameters as unknown, put a prior on them, and then find the posterior. We might treat $\sigma^2$ as fixed and known, or we might treat it as an unknown and also put a prior on it. Because the underlying assumption of a regression model is that the errors are independent and identically normally distributed with mean $0$ and variance $\sigma^2$, this defines a normal likelihood.
#### $\sigma^2$ known
Sometimes we may know the value of the error variance $\sigma^2$. This simplifies calculations. The conjugate prior for the $\beta$'s is a normal prior. In practice, people typically use a non-informative prior, i.e., the limit as the variance of the normal prior goes to infinity, which has the same mean as the standard least-squares estimates. If we are only estimating $\beta$ and treating $\sigma^2$ as known, then the posterior for $\beta$ is a (multivariate) normal distribution. If we just have a single covariate, then the posterior for the slope is
$$
\beta_1 | y \sim N(\frac{\sum_{i = 1}^n{(x_i-\bar{x})(y_i - \bar{y})}}{\sum_{i=1}^n{(x_i-\bar{x})^2}}, \frac{\sigma^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}})
$$
If we have multiple covariates, then using a matrix-vector notation, the posterior for the vector of coefficients is
$$
\beta | y \sim N((X^tX)^{-1}X^ty, (X^tX)^{-1}\sigma^2)
$$
where $X$ denotes the design matrix and $X^t$ is the transpose of $X$. The intercept is typically included in $X$ as a column of $1$'s. Using an improper prior requires us to have at least as many data points as we have parameters to ensure that the posterior is proper.
#### $\sigma^2$ Unknown
If we treat both $\beta$ and $\sigma^2$ as unknown, the standard prior is the non-informative Jeffreys prior, $f(\beta, \sigma^2) \propto \frac{1}{\sigma^2}$. Again, the posterior mean for $\beta$ will be the same as the standard least-squares estimates. The posterior for $\beta$ conditional on $\sigma^2$ is the same normal distributions as when $\sigma^2$ is known, but the marginal posterior distribution for $\beta$, with $\sigma^2$ integrated out is a $t$ distribution, analogous to the $t$ tests for significance in standard linear regression. The posterior $t$ distribution has mean $(X^tX)^{-1}X^ty$ and scale matrix (related to the variance matrix) $s^2(X^tX)^{-1}$, where $s^2 = \sum_{i = 1}^n{(y_i - \hat{y_i})^2/(n - k - 1)}$. The posterior distribution for $\sigma^2$ is an inverse gamma distribution
$$
\sigma^2 | y \sim \Gamma^{-1}(\frac{n - k - 1}{2}, \frac{n - k - 1}{2}s^2)
$$
In the simple linear regression case (single variable), the marginal posterior for $\beta$ is a $t$ distribution with mean $\frac{\sum_{i = 1}^n{(x_i-\bar{x})(y_i - \bar{y})}}{\sum_{i=1}^n{(x_i-\bar{x})^2}}$ and scale $ \frac{s^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}}$. If we are trying to predict a new observation at a specified input $x^*$, that predicted value has a marginal posterior predictive distribution that is a $t$ distribution, with mean $\hat{y} = \hat{\beta_0} + \hat{\beta_1}x^*$ and scale $se_r\sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{(n - 1)s_x^2}}$. $se_r$ is the residual standard error of the regression, which can be found easily in R. $s_x^2$ is the sample variance of $x$. Recall that the predictive distribution for a new observation has more variability than the posterior distribution for $\hat{y}$ , because individual observations are more variable than the mean.
## Linear Regression
### Single Variable Regression
We'll be looking at the Challenger dataset. It contains 23 past launches where it has the temperature at the day of launch and the O-ring damage index
http://www.randomservices.org/random/data/Challenger2.txt
Read in the data
```R
oring=read.table("http://www.randomservices.org/random/data/Challenger2.txt",
header=T)
# Note that attaching this masks T which is originally TRUE
attach(oring)
```
Now we'll see the plot
```R
plot(T, I)
```
![Challenger](files/courses/BayesianStatistics/Challenger.png)
Fit a linear model
```R
oring.lm=lm(I~T)
summary(oring.lm)
```
Output of the summary
```
Call:
lm(formula = I ~ T)
Residuals:
Min 1Q Median 3Q Max
-2.3025 -1.4507 -0.4928 0.7397 5.5337
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.36508 4.43859 4.138 0.000468
T -0.24337 0.06349 -3.833 0.000968
(Intercept) ***
T ***
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.102 on 21 degrees of freedom
Multiple R-squared: 0.4116, Adjusted R-squared: 0.3836
F-statistic: 14.69 on 1 and 21 DF, p-value: 0.0009677
```
Add the fitted line into the scatterplot
```R
lines(T,fitted(oring.lm))
```
![challengerfitted](files/courses/BayesianStatistics/challengerfitted.png)
Create a 95% posterior interval for the slope
```R
-0.24337 - 0.06349*qt(.975,21)
# [1] -0.3754047
-0.24337 + 0.06349*qt(.975,21)
# [1] -0.1113353
```
**Note:** These are the same as the frequentest confidence intervals
If the challenger launch was at 31 degrees Fahrenheit, how much O-Ring damage would we predict?
```R
coef(oring.lm)[1] + coef(oring.lm)[2]*31
# [1] 10.82052
```
Let's make our posterior prediction interval
```R
predict(oring.lm,data.frame(T=31),interval="predict")
```
Output of `predict`
```
fit lwr upr
1 10.82052 4.048269 17.59276
```
We can calculate the lower bound through the following formula
```R
10.82052-2.102*qt(.975,21)*sqrt(1+1/23+((31-mean(T))^2/22/var(T)))
```
What's the posterior probability that the damage index is greater than zero?
```R
1-pt((0-10.82052)/(2.102*sqrt(1+1/23+((31-mean(T))^2/22/var(T)))),21)
```
### Multivariate Regression
We're looking at Galton's seminal data predicting the height of children from the height of the parents
http://www.randomservices.org/random/data/Galton.txt
Read in the data
```R
heights=read.table("http://www.randomservices.org/random/data/Galton.txt",
header=T)
attach(heights)
```
What are the columns in the dataset?
```R
names(heights)
# [1] "Family" "Father" "Mother" "Gender" "Height" "Kids"
```
Let's look at the relationship between the different variables
```R
pairs(heights)
```
![heightpairs](files/courses/BayesianStatistics/heightpairs.png)
First let's start by creating a linear model taking all of the columns into account
```R
summary(lm(Height~Father+Mother+Gender+Kids))
```
Output of `summary`
```
Call:
lm(formula = Height ~ Father + Mother + Gender + Kids)
Residuals:
Min 1Q Median 3Q Max
-9.4748 -1.4500 0.0889 1.4716 9.1656
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.18771 2.79387 5.794 9.52e-09
Father 0.39831 0.02957 13.472 < 2e-16
Mother 0.32096 0.03126 10.269 < 2e-16
GenderM 5.20995 0.14422 36.125 < 2e-16
Kids -0.04382 0.02718 -1.612 0.107
(Intercept) ***
Father ***
Mother ***
GenderM ***
Kids
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.152 on 893 degrees of freedom
Multiple R-squared: 0.6407, Adjusted R-squared: 0.6391
F-statistic: 398.1 on 4 and 893 DF, p-value: < 2.2e-16
```
As you can see here, the `Kids` column is not significant. Let's look at a model with it removed.
```R
summary(lm(Height~Father+Mother+Gender))
```
Output of `summary`
```
Call:
lm(formula = Height ~ Father + Mother + Gender)
Residuals:
Min 1Q Median 3Q Max
-9.523 -1.440 0.117 1.473 9.114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.34476 2.74696 5.586 3.08e-08
Father 0.40598 0.02921 13.900 < 2e-16
Mother 0.32150 0.03128 10.277 < 2e-16
GenderM 5.22595 0.14401 36.289 < 2e-16
(Intercept) ***
Father ***
Mother ***
GenderM ***
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.154 on 894 degrees of freedom
Multiple R-squared: 0.6397, Adjusted R-squared: 0.6385
F-statistic: 529 on 3 and 894 DF, p-value: < 2.2e-16
```
This model looks good, let's go ahead and save it to a variable
```R
heights.lm=lm(Height~Father+Mother+Gender)
```
From this we can tell that for each extra inch of height in a father is correlated with an extra 0.4 inches extra to the height of a child.
We can also tell that each extra inch of height in a mother is correlated with an extra 0.3 inches extra to the height of the child.
A male child is on average 5.2 inches taller than a female child.
Let's create a 95% posterior interval for the difference in height by gender
```R
5.226 - 0.144*qt(.975,894)
# [1] 4.943383
5.226 + 0.144*qt(.975,894)
# [1] 5.508617
```
Let's make a posterior prediction interval for a male and female with a father whose 68 inches and a mother whose 64 inches.
```R
predict(heights.lm,data.frame(Father=68,Mother=64,Gender="M"),
interval="predict")
# fit lwr upr
# 1 68.75291 64.51971 72.9861
```
```R
predict(heights.lm,data.frame(Father=68,Mother=64,Gender="F"),
interval="predict")
# fit lwr upr
# 1 63.52695 59.29329 67.76062
```
## What's next?
This concludes this course, if you want to go further with Bayesian statistics, the next topics to explore would be hierarchal modeling and fitting of non conjugate models with Markov Chain Monte Carlo or MCMC.

View file

@ -0,0 +1,82 @@
# Handy Quadratic Congruences Facts
## Number of Solutions
<u>For congruences mod 2</u>
**Proposition 16.1**. Let $f(x) = ax^2 + bx + c$ with $a$ odd, and let $\Delta = b^2 - 4ac$ be the discriminant of $f(x)$. Then,
1. If $\Delta \equiv 1$ (mod $8$), so that $b$ is odd and $c$ is even, then $f(x) \equiv 0$ (mod $2$) has **two** solutions
2. If $\Delta \equiv 5$ (mod $8$), so that $b$ and $c$ are odd, then $f(x) \equiv 0$ (mod $2$) has **no** solutions
3. If $4 | \Delta$ , so that $b$ is even, then $f(x) \equiv 0$ (mod $2$) has exactly **one** solution.
**Proposition 16.2.** Let $p$ be an odd prime and let $a$ be an integer. Then,
1. If $p$ does not divide $a$, then the congruence $x^2 \equiv a$ (mod $p$) has either two solutions or no solutions.
2. If $p$ divides $a$, then $x^2 \equiv a$ (mod $p$) has exactly one solution, namely $x = 0$.
**Legendre symbol definition**. Let $p$ be an odd prime and $a$ any integer. Then the *Legendre symbol*, written as $(\frac{a}{p})$ is defined as
$$
(\frac{a}{p}) = \begin{cases}
1, & \text{if $x^2 \equiv a$ (mod $p$) has exactly two solutions,} \\
0, & \text{if $x^2 \equiv a$ (mod $p$) has exactly one solution,} \\
-1, & \text{if $x^2 \equiv a$ (mod $p$) has no soultions.}
\end{cases}
$$
**Properties of Legendre symbol**.
- $(\frac{a}{p}) = 0 \iff p$ divides $a$
- $(\frac{1}{p}) = 1$ for every odd prime $p$
- $a \equiv b$ (mod $p$) $\implies$ $(\frac{a}{p}) = (\frac{b}{p})$
- $(\frac{ab}{p}) = (\frac{a}{p})(\frac{b}{p})$
- If $p$ is an odd prime then,
- $$
(\frac{-1}{p}) =
\begin{cases}
1, & \text{if $p \equiv 1$ (mod $4$)} \\
-1, & \text{ if $p \equiv 3$ (mod $4$)}
\end{cases}
$$
- If $p$ is an odd prime then,
- $$
(\frac{2}{p}) =
\begin{cases}
1, & \text{if $p \equiv 1$ (mod $8$) or $p \equiv 7$ (mod $8$)} \\
-1, & \text{ if $p \equiv 3$ (mod $8$) or $p \equiv 5$ (mod $8$)}
\end{cases}
$$
- **Quadratic Reciprocity Theorem**. Let $p$ and $q$ be distinct odd primes. Then,
- $$
(\frac{q}{p}) =
\begin{cases}
(\frac{p}{q}), & \text{if $p \equiv 1$ (mod $4$) or $q \equiv 1$ (mod $4$)} \\
-(\frac{p}{q}), & \text{ if $p \equiv 3$ (mod $4$) and $q \equiv 3$ (mod $4$)}
\end{cases}
$$
## Procedure
When $p$ is an odd prime, a quadratic congruence $ax^2 + bx + c \equiv 0$ (mod $p$) can be transformed into a specialized form by completing the square.
\begin{align*}
ax^2 + bx + c \equiv 0 \text{ (mod $p$)} &\iff 4a(ax^2 + bx + c) \equiv 0 \text{ (mod $p$)} \\\\
&\iff 4a^2x^2 + 4abx + 4ac \equiv 0 \text{ (mod $p$)} \\\\
&\iff 4a^2x^2 + 4abx + 4ac + (b^2 - 4ac) \equiv b^2 - 4ac \text{ (mod $p$)} \\\\
&\iff 4a^2x^2 + 4abx + b^2 \equiv b^2 - 4ac \text{ (mod $p$)} \\\\
&\iff (2ax+b)^2 \equiv b^2 - 4ac \text{ (mod $p$)}
\end{align*}
## Quadratic Congruences Modulo a prime power
Let $a$ be the solution to $f(x) \equiv 0$ (mod $p$) where $p$ is an odd prime. Consider $b = pt + a$. Then, $f(b) \equiv 0$ (mod $p^2$) if $f^\prime(a)t \equiv -\frac{f(a)}{p}$ (mod $p$).
In general, let $a$ be the solution to $f(x) \equiv 0$ (mod $p^n$) where $p$ is an odd prime. Consider $b = pt + a$. Then, $f(b) \equiv 0$ (mod $p^{n + 1}$) if $f^\prime(a)t \equiv -\frac{f(a)}{p^n}$ (mod $p$)

View file

@ -0,0 +1,93 @@
# Real Analysis Sheet
**Fact:** $\forall a,b, \in \mathbb{R}$, $\sqrt{ab} \le \frac{1}{2}(a + b)$.
**Bernoulli's Inequality:** If $x > -1$, then $(1 + x)^n \ge 1 + nx$, $\forall n \in \mathbb{N}$.
**Triangle Inequality:** If $a,b \in \mathbb{R}$, then $|a + b| \le |a| + |b|$.
**Epsilon Neighborhood Definition:**
Let $a \in \mathbb{R}$ and $\epsilon > 0$. The $\epsilon$-neighborhood of $a$ is the set $V_\epsilon(a) = \{ x \in \mathbb{R}, |x - a| < \epsilon\}$.
**Archimedean Property:** If $x \in \mathbb{R}$, then $\exists n_x \in \mathbb{N}$ such that $x \le n_x$.
**Convergence Definition:**
$X = (x_n)$ converges to $x \in \mathbb{R}$ if $\forall \epsilon > 0, \exists k \in \mathbb{N}$ such that $\forall n \ge k, |x_n - x| < \epsilon$.
**Fact:** A sequence in $\mathbb{R}$ can have at most one limit.
**Theorem 3.1.9:**
If $(a_n)$ is a sequence in $\mathbb{R}$ such that $a_n \rightarrow 0$, and for some constant $c > 0, m \in \mathbb{N}$,
$|x_n - x| \le ca_n, \forall n \ge m$, then $x_n \rightarrow x$.
**Theorem 3.2.7:**
(a) If $x_n \rightarrow x$, then $|x_n| \rightarrow |x|$.
(b) If $x_n \rightarrow x$ and $x_n \ge 0$, then $\sqrt{x_n} \rightarrow \sqrt{x}$.
**Ratio Test:**
Let $\{x_n\} \subseteq \mathbb{R}^+$ such that $L = \lim{(\frac{x_{n + 1}}{x_n})}$. If $L < 1$, then $x_n \rightarrow 0$.
**Theorem 3.2.2:** Every convergent sequence is bounded. (The converse is not necessarily true)
**Squeeze Theorem:**
If $x_n \rightarrow x, y_n \rightarrow y,$ and $z_n \rightarrow x$ such that $x_n \le y_n \le z_n, \forall n \in \mathbb{N}$ then $y = x$.
**Monotone Convergence Theorem**
Let $X = (x_n)$ be a subsequence in $\mathbb{R}$.
(a) If $X$ is monotonically increasing and bounded above then $\lim{x_n} = sup\{x_n : n \in \mathbb{N}\}$
(b) If $X$ is monotonically decreasing and bounded below then $\lim{x_n} = inf\{x_n : n \in \mathbb{N}\}$
**Useful Fact:** If $\lim{x_n} = a \in \mathbb{R}$, then $\lim{x_{n + 1}} = a$.
**Interesting Application of MCT:**
Let $s_1 > 0$ be arbitrary, and define $s_{n + 1} = \frac{1}{2}(s_n + \frac{a}{s_n})$. We know $s_n \rightarrow \sqrt{a}$.
**Euler's Number:** Consider the sequence $e_n = (1 + \frac{1}{n})^n$. We know $e_n \rightarrow e$.
**Theorem 3.4.2:** If $X = (x_n) \subseteq \mathbb{R}$ converges to $x$, then every subsequence converges to $x$.
**Corollary:** If $X = (x_n) \subseteq \mathbb{R}$ has a subsequence that diverges then $X$ diverges.
**Monotone Sequence Theorem**: If $X = (x_n) \subseteq \mathbb{R}$, then it contains a <u>monotone subsequence</u>.
**Bolzano-Weierstrass Theorem:** Every bounded sequence in $\mathbb{R}$ has a convergent subsequence.
**Theorem 3.4.12:** A bounded $(x_n) \in \mathbb{R}$ is convergent iff $liminf(x_n) = limsup(x_n)$.
**Cauchy Criteria for Convergence:** Let $X = (x_n) \subseteq \mathbb{R}$. We say that $X$ is cauchy if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall n,m \ge N, |x_n - x_m| < \epsilon$. We know that a sequence converges iff it is cauchy.
**P-Series:** Series of the form $s_n = \sum_{i = 0}^n{\frac{1}{n^p}}$ is convergent for $p > 1$.
**Geometric Series:**
Series of the form $s_n = \sum_{i = 0}^n{ar^i}$ has the following partial sum $a\frac{1 - r^n}{1 - r}$ and converges to $\frac{a}{1 - r}$ if $|r| < 1$.
**Comparison Test:**
Let $(x_n), (y_n) \subseteq \mathbb{R}$. Suppose that for some $k \in \mathbb{N}, 0 \le x_n \le y_n,$ for $n \ge k$.
(a) If $\sum{y_n} < \infty$, then $\sum{x_n} < \infty$.
(b) If $\sum{x_n} = \infty$, then $\sum{y_n} = \infty$.
**Limit Comparison Test:**
Let $(x_n), (y_n)$ be strictly positive sequence of real numbers. Suppose $r = \lim{\frac{x_n}{y_n}}$.
(a) If $r \ne 0$, then $\sum{x_n} < \infty \iff \sum{y_n} < \infty$.
(b) If ($r = 0$ and $\sum{y_n} < \infty$), then $\sum{x_n} < \infty$.

View file

@ -0,0 +1,20 @@
---
title: Reproducible Research
showthedate: false
---
# Reproducible Research
In the Winter of 2017, I took a coursera course on Reproducible Research taught by Dr. Peng from John Hopkins University
Below are my notes for each of the four weeks.
[Week 1](week1)
[Week 2](week2)
[Week 3](week3)
[Week 4](week4)

View file

@ -0,0 +1,397 @@
# Reproducible Research Week 1
## Replication
The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent
- Investigators
- Data
- Analytical Methods
- Laboratories
- Instruments
Replication is particularly important in studies that can impact broad policy or regulatory decisions
### What's wrong with replication?
Some studies cannot be replicated
- No time, opportunistic
- No money
- Unique
*Reproducible Research:* Make analytic data and code available so that others may reproduce findings
Reproducibility bridges the gap between replication which is awesome and doing nothing.
## Why do we need reproducible research?
New technologies increasing data collection throughput; data are more complex and extremely high dimensional
Existing databases can be merged into new "megadatabases"
Computing power is greatly increased, allowing more sophisticated analyses
For every field "X" there is a field "Computational X"
## Research Pipeline
Measured Data -> Analytic Data -> Computational Results -> Figures/Tables/Numeric Summaries -> Articles -> Text
Data/Metadata used to develop test should be made publically available
The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available
"Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported" -- IOM Report
### What do we need for reproducible research?
- Analytic data are available
- Analytic code are available
- Documentation of code and data
- Standard means of distribution
### Who is the audience for reproducible research?
Authors:
- Want to make their research reproducible
- Want tools for reproducible research to make their lives easier (or at least not much harder)
Readers:
- Want to reproduce (and perhaps expand upon) interesting findings
- Want tools for reproducible research to make their lives easier.
### Challenges for reproducible research
- Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)
- Readers must download data/results individually and piece together which data go with which code sections, etc.
- Readers may not have the same resources as authors
- Few tools to help authors/readers
### What happens in reality
Authors:
- Just put stuff on the web
- (Infamous for disorganization) Journal supplementary materials
- There are some central databases for various fields (e.g biology, ICPSR)
Readers:
- Just download the data and (try to) figure it out
- Piece together the software and run it
## Literate (Statistical) Programming
An article is a stream of text and code
Analysis code is divided into text and code "chunks"
Each code chunk loads data and computes results
Presentation code formats results (tables, figures, etc.)
Article text explains what is going on
Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents
Literate programming is a general concept that requires
1. A documentation language (human readable)
2. A programming language (machine readable)
Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML
### Quick summary so far
Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate
Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available
There is a growing number of tools for creating reproducible documents
**Golden Rule of Reproducibility: Script Everything**
## Steps in a Data Analysis
1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modeling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
11. Create reproducible code
"Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?" -- Dan Myer
Defining a question is the kind of most powerful dimension reduction tool you can ever employ.
### An Example for #1
**Start with a general question**
Can I automatically detect emails that are SPAM or not?
**Make it concrete**
Can I use quantitative characteristics of emails to classify them as SPAM?
### Define the ideal data set
The data set may depend on your goal
- Descriptive goal -- a whole population
- Exploratory goal -- a random sample with many variables measured
- Inferential goal -- The right population, randomly sampled
- Predictive goal -- a training and test data set from the same population
- Causal goal -- data from a randomized study
- Mechanistic goal -- data about all components of the system
### Determine what data you can access
Sometimes you can find data free on the web
Other times you may need to buy the data
Be sure to respect the terms of use
If the data don't exist, you may need to generate it yourself.
### Obtain the data
Try to obtain the raw data
Be sure to reference the source
Polite emails go a long way
If you load the data from an Internet source, record the URL and time accessed
### Clean the data
Raw data often needs to be processed
If it is pre-processed, make sure you understand how
Understand the source of the data (census, sample, convenience sample, etc)
May need reformatting, subsampling -- record these steps
**Determine if the data are good enough** -- If not, quit or change data
### Exploratory Data Analysis
Look at summaries of the data
Check for missing data
-> Why is there missing data?
Look for outliers
Create exploratory plots
Perform exploratory analyses such as clustering
If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis
`plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)`
### Statistical prediction/modeling
Should be informed by the results of your exploratory analysis
Exact methods depend on the question of interest
Transformations/processing should be accounted for when necessary
Measures of uncertainty should be reported.
### Interpret Results
Use the appropriate language
- Describes
- Correlates with/associated with
- Leads to/Causes
- Predicts
Gives an explanation
Interpret Coefficients
Interpret measures of uncertainty
### Challenge Results
Challenge all steps:
- Question
- Data Source
- Processing
- Analysis
- Conclusions
Challenge measures of uncertainty
Challenge choices of terms to include in models
Think of potential alternative analyses
### Synthesize/Write-up Results
Lead with the question
Summarize the analyses into the story
Don't include every analysis, include it
- If it is needed for the story
- If it is needed to address a challenge
- Order analyses according to the story, rather than chronologically
- Include "pretty" figures that contribute to the story
### In the lecture example...
Lead with the question
Can I use quantitative characteristics of the emails to classify them as SPAM?
Describe the approach
Collected data from UCI -> created training/test sets
Explored Relationships
Choose logistic model on training set by cross validation
Applied to test, 78% test set accuracy
Interpret results
Number of dollar signs seem reasonable, e.g. "Make more money with Viagra $ $ $ $"
Challenge Results
78% isn't that great
Could use more variables
Why use logistic regression?
## Data Analysis Files
Data
- Raw Data
- Processed Data
Figures
- Exploratory Figures
- Final Figures
R Code
- Raw/Unused Scripts
- Final Scripts
- R Markdown Files
Text
- README files
- Text of Analysis/Report
### Raw Data
Should be stored in the analysis folder
If accessed from the web, include URL, description, and date accessed in README
### Processed Data
Processed data should be named so it is easy to see which script generated the data
The processing script -- processed data mapping should occur in the README
Processed data should be tidy
### Exploratory Figures
Figures made during the course of your analysis, not necessarily part of your final report
They do not need to be "pretty"
### Final Figures
Usually a small subset of the original figures
Axes/Colors set to make the figure clear
Possibly multiple panels
### Raw Scripts
May be less commented (but comments help you!)
May be multiple versions
May include analyses that are later discarded
### Final Scripts
Clearly commented
- Small comments liberally - what, when, why, how
- Bigger commented blocks for whole sections
Include processing details
Only analyses that appear in the final write-up
### R Markdown Files
R Markdown files can be used to generate reproducible reports
Text and R code are integrated
Very easy to create in RStudio
### Readme Files
Not necessary if you use R Markdown
Should contain step-by-step instructions for analysis
### Text of the document
It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)
It should tell a story
It should not include every analysis you performed
References should be included for statistical methods

View file

@ -0,0 +1,227 @@
## Coding Standards for R
1. Always use text files/text editor
2. Indent your code
3. Limit the width of your code (80 columns?)
4. Author suggests indentation of 4 spaces at minimum
5. Limit the length of individual functions
## What is Markdown?
Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it structurally to valid XHTML/HTML
## Markdown Syntax
`*This text will appear italicized!*`
*This text will appear italicized!*
`**This text will appear bold!**`
**This text will appear bold**
`## This is a secondary heading`
`###This is a tertiary heading `
## This is a secondary heading
### This is a tertiary heading
Unordered Lists
`- first item in list`
`- second item in list`
- first item in list
- second item in list
Ordered lists
`1. first item in list`
`2. second item in list`
`3. third item in list`
1. first item in list
2. second item in list
3. third item in list
Create links
`[Download R](http://www.r-project.org/)`
[Download R](http://www.r-project.org/)
Advanced linking
`I spent so much time reading [R bloggers][1] and [Simply Statistics][2]!`
`[1]: http://www.r-bloggers.com/ "R bloggers"`
`[2]: http://simplystatistics.org/ "Simply Statistics"`
I spent so much time reading [R bloggers][1] and [Simply Statistics][2]!
[1]: http://www.r-bloggers.com/ "R bloggers"
[2]: http://simplystatistics.org/ "Simply Statistics"
Newlines require a double space after the end of a line
## What is Markdown?
Created by John Gruber and Aaron Swartz. It is a simplified version of "markup" languages. It allows one to focus on writing as opposed to formatting. Markdown provides a simple, minimal, and intuitive way of formatting elements.
You can easily convert Markdown to valid HTML (and other formats) using existing tools.
## What is R Markdown?
R Markdown is the integration of R code with markdown. It allows one to create documents containing "live" R code. R code is evaluated as part of the processing of the markdown and its results are inserted into the Markdown document. R Markdown is a core tool in **literate statistical programming**
R Markdown can be converted to standard markdown using `knitr` package in R. Markdown can then be converted to HTML using the `markdown` package in R. This workflow can be easily managed using R Studio. One can create powerpoint like slides using the `slidify` package.
## Problems, Problems
- Authors must undertake considerable effort to put data/results on the web
- Readers must download data/results individually and piece together which data go with which code sections, etc.
- Authors/readers must manually interact with websites
- There is no single documents to integrate data analysis with textual representations; i.e data, code, and text are not linked
One of the ways to resolve this is to simply put the data and code together in the same document so that people can execute the code in the right order, and the data are read at the right times. You can have a single document that integrates the data analysis with all the textual representations.
## Literate Statistical Programming
- Original idea comes from Don Knuth
- An article is a stream of **text** and **code**
- Analysis code is divded into text and code "chunks"
- Presentation code formats results (tables, figures, etc.)
- Article text explains what is going on
- Literate programs are weaved to produce human-readable documents and tangled to produce machine-readable documents.
## Literate Statistical Programming
- Literate programming is a general concept. We need
- A documentation language
- A programming language
- `knitr` supports a variety of documentation languages
## How Do I Make My Work Reproducible?
- Decide to do it (ideally from the start)
- Keep track of everything, hopefully through a version control system
- Use software in which operations can be coded
- Don't save output
- Save data in non-proprietary formats
## Literate Programming: Pros
- Text and code all in one place, logical order
- Data, results automatically updated to reflect external changes
- Code is live -- automatic "regression test" when building a document
## Literate Programming: Cons
- Text and code are all in one place; can make documents difficult to read, especially if there is a lot of code
- Can substantially slow down processing of documents (although there are tools to help)
## What is Knitr Good For?
- Manuals
- Short/Medium-Length technical documents
- Tutorials
- Reports (Especially if generated periodically)
- Data Preprocessing documents/summaries
## What is knitr NOT good for?
- Very long research articles
- Complex time-consuming computations
- Documents that require precise formatting
## Non-GUI Way of Creating R Markdown documents
```R
library(knitr)
setwd(<working directory>)
knit2html('document.Rmd')
browseURL('document.html')
```
## A few notes about knitr
- knitr will fill a new document with filler text; delete it
- Code chunks begin with ` ```{r}` and ends with ` ``` `
- All R code goes in between these markers
- Code chunks can have names, which is useful when we start making graphics
` ```{r firstchunk}`
`## R code goes here`
` ``` `
- By default, code in a code chunk is echoed, as will the results of the computation (if there are results to print)
## Processing of knitr documents
- You write RMarkdown document (.Rmd)
- knitr produces a Markdown document (.md)
- knitr converts the Markdown document into HTML (by default)
- .Rmd -> .md -> .html
- You should NOT edit (or save) the .md or .html documents until you are finished
## Inline Text Computations
You can reference variable in RMarkdown through the following
```
`The current time is `r time`. My favorite random number is `r rand`
```
## Setting Global Options
- Sometimes we want to set options for every code chunk that are different from the defaults
- For example, we may want to suppress all code echoing and results output
- We have to write some code to set these global options
Example for suppressing all code chunks
```R
```{r setoptions, echo=FALSE}
opts_chunk$set(echo=False, results = "hide")
```
```
## Some Common Options
- Output
- Results: "axis", "hide"
- echo: TRUE, FALSE
- Figures
- fig.height: numeric
- fig.width: numeric
## Caching Computations
- What if one chunk takes a long time to run?
- All chunks have to be re-computed every time you re-knit the file
- The `cache=TRUE` option can be set on a chunk-by-chunk basis to store results of computation
- After the first run, results are loaded from cache
## Caching Caveats
- If the data or code (or anything external) changes, you need to re-run the cache code chunks
- Dependencies are not checked explicitly!!!!
- Chunks with significant *side effects* may not be cacheable
## Summary of knitr
- Literate statistical programming can be a useful way to put text, code, data, output all in one document
- knitr is a powerful tool for iterating code and text in a simple document format

View file

@ -0,0 +1,308 @@
## tl;dr
People are busy, especially managers and leaders. Results of data analyses are sometimes presented in oral form, but often the first cut is presented via email.
It is often useful therefore, to breakdown the results of an analysis into different levels of granularity/detail
## Hierarchy of Information: Research Paper
- Title / Author List
- Speaks about what the paper is about
- Hopefully interesting
- No detail
- Abstract
- Motivation of the problem
- Bottom Line Results
- Body / Results
- Methods
- More detailed results
- Sensitivity Analysis
- Implication of Results
- Supplementary Materials / Gory Details
- Details on what was done
- Code / Data / Really Gory Details
- For reproducibility
## Hierarchy of Information: Email Presentation
- Subject Line / Subject Info
- At a minimum: include one
- Can you summarize findings in one sentence?
- Email Body
- A brief description of the problem / context: recall what was proposed and executed; summarize findings / results. (Total of 1-2 paragraphs)
- If action is needed to be taken as a result of this presentation, suggest some options and make them as concrete as possible
- If questions need to be addressed, try to make them yes / no
- Attachment(s)
- R Markdown file
- knitr report
- Stay Concise: Don't spit out pages of code
- Links to Supplementary Materials
- Code / Software / Data
- Github Repository / Project Website
## DO: Start with Good Science
- Remember: Garbage, in, garbage out
- Find a coherent focused question. This helps solve many problems
- Working with good collaborators reinforces good practices
- Something that's interesting to you will hopefully motivate good habits
## DON'T: Do Things By Hand
- Editing spreadsheets of data to "clean it up"
- Removing outliers
- QA / QC
- Validating
- Editing tables or figures (e.g rounding, formatting)
- Downloading data from a website
- Moving data around your computer, splitting, or reformatting files.
Things done by hand need to precisely documented (this is harder than it sounds!)
## DON'T: Point and Click
- Many data processing / statistical analysis packages have graphical user interfaces (GUIs)
- GUIs are convenient / intuitive but the actions you take with a GUI can be difficult for others to reproduce
- Some GUIs produce a log file or script which includes equivalent commands; these can be saved for later examination
- In general, be careful with data analysis software that is highly interactive; ease of use can sometimes lead to non-reproducible analyses.
- Other interactive software, such as text editors, are usually fine.
## DO: Teach a Computer
If something needs to be done as part of your analysis / investigation, try to teach your computer to do it (even if you only need to do it once)
In order to give your computer instructions, you need to write down exactly what you mean to do and how it should be done. Teaching a computer almost guarantees reproducibility
For example, by, hand you can
1. Go to the UCI Machine Learning Repository at http://archive.ics.uci.edu/mil/
2. Download the Bike Sharing Dataset
Or you can teach your computer to do it using R
```R
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip", "ProjectData/Bike-Sharing-Dataset.zip")
```
Notice here that:
- The full URL to the dataset file is specified
- The name of the file saved to your local computer is specified
- The directory to which the filed was saved is specified ("ProjectData")
- Code can always be executed in R (as long as link is available)
## DO: Use Some Version Control
It helps you slow things down by adding changes into small chunks. (Don't just do one massive commit). It allows one to track / tag snapshots so that one can revert back to older versions of the project. Software like Github / Bitbucket / SourceForge make it easy to publish results.
## DO: Keep Track of Your Software Environment
If you work on a complex project involving many tools / datasets, the software and computing environment can be critical for reproducing your analysis.
**Computer Architecture**: CPU (Intel, AMD, ARM), CPU Architecture, GPUs
**Operating System**: Windows, Mac OS, Linux / Unix
**Software Toolchain**: Compilers, interpreters, command shell, programming language (C, Perl, Python, etc.), database backends, data analysis software
**Supporting software / infrastructure**: Libraries, R packages, dependencies
**External dependencies**: Websites, data repositories, remote databases, software repositories
**Version Numbers:** Ideally, for everything (if available)
This function in R helps report a bunch of information relating to the software environment
```R
sessionInfo()
```
## DON'T: Save Output
Avoid saving data analysis output (tables, figures, summaries, processed data, etc.), except perhaps temporarily for efficiency purposes.
If a stray output file cannot be easily connected with the means by which it was created, then it is not reproducible
Save the data + code that generated the output, rather than the output itself.
Intermediate files are okay as long as there is clear documentation of how they were created.
## DO: Set Your Seed
Random number generators generate pseudo-random numbers based on an initial seed (usually a number or set of numbers)
In R, you can use the `set.seed()` function to set the seed and to specify the random number generator to use
Setting the seed allows for the stream of random numbers to be exactly reproducible
Whenever you generate random numbers for a non-trivial purpose, **always set the seed**.
## DO: Think About the Entire Pipeline
- Data analysis is a lengthy process; it is not just tables / figures/ reports
- Raw data -> processed data -> analysis -> report
- How you got the end is just as important as the end itself
- The more of the data analysis pipeline you can make reproducible, the better for everyone
## Summary: Checklist
- Are we doing good science?
- Is this interesting or worth doing?
- Was any part of this analysis done by hand?
- If so, are those parts precisely documented?
- Does the documentation match reality?
- Have we taught a computer to do as much as possible (i.e. coded)?
- Are we using a version control system?
- Have we documented our software environment?
- Have we saved any output that we cannot reconstruct from original data + code?
- How far back in the analysis pipeline can we go before our results are no longer (automatically reproducible)
## Replication and Reproducibility
Replication
- Focuses on the validity of the scientific claim
- Is this claim true?
- The ultimate standard for strengtening scientiffic evidence
- New investigators, data, analytical methods, laboratories, instruments, etc.
- Particularly important in studies that can impact broad policy or regulatory decisions.
Reproducibility
- Focuses on the validity of the data analysis
- Can we trust this analysis?
- Arguably a minimum standard for any scientific study
- New investigators, same data, same methods
- Important when replication is impossible
## Background and Underlying Trends
- Some studies cannot be replicated: No time, no money, or just plain unique / opportunistic
- Technology is increasing data collection throughput; data are more complex and high-dimensional
- Existing databases can be merged to become bigger databases (but data are used off-label)
- Computing power allows more sophisticated analyses, even on "small" data
- For every field "X", there is a "Computational X"
## The Result?
- Even basic analyses are difficult to describe
- Heavy computational requirements are thrust upon people without adequate training in statistics and computing
- Errors are more easily introduced into long analysis pipelines
- Knowledge transfer is inhibited
- Results are difficult to replicate or reproduce
- Complicated analyses cannot be trusted
## What Problem Does Reproducibility Solve?
What we get:
- Transparency
- Data Availability
- Software / Methods of Availability
- Improved Transfer of Knowledge
What we do NOT get
- Validity / Correctness of the analysis
An analysis can be reproducible and still be wrong
We want to know 'can we trust this analysis
Does requiring reproducibility deter bad analysis?
## Problems with Reproducibility
The premise of reproducible research is that with data/code available, people can check each other and the whole system is self-correcting
- Addresses the most "downstream" aspect of the research process -- Post-publication
- Assumes everyone plays by the same rules and wants to achieve the same goals (i.e. scientific discovery)
## Who Reproduces Research?
- For reproducibility to be effective as a means to check validity, someone needs to do something
- Re-run the analysis; check results match
- Check the code for bugs/errors
- Try alternate approaches; check sensitivity
- The need for someone to do something is inherited from traditional notion of replication
- Who is "someone" and what are their goals?
## The Story So Far
- Reproducibility brings transparency (wrt code+data) and increased transfer of knowledge
- A lot of discussion about how to get people to share data
- Key question of "can we trust this analysis"? is not addressed by reproducibility
- Reproducibility addresses potential problems long after they've occurred ("downstream")
- Secondary analyses are inevitably colored by the interests/motivations of others.
## Evidence-based Data Analysis
- Most data analyses involve stringing together many different tools and methods
- Some methods may be standard for a given field, but others are often applied ad hoc
- We should apply throughly studied (via statistical research), mutually agreed upon methods to analyze data whenever possible
- There should be evidence to justify the application of a given method
## Evidence-based Data Analysis
- Create analytic pipelines from evidence-based components - standardize it
- A deterministic statistical machine
- Once an evidence-based analytic pipeline is established, we shouldn't mess with it
- Analysis with a "transparent box"
- Reduce the "research degrees of freedom"
- Analogous to a pre-specified clinical trial protocol
## Case Study: Estimating Acute Effects of Ambient Air Pollution Exposure
- Acute / Short-term effects typically estimated via panel studies or time series studies
- Work originated in late 1970s early 1980s
- Key question "Are short-term changes in pollution associated with short-term changes in a population health outcome?"
- Studies are usually conducted at a community level
- Long history of statistical research investigating proper methods of analysis
## Case Study: Estimating Acute Effects of Ambient Air Pollution Exposure
- Can we encode everything that we have found in statistical / epidemiological research into a single package?
- Time series studies do not have a huge range of variation; typically involves similar types of data and similar questions
- We can create a deterministic statistical machine for this area?
## DSM Modules for Time Series Studies of Air Pollution and Health
1. Check for outliers, high leverage, overdispersion
2. Fill in missing data? No!
3. Model selection: Estimate degrees of freedom to adjust for unmeasured confounders
- Other aspects of model not as critical
4. Multiple lag analysis
5. Sensitivity analysis wrt
- Unmeasured confounder adjustment
- Influential points
## Where to Go From Here?
- One DSM is not enough, we need many!
- Different problems warrant different approaches and expertise
- A curated library of machines providing state-of-the-art analysis pipelines
- A CRAN/CPAN/CTAN/... for data analysis
- Or a "Cochrane Collaboration" for data analysis
## A Curated Library of Data Analysis
- Provide packages that encode data analysis pipelines for given problems, technologies, questions
- Curated by experts knowledgeable in the field
- Documentation / References given supporting module in the pipeline
- Changes introduced after passing relevant benchmarks/unit tests
## Summary
- Reproducible research is important, but does not necessarily solve the critical question of whether a data analysis is trustworthy
- Reproducible research focuses on the most "downstream" aspect of research documentation
- Evidence-based data analysis would provide standardized best practices for given scientific areas and questions
- Gives reviewers an important tool without dramatically increases the burden on them
- More effort should be put into improving the quality of "upstream" aspects of scientific research

View file

@ -0,0 +1,158 @@
## The `cacher` Package for R
- Add-on package for R
- Evaluates code written in files and stores immediate results in a key-value database
- R expressions are given SHA-1 hash values so that changes can be tracked and code reevaluated if necessary
- "Chacher packages" can be built for distribution
- Others can "clone" an analysis and evaluate subsets of code or inspect data objects
The value of this is so other people can get the analysis or clone the analysis and look at subsets of the code. Or maybe more specifically data objects. People who want to run your code may not necessarily have the resources that you have. Because of that, they may not want to run the entire Markov chain Monte Carlo simulation that you did to get the posterior distribution or the histogram that you got at the end.
But the idea is that you peel the onion a little bit rather than just go straight to the core.
## Using `cacher` as an Author
1. Parse the R source file; Create the necessary cache directiories and subdirectories
2. Cycle through each expression in the source file
- If an expression has never been evaluated, evaluate it and store any resulting R objects in the cache database
- If any cached results exists, lazy-load the results from the cache database and move to the next expression
- If an expression does not create any R objects (i.e, there is nothing to cache), add the expression to the list of expressions where evaluation needs to be forced
- Write out metadata for this expression to the metadata file
- The `cachepackage` function creates a `cacher` package storing
- Source File
- Cached data objects
- Metadata
- Package file is zipped and can be distributed
- Readers can unzip the file and immediately investigate its contents via `cacher` package
## Using `cacher` as a Reader
A journal article says
"... the code and data for this analysis can be found in the cacher package 092dcc7dda4b93e42f23e038a60e1d44dbec7b3f"
```R
library(cacher)
clonecache(id = "092dcc7dda4b93e42f23e038a60e1d44dbec7b3f")
clonecache(id = "092d") ## Same as above
# Created cache directory `.cache`
showfiles()
# [1] "top20.R"
sourcefile("top20.R")
```
## Cloning an Analysis
- Local directories are created
- Source code files and metadata are downloaded
- Data objects are *not* downloaded by default (may be really large)
- References to data objects are loaded and corresponding data can be lazy-loaded on demand
`graphcode()` gives a node graph representing the code
## Running Code
- The `runcode` function executes code in the source file
- By default, expressions that results in an object being created are not run and the resulting objects is lazy-loaded into the workspace
- Expressions not resulting in objects are evaluated
## Checking Code and Objects
- The `checkcode` function evaluates all expressions from scratch (no lazy-loading)
- Results of evaluation are checked against stored results to see if the results are the same as what the author calculated
- Setting RNG seeds is critical for this to work
- The integrity of data objects can be verified with the `checkobjects` function to check for possible corruption of data perhaps during transit.
You can inspect data objects with `loadcache`. This loads in pointers to each of the data objects into the workspace. Once you access the object, it will transfer it from the cache.
## `cacher` Summary
- The `cacher` package can be used by authors to create cache packages from data analyses for distribution
- Readers can use the `cacher` package to inspect others' data analyses by examing cached computations
- `cacher` is mindful of readers' resources and efficiently loads only those data objects that are needed.
# Case Study: Air Pollution
Particulate Matter -- PM
When doing air pollution studies you're looking at particulate matter pollution. The dust is not just one monolithic piece of dirt or soot but it's actually composed of many different chemical constituents.
Metals inert things like salts and other kinds of components so there's a possibility that a subset of those constituents are really harmful elements.
PM is composed of many different chemical constituents and it's important to understand that the Environmental Protection Agency (EPA) monitors the chemical constituents of particulate matter and has been doing so since 1999 or 2000 on a national basis.
## What causes PM to be Toxic?
- PM is composed of many different chemical elements
- Some components of PM may be more harmful than others
- Some sources of PM may be more dangerous than others
- Identifying harmful chemical constituents may lead us to strategies for controlling sources of PM
## NMMAPS
- The National Morbidity, Mortality, and Air Pollution Study (NMMAPS) was a national study of the short-term health effects of ambient air pollution
- Focused primarily on particulate matter ($PM_{10}$) and Ozone ($O_3$)
- Health outcomes include mortality from all causes and hospitalizations for cardiovascular and respiratory diseases
- Key publications
- http://www.ncbi.nlm.nih.gov/pubmed/11098531
- http://www.ncbi.nlm.nih.gov/pubmed/11354823
- Funded by the Heath Effects Institute
- Roger Peng currently serves on the Health Effects Institute Health Review Committee
## NMMAPS and Reproducibility
- Data made available at the Internet-based Health and Air Pollution Surveillance System (http://www.ihapss.jhsph.edu)
- Research and software also available at iHAPSS
- Many studies (over 67 published) have been conducted based on the public data http://www.ncbi.nlm.nih.gov/pubmed/22475833
- Has served as an important test bed for methodological development
## What Causes Particulate Matter to be Toxic?
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1665439
- Lippmann et al. found strong evidence that NI modified the short-term effect of $PM_{10}$ across 60 US communities
- No other PM chemical constituent seemed to have the same modifying effect
- To simple to be true?
## A Reanalysis of the Lippmann et al. Study
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2137127
- Rexamine the data from NMMAPS and link with PM chemical constituent data
- Are the findings sensitive for levels of Nickel in New York City?
## Does Nickel Make PM Toxic?
- Long-term average nickel concentrations appear correlated with PM risk
- There appear to be some outliers on the right-hand side (New York City)
## Does Nickel Make PM Toxic?
One of the most important things about those three points to the right is those are called high leverage points. So the regression line can be very senstive to high leverage points. Removing those three points from the dataset brings the regression line's slope down a little bit. Which then produces a line that is no longer statistical significant (p-value about 0.31)
## What Have We Learned?
- New York does have very high levels of nickel and vanadium, much higher than any other US community
- There is evidence of a positive relationship between NI concentrations and $PM_{10}$ risk
- The strength of this relationship is highly sensitive to the observations from New York City
- Most of the information in the data is derived from just 3 observations
## Lessons Learned
- Reproducibility of NMMAPS allowed for a secondary analysis (and linking with PM chemical constituent data) investigating a novel hypothesis (Lippmann et al.)
- Reproducibility also allowed for a critique of that new analysis and some additional new analysis (Dominici et al.)
- Original hypothesis not necessarily invalidated, but evidence not as strong as originally suggested (more work should be done)
- Reproducibility allows for the scientific discussion to occur in a timely and informed manner
- This is how science works.

20
content/notes/stat381.md Normal file
View file

@ -0,0 +1,20 @@
---
title: Probability and Statistical Inference
showthedate: false
---
# Probability and Statistical Inference
In the Fall of 2017, I took the course STAT 381 with Dr. Debra Hydorn. Below I included the interesting labs we worked on in the class.
*Please note that these reports were not formatted for this site. So equations and images may not show up.*
[Random Walk](randomwalk)
[Random Number Generation](randomnumber)
[Central Limit Theorum](centrallimit)
[Confidence Interval](confint)

View file

@ -0,0 +1,407 @@
# Central Limit Theorem Lab
**Brandon Rozek**
## Introduction
The Central Limit Theorem tells us that if the sample size is large, then the distribution of sample means approach the Normal Distribution. For distributions that are more skewed, a larger sample size is needed, since that lowers the impact of extreme values on the sample mean.
### Skewness
Skewness can be determined by the following formula
$$
Sk = E((\frac{X - \mu}{\sigma})^3) = \frac{E((X - \mu)^3)}{\sigma^3}
$$
Uniform distributions have a skewness of zero. Poisson distributions however have a skewness of $\lambda^{-\frac{1}{2}}$.
In this lab, we are interested in the sample size needed to obtain a distribution of sample means that is approximately normal.
### Shapiro-Wilk Test
In this lab, we will test for normality using the Shapiro-Wilk test. The null hypothesis of this test is that the data is normally distributed. The alternative hypothesis is that the data is not normally distributed. This test is known to favor the alternative hypothesis for a large number of sample means. To circumvent this, we will test normality starting with a small sample size $n$ and steadily increase it until we obtain a distribution of sample means that has a p-value greater than 0.05 in the Shapiro-Wilk test.
This tells us that with a false positive rate of 5%, there is no evidence to suggest that the distribution of sample means don't follow the normal distribution.
We will use this test to look at the distribution of sample means of both the Uniform and Poisson distribution in this lab.
### Properties of the distribution of sample means
The Uniform distribution has a mean of $0.5$ and a standard deviation of $\frac{1}{\sqrt{12n}}$ and the Poisson distribution has a mean of $\lambda$ and a standard deviation of $\sqrt{\frac{\lambda}{n}}$.
## Methods
For the first part of the lab, we will sample means from a Uniform distribution and a Poisson distribution of $\lambda = 1$ both with a sample size $n = 5$.
Doing so shows us how the Uniform distribution is roughly symmetric while the Poisson distribution is highly skewed. This begs the question: what sample size $(n)$ do I need for the Poisson distribution to be approximately normal?
### Sampling the means
The maximum number of mean observations that the Shapiro-Wilk test allows is 5000 observations. Therefore, we will obtain `n` observations separately from both the Uniform or Poisson distribution and calculate the mean from it. Repeating that process 5000 times.
The mean can be calculated from the following way
$$
Mean = \frac{\sum x_i}{n}
$$
Where $x_i$ is the observation obtained from the Uniform or Poisson distribution
### Iterating with the Shapiro-Wilk Test
Having a sample size of a certain amount doesn't always guarantee that it will fail to reject the Shapiro-Wilk test. Therefore, it is useful to run the test multiple times so that we can create a 95th percentile of sample sizes that fails to reject the Shapiro-Wilk test.
The issue with this is that lower lambda values result in higher skewness's. Which is described by the skewness formula. If a distribution has a high degree of skewness, then it will take a larger sample size n to make the sample mean distribution approximately normal.
Finding large values of n result in longer computational time. Therefore, the code takes this into account by starting at a larger value of n and/or incrementing by a larger value of n each iteration. Incrementing by a larger value of n decreases the precision, though that is the compromise I'm willing to take in order to achieve faster results.
Finding just the first value of $n$ that generates the sample means that fails to reject the Shapiro-Wilk test doesn't tell us much in terms of the sample size needed for the distribution of sample means to be approximately normal. Instead, it is better to run this process many times, finding the values of n that satisfy this condition multiple times. That way we can look at the distribution of sample sizes required and return back the 95th percentile.
Returning the 95th percentile tells us that 95% of the time, it was the sample size returned or lower that first failed to reject the Shapiro-Wilk test. One must be careful, because it can be wrongly interpreted as the sample size needed to fail to reject the Shapiro-Wilk test 95% of the time. Using that logic requires additional mathematics outside the scope of this paper. Returning the 95th percentile of the first sample size that failed to reject the Shapiro-Wilk test, however, will give us a good enough estimate for a sample size needed.
### Plots
Once a value for `n ` is determined, we sample the means of the particular distribution (Uniform/Poisson) and create histograms and Q-Q plots for each of the parameters we're interested in. We're looking to verify that the histogram looks symmetric and that the points on the Q-Q Plot fit closely to the Q-Q Line with one end of the scattering of points on the opposite side of the line as the other.
## Results
### Part I
Sampling the mean of the uniform distribution with $n = 5$ results in a mean of $\bar{x} = 0.498$ and standard deviation of $sd = 0.1288$. The histogram and Q-Q Plot can be seen in Figure I and Figure II respectively.
$\bar{x}$ isn't far from the theoretical 0.5 and the standard deviation is also close to
$$
\frac{1}{\sqrt{12(5)}} \approx 0.129
$$
Looking at the histogram and Q-Q plot, it suggests that data is approximately normal. Therefore we can conclude that a sample size of 5 is sufficient for the sample mean distribution coming from the normal distribution to be approximately normal.
Sampling the mean of the Poisson distribution with $n = 5$ and $\lambda = 1$ results in a mean of $\bar{x} = 0.9918$ and a standard deviation of $sd = 0.443$. The histogram and Q-Q Plot can be seen in Figures III and IV respectively.
$\bar{x}$ is not too far from the theoretical $\lambda = 1$, the standard deviation is a bit farther from the theoretical
$$
\sqrt{\frac{\lambda}{n}} = \sqrt{\frac{1}{5}} = 0.447
$$
Looking at the Figures, however, shows us that the data does not appear normal. Therefore, we cannot conclude that 5 is a big enough sample size for the Poisson Distribution of $\lambda = 1$ to be approximately normal.
### Part II
Running the algorithm I described, I produced the following table
| $\lambda$ | Skewness | Sample Size Needed | Shapiro-Wilk P-Value | Average of Sample Means | Standard Deviation of Sample Means | Theoretical Standard Deviation of Sample Means |
| --------- | -------- | ------------------ | -------------------- | ----------------------- | ---------------------------------- | ---------------------------------------- |
| 0.1 | 3.16228 | 2710 | 0.05778 | 0.099 | 0.0060 | 0.0061 |
| 0.5 | 1.41421 | 802 | 0.16840 | 0.499 | 0.0250 | 0.0249 |
| 1 | 1.00000 | 215 | 0.06479 | 1.000 | 0.0675 | 0.0682 |
| 5 | 0.44721 | 53 | 0.12550 | 4.997 | 0.3060 | 0.3071 |
| 10 | 0.31622 | 31 | 0.14120 | 9.999 | 0.5617 | 0.5679 |
| 50 | 0.14142 | 10 | 0.48440 | 50.03 | 2.2461 | 2.2361 |
| 100 | 0.10000 | 6 | 0.47230 | 100.0027 | 4.1245 | 4.0824 |
The skewness was derived from the formula in the first section while the sample size was obtained by looking at the .95 blue quantile line in Figures XVIII-XIV. The rest of the columns are obtained from the output of the R Code function `show_results`.
Looking at the histograms and Q-Q Plots produced by the algorithm, the distribution of sample means distributions are all roughly symmetric. The sample means are also tightly clustered around the Q-Q line, showing that the normal distribution is a good fit. This allows us to be confident that using these values of `n` as the sample size would result in the distribution of sample means of Uniform or Poisson (with a given lambda) to be approximately normal.
All the values of the average sampling means are within 0.001 of the theoretical average of sample means. The standard deviation of sample means slightly increase as the value of $\lambda$ increases, but it still is quite low.
## Conclusion
The table in the results section clearly show that as the skewness increases, so does the sample size needed to make the distribution of sample means approximately normal. This shows the central limit theorem in action in that no matter the skewness, if you obtain a large enough sample, the distribution of sample means will be approximately normal.
These conclusions pave the way for more interesting applications such as hypothesis testing and confidence intervals.
## Appendix
### Figures
#### Figure I, Histogram of Sample Means coming from a Uniform Distribution with sample size of 5
![part1histunif](/home/rozek/Pictures/stat381lab3/part1histunif.png)
#### Figure II, Q-Q Plot of Sample Means coming from a Uniform Distribution with sample size of 5
![part1qunif](/home/rozek/Pictures/stat381lab3/part1qunif.png)
#### Figure III, Histogram of Sample Means coming from a Poisson Distribution with $\lambda = 1$ and sample size of 5
![part1histpoisson](/home/rozek/Pictures/stat381lab3/part1histpoisson.png)
#### Figure IV, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 5
![part1qpoisson](/home/rozek/Pictures/stat381lab3/part1qpoisson.png)
#### Figure V, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 0.1$ and sample size of 2710
![histpoisson01](/home/rozek/Pictures/stat381lab3/histpoisson01.png)
#### Figure VI, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 0.1$ and sample size of 2710
![qpoisson01](/home/rozek/Pictures/stat381lab3/qpoisson01.png)
#### Figure VII, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 0.5$ and sample size of 516
![histpoisson05](/home/rozek/Pictures/stat381lab3/histpoisson05.png)
#### Figure VII, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 0.5$ and sample size of 516
![qpoisson05](/home/rozek/Pictures/stat381lab3/qpoisson05.png)
#### Figure VIII, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 215
![histpoisson1](/home/rozek/Pictures/stat381lab3/histpoisson1.png)
#### Figure IX, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 215
![qpoisson1](/home/rozek/Pictures/stat381lab3/qpoisson1.png)
#### Figure X, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 5$ and sample size of 53
![histpoisson5](/home/rozek/Pictures/stat381lab3/histpoisson5.png)
#### Figure XI, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 5$ and sample size of 53
![qpoisson5](/home/rozek/Pictures/stat381lab3/qpoisson5.png)
#### Figure XII, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 10$ and sample size of 31
![histpoisson10](/home/rozek/Pictures/stat381lab3/histpoisson10.png)
#### Figure XIII, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 10$ and sample size of 31
![qpoisson10](/home/rozek/Pictures/stat381lab3/qpoisson10.png)
#### Figure XIV, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 50$ and sample size of 10
![histpoisson50](/home/rozek/Pictures/stat381lab3/histpoisson50.png)
#### Figure XV, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 50$ and sample size of 10
![qpoisson50](/home/rozek/Pictures/stat381lab3/qpoisson50.png)
#### Figure XVI, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 100$ and sample size of 6
![histpoisson100](/home/rozek/Pictures/stat381lab3/histpoisson100.png)
#### Figure XVII, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 100$ and sample size of 6
![qpoisson100](/home/rozek/Pictures/stat381lab3/qpoisson100.png)
#### Figure XVIII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 0.1$
![histl01](/home/rozek/Pictures/stat381lab3/histl01.png)
#### Figure XIX, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 0.5$
![histl05](/home/rozek/Pictures/stat381lab3/histl05.png)
#### Figure XX, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 1$
![histl1.0](/home/rozek/Pictures/stat381lab3/histl1.0.png)
#### Figure XXI, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 5$
![histl5](/home/rozek/Pictures/stat381lab3/histl5.png)
####Figure XXII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 10$
![histl10](/home/rozek/Pictures/stat381lab3/histl10.png)
####Figure XXIII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 50$
![histl50](/home/rozek/Pictures/stat381lab3/histl50.png)
#### Figure XXIV, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 100$
![histl100](/home/rozek/Pictures/stat381lab3/histl100.png)
### R Code
```R
rm(list=ls())
library(ggplot2)
sample_mean_uniform = function(n) {
xbarsunif = numeric(5000)
for (i in 1:5000) {
sumunif = 0
for (j in 1:n) {
sumunif = sumunif + runif(1, 0, 1)
}
xbarsunif[i] = sumunif / n
}
xbarsunif
}
sample_mean_poisson = function(n, lambda) {
xbarspois = numeric(5000)
for (i in 1:5000) {
sumpois = 0
for (j in 1:n) {
sumpois = sumpois + rpois(1, lambda)
}
xbarspois[i] = sumpois / n
}
xbarspois
}
poisson_n_to_approx_normal = function(lambda) {
print(paste("Looking at Lambda =", lambda))
ns = c()
# Speed up computation of lower lambda values by starting at a different sample size
# and/or lowering the precision by increasing the delta sample size
# and/or lowering the number of sample sizes we obtain from the shapiro loop
increaseBy = 1;
iter = 3;
startingValue = 2
if (lambda == 0.1) {
startingValue = 2000;
iter = 3;
increaseBy = 50;
} else if (lambda == 0.5) {
startingValue = 200;
iter = 5;
increaseBy = 10;
} else if (lambda == 1) {
startingValue = 150;
iter = 25;
} else if (lambda == 5) {
startingValue = 20;
iter = 50;
startingValue = 10;
} else if (lambda == 10) {
iter = 100;
} else {
iter = 500;
}
progressIter = 1
for (i in 1:iter) {
# Include a progress indicator for personal sanity
if (i / iter > .1 * progressIter) {
print(paste("Progress", i / iter * 100, "% complete"))
progressIter = progressIter + 1
}
n = startingValue
dist = sample_mean_poisson(n, lambda)
p.value = shapiro.test(dist)$p.value
while (p.value < 0.05) {
n = n + increaseBy
dist = sample_mean_poisson(n, lambda)
p.value = shapiro.test(dist)$p.value
# More sanity checks
if (n %% 10 == 0) {
print(paste("N =", n, " p.value =", p.value))
}
}
ns = c(ns, n)
}
print(ggplot(data.frame(ns), aes(x = ns)) +
geom_histogram(fill = 'white', color = 'black', bins = 10) +
geom_vline(xintercept = ceiling(quantile(ns, .95)), col = '#0000AA') +
ggtitle(paste("Histogram of N needed for Poisson distribution of lambda =", lambda)) +
xlab("N") +
ylab("Count") +
theme_bw())
ceiling(quantile(ns, .95)) #95% of the time, this value of n will give you a sampling distribution that is approximately normal
}
uniform_n_to_approx_normal = function() {
ns = c()
progressIter = 1
for (i in 1:500) {
# Include a progress indicator for personal sanity
if (i / 500 > .1 * progressIter) {
print(paste("Progress", i / 5, "% complete"))
progressIter = progressIter + 1
}
n = 2
dist = sample_mean_uniform(n)
p.value = shapiro.test(dist)$p.value
while (p.value < 0.05) {
n = n + 1
dist = sample_mean_uniform(n)
p.value = shapiro.test(dist)$p.value
if (n %% 10 == 0) {
print(paste("N =", n, " p.value =", p.value))
}
}
ns = c(ns, n)
}
print(ggplot(data.frame(ns), aes(x = ns)) +
geom_histogram(fill = 'white', color = 'black', bins = 10) +
geom_vline(xintercept = ceiling(quantile(ns, .95)), col = '#0000AA') +
ggtitle("Histogram of N needed for Uniform Distribution") +
xlab("N") +
ylab("Count") +
theme_bw())
ceiling(quantile(ns, .95)) #95% of the time, this value of n will give you a sampling distribution that is approximately normal
}
show_results = function(dist) {
print(paste("The mean of the sample mean distribution is:", mean(dist)))
print(paste("The standard deviation of the sample mean distribution is:", sd(dist)))
print(shapiro.test(dist))
print(ggplot(data.frame(dist), aes(x = dist)) +
geom_histogram(fill = 'white', color = 'black', bins = 20) +
ggtitle("Histogram of Sample Means") +
xlab("Mean") +
ylab("Count") +
theme_bw())
qqnorm(dist, pch = 1, col = '#001155', main = "QQ Plot", xlab = "Sample Data", ylab = "Theoretical Data")
qqline(dist, col="#AA0000", lty=2)
}
## PART I
uniform_mean_dist = sample_mean_uniform(n = 5)
poisson_mean_dist = sample_mean_poisson(n = 5, lambda = 1)
show_results(uniform_mean_dist)
show_results(poisson_mean_dist)
## PART II
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 0.1");
n.01 = poisson_n_to_approx_normal(0.1)
show_results(sample_mean_poisson(n.01, 0.1))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 0.5");
n.05 = poisson_n_to_approx_normal(0.5)
show_results(sample_mean_poisson(n.05, 0.5))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 1");
n.1 = poisson_n_to_approx_normal(1)
show_results(sample_mean_poisson(n.1, 1))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 5");
n.5 = poisson_n_to_approx_normal(5)
show_results(sample_mean_poisson(n.5, 5))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 10");
n.10 = poisson_n_to_approx_normal(10)
show_results(sample_mean_poisson(n.10, 10))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 50");
n.50 = poisson_n_to_approx_normal(50)
show_results(sample_mean_poisson(n.50, 50))
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 100");
n.100 = poisson_n_to_approx_normal(100)
show_results(sample_mean_poisson(n.100, 100))
print("Starting Algorithm to Find Sample Size Needed for the Uniform Distribution")
n.uniform = uniform_n_to_approx_normal()
show_results(sample_mean_uniform(n.uniform))
```

View file

@ -0,0 +1,287 @@
# Confidence Interval Lab
**Written by Brandon Rozek**
## Introduction
Confidence intervals expands the concept of a point estimation by giving a margin of error such that one can be confident that a certain percentage of the time the true parameter falls within that interval.
In this lab, we will look at confidence intervals for a mean. This lab focuses on a certain method of confidence intervals that depends on the distribution of sample means being Normal. We will show how the violation of this assumption impacts the probability that the true parameter falls within the interval.
## Methods
The observed level of confidence tells us the proportion of times the true mean falls within a confidence interval. To show how the violation of the Normality assumption affects this, we will sample from both a Normal distribution, T distribution, and exponential distribution with different sample sizes.
The normal and T distributions are sampled with a mean of 5 and a standard deviation of 2. The exponential deviation is sampled with a lambda of 2 or mean of 0.5.
From the samples, we obtain the mean and the upper/lower bounds of the confidence interval. This is performed 100,000 times. That way we obtain a distribution of these statistics.
We know that a confidence interval is valid, if the lower bound is no more than the true mean and the upper bound is no less than the true mean. From this definition, we can compute a proportion of observed confidence from the simulations
### Visualizations
From the distributions of statistics, we can create visualizations to support the understanding of confidence intervals.
The first one is a scatterplot of lower bounds vs upper bounds. This plot demonstrates the valid confidence intervals in blue and the invalid ones in red. It demonstrates how confidence intervals that are invalid are not located inside the box.
The second visualization is a histogram of all the sample means collected. The sample means that didn't belong to a valid confidence interval are shaded in red. This graphic helps demonstrate the type I errors on both sides of the distribution.
In this lab, we're interested in seeing how our observed level of confidence differs from our theoretical level of confidence (95%) when different sample sizes and distributions are applied.
## Results
We can see from the table section in the Appendix that sampling from a Normal or t distribution does not adversely affect our observed level of confidence. The observed level of confidence varies slightly from the theoretical level of confidence of 0.95.
When sampling from the exponential distribution, however, the observed level of confidence highly depends upon the sample size.
Looking at Table III, we can see that for a sample size of 10, the observed level of confidence is at a meager 90%. This is 5% off from our theoretical level of confidence. This shows how the normality assumption is vital to the precision of our estimate.
This comes from the fact that using this type of confidence interval on a mean from a non-normal distribution requires a large sample size for the central limit theorem to take affect.
The central limit theorem states that if the sample size is "large", the distribution of sample means approach the normal distribution. You can see how in Figure XVIII, the distribution of sample means is skewed, though as the sample size increases, the distribution of sample means become more symmetric (Figure XIX).
## Conclusion
From this, we can conclude that violating the underlying assumption of normality decreases the observed level of confidence. We can mitigate the decrease of the observed level of confidence when sampling means from a non-normal distribution by having a larger sample size. This is due to the central limit theorem.
## Appendix
### Tables
#### Table I. Sampling from Normal
| Sample Size | Proportion of Means Within CI |
| ----------- | ----------------------------- |
| 10 | 0.94849 |
| 20 | 0.94913 |
| 50 | 0.95045 |
| 100 | 0.94955 |
#### Table II. Sampling from T Distribution
| Sample Size | Proportion of Means Within CI |
| ----------- | ----------------------------- |
| 10 | 0.94966 |
| 20 | 0.94983 |
| 50 | 0.94932 |
| 100 | 0.94999 |
#### Table III. Sampling from Exponential Distribution
| Sample Size | Proportion of Means Within CI |
| ----------- | ----------------------------- |
| 10 | 0.89934 |
| 20 | 0.91829 |
| 50 | 0.93505 |
| 100 | 0.94172 |
### Figures
#### Normal Distribution
##### Figure I. Scatterplot of Bounds for Normal Distribution of Sample Size 10
![normal10scatter](/home/rozek/Pictures/statlab4/normal10scatter.png)
##### Figure II. Histogram of Sample Means for Normal Distribution of Sample Size 10
![normal10hist](/home/rozek/Pictures/statlab4/normal10hist.png)
##### Figure III. Scatterplot of Bounds for Normal Distribution of Sample Size 20
![normal20scatterplot](/home/rozek/Pictures/statlab4/normal20scatterplot.png)
##### Figure IV. Histogram of Sample Means for Normal Distribution of Sample Size 20
![normal20hist](/home/rozek/Pictures/statlab4/normal20hist.png)
##### Figure VScatterplot of Bounds for Normal Distribution of Sample Size 50
![normal50scatterplot](/home/rozek/Pictures/statlab4/normal50scatterplot.png)
##### Figure VI. Histogram of Sample Means for Normal Distribution of Sample Size 50
![normal50hist](/home/rozek/Pictures/statlab4/normal50hist.png)
##### Figure VII. Scatterplot of Bounds for Normal Distribution of Sample Size 100
![normal100scatterplot](/home/rozek/Pictures/statlab4/normal100scatterplot.png)
##### Figure VIII. Histogram of Sample Means for Normal Distribution of Sample Size 100
![normal100hist](/home/rozek/Pictures/statlab4/normal100hist.png)
#### T Distribution
##### Figure IX. Scatterplot of Bounds for T Distribution of Sample Size 10
![t10scatterplot](/home/rozek/Pictures/statlab4/t10scatterplot.png)
##### Figure X. Histogram of Sample Means for T Distribution of Sample Size 10
![t10hist](/home/rozek/Pictures/statlab4/t10hist.png)
##### Figure XI. Scatterplot of Bounds for T Distribution of Sample Size 20
![t20scatterplot](/home/rozek/Pictures/statlab4/t20scatterplot.png)
##### Figure XII. Histogram of Sample Means for T Distribution of Sample Size 20
![t20hist](/home/rozek/Pictures/statlab4/t20hist.png)
##### Figure XIII. Scatterplot of Bounds for T Distribution of Sample Size 50
![t50scatter](/home/rozek/Pictures/statlab4/t50scatter.png)
##### Figure XIV. Histogram of Sample Means for T Distribution of Sample Size 50
![t50hist](/home/rozek/Pictures/statlab4/t50hist.png)
##### Figure XV. Scatterplot of Bounds for T Distribution of Sample Size 100
![t100scatter](/home/rozek/Pictures/statlab4/t100scatter.png)
##### Figure XVI. Histogram of Sample Means for T Distribution of Sample Size 100
![t100hist](/home/rozek/Pictures/statlab4/t100hist.png)
#### Exponential Distribution
##### Figure XVII. Scatterplot of Bounds for Exponential Distribution of Sample Size 10
![exp10scatter](/home/rozek/Pictures/statlab4/exp10scatter.png)
##### Figure XVIII. Histogram of Sample Means for Exponential Distribution of Sample Size 10
![exp10hist](/home/rozek/Pictures/statlab4/exp10hist.png)
##### Figure XIX. Scatterplot of Bounds for Exponential Distribution of Sample Size 20
![exp20scatter](/home/rozek/Pictures/statlab4/exp20scatter.png)
##### Figure XX. Histogram of Sample Means for Exponential Distribution of Sample Size 20
![exp20hist](/home/rozek/Pictures/statlab4/exp20hist.png)
##### Figure XXI. Scatterplot of Bounds for Exponential Distribution of Sample Size 50
![exp50scatter](/home/rozek/Pictures/statlab4/exp50scatter.png)
##### Figure XXII. Histogram of Sample Means for Exponential Distribution of Sample Size 50
![exp50hist](/home/rozek/Pictures/statlab4/exp50hist.png)
##### Figure XXIII. Scatterplot of Bounds for Exponential Distribution of Sample Size 100
![exp100scatter](/home/rozek/Pictures/statlab4/exp100scatter.png)
##### Figure XXIV. Histogram of Sample Means for Exponential Distribution of Sample Size 100
![exp100hist](/home/rozek/Pictures/statlab4/exp100hist.png)
### R Code
```R
rm(list=ls())
library(ggplot2)
library(functional) # For function currying
proportion_in_CI = function(n, mu, dist) {
# Preallocate vectors
lower_bound = numeric(100000)
upper_bound = numeric(100000)
means = numeric(100000)
number_within_CI = 0
ME = 1.96 * 2 / sqrt(n) ## Normal Margin of Error
for (i in 1:100000) {
x = numeric(n)
# Sample from distribution
if (dist == "Normal" | dist == "t") {
x = rnorm(n,mu,2)
} else if (dist == "Exponential") {
x = rexp(n, 1 / mu)
}
## Correct ME if non-normal
if (dist != "Normal") {
ME = qt(0.975,n-1)*sd(x)/sqrt(n)
}
## Store statistics
means[i] = mean(x)
lower_bound[i] = mean(x) - ME
upper_bound[i] = mean(x) + ME
# Is Confidence Interval Valid?
if (lower_bound[i] < mu & upper_bound[i] > mu) {
number_within_CI = number_within_CI + 1
}
}
# Prepare for plotting
lbub = data.frame(lower_bound, upper_bound, means)
lbub$col = ifelse(lbub$lower_bound < mu & lbub$upper_bound > mu, 'Within CI', 'Outside CI')
print(ggplot(lbub, aes(x = lower_bound, y = upper_bound, col = col)) +
geom_point(pch = 1) +
geom_hline(yintercept = mu, col = '#000055') +
geom_vline(xintercept = mu, col = '#000055') +
ggtitle(paste("Plot of Lower Bounds vs Upper Bounds with Sample Size of ", n)) +
xlab("Lower Bound") +
ylab("Upper Bounds") +
theme_bw()
)
print(ggplot(lbub, aes(x = means, fill = col)) +
geom_histogram(color = 'black') +
ggtitle(paste("Histogram of Sample Means with Sample Size of ", n)) +
xlab("Sample Mean") +
ylab("Count") +
theme_bw()
)
# Return proportion within CI
number_within_CI / 100000
}
sample_sizes = c(10, 20, 50, 100)
### PART I
proportion_in_CI_Normal = Curry(proportion_in_CI, dist = "Normal", mu = 5)
p_norm = sapply(sample_sizes, proportion_in_CI_Normal)
sapply(p_norm, function(x) {
cat("The observed proportion of intervals containing mu is", x, "\n")
invisible(x)
})
### PART II
proportion_in_CI_T = Curry(proportion_in_CI, dist = "t", mu = 5)
p_t = sapply(sample_sizes, proportion_in_CI_T)
sapply(p_t, function(x) {
cat("The observed proportion of intervals containing mu is", x, "\n")
invisible(x)
})
### PART III
proportion_in_CI_Exp = Curry(proportion_in_CI, dist = "Exponential", mu = 0.5)
p_exp = sapply(sample_sizes, proportion_in_CI_Exp)
sapply(p_exp, function(x) {
cat("The observed proportion of intervals containing mu is", x, "\n")
invisible(x)
})
```

View file

@ -0,0 +1,441 @@
# Random Number Generation
## Introduction
The generation of random numbers have a variety of applications including but not limited to the modeling of stochastic processes. It is important, therefore, to be able to generate any random number following a given distribution. One of the many ways to achieve this is to transform numbers sampled from a random uniform distribution.
In this lab, we will compare the effectiveness between the linear congruence method (LCM), `runif`, and `rexp` to generate random numbers. The programming language R will be used and different plots and summary statistics are drawn upon to analyze the effectiveness of the methods.
## Methods
### The Linear Congruential Method
The linear congruential method (LCM) is an algorithm that produces a sequence of pseudo-random numbers using the following recursive function
$$
X_{n + 1} = (aX_n + c) \mod m
$$
The R code we use has a `c` value of 0 which is a special case known as the multiplicative congruential generator (MCG).
There are several conditions for using a MCG. First, the seed value must be co-prime to `m`. In other words, the greatest common denominator between the two must be 1. A function was written in R to check this. Secondly, `m` must be a prime number or a power of a prime number.
#### Periodicity
An element of periodicity comes into play when dealing with pseudo-random number generators. Once the generator has produced a certain number of terms, it is shown to loop back to the first term of the sequence. It is advantageous, therefore, to keep the period high.
The period in a MCG is at most `m - 1`. In this lab, `m` has a value of $2^{31}$ and only 100 numbers were sampled from the LCM. This allows us to avoid the issue of periodicity entirely in our analysis.
### Mersenne-Twister Method
The default pseudo-random number generator (pseudo RNG) in R is the Mersenne-Twister with the default seed value of the current time and the process id of the current R instance. Mersenne-Twister is part of a class of pseudo RNG called the generalized feedback register. This class of pseudo RNGs provide a very long period (VLP) or in other words, a long cycle before the values start repeating. VLPs are often useful when executing large simulations in R.
Since this method is already coded in the `base` R-package, we will leave the implementation for the curious to research.
### The Uniform Distribution
#### Probability Mass Function
The uniform distribution function $Unif(\theta_1, \theta_2)$ has a probability mass function (PMF) of the following.
$$
f(x) = \frac{1}{\theta_2 - \theta_1}
$$
Where x is valid between [$\theta_1$, $\theta_2$].
In our case, we are producing numbers from [0,1]. We can therefore reduce the probability mass function to the following
$$
f(x) = 1
$$
#### Expected Value
The expected value can be defined as
$$
E(X) = \int_a^b xf(x)dx
$$
For the uniform distribution we're using that becomes
$$
E(X) = \int_0^1 xdx = \frac{1}{2}[x^2]_0^1 = \frac{1}{2}
$$
We should, therefore, expect the mean of the generated random number sequence to roughly equal $\frac{1}{2}$.
#### Median
To find the median of the distribution, we need to find the point at which the cumulative density function (CDF) is equal to .5.
$$
\int_0^{Median(X)} f(x)dx = \frac{1}{2}
$$
$$
\int_0^{Median(X)} dx = \frac{1}{2}
$$
$$
[X]_0^{Median(X)} = \frac{1}{2}
$$
$$
Median(X) = \frac{1}{2}
$$
This shows us that the median of the distribution should be roughly equal to $\frac{1}{2}$ as well.
#### Analysis of Uniform Distribution Fit
The histogram of a uniform distribution of data should look rectangular in shape. This means that the counts of all the sub-intervals should be about the same.
Another way to test whether or not the distribution is a good fit is to use what is called a Quantile-Quantile plot (Q-Q Plot). This is a probability plot that compares the quantiles from the theoretical distribution, this is the uniform distribution, to that of the sample.
For a precise Q-Q Plot, we need a lot of quantiles to compare. In this lab, we will compare 100 different quantiles. The quantiles from the theoretical distribution can be easily derived from the fact that all value ranges are equally probable. The theoretical quantiles in this case is 0.01, 0.02, ..., 0.98, 0.99, 1. The sample quantiles are obtianed by receiving 100 observations from `runif` or the LCM.
In the Q-Q plot, a good fit is shown when the points hug closely to the Q-Q line. It is also important to have symmetry in the points. This means that the points are ending on opposite sides of the Q-Q line.
For sake of exploration, we will use 5 different seed values for the linear congruential method (while making sure that the seed values are co-prime). We will then use the results of these and compare LCM to how the standard `runif` method generates random numbers.
### Exponential Distribution
#### Probability Mass Function and the Cumulative Density Function
The exponential distribution is a type of continuous distribution that is defined by the following PMF
$$
f(x) = \lambda e^{-\lambda t}
$$
We can find the CDF by taking the integral of the PMF.
$$
F(x) = \int f(x)dx = \lambda \int e^{-\lambda x} dx = \lambda * (\frac{-1}{\lambda}) e^{-\lambda x} + C = -e^{-\lambda x} + C
$$
One of the conditions for the cumulative density function is that as $x \to \infty$, $F(x) \to 1$.
$$
1 = \lim_{x \to \infty} (-e^{-\lambda x} + C) = \lim_{x \to \infty} (-e^{-\lambda x}) + lim_{x \to \infty} ( C) = \lim_{x \to \infty}(-e^{\lambda x}) + C
$$
This shows that $C = 1$, since $lim_{x \to \infty} (-e^{-\lambda x})$ is equal to 0.
Substituting $C$ gives us the following.
$$
F(x) = 1 - e^{-\lambda x}
$$
#### Expected Value
We can find the expected value using the formula described in the previous Expected Value section. Reflected in the bounds of integration is the domain of the exponential distribution $[0, \infty)$.
$$
E(X) = \lim_{a \to \infty}\int_0^a x \lambda e^{-\lambda x} dx
$$
The integral above features a product of two functions. So as an aside, we will derive a formula so we can take the integral above.
The total derivative is defined as
$$
d(uv) = udv + vdu
$$
After taking the integral of both sides, we can solve for a formula that gives the following
$$
\int d(uv) = \int udv + \int vdu
$$
$$
\int udv = uv - \int vdu
$$
The formula above is called *Integration by Parts*. We will make use of this formula by defining $u = \lambda x$ and $dv = e^{-\lambda x} dx$
This implies that $du = \lambda dx$ and $v= -\frac{1}{\lambda}e^{-\lambda x}$
$$
E(X) = -\lim_{a \to \infty}[\lambda x \frac{1}{\lambda}e^{-\lambda x}]_0^a - \lim_{b \to \infty}\int_0^b -\frac{1}{\lambda}e^{-\lambda x}\lambda dx
$$
$$
E(X) = -\lim_{a \to \infty} [xe^{-\lambda x}]_0^a - \lim_{b \to \infty}\int_0^b -e^{-\lambda x}dx
$$
$$
E(X) = -\lim_{a \to \infty}(ae^{-\lambda a}) - \lim_{b \to \infty}[\frac{1}{\lambda}e^{-\lambda x}]_0^b
$$
$$
E(X) = 0 - \frac{1}{\lambda}[\lim_{b \to \infty}(e^{-\lambda b}) - e^{-0\lambda}]
$$
$$
E(X) = -\frac{1}{\lambda}(-1) = \frac{1}{\lambda}
$$
For the purposes of this lab, we will make the rate ($\lambda$) equal to 3. Which means we should expect our mean to be roughly equal to $\frac{1}{3}$.
#### Median
The median can be found by setting the CDF equal to $\frac{1}{2}$
$$
1- e^{-\lambda Median(X)} = \frac{1}{2}
$$
$$
e^{-\lambda Median(X)} = \frac{1}{2}
$$
$$
-\lambda Median(X) = \ln(\frac{1}{2})
$$
$$
Median(X) = \frac{\ln(2)}{\lambda}
$$
This shows that we should expect our median to be around $\frac{\ln(2)}{3} \approx 0.231$.
### Inverse Transform Sampling
Once we have a uniform distribution of values, we can then use these values to map to an exponential distribution. This is done through a technique called inverse transform sampling, the technique works as follows:
1. Generate a random number u from the standard uniform distribution
2. Compute the value of X such that F(X) = u
3. The value of X belongs to the distribution of F
Using these steps, we'll derive the exponential transform below.
$$
F(X) = 1 - e^{-\lambda x} = u
$$
Then proceeding to solve for X, we obtain the following.
$$
e^{-\lambda X} = 1 - u
$$
$$
-\lambda X = \ln(1 - u)
$$
$$
X = \frac{-\ln(1 - u)}{\lambda}
$$
With this formula, we can now find values for X which is a random variable that follows an exponential distribution given random uniform data $u$.
In this lab, we are looking at the exponential distribution with a rate of 3. Therefore, the probability mass function, the cumulative distribution function, and the exponential transform can be redefined as these respectively.
$$
f(x) = 3e^{-3x}
$$
$$
F(x) = 1 - e^{-3x}
$$
$$
X = \frac{-\ln(1 - u)}{3}
$$
### R Code
The R code makes use of the concepts above. The purpose of this code is to output the summary statistics, histograms, and Q-Q plots are used to compare the different methods.
First the LCM is executed four times with three different seed values. The LCM is used to create a uniform distribution of data that is then compared to the standard `runif` function in the R language.
Then, transformations of a LCM, `runif`, are executed and compared with the standard `rexp` to create an exponential distribution of data with $\lambda = 3$.
## Results
### Uniform Distribution
For a small sample of 100 values, the data seems evenly distributed for all seeds and methods used. The peaks and troughs are more pronounced in the LCM methods suggesting that the `runif` command creates more evenly distributed data than the LCM.
Deviations from the mean and median are lowest for the LCM rather than the standard `runif` command with the exception of the LCM with the seed of 93463.
The Q-Q plots for all of the methods follow the Q-Q Line tightly and appear symmetric.
### Exponential Distribution
A bin size of 20 is used to make the histograms easily comparable to each other. One interesting thing to note is that in Figure XI, there is an observation located far out from the rest of the data. For the purpose of a histogram, which is to show us the shape of the distribution, this one far observation skews what we can see. Therefore the next figure, Figure XII has that single outlier removed and shows the histogram of the rest of the data.
All of the histograms appear exponential in shape. Looking at the Q-Q plots, the LCM transformation and the `rexp` distribution of data fit closely to the QQ-Line. All of the Q-Q Plots have the points getting further away from the line as you get higher up in the percentiles. The `runif` transformation quantiles diverge from the line after about the 50th percentile.
Deviations from the mean and median were about the same for both transformed techniques (`runif` and LCM). The `rexp` command did better when it came to the deviation from the mean, but performed worse when it came to deviation from the median.
## Conclusion
The linear congruential methods performed better when it came to simulating the distribution than the standard R functions. It more accurately captured the median for both the standard uniform data, and the exponential data. Against the `runif` command, it also performed better at simulating the mean.
This can especially be seen when comparing the two transform techniques. The transformed LCM distribution of data followed the Q-Q line more tightly than the transformed `runif`.
I speculate that this is due to the seed value used. The Mersenne-Twist method performs better when the seed value doesn't have many zeros in it. Since the seed value is determined by the system time and process id, we don't know for sure if it's a proper input for the Mersenne-Twist. For the LCM, seeds were properly tested to make sure that it was co-prime with one of its parameters. This condition allowed proper seeds to work well with the LCM.
Further research can be done on standardizing the seed values used across all the different pseudo random number generators and looking at the 6 other pseudo RNG that comes built into R. Changing the seed and random number generator can be achieved through the `set.seed` function.
## Appendix
### Figures
#### Figure I, Histogram of LCM Random Data with seed 55555
![](file:///home/rozek/Pictures/stat381lab2/hist55555.png)
#### Figure II, Q-Q Plot of LCM Random Data with seed 55555
![qqplot55555](/home/rozek/Pictures/stat381lab2/qqplot55555.png)
#### Figure III, Histogram of LCM Random Data with seed 93463
![hist93463](/home/rozek/Pictures/stat381lab2/hist93463.png)
#### Figure IV, Q-Q Plot of LCM Random Data with seed 93463
![q93463](/home/rozek/Pictures/stat381lab2/q93463.png)
#### Figure V, Histogram of LCM Random Data with seed 29345
![hist29345](/home/rozek/Pictures/stat381lab2/hist29345.png)
#### Figure VI, Q-Q Plot of LCM Random Data with seed 29345
![q29345](/home/rozek/Pictures/stat381lab2/q29345.png)
#### Figure VII, Histogram of LCM Random Data with seed 68237
![hist68237](/home/rozek/Pictures/stat381lab2/hist68237.png)
#### Figure VIII, Q-Q Plot of LCM Random Data with seed 68237
![q68237](/home/rozek/Pictures/stat381lab2/q68237.png)
#### Figure IX, Histogram of R Random Uniform Data
![histunif](/home/rozek/Pictures/stat381lab2/histunif.png)
#### Figure X, Q-Q Plot of R Random Uniform Data
![qunif](/home/rozek/Pictures/stat381lab2/qunif.png)
#### Figure XI, Histogram of Exponential Data from LCM Random
![histexplcm](/home/rozek/Pictures/stat381lab2/histexplcm.png)
#### Figure XII, Histogram of Exponential Data from LCM Random without Outlier Above 2
![histexplcm_nooutlier](/home/rozek/Pictures/stat381lab2/histexplcm_nooutlier.png)
#### Figure XIII, Q-Q Plot of Exponential Data from LCM Rnadom
![qexplcm](/home/rozek/Pictures/stat381lab2/qexplcm.png)
#### Figure XIV, Histogram of Exponential Data from R Random Uniform
![histexpunif](/home/rozek/Pictures/stat381lab2/histexpunif.png)
#### Figure XV, Q-Q Plot of Exponential Data from R Random Uniform
![qexpunif](/home/rozek/Pictures/stat381lab2/qexpunif.png)
#### Figure XVI, Histogram of R Generated Exponential Data
![histexp](/home/rozek/Pictures/stat381lab2/histexp.png)
#### Figure XVII, Q-Q Plot of R Generated Exponential Data
![qexp](/home/rozek/Pictures/stat381lab2/qexp.png)
### Tables
#### Table I, Uniform Distribution Sample Data
| Method | Mean ($\bar{x}$) | Median ($\tilde{x}$) | $\mu - \bar{x}$ | $m - \tilde{x}$ |
| ------------------- | ---------------- | -------------------- | --------------- | --------------- |
| LCM with seed 55555 | 0.505 | 0.521 | -0.005 | -0.021 |
| LCM with seed 93463 | 0.452 | 0.402 | 0.048 | 0.098 |
| LCM with seed 29345 | 0.520 | 0.502 | -0.020 | -0.002 |
| LCM with seed 68237 | 0.489 | 0.517 | 0.011 | -0.017 |
| R Random Uniform | 0.480 | 0.471 | 0.020 | 0.029 |
#### Table II, Exponential Distribution Sample Data
| Method | Mean | Median | $\mu - \bar{x}$ | $m - \tilde{x}$ |
| --------------- | ----- | ------ | --------------- | --------------- |
| LCM Transform | 0.380 | 0.246 | -0.047 | -0.015 |
| RUnif Transform | 0.283 | 0.218 | 0.050 | 0.013 |
| R Exponential | 0.363 | 0.278 | -0.030 | -0.047 |
### R Code
```R
library(ggplot2)
linear_congruential = function(seed) {
a = 69069
c = 0
m = 2^31
x = numeric(100)
x[1] = seed
for (i in 2:100) {
x[i] = (a * x[i-1] + c) %% m
}
xnew = x / (max(x) + 1)
xnew
}
gcd = function(x,y) {
r = x %% y;
return(ifelse(r, gcd(y, r), y))
}
check_seed = function(seed) {
if (gcd(seed, 2^31) == 1) {
print(paste("The seed value of", seed, "is co-prime"))
} else {
print(paste("The seed value of", seed, "is NOT co-prime"))
}
}
check_data = function(data, distribution, title) {
print(paste("Currently looking at", title))
print(summary(data))
print(ggplot(data.frame(data), aes(x = data)) +
geom_histogram(fill = 'white', color = 'black', bins = 10) +
xlab("Date") +
ylab("Frequency") +
ggtitle(paste("Histogram of", title)) +
theme_bw())
uqs = (1:100) / 100
if (distribution == "Uniform") {
qqplot(data, uqs, pch = 1, col = '#001155', main = paste("QQ Plot of", title), xlab = "Sample Data", ylab = "Theoretical Data")
qqline(uqs, distribution = qunif, col="red", lty=2)
} else if (distribution == "Exponential") {
eqs = -log(1-uqs) / 3
qqplot(data, eqs, pch = 1, col = '#001155', main = paste("QQ Plot of", title), xlab = "Sample Data", ylab = "Theoretical Data")
qqline(eqs, distribution=function(p) { qexp(p, rate = 3)}, col="#AA0000", lty=2)
}
}
seed1 = 55555
seed2 = 93463
seed3 = 29345
seed4 = 68237
check_seed(seed1)
lin_cong = linear_congruential(seed1)
check_data(lin_cong, "Uniform", paste("LCM Random Data with seed", seed1))
check_seed(seed2)
lin_cong2 = linear_congruential(seed2)
check_data(lin_cong2, "Uniform", paste("LCM Random Data with seed", seed2))
check_seed(seed3)
lin_cong3 = linear_congruential(seed3)
check_data(lin_cong3, "Uniform", paste("LCM Random Data with seed", seed3))
check_seed(seed4)
lin_cong4 = linear_congruential(seed4)
check_data(lin_cong4, "Uniform", paste("LCM Random Data with seed", seed4))
# Using the standard built-in R function
unif = runif(100, 0, 1)
check_data(unif, "Uniform", "R Random Uniform Data")
# Building exponential from linear congruence method
exp1 = -log(1 - lin_cong) / 3
check_data(exp1, "Exponential", "Exponential Data from LCM Random")
# Building exponential from runif
exp2 = -log(1 - unif) / 3
check_data(exp2, "Exponential", "Exponential Data from R Random Uniform")
# Building exponential from rexp
exp3 = rexp(100, rate = 3)
check_data(exp3, "Exponential", "R Generated Exponential Data")
```

File diff suppressed because one or more lines are too long