mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-11-28 07:40:25 +00:00
Website snapshot
This commit is contained in:
parent
ee0ab66d73
commit
50ec3688a5
281 changed files with 21066 additions and 0 deletions
407
content/notes/stat381/centrallimit.md
Normal file
407
content/notes/stat381/centrallimit.md
Normal file
|
|
@ -0,0 +1,407 @@
|
|||
# Central Limit Theorem Lab
|
||||
|
||||
**Brandon Rozek**
|
||||
|
||||
## Introduction
|
||||
|
||||
The Central Limit Theorem tells us that if the sample size is large, then the distribution of sample means approach the Normal Distribution. For distributions that are more skewed, a larger sample size is needed, since that lowers the impact of extreme values on the sample mean.
|
||||
|
||||
### Skewness
|
||||
|
||||
Skewness can be determined by the following formula
|
||||
$$
|
||||
Sk = E((\frac{X - \mu}{\sigma})^3) = \frac{E((X - \mu)^3)}{\sigma^3}
|
||||
$$
|
||||
Uniform distributions have a skewness of zero. Poisson distributions however have a skewness of $\lambda^{-\frac{1}{2}}$.
|
||||
|
||||
In this lab, we are interested in the sample size needed to obtain a distribution of sample means that is approximately normal.
|
||||
|
||||
### Shapiro-Wilk Test
|
||||
|
||||
In this lab, we will test for normality using the Shapiro-Wilk test. The null hypothesis of this test is that the data is normally distributed. The alternative hypothesis is that the data is not normally distributed. This test is known to favor the alternative hypothesis for a large number of sample means. To circumvent this, we will test normality starting with a small sample size $n$ and steadily increase it until we obtain a distribution of sample means that has a p-value greater than 0.05 in the Shapiro-Wilk test.
|
||||
|
||||
This tells us that with a false positive rate of 5%, there is no evidence to suggest that the distribution of sample means don't follow the normal distribution.
|
||||
|
||||
We will use this test to look at the distribution of sample means of both the Uniform and Poisson distribution in this lab.
|
||||
|
||||
### Properties of the distribution of sample means
|
||||
|
||||
The Uniform distribution has a mean of $0.5$ and a standard deviation of $\frac{1}{\sqrt{12n}}$ and the Poisson distribution has a mean of $\lambda$ and a standard deviation of $\sqrt{\frac{\lambda}{n}}$.
|
||||
|
||||
## Methods
|
||||
|
||||
For the first part of the lab, we will sample means from a Uniform distribution and a Poisson distribution of $\lambda = 1$ both with a sample size $n = 5$.
|
||||
|
||||
Doing so shows us how the Uniform distribution is roughly symmetric while the Poisson distribution is highly skewed. This begs the question: what sample size $(n)$ do I need for the Poisson distribution to be approximately normal?
|
||||
|
||||
### Sampling the means
|
||||
|
||||
The maximum number of mean observations that the Shapiro-Wilk test allows is 5000 observations. Therefore, we will obtain `n` observations separately from both the Uniform or Poisson distribution and calculate the mean from it. Repeating that process 5000 times.
|
||||
|
||||
The mean can be calculated from the following way
|
||||
$$
|
||||
Mean = \frac{\sum x_i}{n}
|
||||
$$
|
||||
Where $x_i$ is the observation obtained from the Uniform or Poisson distribution
|
||||
|
||||
### Iterating with the Shapiro-Wilk Test
|
||||
|
||||
Having a sample size of a certain amount doesn't always guarantee that it will fail to reject the Shapiro-Wilk test. Therefore, it is useful to run the test multiple times so that we can create a 95th percentile of sample sizes that fails to reject the Shapiro-Wilk test.
|
||||
|
||||
The issue with this is that lower lambda values result in higher skewness's. Which is described by the skewness formula. If a distribution has a high degree of skewness, then it will take a larger sample size n to make the sample mean distribution approximately normal.
|
||||
|
||||
Finding large values of n result in longer computational time. Therefore, the code takes this into account by starting at a larger value of n and/or incrementing by a larger value of n each iteration. Incrementing by a larger value of n decreases the precision, though that is the compromise I'm willing to take in order to achieve faster results.
|
||||
|
||||
Finding just the first value of $n$ that generates the sample means that fails to reject the Shapiro-Wilk test doesn't tell us much in terms of the sample size needed for the distribution of sample means to be approximately normal. Instead, it is better to run this process many times, finding the values of n that satisfy this condition multiple times. That way we can look at the distribution of sample sizes required and return back the 95th percentile.
|
||||
|
||||
Returning the 95th percentile tells us that 95% of the time, it was the sample size returned or lower that first failed to reject the Shapiro-Wilk test. One must be careful, because it can be wrongly interpreted as the sample size needed to fail to reject the Shapiro-Wilk test 95% of the time. Using that logic requires additional mathematics outside the scope of this paper. Returning the 95th percentile of the first sample size that failed to reject the Shapiro-Wilk test, however, will give us a good enough estimate for a sample size needed.
|
||||
|
||||
### Plots
|
||||
|
||||
Once a value for `n ` is determined, we sample the means of the particular distribution (Uniform/Poisson) and create histograms and Q-Q plots for each of the parameters we're interested in. We're looking to verify that the histogram looks symmetric and that the points on the Q-Q Plot fit closely to the Q-Q Line with one end of the scattering of points on the opposite side of the line as the other.
|
||||
|
||||
|
||||
|
||||
## Results
|
||||
|
||||
### Part I
|
||||
|
||||
Sampling the mean of the uniform distribution with $n = 5$ results in a mean of $\bar{x} = 0.498$ and standard deviation of $sd = 0.1288$. The histogram and Q-Q Plot can be seen in Figure I and Figure II respectively.
|
||||
|
||||
$\bar{x}$ isn't far from the theoretical 0.5 and the standard deviation is also close to
|
||||
$$
|
||||
\frac{1}{\sqrt{12(5)}} \approx 0.129
|
||||
$$
|
||||
Looking at the histogram and Q-Q plot, it suggests that data is approximately normal. Therefore we can conclude that a sample size of 5 is sufficient for the sample mean distribution coming from the normal distribution to be approximately normal.
|
||||
|
||||
Sampling the mean of the Poisson distribution with $n = 5$ and $\lambda = 1$ results in a mean of $\bar{x} = 0.9918$ and a standard deviation of $sd = 0.443$. The histogram and Q-Q Plot can be seen in Figures III and IV respectively.
|
||||
|
||||
$\bar{x}$ is not too far from the theoretical $\lambda = 1$, the standard deviation is a bit farther from the theoretical
|
||||
$$
|
||||
\sqrt{\frac{\lambda}{n}} = \sqrt{\frac{1}{5}} = 0.447
|
||||
$$
|
||||
Looking at the Figures, however, shows us that the data does not appear normal. Therefore, we cannot conclude that 5 is a big enough sample size for the Poisson Distribution of $\lambda = 1$ to be approximately normal.
|
||||
|
||||
### Part II
|
||||
|
||||
Running the algorithm I described, I produced the following table
|
||||
|
||||
| $\lambda$ | Skewness | Sample Size Needed | Shapiro-Wilk P-Value | Average of Sample Means | Standard Deviation of Sample Means | Theoretical Standard Deviation of Sample Means |
|
||||
| --------- | -------- | ------------------ | -------------------- | ----------------------- | ---------------------------------- | ---------------------------------------- |
|
||||
| 0.1 | 3.16228 | 2710 | 0.05778 | 0.099 | 0.0060 | 0.0061 |
|
||||
| 0.5 | 1.41421 | 802 | 0.16840 | 0.499 | 0.0250 | 0.0249 |
|
||||
| 1 | 1.00000 | 215 | 0.06479 | 1.000 | 0.0675 | 0.0682 |
|
||||
| 5 | 0.44721 | 53 | 0.12550 | 4.997 | 0.3060 | 0.3071 |
|
||||
| 10 | 0.31622 | 31 | 0.14120 | 9.999 | 0.5617 | 0.5679 |
|
||||
| 50 | 0.14142 | 10 | 0.48440 | 50.03 | 2.2461 | 2.2361 |
|
||||
| 100 | 0.10000 | 6 | 0.47230 | 100.0027 | 4.1245 | 4.0824 |
|
||||
|
||||
The skewness was derived from the formula in the first section while the sample size was obtained by looking at the .95 blue quantile line in Figures XVIII-XIV. The rest of the columns are obtained from the output of the R Code function `show_results`.
|
||||
|
||||
Looking at the histograms and Q-Q Plots produced by the algorithm, the distribution of sample means distributions are all roughly symmetric. The sample means are also tightly clustered around the Q-Q line, showing that the normal distribution is a good fit. This allows us to be confident that using these values of `n` as the sample size would result in the distribution of sample means of Uniform or Poisson (with a given lambda) to be approximately normal.
|
||||
|
||||
All the values of the average sampling means are within 0.001 of the theoretical average of sample means. The standard deviation of sample means slightly increase as the value of $\lambda$ increases, but it still is quite low.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The table in the results section clearly show that as the skewness increases, so does the sample size needed to make the distribution of sample means approximately normal. This shows the central limit theorem in action in that no matter the skewness, if you obtain a large enough sample, the distribution of sample means will be approximately normal.
|
||||
|
||||
These conclusions pave the way for more interesting applications such as hypothesis testing and confidence intervals.
|
||||
|
||||
## Appendix
|
||||
|
||||
### Figures
|
||||
|
||||
#### Figure I, Histogram of Sample Means coming from a Uniform Distribution with sample size of 5
|
||||
|
||||

|
||||
|
||||
#### Figure II, Q-Q Plot of Sample Means coming from a Uniform Distribution with sample size of 5
|
||||
|
||||

|
||||
|
||||
#### Figure III, Histogram of Sample Means coming from a Poisson Distribution with $\lambda = 1$ and sample size of 5
|
||||
|
||||

|
||||
|
||||
#### Figure IV, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 5
|
||||
|
||||

|
||||
|
||||
#### Figure V, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 0.1$ and sample size of 2710
|
||||
|
||||

|
||||
|
||||
#### Figure VI, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 0.1$ and sample size of 2710
|
||||
|
||||

|
||||
|
||||
#### Figure VII, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 0.5$ and sample size of 516
|
||||
|
||||

|
||||
|
||||
#### Figure VII, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 0.5$ and sample size of 516
|
||||
|
||||

|
||||
|
||||
#### Figure VIII, Histogram of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 215
|
||||
|
||||

|
||||
|
||||
#### Figure IX, Q-Q Plot of Sample Means coming from Poisson Distribution with $\lambda = 1$ and sample size of 215
|
||||
|
||||

|
||||
|
||||
#### Figure X, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 5$ and sample size of 53
|
||||
|
||||

|
||||
|
||||
#### Figure XI, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 5$ and sample size of 53
|
||||
|
||||

|
||||
|
||||
#### Figure XII, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 10$ and sample size of 31
|
||||
|
||||

|
||||
|
||||
#### Figure XIII, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 10$ and sample size of 31
|
||||
|
||||

|
||||
|
||||
#### Figure XIV, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 50$ and sample size of 10
|
||||
|
||||

|
||||
|
||||
#### Figure XV, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 50$ and sample size of 10
|
||||
|
||||

|
||||
|
||||
#### Figure XVI, Histogram of Sample Means coming from Poisson Distribution of $\lambda = 100$ and sample size of 6
|
||||
|
||||

|
||||
|
||||
#### Figure XVII, Q-Q Plot of Sample Means coming from Poisson Distribution of $\lambda = 100$ and sample size of 6
|
||||
|
||||

|
||||
|
||||
#### Figure XVIII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 0.1$
|
||||
|
||||

|
||||
|
||||
#### Figure XIX, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 0.5$
|
||||
|
||||

|
||||
|
||||
#### Figure XX, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 1$
|
||||
|
||||

|
||||
|
||||
#### Figure XXI, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 5$
|
||||
|
||||

|
||||
|
||||
####Figure XXII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 10$
|
||||
|
||||

|
||||
|
||||
####Figure XXIII, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 50$
|
||||
|
||||

|
||||
|
||||
#### Figure XXIV, Histogram of sample size needed to fail to reject the normality test for Poisson Distribution of $\lambda = 100$
|
||||
|
||||

|
||||
|
||||
### R Code
|
||||
|
||||
```R
|
||||
rm(list=ls())
|
||||
library(ggplot2)
|
||||
|
||||
sample_mean_uniform = function(n) {
|
||||
xbarsunif = numeric(5000)
|
||||
for (i in 1:5000) {
|
||||
sumunif = 0
|
||||
for (j in 1:n) {
|
||||
sumunif = sumunif + runif(1, 0, 1)
|
||||
}
|
||||
xbarsunif[i] = sumunif / n
|
||||
}
|
||||
xbarsunif
|
||||
}
|
||||
|
||||
sample_mean_poisson = function(n, lambda) {
|
||||
xbarspois = numeric(5000)
|
||||
for (i in 1:5000) {
|
||||
sumpois = 0
|
||||
for (j in 1:n) {
|
||||
sumpois = sumpois + rpois(1, lambda)
|
||||
}
|
||||
xbarspois[i] = sumpois / n
|
||||
}
|
||||
xbarspois
|
||||
}
|
||||
|
||||
|
||||
poisson_n_to_approx_normal = function(lambda) {
|
||||
print(paste("Looking at Lambda =", lambda))
|
||||
ns = c()
|
||||
|
||||
# Speed up computation of lower lambda values by starting at a different sample size
|
||||
# and/or lowering the precision by increasing the delta sample size
|
||||
# and/or lowering the number of sample sizes we obtain from the shapiro loop
|
||||
increaseBy = 1;
|
||||
iter = 3;
|
||||
startingValue = 2
|
||||
if (lambda == 0.1) {
|
||||
startingValue = 2000;
|
||||
iter = 3;
|
||||
increaseBy = 50;
|
||||
} else if (lambda == 0.5) {
|
||||
startingValue = 200;
|
||||
iter = 5;
|
||||
increaseBy = 10;
|
||||
} else if (lambda == 1) {
|
||||
startingValue = 150;
|
||||
iter = 25;
|
||||
} else if (lambda == 5) {
|
||||
startingValue = 20;
|
||||
iter = 50;
|
||||
startingValue = 10;
|
||||
} else if (lambda == 10) {
|
||||
iter = 100;
|
||||
} else {
|
||||
iter = 500;
|
||||
}
|
||||
|
||||
progressIter = 1
|
||||
for (i in 1:iter) {
|
||||
|
||||
# Include a progress indicator for personal sanity
|
||||
if (i / iter > .1 * progressIter) {
|
||||
print(paste("Progress", i / iter * 100, "% complete"))
|
||||
progressIter = progressIter + 1
|
||||
}
|
||||
|
||||
n = startingValue
|
||||
|
||||
dist = sample_mean_poisson(n, lambda)
|
||||
p.value = shapiro.test(dist)$p.value
|
||||
while (p.value < 0.05) {
|
||||
n = n + increaseBy
|
||||
dist = sample_mean_poisson(n, lambda)
|
||||
p.value = shapiro.test(dist)$p.value
|
||||
|
||||
# More sanity checks
|
||||
if (n %% 10 == 0) {
|
||||
print(paste("N =", n, " p.value =", p.value))
|
||||
}
|
||||
}
|
||||
ns = c(ns, n)
|
||||
}
|
||||
|
||||
print(ggplot(data.frame(ns), aes(x = ns)) +
|
||||
geom_histogram(fill = 'white', color = 'black', bins = 10) +
|
||||
geom_vline(xintercept = ceiling(quantile(ns, .95)), col = '#0000AA') +
|
||||
ggtitle(paste("Histogram of N needed for Poisson distribution of lambda =", lambda)) +
|
||||
xlab("N") +
|
||||
ylab("Count") +
|
||||
theme_bw())
|
||||
|
||||
|
||||
ceiling(quantile(ns, .95)) #95% of the time, this value of n will give you a sampling distribution that is approximately normal
|
||||
}
|
||||
|
||||
uniform_n_to_approx_normal = function() {
|
||||
ns = c()
|
||||
progressIter = 1
|
||||
|
||||
for (i in 1:500) {
|
||||
|
||||
# Include a progress indicator for personal sanity
|
||||
if (i / 500 > .1 * progressIter) {
|
||||
print(paste("Progress", i / 5, "% complete"))
|
||||
progressIter = progressIter + 1
|
||||
}
|
||||
|
||||
n = 2
|
||||
dist = sample_mean_uniform(n)
|
||||
p.value = shapiro.test(dist)$p.value
|
||||
while (p.value < 0.05) {
|
||||
n = n + 1
|
||||
dist = sample_mean_uniform(n)
|
||||
p.value = shapiro.test(dist)$p.value
|
||||
|
||||
if (n %% 10 == 0) {
|
||||
print(paste("N =", n, " p.value =", p.value))
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
ns = c(ns, n)
|
||||
}
|
||||
|
||||
print(ggplot(data.frame(ns), aes(x = ns)) +
|
||||
geom_histogram(fill = 'white', color = 'black', bins = 10) +
|
||||
geom_vline(xintercept = ceiling(quantile(ns, .95)), col = '#0000AA') +
|
||||
ggtitle("Histogram of N needed for Uniform Distribution") +
|
||||
xlab("N") +
|
||||
ylab("Count") +
|
||||
theme_bw())
|
||||
|
||||
ceiling(quantile(ns, .95)) #95% of the time, this value of n will give you a sampling distribution that is approximately normal
|
||||
}
|
||||
|
||||
|
||||
|
||||
show_results = function(dist) {
|
||||
print(paste("The mean of the sample mean distribution is:", mean(dist)))
|
||||
print(paste("The standard deviation of the sample mean distribution is:", sd(dist)))
|
||||
print(shapiro.test(dist))
|
||||
print(ggplot(data.frame(dist), aes(x = dist)) +
|
||||
geom_histogram(fill = 'white', color = 'black', bins = 20) +
|
||||
ggtitle("Histogram of Sample Means") +
|
||||
xlab("Mean") +
|
||||
ylab("Count") +
|
||||
theme_bw())
|
||||
qqnorm(dist, pch = 1, col = '#001155', main = "QQ Plot", xlab = "Sample Data", ylab = "Theoretical Data")
|
||||
qqline(dist, col="#AA0000", lty=2)
|
||||
}
|
||||
|
||||
|
||||
## PART I
|
||||
uniform_mean_dist = sample_mean_uniform(n = 5)
|
||||
poisson_mean_dist = sample_mean_poisson(n = 5, lambda = 1)
|
||||
|
||||
show_results(uniform_mean_dist)
|
||||
show_results(poisson_mean_dist)
|
||||
|
||||
## PART II
|
||||
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 0.1");
|
||||
n.01 = poisson_n_to_approx_normal(0.1)
|
||||
show_results(sample_mean_poisson(n.01, 0.1))
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 0.5");
|
||||
n.05 = poisson_n_to_approx_normal(0.5)
|
||||
show_results(sample_mean_poisson(n.05, 0.5))
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 1");
|
||||
n.1 = poisson_n_to_approx_normal(1)
|
||||
show_results(sample_mean_poisson(n.1, 1))
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 5");
|
||||
n.5 = poisson_n_to_approx_normal(5)
|
||||
show_results(sample_mean_poisson(n.5, 5))
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 10");
|
||||
n.10 = poisson_n_to_approx_normal(10)
|
||||
show_results(sample_mean_poisson(n.10, 10))
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 50");
|
||||
n.50 = poisson_n_to_approx_normal(50)
|
||||
show_results(sample_mean_poisson(n.50, 50))
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Poisson Distribution of Lambda = 100");
|
||||
n.100 = poisson_n_to_approx_normal(100)
|
||||
show_results(sample_mean_poisson(n.100, 100))
|
||||
|
||||
print("Starting Algorithm to Find Sample Size Needed for the Uniform Distribution")
|
||||
n.uniform = uniform_n_to_approx_normal()
|
||||
show_results(sample_mean_uniform(n.uniform))
|
||||
```
|
||||
|
||||
287
content/notes/stat381/confint.md
Normal file
287
content/notes/stat381/confint.md
Normal file
|
|
@ -0,0 +1,287 @@
|
|||
# Confidence Interval Lab
|
||||
|
||||
**Written by Brandon Rozek**
|
||||
|
||||
## Introduction
|
||||
|
||||
Confidence intervals expands the concept of a point estimation by giving a margin of error such that one can be confident that a certain percentage of the time the true parameter falls within that interval.
|
||||
|
||||
In this lab, we will look at confidence intervals for a mean. This lab focuses on a certain method of confidence intervals that depends on the distribution of sample means being Normal. We will show how the violation of this assumption impacts the probability that the true parameter falls within the interval.
|
||||
|
||||
## Methods
|
||||
|
||||
The observed level of confidence tells us the proportion of times the true mean falls within a confidence interval. To show how the violation of the Normality assumption affects this, we will sample from both a Normal distribution, T distribution, and exponential distribution with different sample sizes.
|
||||
|
||||
The normal and T distributions are sampled with a mean of 5 and a standard deviation of 2. The exponential deviation is sampled with a lambda of 2 or mean of 0.5.
|
||||
|
||||
From the samples, we obtain the mean and the upper/lower bounds of the confidence interval. This is performed 100,000 times. That way we obtain a distribution of these statistics.
|
||||
|
||||
We know that a confidence interval is valid, if the lower bound is no more than the true mean and the upper bound is no less than the true mean. From this definition, we can compute a proportion of observed confidence from the simulations
|
||||
|
||||
### Visualizations
|
||||
|
||||
From the distributions of statistics, we can create visualizations to support the understanding of confidence intervals.
|
||||
|
||||
The first one is a scatterplot of lower bounds vs upper bounds. This plot demonstrates the valid confidence intervals in blue and the invalid ones in red. It demonstrates how confidence intervals that are invalid are not located inside the box.
|
||||
|
||||
The second visualization is a histogram of all the sample means collected. The sample means that didn't belong to a valid confidence interval are shaded in red. This graphic helps demonstrate the type I errors on both sides of the distribution.
|
||||
|
||||
In this lab, we're interested in seeing how our observed level of confidence differs from our theoretical level of confidence (95%) when different sample sizes and distributions are applied.
|
||||
|
||||
## Results
|
||||
|
||||
We can see from the table section in the Appendix that sampling from a Normal or t distribution does not adversely affect our observed level of confidence. The observed level of confidence varies slightly from the theoretical level of confidence of 0.95.
|
||||
|
||||
When sampling from the exponential distribution, however, the observed level of confidence highly depends upon the sample size.
|
||||
|
||||
Looking at Table III, we can see that for a sample size of 10, the observed level of confidence is at a meager 90%. This is 5% off from our theoretical level of confidence. This shows how the normality assumption is vital to the precision of our estimate.
|
||||
|
||||
This comes from the fact that using this type of confidence interval on a mean from a non-normal distribution requires a large sample size for the central limit theorem to take affect.
|
||||
|
||||
The central limit theorem states that if the sample size is "large", the distribution of sample means approach the normal distribution. You can see how in Figure XVIII, the distribution of sample means is skewed, though as the sample size increases, the distribution of sample means become more symmetric (Figure XIX).
|
||||
|
||||
## Conclusion
|
||||
|
||||
From this, we can conclude that violating the underlying assumption of normality decreases the observed level of confidence. We can mitigate the decrease of the observed level of confidence when sampling means from a non-normal distribution by having a larger sample size. This is due to the central limit theorem.
|
||||
|
||||
## Appendix
|
||||
|
||||
### Tables
|
||||
|
||||
#### Table I. Sampling from Normal
|
||||
|
||||
| Sample Size | Proportion of Means Within CI |
|
||||
| ----------- | ----------------------------- |
|
||||
| 10 | 0.94849 |
|
||||
| 20 | 0.94913 |
|
||||
| 50 | 0.95045 |
|
||||
| 100 | 0.94955 |
|
||||
|
||||
|
||||
|
||||
#### Table II. Sampling from T Distribution
|
||||
|
||||
| Sample Size | Proportion of Means Within CI |
|
||||
| ----------- | ----------------------------- |
|
||||
| 10 | 0.94966 |
|
||||
| 20 | 0.94983 |
|
||||
| 50 | 0.94932 |
|
||||
| 100 | 0.94999 |
|
||||
|
||||
|
||||
|
||||
#### Table III. Sampling from Exponential Distribution
|
||||
|
||||
| Sample Size | Proportion of Means Within CI |
|
||||
| ----------- | ----------------------------- |
|
||||
| 10 | 0.89934 |
|
||||
| 20 | 0.91829 |
|
||||
| 50 | 0.93505 |
|
||||
| 100 | 0.94172 |
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
### Figures
|
||||
|
||||
#### Normal Distribution
|
||||
|
||||
##### Figure I. Scatterplot of Bounds for Normal Distribution of Sample Size 10
|
||||
|
||||

|
||||
|
||||
##### Figure II. Histogram of Sample Means for Normal Distribution of Sample Size 10
|
||||
|
||||

|
||||
|
||||
##### Figure III. Scatterplot of Bounds for Normal Distribution of Sample Size 20
|
||||
|
||||

|
||||
|
||||
##### Figure IV. Histogram of Sample Means for Normal Distribution of Sample Size 20
|
||||
|
||||

|
||||
|
||||
##### Figure VScatterplot of Bounds for Normal Distribution of Sample Size 50
|
||||
|
||||

|
||||
|
||||
##### Figure VI. Histogram of Sample Means for Normal Distribution of Sample Size 50
|
||||
|
||||

|
||||
|
||||
##### Figure VII. Scatterplot of Bounds for Normal Distribution of Sample Size 100
|
||||
|
||||

|
||||
|
||||
##### Figure VIII. Histogram of Sample Means for Normal Distribution of Sample Size 100
|
||||
|
||||

|
||||
|
||||
#### T Distribution
|
||||
|
||||
##### Figure IX. Scatterplot of Bounds for T Distribution of Sample Size 10
|
||||
|
||||

|
||||
|
||||
##### Figure X. Histogram of Sample Means for T Distribution of Sample Size 10
|
||||
|
||||

|
||||
|
||||
##### Figure XI. Scatterplot of Bounds for T Distribution of Sample Size 20
|
||||
|
||||

|
||||
|
||||
##### Figure XII. Histogram of Sample Means for T Distribution of Sample Size 20
|
||||
|
||||

|
||||
|
||||
##### Figure XIII. Scatterplot of Bounds for T Distribution of Sample Size 50
|
||||
|
||||

|
||||
|
||||
##### Figure XIV. Histogram of Sample Means for T Distribution of Sample Size 50
|
||||
|
||||

|
||||
|
||||
##### Figure XV. Scatterplot of Bounds for T Distribution of Sample Size 100
|
||||
|
||||

|
||||
|
||||
##### Figure XVI. Histogram of Sample Means for T Distribution of Sample Size 100
|
||||
|
||||

|
||||
|
||||
#### Exponential Distribution
|
||||
|
||||
##### Figure XVII. Scatterplot of Bounds for Exponential Distribution of Sample Size 10
|
||||
|
||||

|
||||
|
||||
##### Figure XVIII. Histogram of Sample Means for Exponential Distribution of Sample Size 10
|
||||
|
||||

|
||||
|
||||
##### Figure XIX. Scatterplot of Bounds for Exponential Distribution of Sample Size 20
|
||||
|
||||

|
||||
|
||||
##### Figure XX. Histogram of Sample Means for Exponential Distribution of Sample Size 20
|
||||
|
||||

|
||||
|
||||
##### Figure XXI. Scatterplot of Bounds for Exponential Distribution of Sample Size 50
|
||||
|
||||

|
||||
|
||||
##### Figure XXII. Histogram of Sample Means for Exponential Distribution of Sample Size 50
|
||||
|
||||

|
||||
|
||||
##### Figure XXIII. Scatterplot of Bounds for Exponential Distribution of Sample Size 100
|
||||
|
||||

|
||||
|
||||
##### Figure XXIV. Histogram of Sample Means for Exponential Distribution of Sample Size 100
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
### R Code
|
||||
|
||||
```R
|
||||
rm(list=ls())
|
||||
library(ggplot2)
|
||||
library(functional) # For function currying
|
||||
|
||||
proportion_in_CI = function(n, mu, dist) {
|
||||
|
||||
# Preallocate vectors
|
||||
lower_bound = numeric(100000)
|
||||
upper_bound = numeric(100000)
|
||||
means = numeric(100000)
|
||||
|
||||
number_within_CI = 0
|
||||
|
||||
ME = 1.96 * 2 / sqrt(n) ## Normal Margin of Error
|
||||
|
||||
for (i in 1:100000) {
|
||||
x = numeric(n)
|
||||
|
||||
# Sample from distribution
|
||||
if (dist == "Normal" | dist == "t") {
|
||||
x = rnorm(n,mu,2)
|
||||
} else if (dist == "Exponential") {
|
||||
x = rexp(n, 1 / mu)
|
||||
}
|
||||
|
||||
## Correct ME if non-normal
|
||||
if (dist != "Normal") {
|
||||
ME = qt(0.975,n-1)*sd(x)/sqrt(n)
|
||||
}
|
||||
|
||||
## Store statistics
|
||||
means[i] = mean(x)
|
||||
lower_bound[i] = mean(x) - ME
|
||||
upper_bound[i] = mean(x) + ME
|
||||
|
||||
# Is Confidence Interval Valid?
|
||||
if (lower_bound[i] < mu & upper_bound[i] > mu) {
|
||||
number_within_CI = number_within_CI + 1
|
||||
}
|
||||
}
|
||||
|
||||
# Prepare for plotting
|
||||
lbub = data.frame(lower_bound, upper_bound, means)
|
||||
lbub$col = ifelse(lbub$lower_bound < mu & lbub$upper_bound > mu, 'Within CI', 'Outside CI')
|
||||
print(ggplot(lbub, aes(x = lower_bound, y = upper_bound, col = col)) +
|
||||
geom_point(pch = 1) +
|
||||
geom_hline(yintercept = mu, col = '#000055') +
|
||||
geom_vline(xintercept = mu, col = '#000055') +
|
||||
ggtitle(paste("Plot of Lower Bounds vs Upper Bounds with Sample Size of ", n)) +
|
||||
xlab("Lower Bound") +
|
||||
ylab("Upper Bounds") +
|
||||
theme_bw()
|
||||
)
|
||||
print(ggplot(lbub, aes(x = means, fill = col)) +
|
||||
geom_histogram(color = 'black') +
|
||||
ggtitle(paste("Histogram of Sample Means with Sample Size of ", n)) +
|
||||
xlab("Sample Mean") +
|
||||
ylab("Count") +
|
||||
theme_bw()
|
||||
)
|
||||
|
||||
# Return proportion within CI
|
||||
number_within_CI / 100000
|
||||
}
|
||||
|
||||
sample_sizes = c(10, 20, 50, 100)
|
||||
|
||||
### PART I
|
||||
proportion_in_CI_Normal = Curry(proportion_in_CI, dist = "Normal", mu = 5)
|
||||
p_norm = sapply(sample_sizes, proportion_in_CI_Normal)
|
||||
sapply(p_norm, function(x) {
|
||||
cat("The observed proportion of intervals containing mu is", x, "\n")
|
||||
invisible(x)
|
||||
})
|
||||
|
||||
|
||||
### PART II
|
||||
proportion_in_CI_T = Curry(proportion_in_CI, dist = "t", mu = 5)
|
||||
p_t = sapply(sample_sizes, proportion_in_CI_T)
|
||||
sapply(p_t, function(x) {
|
||||
cat("The observed proportion of intervals containing mu is", x, "\n")
|
||||
invisible(x)
|
||||
})
|
||||
|
||||
### PART III
|
||||
proportion_in_CI_Exp = Curry(proportion_in_CI, dist = "Exponential", mu = 0.5)
|
||||
p_exp = sapply(sample_sizes, proportion_in_CI_Exp)
|
||||
sapply(p_exp, function(x) {
|
||||
cat("The observed proportion of intervals containing mu is", x, "\n")
|
||||
invisible(x)
|
||||
})
|
||||
```
|
||||
|
||||
441
content/notes/stat381/randomnumber.md
Normal file
441
content/notes/stat381/randomnumber.md
Normal file
|
|
@ -0,0 +1,441 @@
|
|||
# Random Number Generation
|
||||
|
||||
## Introduction
|
||||
|
||||
The generation of random numbers have a variety of applications including but not limited to the modeling of stochastic processes. It is important, therefore, to be able to generate any random number following a given distribution. One of the many ways to achieve this is to transform numbers sampled from a random uniform distribution.
|
||||
|
||||
In this lab, we will compare the effectiveness between the linear congruence method (LCM), `runif`, and `rexp` to generate random numbers. The programming language R will be used and different plots and summary statistics are drawn upon to analyze the effectiveness of the methods.
|
||||
|
||||
## Methods
|
||||
|
||||
### The Linear Congruential Method
|
||||
|
||||
The linear congruential method (LCM) is an algorithm that produces a sequence of pseudo-random numbers using the following recursive function
|
||||
$$
|
||||
X_{n + 1} = (aX_n + c) \mod m
|
||||
$$
|
||||
The R code we use has a `c` value of 0 which is a special case known as the multiplicative congruential generator (MCG).
|
||||
|
||||
There are several conditions for using a MCG. First, the seed value must be co-prime to `m`. In other words, the greatest common denominator between the two must be 1. A function was written in R to check this. Secondly, `m` must be a prime number or a power of a prime number.
|
||||
|
||||
#### Periodicity
|
||||
|
||||
An element of periodicity comes into play when dealing with pseudo-random number generators. Once the generator has produced a certain number of terms, it is shown to loop back to the first term of the sequence. It is advantageous, therefore, to keep the period high.
|
||||
|
||||
The period in a MCG is at most `m - 1`. In this lab, `m` has a value of $2^{31}$ and only 100 numbers were sampled from the LCM. This allows us to avoid the issue of periodicity entirely in our analysis.
|
||||
|
||||
### Mersenne-Twister Method
|
||||
|
||||
The default pseudo-random number generator (pseudo RNG) in R is the Mersenne-Twister with the default seed value of the current time and the process id of the current R instance. Mersenne-Twister is part of a class of pseudo RNG called the generalized feedback register. This class of pseudo RNGs provide a very long period (VLP) or in other words, a long cycle before the values start repeating. VLPs are often useful when executing large simulations in R.
|
||||
|
||||
Since this method is already coded in the `base` R-package, we will leave the implementation for the curious to research.
|
||||
|
||||
### The Uniform Distribution
|
||||
|
||||
#### Probability Mass Function
|
||||
|
||||
The uniform distribution function $Unif(\theta_1, \theta_2)$ has a probability mass function (PMF) of the following.
|
||||
$$
|
||||
f(x) = \frac{1}{\theta_2 - \theta_1}
|
||||
$$
|
||||
Where x is valid between [$\theta_1$, $\theta_2$].
|
||||
|
||||
In our case, we are producing numbers from [0,1]. We can therefore reduce the probability mass function to the following
|
||||
$$
|
||||
f(x) = 1
|
||||
$$
|
||||
#### Expected Value
|
||||
|
||||
The expected value can be defined as
|
||||
$$
|
||||
E(X) = \int_a^b xf(x)dx
|
||||
$$
|
||||
For the uniform distribution we're using that becomes
|
||||
$$
|
||||
E(X) = \int_0^1 xdx = \frac{1}{2}[x^2]_0^1 = \frac{1}{2}
|
||||
$$
|
||||
We should, therefore, expect the mean of the generated random number sequence to roughly equal $\frac{1}{2}$.
|
||||
|
||||
#### Median
|
||||
|
||||
To find the median of the distribution, we need to find the point at which the cumulative density function (CDF) is equal to .5.
|
||||
$$
|
||||
\int_0^{Median(X)} f(x)dx = \frac{1}{2}
|
||||
$$
|
||||
$$
|
||||
\int_0^{Median(X)} dx = \frac{1}{2}
|
||||
$$
|
||||
|
||||
$$
|
||||
[X]_0^{Median(X)} = \frac{1}{2}
|
||||
$$
|
||||
|
||||
$$
|
||||
Median(X) = \frac{1}{2}
|
||||
$$
|
||||
|
||||
This shows us that the median of the distribution should be roughly equal to $\frac{1}{2}$ as well.
|
||||
|
||||
#### Analysis of Uniform Distribution Fit
|
||||
|
||||
The histogram of a uniform distribution of data should look rectangular in shape. This means that the counts of all the sub-intervals should be about the same.
|
||||
|
||||
Another way to test whether or not the distribution is a good fit is to use what is called a Quantile-Quantile plot (Q-Q Plot). This is a probability plot that compares the quantiles from the theoretical distribution, this is the uniform distribution, to that of the sample.
|
||||
|
||||
For a precise Q-Q Plot, we need a lot of quantiles to compare. In this lab, we will compare 100 different quantiles. The quantiles from the theoretical distribution can be easily derived from the fact that all value ranges are equally probable. The theoretical quantiles in this case is 0.01, 0.02, ..., 0.98, 0.99, 1. The sample quantiles are obtianed by receiving 100 observations from `runif` or the LCM.
|
||||
|
||||
In the Q-Q plot, a good fit is shown when the points hug closely to the Q-Q line. It is also important to have symmetry in the points. This means that the points are ending on opposite sides of the Q-Q line.
|
||||
|
||||
For sake of exploration, we will use 5 different seed values for the linear congruential method (while making sure that the seed values are co-prime). We will then use the results of these and compare LCM to how the standard `runif` method generates random numbers.
|
||||
|
||||
### Exponential Distribution
|
||||
|
||||
#### Probability Mass Function and the Cumulative Density Function
|
||||
|
||||
The exponential distribution is a type of continuous distribution that is defined by the following PMF
|
||||
$$
|
||||
f(x) = \lambda e^{-\lambda t}
|
||||
$$
|
||||
We can find the CDF by taking the integral of the PMF.
|
||||
$$
|
||||
F(x) = \int f(x)dx = \lambda \int e^{-\lambda x} dx = \lambda * (\frac{-1}{\lambda}) e^{-\lambda x} + C = -e^{-\lambda x} + C
|
||||
$$
|
||||
One of the conditions for the cumulative density function is that as $x \to \infty$, $F(x) \to 1$.
|
||||
$$
|
||||
1 = \lim_{x \to \infty} (-e^{-\lambda x} + C) = \lim_{x \to \infty} (-e^{-\lambda x}) + lim_{x \to \infty} ( C) = \lim_{x \to \infty}(-e^{\lambda x}) + C
|
||||
$$
|
||||
This shows that $C = 1$, since $lim_{x \to \infty} (-e^{-\lambda x})$ is equal to 0.
|
||||
|
||||
Substituting $C$ gives us the following.
|
||||
$$
|
||||
F(x) = 1 - e^{-\lambda x}
|
||||
$$
|
||||
|
||||
#### Expected Value
|
||||
|
||||
We can find the expected value using the formula described in the previous Expected Value section. Reflected in the bounds of integration is the domain of the exponential distribution $[0, \infty)$.
|
||||
$$
|
||||
E(X) = \lim_{a \to \infty}\int_0^a x \lambda e^{-\lambda x} dx
|
||||
$$
|
||||
The integral above features a product of two functions. So as an aside, we will derive a formula so we can take the integral above.
|
||||
|
||||
The total derivative is defined as
|
||||
$$
|
||||
d(uv) = udv + vdu
|
||||
$$
|
||||
|
||||
|
||||
After taking the integral of both sides, we can solve for a formula that gives the following
|
||||
$$
|
||||
\int d(uv) = \int udv + \int vdu
|
||||
$$
|
||||
|
||||
$$
|
||||
\int udv = uv - \int vdu
|
||||
$$
|
||||
|
||||
The formula above is called *Integration by Parts*. We will make use of this formula by defining $u = \lambda x$ and $dv = e^{-\lambda x} dx$
|
||||
|
||||
This implies that $du = \lambda dx$ and $v= -\frac{1}{\lambda}e^{-\lambda x}$
|
||||
$$
|
||||
E(X) = -\lim_{a \to \infty}[\lambda x \frac{1}{\lambda}e^{-\lambda x}]_0^a - \lim_{b \to \infty}\int_0^b -\frac{1}{\lambda}e^{-\lambda x}\lambda dx
|
||||
$$
|
||||
|
||||
$$
|
||||
E(X) = -\lim_{a \to \infty} [xe^{-\lambda x}]_0^a - \lim_{b \to \infty}\int_0^b -e^{-\lambda x}dx
|
||||
$$
|
||||
|
||||
$$
|
||||
E(X) = -\lim_{a \to \infty}(ae^{-\lambda a}) - \lim_{b \to \infty}[\frac{1}{\lambda}e^{-\lambda x}]_0^b
|
||||
$$
|
||||
|
||||
$$
|
||||
E(X) = 0 - \frac{1}{\lambda}[\lim_{b \to \infty}(e^{-\lambda b}) - e^{-0\lambda}]
|
||||
$$
|
||||
|
||||
$$
|
||||
E(X) = -\frac{1}{\lambda}(-1) = \frac{1}{\lambda}
|
||||
$$
|
||||
|
||||
For the purposes of this lab, we will make the rate ($\lambda$) equal to 3. Which means we should expect our mean to be roughly equal to $\frac{1}{3}$.
|
||||
|
||||
#### Median
|
||||
|
||||
The median can be found by setting the CDF equal to $\frac{1}{2}$
|
||||
$$
|
||||
1- e^{-\lambda Median(X)} = \frac{1}{2}
|
||||
$$
|
||||
|
||||
$$
|
||||
e^{-\lambda Median(X)} = \frac{1}{2}
|
||||
$$
|
||||
|
||||
$$
|
||||
-\lambda Median(X) = \ln(\frac{1}{2})
|
||||
$$
|
||||
|
||||
$$
|
||||
Median(X) = \frac{\ln(2)}{\lambda}
|
||||
$$
|
||||
|
||||
This shows that we should expect our median to be around $\frac{\ln(2)}{3} \approx 0.231$.
|
||||
|
||||
### Inverse Transform Sampling
|
||||
|
||||
Once we have a uniform distribution of values, we can then use these values to map to an exponential distribution. This is done through a technique called inverse transform sampling, the technique works as follows:
|
||||
|
||||
1. Generate a random number u from the standard uniform distribution
|
||||
2. Compute the value of X such that F(X) = u
|
||||
3. The value of X belongs to the distribution of F
|
||||
|
||||
Using these steps, we'll derive the exponential transform below.
|
||||
$$
|
||||
F(X) = 1 - e^{-\lambda x} = u
|
||||
$$
|
||||
Then proceeding to solve for X, we obtain the following.
|
||||
$$
|
||||
e^{-\lambda X} = 1 - u
|
||||
$$
|
||||
|
||||
$$
|
||||
-\lambda X = \ln(1 - u)
|
||||
$$
|
||||
|
||||
$$
|
||||
X = \frac{-\ln(1 - u)}{\lambda}
|
||||
$$
|
||||
|
||||
With this formula, we can now find values for X which is a random variable that follows an exponential distribution given random uniform data $u$.
|
||||
|
||||
In this lab, we are looking at the exponential distribution with a rate of 3. Therefore, the probability mass function, the cumulative distribution function, and the exponential transform can be redefined as these respectively.
|
||||
$$
|
||||
f(x) = 3e^{-3x}
|
||||
$$
|
||||
|
||||
$$
|
||||
F(x) = 1 - e^{-3x}
|
||||
$$
|
||||
|
||||
$$
|
||||
X = \frac{-\ln(1 - u)}{3}
|
||||
$$
|
||||
|
||||
### R Code
|
||||
|
||||
The R code makes use of the concepts above. The purpose of this code is to output the summary statistics, histograms, and Q-Q plots are used to compare the different methods.
|
||||
|
||||
First the LCM is executed four times with three different seed values. The LCM is used to create a uniform distribution of data that is then compared to the standard `runif` function in the R language.
|
||||
|
||||
Then, transformations of a LCM, `runif`, are executed and compared with the standard `rexp` to create an exponential distribution of data with $\lambda = 3$.
|
||||
|
||||
|
||||
|
||||
## Results
|
||||
|
||||
### Uniform Distribution
|
||||
|
||||
For a small sample of 100 values, the data seems evenly distributed for all seeds and methods used. The peaks and troughs are more pronounced in the LCM methods suggesting that the `runif` command creates more evenly distributed data than the LCM.
|
||||
|
||||
Deviations from the mean and median are lowest for the LCM rather than the standard `runif` command with the exception of the LCM with the seed of 93463.
|
||||
|
||||
The Q-Q plots for all of the methods follow the Q-Q Line tightly and appear symmetric.
|
||||
|
||||
### Exponential Distribution
|
||||
|
||||
A bin size of 20 is used to make the histograms easily comparable to each other. One interesting thing to note is that in Figure XI, there is an observation located far out from the rest of the data. For the purpose of a histogram, which is to show us the shape of the distribution, this one far observation skews what we can see. Therefore the next figure, Figure XII has that single outlier removed and shows the histogram of the rest of the data.
|
||||
|
||||
All of the histograms appear exponential in shape. Looking at the Q-Q plots, the LCM transformation and the `rexp` distribution of data fit closely to the QQ-Line. All of the Q-Q Plots have the points getting further away from the line as you get higher up in the percentiles. The `runif` transformation quantiles diverge from the line after about the 50th percentile.
|
||||
|
||||
Deviations from the mean and median were about the same for both transformed techniques (`runif` and LCM). The `rexp` command did better when it came to the deviation from the mean, but performed worse when it came to deviation from the median.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The linear congruential methods performed better when it came to simulating the distribution than the standard R functions. It more accurately captured the median for both the standard uniform data, and the exponential data. Against the `runif` command, it also performed better at simulating the mean.
|
||||
|
||||
This can especially be seen when comparing the two transform techniques. The transformed LCM distribution of data followed the Q-Q line more tightly than the transformed `runif`.
|
||||
|
||||
I speculate that this is due to the seed value used. The Mersenne-Twist method performs better when the seed value doesn't have many zeros in it. Since the seed value is determined by the system time and process id, we don't know for sure if it's a proper input for the Mersenne-Twist. For the LCM, seeds were properly tested to make sure that it was co-prime with one of its parameters. This condition allowed proper seeds to work well with the LCM.
|
||||
|
||||
Further research can be done on standardizing the seed values used across all the different pseudo random number generators and looking at the 6 other pseudo RNG that comes built into R. Changing the seed and random number generator can be achieved through the `set.seed` function.
|
||||
|
||||
## Appendix
|
||||
|
||||
### Figures
|
||||
|
||||
#### Figure I, Histogram of LCM Random Data with seed 55555
|
||||
|
||||

|
||||
|
||||
#### Figure II, Q-Q Plot of LCM Random Data with seed 55555
|
||||
|
||||

|
||||
|
||||
#### Figure III, Histogram of LCM Random Data with seed 93463
|
||||
|
||||

|
||||
|
||||
#### Figure IV, Q-Q Plot of LCM Random Data with seed 93463
|
||||
|
||||

|
||||
|
||||
#### Figure V, Histogram of LCM Random Data with seed 29345
|
||||
|
||||

|
||||
|
||||
#### Figure VI, Q-Q Plot of LCM Random Data with seed 29345
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
#### Figure VII, Histogram of LCM Random Data with seed 68237
|
||||
|
||||

|
||||
|
||||
#### Figure VIII, Q-Q Plot of LCM Random Data with seed 68237
|
||||
|
||||

|
||||
|
||||
#### Figure IX, Histogram of R Random Uniform Data
|
||||
|
||||

|
||||
|
||||
#### Figure X, Q-Q Plot of R Random Uniform Data
|
||||
|
||||

|
||||
|
||||
#### Figure XI, Histogram of Exponential Data from LCM Random
|
||||
|
||||

|
||||
|
||||
#### Figure XII, Histogram of Exponential Data from LCM Random without Outlier Above 2
|
||||
|
||||

|
||||
|
||||
#### Figure XIII, Q-Q Plot of Exponential Data from LCM Rnadom
|
||||
|
||||

|
||||
|
||||
#### Figure XIV, Histogram of Exponential Data from R Random Uniform
|
||||
|
||||

|
||||
|
||||
#### Figure XV, Q-Q Plot of Exponential Data from R Random Uniform
|
||||
|
||||

|
||||
|
||||
#### Figure XVI, Histogram of R Generated Exponential Data
|
||||
|
||||

|
||||
|
||||
#### Figure XVII, Q-Q Plot of R Generated Exponential Data
|
||||
|
||||

|
||||
|
||||
### Tables
|
||||
|
||||
#### Table I, Uniform Distribution Sample Data
|
||||
|
||||
| Method | Mean ($\bar{x}$) | Median ($\tilde{x}$) | $\mu - \bar{x}$ | $m - \tilde{x}$ |
|
||||
| ------------------- | ---------------- | -------------------- | --------------- | --------------- |
|
||||
| LCM with seed 55555 | 0.505 | 0.521 | -0.005 | -0.021 |
|
||||
| LCM with seed 93463 | 0.452 | 0.402 | 0.048 | 0.098 |
|
||||
| LCM with seed 29345 | 0.520 | 0.502 | -0.020 | -0.002 |
|
||||
| LCM with seed 68237 | 0.489 | 0.517 | 0.011 | -0.017 |
|
||||
| R Random Uniform | 0.480 | 0.471 | 0.020 | 0.029 |
|
||||
|
||||
#### Table II, Exponential Distribution Sample Data
|
||||
|
||||
| Method | Mean | Median | $\mu - \bar{x}$ | $m - \tilde{x}$ |
|
||||
| --------------- | ----- | ------ | --------------- | --------------- |
|
||||
| LCM Transform | 0.380 | 0.246 | -0.047 | -0.015 |
|
||||
| RUnif Transform | 0.283 | 0.218 | 0.050 | 0.013 |
|
||||
| R Exponential | 0.363 | 0.278 | -0.030 | -0.047 |
|
||||
|
||||
### R Code
|
||||
|
||||
```R
|
||||
library(ggplot2)
|
||||
linear_congruential = function(seed) {
|
||||
a = 69069
|
||||
c = 0
|
||||
m = 2^31
|
||||
x = numeric(100)
|
||||
x[1] = seed
|
||||
for (i in 2:100) {
|
||||
x[i] = (a * x[i-1] + c) %% m
|
||||
}
|
||||
xnew = x / (max(x) + 1)
|
||||
xnew
|
||||
}
|
||||
|
||||
gcd = function(x,y) {
|
||||
r = x %% y;
|
||||
return(ifelse(r, gcd(y, r), y))
|
||||
}
|
||||
check_seed = function(seed) {
|
||||
if (gcd(seed, 2^31) == 1) {
|
||||
print(paste("The seed value of", seed, "is co-prime"))
|
||||
} else {
|
||||
print(paste("The seed value of", seed, "is NOT co-prime"))
|
||||
}
|
||||
}
|
||||
|
||||
check_data = function(data, distribution, title) {
|
||||
print(paste("Currently looking at", title))
|
||||
print(summary(data))
|
||||
print(ggplot(data.frame(data), aes(x = data)) +
|
||||
geom_histogram(fill = 'white', color = 'black', bins = 10) +
|
||||
xlab("Date") +
|
||||
ylab("Frequency") +
|
||||
ggtitle(paste("Histogram of", title)) +
|
||||
theme_bw())
|
||||
uqs = (1:100) / 100
|
||||
if (distribution == "Uniform") {
|
||||
qqplot(data, uqs, pch = 1, col = '#001155', main = paste("QQ Plot of", title), xlab = "Sample Data", ylab = "Theoretical Data")
|
||||
qqline(uqs, distribution = qunif, col="red", lty=2)
|
||||
} else if (distribution == "Exponential") {
|
||||
eqs = -log(1-uqs) / 3
|
||||
qqplot(data, eqs, pch = 1, col = '#001155', main = paste("QQ Plot of", title), xlab = "Sample Data", ylab = "Theoretical Data")
|
||||
qqline(eqs, distribution=function(p) { qexp(p, rate = 3)}, col="#AA0000", lty=2)
|
||||
}
|
||||
}
|
||||
|
||||
seed1 = 55555
|
||||
seed2 = 93463
|
||||
seed3 = 29345
|
||||
seed4 = 68237
|
||||
|
||||
check_seed(seed1)
|
||||
lin_cong = linear_congruential(seed1)
|
||||
check_data(lin_cong, "Uniform", paste("LCM Random Data with seed", seed1))
|
||||
|
||||
check_seed(seed2)
|
||||
lin_cong2 = linear_congruential(seed2)
|
||||
check_data(lin_cong2, "Uniform", paste("LCM Random Data with seed", seed2))
|
||||
|
||||
check_seed(seed3)
|
||||
lin_cong3 = linear_congruential(seed3)
|
||||
check_data(lin_cong3, "Uniform", paste("LCM Random Data with seed", seed3))
|
||||
|
||||
check_seed(seed4)
|
||||
lin_cong4 = linear_congruential(seed4)
|
||||
check_data(lin_cong4, "Uniform", paste("LCM Random Data with seed", seed4))
|
||||
|
||||
# Using the standard built-in R function
|
||||
unif = runif(100, 0, 1)
|
||||
check_data(unif, "Uniform", "R Random Uniform Data")
|
||||
|
||||
# Building exponential from linear congruence method
|
||||
exp1 = -log(1 - lin_cong) / 3
|
||||
check_data(exp1, "Exponential", "Exponential Data from LCM Random")
|
||||
|
||||
# Building exponential from runif
|
||||
exp2 = -log(1 - unif) / 3
|
||||
check_data(exp2, "Exponential", "Exponential Data from R Random Uniform")
|
||||
|
||||
# Building exponential from rexp
|
||||
exp3 = rexp(100, rate = 3)
|
||||
check_data(exp3, "Exponential", "R Generated Exponential Data")
|
||||
```
|
||||
|
||||
551
content/notes/stat381/randomwalk.md
Normal file
551
content/notes/stat381/randomwalk.md
Normal file
File diff suppressed because one or more lines are too long
Loading…
Add table
Add a link
Reference in a new issue