mirror of
https://github.com/Brandon-Rozek/website.git
synced 2024-11-12 12:00:35 -05:00
Cleanup filename
This commit is contained in:
parent
ba5040453a
commit
8e175e60e4
1 changed files with 179 additions and 156 deletions
|
@ -35,13 +35,13 @@ For the rest of this article, we will use R for analysis. This article will focu
|
|||
|
||||
Read the CSV file into R
|
||||
|
||||
<pre class="language-R"><code class="language-R">rm(list=ls())
|
||||
|
||||
```R
|
||||
rm(list=ls())
|
||||
# Read in file
|
||||
LifeExpectancy = read.csv("~/LifeExpectancy.csv")
|
||||
maleExpectancy = LifeExpectancy$Life.Expectancy.Male
|
||||
femaleExpectancy = LifeExpectancy$Life.Expectancy.Female
|
||||
</code></pre>
|
||||
```
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
|
@ -49,21 +49,23 @@ Before we begin our inferential statistics, it is a good idea to look at what we
|
|||
|
||||
We’re interested in the minimum, mean, maximum, and interquartile range of the data
|
||||
|
||||
<pre class="language-R"><code class="language-R">
|
||||
```R
|
||||
# Summary statistics
|
||||
male_row = c(min(maleExpectancy), mean(maleExpectancy), max(maleExpectancy), IQR(maleExpectancy))
|
||||
female_row = c(min(femaleExpectancy), mean(femaleExpectancy), max(femaleExpectancy), IQR(femaleExpectancy))
|
||||
summary = rbind(male_row, female_row)
|
||||
colnames(summary) = c("Min", "Mean", "Max", "IQR")
|
||||
rownames(summary) = c("Male", "Female")
|
||||
</code></pre>
|
||||
```
|
||||
|
||||
Looking at the table below, we can see that the average male lives to be around 69 years old in our sample while the average female lives to be about 71 years old. One interesting thing to note is how small the variation is between all the counties life expectancy that we sampled.
|
||||
|
||||
<pre class="language-R"><code class="language-R">summary
|
||||
```R
|
||||
summary
|
||||
## Min Mean Max IQR
|
||||
## Male 69.0 74.952 80.9 2.775
|
||||
## Female 76.1 80.416 84.1 2.350</code></pre>
|
||||
```
|
||||
|
||||
## Inferential Statistics
|
||||
|
||||
|
@ -83,21 +85,26 @@ Performing a t-test comes with several assumptions we need to check before confi
|
|||
|
||||
The male life expectancy distribution appears to be unimodal and symmetric.
|
||||
|
||||
<pre class="language-R"><code class='language-R'># Check for normality
|
||||
hist(maleExpectancy, main = "Male Life Expectancy", xlab = "Age")</code></pre>
|
||||
```R
|
||||
# Check for normality
|
||||
hist(maleExpectancy, main = "Male Life Expectancy", xlab = "Age")
|
||||
```
|
||||
|
||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/maleLifeExpectancyHist.png" width="672" />
|
||||
|
||||
Same with the female life expectancy distribution
|
||||
|
||||
<pre class="language-R"><code class="language-R">hist(femaleExpectancy, main = "Female Life Expectancy", xlab = "Age")</code></pre>
|
||||
```R
|
||||
hist(femaleExpectancy, main = "Female Life Expectancy", xlab = "Age")
|
||||
```
|
||||
|
||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/femaleLifeExpectancyHist.png" width="672" />
|
||||
|
||||
Looking at the boxplot, we can see that the IQR of the female life expectancy is higher than the one of the males. The hypothesis test will show us if this is of significant difference. On the male’s side there are two outliers. This violates the Nearly Normal Condition so we must proceed with caution in our test.
|
||||
|
||||
<pre class="language-R"><code class="language-R">boxplot(maleExpectancy, femaleExpectancy, names = c("Male Life Expectancy", "Female Life
|
||||
Expectancy"), ylab = "Age")</code></pre>
|
||||
```R
|
||||
boxplot(maleExpectancy, femaleExpectancy, names = c("Male Life Expectancy", "Female Life Expectancy"), ylab = "Age")
|
||||
```
|
||||
|
||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/LifeExpectancyBoxplot.png" width="672" />
|
||||
|
||||
|
@ -109,10 +116,12 @@ Let us conduct a two sample t-test with the alternative hypothesis being that th
|
|||
|
||||
Running the test below shoes us a p-value of less than 0.001. This tells us that the probability of obtaining a sample as extreme as the one obtained is close to zero. Therefore at a significance level of 5%, we reject the null hypothesis and state that there is strong evidence to suggest that females have a greater life expectancy that that of males.
|
||||
|
||||
<pre class="language-R"><code class="language-R"># Test alternative hypothesis
|
||||
t.test(femaleExpectancy, maleExpectancy, alternative='g')</code></pre>
|
||||
```R
|
||||
# Test alternative hypothesis
|
||||
t.test(femaleExpectancy, maleExpectancy, alternative='g')
|
||||
```
|
||||
|
||||
<pre class="language-R"><code class="language-R">##
|
||||
```R
|
||||
## Welch Two Sample t-test
|
||||
##
|
||||
## data: femaleExpectancy and maleExpectancy
|
||||
|
@ -122,14 +131,17 @@ t.test(femaleExpectancy, maleExpectancy, alternative='g')</code></pre>
|
|||
## 4.984992 Inf
|
||||
## sample estimates:
|
||||
## mean of x mean of y
|
||||
## 80.416 74.952</code></pre>
|
||||
## 80.416 74.952
|
||||
```
|
||||
|
||||
In fact, we are 95% confident that the difference between the average female life expectancy and the average male life expectancy in the United States is between 5 and 6 years. Females live on average 5-6 years longer than males in the United States.
|
||||
|
||||
<code class="language-R"># Find confidence interval<br />
|
||||
t.test(femaleExpectancy, maleExpectancy)</code>
|
||||
```R
|
||||
# Find confidence interval
|
||||
t.test(femaleExpectancy, maleExpectancy)
|
||||
````
|
||||
|
||||
<pre class="language-R"><code class="language-R">##
|
||||
```R
|
||||
## Welch Two Sample t-test
|
||||
##
|
||||
## data: femaleExpectancy and maleExpectancy
|
||||
|
@ -139,7 +151,8 @@ t.test(femaleExpectancy, maleExpectancy)</code>
|
|||
## 4.892333 6.035667
|
||||
## sample estimates:
|
||||
## mean of x mean of y
|
||||
## 80.416 74.952</code></pre>
|
||||
## 80.416 74.952
|
||||
```
|
||||
|
||||
### Outlier Analysis
|
||||
|
||||
|
@ -147,23 +160,27 @@ We cannot forget that we had outliers in our dataset. This might affect the resu
|
|||
|
||||
First let us remove the outliers in R
|
||||
|
||||
<pre class="language-R"><code class="language-R"># Remove outliers
|
||||
```R
|
||||
# Remove outliers
|
||||
maleExpectancy2 = maleExpectancy[!maleExpectancy %in% boxplot.stats(maleExpectancy)$out]
|
||||
</code></pre>
|
||||
```
|
||||
|
||||
Then let us check the histogram and boxplots to see if the nearly normal condition is now met.
|
||||
|
||||
Looking at the boxplot, there are no more outliers present
|
||||
|
||||
<pre class="language-R"><code class="language-R">
|
||||
```R
|
||||
# Check graphs again
|
||||
boxplot(maleExpectancy2, ylab = "Age", main = "Male Life Expectancy w/o Outliers")</code></pre>
|
||||
boxplot(maleExpectancy2, ylab = "Age", main = "Male Life Expectancy w/o Outliers")
|
||||
```
|
||||
|
||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/MLifeExpectBoxplotNoOutliers.png" width="672" />
|
||||
|
||||
The histogram still appears to be unimodal and symmetric
|
||||
|
||||
<pre class="language-R"><code class="language-R">hist(maleExpectancy2, xlab = "Age", main = "Male Life Expectancy w/o Outliers")</code></pre>
|
||||
```R
|
||||
hist(maleExpectancy2, xlab = "Age", main = "Male Life Expectancy w/o Outliers")
|
||||
```
|
||||
|
||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/MLifeExpectHistNoOutliers.png" width="672" />
|
||||
|
||||
|
@ -171,10 +188,12 @@ Without the outliers present, the nearly normal condition is now met. We can per
|
|||
|
||||
We can see that the hypothesis test returns the same results as before, this tells us that the outliers did not have a significant impact on our test results
|
||||
|
||||
<pre class="language-R"><code class="language-R"># Test new alternative
|
||||
t.test(femaleExpectancy, maleExpectancy2, alternative='g')</code></pre>
|
||||
```R
|
||||
# Test new alternative
|
||||
t.test(femaleExpectancy, maleExpectancy2, alternative='g')
|
||||
```
|
||||
|
||||
<pre class="language-R"><code class="language-R">##
|
||||
```R
|
||||
## Welch Two Sample t-test
|
||||
##
|
||||
## data: femaleExpectancy and maleExpectancy2
|
||||
|
@ -184,14 +203,17 @@ t.test(femaleExpectancy, maleExpectancy2, alternative='g')</code></pre>
|
|||
## 5.000048 Inf
|
||||
## sample estimates:
|
||||
## mean of x mean of y
|
||||
## 80.41600 74.95204</code></pre>
|
||||
## 80.41600 74.95204
|
||||
```
|
||||
|
||||
Redoing the confidence intervals, we can see that it did not change greatly
|
||||
|
||||
<pre class="language-R"><code class="language-R"># Find new confidence interval
|
||||
t.test(femaleExpectancy, maleExpectancy2)</code></pre>
|
||||
```R
|
||||
# Find new confidence interval
|
||||
t.test(femaleExpectancy, maleExpectancy2)
|
||||
```
|
||||
|
||||
<pre class="language-R"><code class="language-R">##
|
||||
```R
|
||||
## Welch Two Sample t-test
|
||||
##
|
||||
## data: femaleExpectancy and maleExpectancy2
|
||||
|
@ -201,7 +223,8 @@ t.test(femaleExpectancy, maleExpectancy2)</code></pre>
|
|||
## 4.910317 6.017601
|
||||
## sample estimates:
|
||||
## mean of x mean of y
|
||||
## 80.41600 74.95204</code></pre>
|
||||
## 80.41600 74.95204
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
Loading…
Reference in a new issue