mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-05-18 16:03:37 +00:00
Cleanup filename
This commit is contained in:
parent
ba5040453a
commit
8e175e60e4
1 changed files with 179 additions and 156 deletions
|
@ -1,118 +1,127 @@
|
||||||
---
|
---
|
||||||
id: 2169
|
id: 2169
|
||||||
title: Male vs Female Life Expectancy
|
title: Male vs Female Life Expectancy
|
||||||
date: 2017-03-16T14:12:40+00:00
|
date: 2017-03-16T14:12:40+00:00
|
||||||
author: rozek_admin
|
author: rozek_admin
|
||||||
layout: revision
|
layout: revision
|
||||||
guid: https://brandonrozek.com/2017/03/2052-revision-v1/
|
guid: https://brandonrozek.com/2017/03/2052-revision-v1/
|
||||||
permalink: /2017/03/2052-revision-v1/
|
permalink: /2017/03/2052-revision-v1/
|
||||||
---
|
---
|
||||||

|

|
||||||
|
|
||||||
## Do females live longer than males?
|
|
||||||
|
|
||||||
It is well known that females live longer than males, but does that statement hold statistically? Matthew Martinez and I set out to find out.
|
|
||||||
|
|
||||||
<!--more-->
|
|
||||||
|
|
||||||
## Population and the hypothesis
|
|
||||||
|
|
||||||
Our population of concern is citizens of the United States. We found a dataset on [WorldLifeExpectancy](http://www.worldlifeexpectancy.com/) listing by county the average life expectancy for both males and females. With this we form our null and alternative hypothesis
|
|
||||||
|
|
||||||
H0: The average life expectancy for both males and females are the same in the United States
|
|
||||||
|
|
||||||
HA: The average female life expectancy is higher than the average male life expectancy in the United States
|
|
||||||
|
|
||||||
## Data preparation
|
|
||||||
|
|
||||||
Since the website gives us an overlook at all of the counties in the United States we want to take a small sample of that so we can perform statistics. Using the entire dataset will result in looking at population parameters which doesn’t leave room for inference.
|
|
||||||
|
|
||||||
A random number was chosen to pick the state and then the county. This was done a total of 101 times. The CSV file is located [here](https://brandonrozek.com/wp-content/uploads/2017/03/LifeExpectancy.csv) for convenience.
|
|
||||||
|
|
||||||
## R Programming
|
|
||||||
|
|
||||||
For the rest of this article, we will use R for analysis. This article will focus more on the analysis, however, than the R code.
|
|
||||||
|
|
||||||
Read the CSV file into R
|
|
||||||
|
|
||||||
<pre class="language-R"><code class="language-R">rm(list=ls())
|
|
||||||
|
|
||||||
|
## Do females live longer than males?
|
||||||
|
|
||||||
|
It is well known that females live longer than males, but does that statement hold statistically? Matthew Martinez and I set out to find out.
|
||||||
|
|
||||||
|
<!--more-->
|
||||||
|
|
||||||
|
## Population and the hypothesis
|
||||||
|
|
||||||
|
Our population of concern is citizens of the United States. We found a dataset on [WorldLifeExpectancy](http://www.worldlifeexpectancy.com/) listing by county the average life expectancy for both males and females. With this we form our null and alternative hypothesis
|
||||||
|
|
||||||
|
H0: The average life expectancy for both males and females are the same in the United States
|
||||||
|
|
||||||
|
HA: The average female life expectancy is higher than the average male life expectancy in the United States
|
||||||
|
|
||||||
|
## Data preparation
|
||||||
|
|
||||||
|
Since the website gives us an overlook at all of the counties in the United States we want to take a small sample of that so we can perform statistics. Using the entire dataset will result in looking at population parameters which doesn’t leave room for inference.
|
||||||
|
|
||||||
|
A random number was chosen to pick the state and then the county. This was done a total of 101 times. The CSV file is located [here](https://brandonrozek.com/wp-content/uploads/2017/03/LifeExpectancy.csv) for convenience.
|
||||||
|
|
||||||
|
## R Programming
|
||||||
|
|
||||||
|
For the rest of this article, we will use R for analysis. This article will focus more on the analysis, however, than the R code.
|
||||||
|
|
||||||
|
Read the CSV file into R
|
||||||
|
|
||||||
|
```R
|
||||||
|
rm(list=ls())
|
||||||
# Read in file
|
# Read in file
|
||||||
LifeExpectancy = read.csv("~/LifeExpectancy.csv")
|
LifeExpectancy = read.csv("~/LifeExpectancy.csv")
|
||||||
maleExpectancy = LifeExpectancy$Life.Expectancy.Male
|
maleExpectancy = LifeExpectancy$Life.Expectancy.Male
|
||||||
femaleExpectancy = LifeExpectancy$Life.Expectancy.Female
|
femaleExpectancy = LifeExpectancy$Life.Expectancy.Female
|
||||||
</code></pre>
|
```
|
||||||
|
|
||||||
## Summary Statistics
|
## Summary Statistics
|
||||||
|
|
||||||
Before we begin our inferential statistics, it is a good idea to look at what we have in our sample. It will give us a good feeling for what we’re working with and help us answer some questions involving the assumptions in parametric tests.
|
Before we begin our inferential statistics, it is a good idea to look at what we have in our sample. It will give us a good feeling for what we’re working with and help us answer some questions involving the assumptions in parametric tests.
|
||||||
|
|
||||||
We’re interested in the minimum, mean, maximum, and interquartile range of the data
|
We’re interested in the minimum, mean, maximum, and interquartile range of the data
|
||||||
|
|
||||||
<pre class="language-R"><code class="language-R">
|
```R
|
||||||
# Summary statistics
|
# Summary statistics
|
||||||
male_row = c(min(maleExpectancy), mean(maleExpectancy), max(maleExpectancy), IQR(maleExpectancy))
|
male_row = c(min(maleExpectancy), mean(maleExpectancy), max(maleExpectancy), IQR(maleExpectancy))
|
||||||
female_row = c(min(femaleExpectancy), mean(femaleExpectancy), max(femaleExpectancy), IQR(femaleExpectancy))
|
female_row = c(min(femaleExpectancy), mean(femaleExpectancy), max(femaleExpectancy), IQR(femaleExpectancy))
|
||||||
summary = rbind(male_row, female_row)
|
summary = rbind(male_row, female_row)
|
||||||
colnames(summary) = c("Min", "Mean", "Max", "IQR")
|
colnames(summary) = c("Min", "Mean", "Max", "IQR")
|
||||||
rownames(summary) = c("Male", "Female")
|
rownames(summary) = c("Male", "Female")
|
||||||
</code></pre>
|
```
|
||||||
|
|
||||||
Looking at the table below, we can see that the average male lives to be around 69 years old in our sample while the average female lives to be about 71 years old. One interesting thing to note is how small the variation is between all the counties life expectancy that we sampled.
|
Looking at the table below, we can see that the average male lives to be around 69 years old in our sample while the average female lives to be about 71 years old. One interesting thing to note is how small the variation is between all the counties life expectancy that we sampled.
|
||||||
|
|
||||||
<pre class="language-R"><code class="language-R">summary
|
```R
|
||||||
|
summary
|
||||||
## Min Mean Max IQR
|
## Min Mean Max IQR
|
||||||
## Male 69.0 74.952 80.9 2.775
|
## Male 69.0 74.952 80.9 2.775
|
||||||
## Female 76.1 80.416 84.1 2.350</code></pre>
|
## Female 76.1 80.416 84.1 2.350</code></pre>
|
||||||
|
```
|
||||||
## Inferential Statistics
|
|
||||||
|
## Inferential Statistics
|
||||||
From here on out, we will perform a hypothesis test on the two hypothesis stated earlier in the text.
|
|
||||||
|
From here on out, we will perform a hypothesis test on the two hypothesis stated earlier in the text.
|
||||||
Since our data is quantitative in nature, we will attempt to perform a two sample t-test
|
|
||||||
|
Since our data is quantitative in nature, we will attempt to perform a two sample t-test
|
||||||
### Check for Assumptions
|
|
||||||
|
### Check for Assumptions
|
||||||
Performing a t-test comes with several assumptions we need to check before confidently reporting our results.
|
|
||||||
|
Performing a t-test comes with several assumptions we need to check before confidently reporting our results.
|
||||||
<u>Independence Condition:</u> One county’s life span does not affect the lifespan of another.
|
|
||||||
|
<u>Independence Condition:</u> One county’s life span does not affect the lifespan of another.
|
||||||
<u>Independent groups assumption:</u> The lifespan of a male does not directly impact a lifespan of a female.
|
|
||||||
|
<u>Independent groups assumption:</u> The lifespan of a male does not directly impact a lifespan of a female.
|
||||||
<u>Nearly Normal Condition:</u> We need to check the histograms to see if they’re unimodal and symmetric and check to see if any outliers exist
|
|
||||||
|
<u>Nearly Normal Condition:</u> We need to check the histograms to see if they’re unimodal and symmetric and check to see if any outliers exist
|
||||||
The male life expectancy distribution appears to be unimodal and symmetric.
|
|
||||||
|
The male life expectancy distribution appears to be unimodal and symmetric.
|
||||||
<pre class="language-R"><code class='language-R'># Check for normality
|
|
||||||
hist(maleExpectancy, main = "Male Life Expectancy", xlab = "Age")</code></pre>
|
```R
|
||||||
|
# Check for normality
|
||||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/maleLifeExpectancyHist.png" width="672" />
|
hist(maleExpectancy, main = "Male Life Expectancy", xlab = "Age")
|
||||||
|
```
|
||||||
Same with the female life expectancy distribution
|
|
||||||
|
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/maleLifeExpectancyHist.png" width="672" />
|
||||||
<pre class="language-R"><code class="language-R">hist(femaleExpectancy, main = "Female Life Expectancy", xlab = "Age")</code></pre>
|
|
||||||
|
Same with the female life expectancy distribution
|
||||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/femaleLifeExpectancyHist.png" width="672" />
|
|
||||||
|
```R
|
||||||
Looking at the boxplot, we can see that the IQR of the female life expectancy is higher than the one of the males. The hypothesis test will show us if this is of significant difference. On the male’s side there are two outliers. This violates the Nearly Normal Condition so we must proceed with caution in our test.
|
hist(femaleExpectancy, main = "Female Life Expectancy", xlab = "Age")
|
||||||
|
```
|
||||||
<pre class="language-R"><code class="language-R">boxplot(maleExpectancy, femaleExpectancy, names = c("Male Life Expectancy", "Female Life
|
|
||||||
Expectancy"), ylab = "Age")</code></pre>
|
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/femaleLifeExpectancyHist.png" width="672" />
|
||||||
|
|
||||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/LifeExpectancyBoxplot.png" width="672" />
|
Looking at the boxplot, we can see that the IQR of the female life expectancy is higher than the one of the males. The hypothesis test will show us if this is of significant difference. On the male’s side there are two outliers. This violates the Nearly Normal Condition so we must proceed with caution in our test.
|
||||||
|
|
||||||
Since the nearly normal condition was not met, we do not meet the assumptions necessary to perform a t-test. However, since the condition was violated by an outlier, let us perform a t-test with the outlier and without the outlier and compare the results.
|
```R
|
||||||
|
boxplot(maleExpectancy, femaleExpectancy, names = c("Male Life Expectancy", "Female Life Expectancy"), ylab = "Age")
|
||||||
### Calculate the Test Statistic
|
```
|
||||||
|
|
||||||
Let us conduct a two sample t-test with the alternative hypothesis being that the female average life expectancy is greater than that of the males
|
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/LifeExpectancyBoxplot.png" width="672" />
|
||||||
|
|
||||||
Running the test below shoes us a p-value of less than 0.001. This tells us that the probability of obtaining a sample as extreme as the one obtained is close to zero. Therefore at a significance level of 5%, we reject the null hypothesis and state that there is strong evidence to suggest that females have a greater life expectancy that that of males.
|
Since the nearly normal condition was not met, we do not meet the assumptions necessary to perform a t-test. However, since the condition was violated by an outlier, let us perform a t-test with the outlier and without the outlier and compare the results.
|
||||||
|
|
||||||
<pre class="language-R"><code class="language-R"># Test alternative hypothesis
|
### Calculate the Test Statistic
|
||||||
t.test(femaleExpectancy, maleExpectancy, alternative='g')</code></pre>
|
|
||||||
|
Let us conduct a two sample t-test with the alternative hypothesis being that the female average life expectancy is greater than that of the males
|
||||||
<pre class="language-R"><code class="language-R">##
|
|
||||||
|
Running the test below shoes us a p-value of less than 0.001. This tells us that the probability of obtaining a sample as extreme as the one obtained is close to zero. Therefore at a significance level of 5%, we reject the null hypothesis and state that there is strong evidence to suggest that females have a greater life expectancy that that of males.
|
||||||
|
|
||||||
|
```R
|
||||||
|
# Test alternative hypothesis
|
||||||
|
t.test(femaleExpectancy, maleExpectancy, alternative='g')
|
||||||
|
```
|
||||||
|
|
||||||
|
```R
|
||||||
## Welch Two Sample t-test
|
## Welch Two Sample t-test
|
||||||
##
|
##
|
||||||
## data: femaleExpectancy and maleExpectancy
|
## data: femaleExpectancy and maleExpectancy
|
||||||
|
@ -122,14 +131,17 @@ t.test(femaleExpectancy, maleExpectancy, alternative='g')</code></pre>
|
||||||
## 4.984992 Inf
|
## 4.984992 Inf
|
||||||
## sample estimates:
|
## sample estimates:
|
||||||
## mean of x mean of y
|
## mean of x mean of y
|
||||||
## 80.416 74.952</code></pre>
|
## 80.416 74.952
|
||||||
|
```
|
||||||
In fact, we are 95% confident that the difference between the average female life expectancy and the average male life expectancy in the United States is between 5 and 6 years. Females live on average 5-6 years longer than males in the United States.
|
|
||||||
|
In fact, we are 95% confident that the difference between the average female life expectancy and the average male life expectancy in the United States is between 5 and 6 years. Females live on average 5-6 years longer than males in the United States.
|
||||||
<code class="language-R"># Find confidence interval<br />
|
|
||||||
t.test(femaleExpectancy, maleExpectancy)</code>
|
```R
|
||||||
|
# Find confidence interval
|
||||||
<pre class="language-R"><code class="language-R">##
|
t.test(femaleExpectancy, maleExpectancy)
|
||||||
|
````
|
||||||
|
|
||||||
|
```R
|
||||||
## Welch Two Sample t-test
|
## Welch Two Sample t-test
|
||||||
##
|
##
|
||||||
## data: femaleExpectancy and maleExpectancy
|
## data: femaleExpectancy and maleExpectancy
|
||||||
|
@ -139,42 +151,49 @@ t.test(femaleExpectancy, maleExpectancy)</code>
|
||||||
## 4.892333 6.035667
|
## 4.892333 6.035667
|
||||||
## sample estimates:
|
## sample estimates:
|
||||||
## mean of x mean of y
|
## mean of x mean of y
|
||||||
## 80.416 74.952</code></pre>
|
## 80.416 74.952
|
||||||
|
```
|
||||||
### Outlier Analysis
|
|
||||||
|
### Outlier Analysis
|
||||||
We cannot forget that we had outliers in our dataset. This might affect the results of our test. The point of outlier analysis is to see if such changes are significant.
|
|
||||||
|
We cannot forget that we had outliers in our dataset. This might affect the results of our test. The point of outlier analysis is to see if such changes are significant.
|
||||||
First let us remove the outliers in R
|
|
||||||
|
First let us remove the outliers in R
|
||||||
<pre class="language-R"><code class="language-R"># Remove outliers
|
|
||||||
|
```R
|
||||||
|
# Remove outliers
|
||||||
maleExpectancy2 = maleExpectancy[!maleExpectancy %in% boxplot.stats(maleExpectancy)$out]
|
maleExpectancy2 = maleExpectancy[!maleExpectancy %in% boxplot.stats(maleExpectancy)$out]
|
||||||
</code></pre>
|
```
|
||||||
|
|
||||||
Then let us check the histogram and boxplots to see if the nearly normal condition is now met.
|
Then let us check the histogram and boxplots to see if the nearly normal condition is now met.
|
||||||
|
|
||||||
Looking at the boxplot, there are no more outliers present
|
Looking at the boxplot, there are no more outliers present
|
||||||
|
|
||||||
<pre class="language-R"><code class="language-R">
|
```R
|
||||||
# Check graphs again
|
# Check graphs again
|
||||||
boxplot(maleExpectancy2, ylab = "Age", main = "Male Life Expectancy w/o Outliers")</code></pre>
|
boxplot(maleExpectancy2, ylab = "Age", main = "Male Life Expectancy w/o Outliers")
|
||||||
|
```
|
||||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/MLifeExpectBoxplotNoOutliers.png" width="672" />
|
|
||||||
|
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/MLifeExpectBoxplotNoOutliers.png" width="672" />
|
||||||
The histogram still appears to be unimodal and symmetric
|
|
||||||
|
The histogram still appears to be unimodal and symmetric
|
||||||
<pre class="language-R"><code class="language-R">hist(maleExpectancy2, xlab = "Age", main = "Male Life Expectancy w/o Outliers")</code></pre>
|
|
||||||
|
```R
|
||||||
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/MLifeExpectHistNoOutliers.png" width="672" />
|
hist(maleExpectancy2, xlab = "Age", main = "Male Life Expectancy w/o Outliers")
|
||||||
|
```
|
||||||
Without the outliers present, the nearly normal condition is now met. We can perform the t-test.
|
|
||||||
|
<img src="https://brandonrozek.com/wp-content/uploads/2017/03/MLifeExpectHistNoOutliers.png" width="672" />
|
||||||
We can see that the hypothesis test returns the same results as before, this tells us that the outliers did not have a significant impact on our test results
|
|
||||||
|
Without the outliers present, the nearly normal condition is now met. We can perform the t-test.
|
||||||
<pre class="language-R"><code class="language-R"># Test new alternative
|
|
||||||
t.test(femaleExpectancy, maleExpectancy2, alternative='g')</code></pre>
|
We can see that the hypothesis test returns the same results as before, this tells us that the outliers did not have a significant impact on our test results
|
||||||
|
|
||||||
<pre class="language-R"><code class="language-R">##
|
```R
|
||||||
|
# Test new alternative
|
||||||
|
t.test(femaleExpectancy, maleExpectancy2, alternative='g')
|
||||||
|
```
|
||||||
|
|
||||||
|
```R
|
||||||
## Welch Two Sample t-test
|
## Welch Two Sample t-test
|
||||||
##
|
##
|
||||||
## data: femaleExpectancy and maleExpectancy2
|
## data: femaleExpectancy and maleExpectancy2
|
||||||
|
@ -184,14 +203,17 @@ t.test(femaleExpectancy, maleExpectancy2, alternative='g')</code></pre>
|
||||||
## 5.000048 Inf
|
## 5.000048 Inf
|
||||||
## sample estimates:
|
## sample estimates:
|
||||||
## mean of x mean of y
|
## mean of x mean of y
|
||||||
## 80.41600 74.95204</code></pre>
|
## 80.41600 74.95204
|
||||||
|
```
|
||||||
Redoing the confidence intervals, we can see that it did not change greatly
|
|
||||||
|
Redoing the confidence intervals, we can see that it did not change greatly
|
||||||
<pre class="language-R"><code class="language-R"># Find new confidence interval
|
|
||||||
t.test(femaleExpectancy, maleExpectancy2)</code></pre>
|
```R
|
||||||
|
# Find new confidence interval
|
||||||
<pre class="language-R"><code class="language-R">##
|
t.test(femaleExpectancy, maleExpectancy2)
|
||||||
|
```
|
||||||
|
|
||||||
|
```R
|
||||||
## Welch Two Sample t-test
|
## Welch Two Sample t-test
|
||||||
##
|
##
|
||||||
## data: femaleExpectancy and maleExpectancy2
|
## data: femaleExpectancy and maleExpectancy2
|
||||||
|
@ -201,8 +223,9 @@ t.test(femaleExpectancy, maleExpectancy2)</code></pre>
|
||||||
## 4.910317 6.017601
|
## 4.910317 6.017601
|
||||||
## sample estimates:
|
## sample estimates:
|
||||||
## mean of x mean of y
|
## mean of x mean of y
|
||||||
## 80.41600 74.95204</code></pre>
|
## 80.41600 74.95204
|
||||||
|
```
|
||||||
## Conclusion
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
By running the tests and checking the effects of the outliers in the dataset and seeing that the results did not change, we can safely conclude that our interpretations stated before are correct. There is enough evidence to suggest that females in the United States live on average longer than males. We are 95% confident that they live longer than males by 5 to 6 years.
|
By running the tests and checking the effects of the outliers in the dataset and seeing that the results did not change, we can safely conclude that our interpretations stated before are correct. There is enough evidence to suggest that females in the United States live on average longer than males. We are 95% confident that they live longer than males by 5 to 6 years.
|
Loading…
Add table
Reference in a new issue