mirror of
https://github.com/Brandon-Rozek/website.git
synced 2024-11-21 07:46:30 -05:00
New Posts
This commit is contained in:
parent
0817a10dc7
commit
7b0fdf2fc9
5 changed files with 236 additions and 0 deletions
204
content/blog/discgolfpymc.md
Normal file
204
content/blog/discgolfpymc.md
Normal file
|
@ -0,0 +1,204 @@
|
|||
---
|
||||
title: "Disc Golf and PyMC3"
|
||||
date: 2020-03-28T22:08:19-04:00
|
||||
draft: false
|
||||
tags: ["python", "stats"]
|
||||
---
|
||||
|
||||
I've been following along with [Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/tree/master/) and I wanted to try using PyMC3 with my own small dataset.
|
||||
|
||||
A week ago, a couple friends and I went out and played Disc Golf at a local park. In case you don't know what Disc Golf is, the goal is for a player to throw a disc at a target with as few throws as possible. Throughout each of the *holes*, I recorded the number of tosses required until we completed that hole.
|
||||
|
||||
| Brandon | Chris | Clare |
|
||||
| ------- | ----- | ----- |
|
||||
| 6 | 8 | 10 |
|
||||
| 7 | 9 | 10 |
|
||||
| 7 | 8 | 9 |
|
||||
| 7 | 7 | 8 |
|
||||
| 5 | 4 | 9 |
|
||||
| 5 | 5 | 10 |
|
||||
| 4 | 4 | 7 |
|
||||
| 5 | 6 | 9 |
|
||||
| 6 | 5 | 7 |
|
||||
| 7 | 7 | 8 |
|
||||
| 5 | 6 | 8 |
|
||||
| 6 | 5 | 7 |
|
||||
| 6 | 6 | 8 |
|
||||
| 5 | 4 | 6 |
|
||||
|
||||
You can also [download the CSV](/data/discgolf03242020.csv).
|
||||
|
||||
What I want to know is the distribution of the number of tosses for each player.
|
||||
|
||||
## PyMC3
|
||||
|
||||
Let's try answering this with Bayesian Statistics + PyMC3.
|
||||
|
||||
First we'll need to import the relevant packages
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import pymc3 as pm
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from scipy.stats import poisson
|
||||
```
|
||||
|
||||
Load in the data
|
||||
|
||||
```python
|
||||
DISC_DATA = pd.read_csv("data/discgolf03242020.csv")
|
||||
PEOPLE = ["Brandon", "Chris", "Clare"]
|
||||
```
|
||||
|
||||
Since the number of tosses are count data, we are going to use the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) for the estimation. This means that we are going to characterize each player's number of tosses with this distribution. The Poisson distribution has a parameter $\lambda$ which also serves as the [expected value](https://en.wikipedia.org/wiki/Expected_value). The shape of the distribution is dependent on this parameter, so we'll need to estimate what this parameter is. Since the expected value must be positive, the exponential distribution is a good candidate.
|
||||
$$
|
||||
toss \sim Poisson(\lambda) \\
|
||||
\lambda \sim Exp(\alpha)
|
||||
$$
|
||||
The exponential distribution also has a parameter $\alpha$. The expected value of an exponential distribution with respect to $\alpha$ is $\frac{1}{\alpha}$. At this point we can give an estimate of what we believe $\alpha$ could be. Given the relationships with the expected values, the mean score of each of the players is a great choice.
|
||||
|
||||
```python
|
||||
with pm.Model() as model:
|
||||
# Random Variables
|
||||
ALPHA = [1. / DISC_DATA[person].mean() for person in PEOPLE]
|
||||
LAMBDAS = [pm.Exponential("lambda_" + person, alpha) for person, alpha in zip(PEOPLE, ALPHAS)]
|
||||
```
|
||||
|
||||
Now to show how easy the library is, we will provide the data and run Monte Carlo simulations to see the distribution that the $\lambda$s live in.
|
||||
|
||||
```python
|
||||
with model:
|
||||
OBS = [
|
||||
pm.Poisson("obs_" + person, lambda_, observed=DISC_DATA[person])
|
||||
for person, lambda_ in zip(PEOPLE, LAMBDAS)
|
||||
]
|
||||
TRACES = pm.sample(10000, tune=5000)
|
||||
```
|
||||
|
||||
We can then grab the distribution of $\lambda$s from the trace.
|
||||
|
||||
```python
|
||||
LAMBDA_SAMPLES = [TRACE["lambda_" + person] for person in PEOPLE]
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
First let's check out the average disc tosses for each player by looking at how the $\lambda$s are distributed.
|
||||
|
||||
```python
|
||||
plt.figure("Distribution of Average Disc Tosses")
|
||||
for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
|
||||
plt.hist(lambda_sample,
|
||||
histtype='stepfilled', bins=30, alpha=0.85,
|
||||
label=r"posterior of $\lambda_{" + person + "}$",
|
||||
color=color, density=True
|
||||
)
|
||||
PARAMETER_TITLE = r"\;".join(
|
||||
[r"\lambda_{" + person + "}" for person in PEOPLE]
|
||||
)
|
||||
plt.title(f"""Posterior distributions of the variables
|
||||
${PARAMETER_TITLE}$""")
|
||||
plt.xlim([
|
||||
DISC_DATA[PEOPLE].min().min() - 2,
|
||||
DISC_DATA[PEOPLE].max().max() + 2
|
||||
])
|
||||
plt.legend()
|
||||
plt.xlabel(r"$\lambda$")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
![](/files/images/2020032901.png)
|
||||
|
||||
Now let's look at the distribution of the number of tosses for each player
|
||||
|
||||
```python
|
||||
plt.figure("Distribution of Disc Tosses")
|
||||
for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
|
||||
tosses = np.arange(15)
|
||||
lambda_ = lambda_sample.mean()
|
||||
plt.bar(tosses, poisson.pmf(tosses, lambda_), color=color,
|
||||
label=r"$\lambda_{" + person + "}$ = " + "{:0.1f}".format(lambda_),
|
||||
alpha=0.60, edgecolor=color, lw="3"
|
||||
)
|
||||
plt.legend()
|
||||
plt.xlabel("Number of Tosses")
|
||||
plt.title("Poisson Distributions")
|
||||
```
|
||||
|
||||
![](/files/images/2020032902.png)
|
||||
|
||||
## Conclusion
|
||||
|
||||
We can see in the first graphic that the average number of tosses required for Brandon and Chris is about $6$, while for Clare it's $8$. Looking at the second graphic though, shows us that Clare has a wide range of possible tosses.
|
||||
|
||||
With PyMC3, we got to see the larger trends in the data by analyzing distributions that are more likely given the data.
|
||||
|
||||
For easy reference, here is the entire script.
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import pymc3 as pm
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from scipy.stats import poisson
|
||||
|
||||
DISC_DATA = pd.read_csv("data/discgolf03242020.csv")
|
||||
PEOPLE = ["Brandon", "Chris", "Clare"]
|
||||
COLORS = ["#A60628", "#7A68A6", "#4c9e65"]
|
||||
|
||||
assert len(PEOPLE) == len(COLORS)
|
||||
|
||||
with pm.Model() as model:
|
||||
# Random Variables
|
||||
ALPHAS = [1. / DISC_DATA[person].mean() for person in PEOPLE]
|
||||
LAMBDAS = [pm.Exponential("lambda_" + person, alpha) for person, alpha in zip(PEOPLE, ALPHAS)]
|
||||
|
||||
# Observations
|
||||
OBS = [
|
||||
pm.Poisson("obs_" + person, lambda_, observed=DISC_DATA[person])
|
||||
for person, lambda_ in zip(PEOPLE, LAMBDAS)
|
||||
]
|
||||
|
||||
# Monte-Carlo
|
||||
TRACE = pm.sample(10000, tune=5000)
|
||||
|
||||
LAMBDA_SAMPLES = [TRACE["lambda_" + person] for person in PEOPLE]
|
||||
|
||||
# Graph histogram of samples
|
||||
plt.figure("Distribution of Average Disc Tosses")
|
||||
for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
|
||||
plt.hist(lambda_sample,
|
||||
histtype='stepfilled', bins=30, alpha=0.85,
|
||||
label=r"posterior of $\lambda_{" + person + "}$",
|
||||
color=color, density=True
|
||||
)
|
||||
PARAMETER_TITLE = r"\;".join(
|
||||
[r"\lambda_{" + person + "}" for person in PEOPLE]
|
||||
)
|
||||
plt.title(f"""Posterior distributions of the variables
|
||||
${PARAMETER_TITLE}$""")
|
||||
plt.xlim([
|
||||
DISC_DATA[PEOPLE].min().min() - 2,
|
||||
DISC_DATA[PEOPLE].max().max() + 2
|
||||
])
|
||||
plt.legend()
|
||||
plt.xlabel(r"$\lambda$")
|
||||
|
||||
# Graph Poisson Distributions
|
||||
plt.figure("Distribution of Disc Tosses")
|
||||
for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
|
||||
tosses = np.arange(15)
|
||||
lambda_ = lambda_sample.mean()
|
||||
plt.bar(tosses, poisson.pmf(tosses, lambda_), color=color,
|
||||
label=r"$\lambda_{" + person + "}$ = " + "{:0.1f}".format(lambda_),
|
||||
alpha=0.60, edgecolor=color, lw="3"
|
||||
)
|
||||
plt.legend()
|
||||
plt.xlabel("Number of Tosses")
|
||||
plt.title("Poisson Distributions")
|
||||
|
||||
plt.show()
|
||||
|
||||
```
|
||||
|
17
content/blog/pyleniterables.md
Normal file
17
content/blog/pyleniterables.md
Normal file
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
title: "Quick Python: Length of Iterables"
|
||||
date: 2020-03-25T18:28:09-04:00
|
||||
draft: false
|
||||
tags: ["python"]
|
||||
---
|
||||
|
||||
I wanted to find the length of what I know is a finite iterable. Normally you would think of using the `len` function but it does not work in this case. [Al Hoo](https://stackoverflow.com/a/44351664) on StackOverflow shared a quick snippet to calculate this.
|
||||
|
||||
```python
|
||||
from functools import reduce
|
||||
|
||||
def ilen(iterable):
|
||||
return reduce(lambda sum, element: sum + 1, iterable, 0)
|
||||
```
|
||||
|
||||
This also turns out to be memory efficient since we are only loading in one object into memory from the iterable at a time.
|
15
static/data/discgolf03242020.csv
Normal file
15
static/data/discgolf03242020.csv
Normal file
|
@ -0,0 +1,15 @@
|
|||
Hole,Brandon,Chris,Clare
|
||||
1,6,8,10
|
||||
2,7,9,10
|
||||
3,7,8,9
|
||||
4,7,7,8
|
||||
5,5,4,9
|
||||
6,5,5,10
|
||||
7,4,4,7
|
||||
8,5,6,9
|
||||
9,6,5,7
|
||||
10,7,7,8
|
||||
11,5,6,8
|
||||
12,6,5,7
|
||||
13,6,6,8
|
||||
14,5,4,6
|
|
BIN
static/files/images/2020032901.png
Normal file
BIN
static/files/images/2020032901.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 23 KiB |
BIN
static/files/images/2020032902.png
Normal file
BIN
static/files/images/2020032902.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 25 KiB |
Loading…
Reference in a new issue