diff --git a/content/blog/discgolfpymc.md b/content/blog/discgolfpymc.md new file mode 100644 index 0000000..c6d5845 --- /dev/null +++ b/content/blog/discgolfpymc.md @@ -0,0 +1,204 @@ +--- +title: "Disc Golf and PyMC3" +date: 2020-03-28T22:08:19-04:00 +draft: false +tags: ["python", "stats"] +--- + +I've been following along with [Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/tree/master/) and I wanted to try using PyMC3 with my own small dataset. + +A week ago, a couple friends and I went out and played Disc Golf at a local park. In case you don't know what Disc Golf is, the goal is for a player to throw a disc at a target with as few throws as possible. Throughout each of the *holes*, I recorded the number of tosses required until we completed that hole. + +| Brandon | Chris | Clare | +| ------- | ----- | ----- | +| 6 | 8 | 10 | +| 7 | 9 | 10 | +| 7 | 8 | 9 | +| 7 | 7 | 8 | +| 5 | 4 | 9 | +| 5 | 5 | 10 | +| 4 | 4 | 7 | +| 5 | 6 | 9 | +| 6 | 5 | 7 | +| 7 | 7 | 8 | +| 5 | 6 | 8 | +| 6 | 5 | 7 | +| 6 | 6 | 8 | +| 5 | 4 | 6 | + +You can also [download the CSV](/data/discgolf03242020.csv). + +What I want to know is the distribution of the number of tosses for each player. + +## PyMC3 + +Let's try answering this with Bayesian Statistics + PyMC3. + +First we'll need to import the relevant packages + +```python +import pandas as pd +import pymc3 as pm +import matplotlib.pyplot as plt +import numpy as np +from scipy.stats import poisson +``` + +Load in the data + +```python +DISC_DATA = pd.read_csv("data/discgolf03242020.csv") +PEOPLE = ["Brandon", "Chris", "Clare"] +``` + +Since the number of tosses are count data, we are going to use the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) for the estimation. This means that we are going to characterize each player's number of tosses with this distribution. The Poisson distribution has a parameter $\lambda$ which also serves as the [expected value](https://en.wikipedia.org/wiki/Expected_value). The shape of the distribution is dependent on this parameter, so we'll need to estimate what this parameter is. Since the expected value must be positive, the exponential distribution is a good candidate. +$$ +toss \sim Poisson(\lambda) \\ +\lambda \sim Exp(\alpha) +$$ +The exponential distribution also has a parameter $\alpha$. The expected value of an exponential distribution with respect to $\alpha$ is $\frac{1}{\alpha}$. At this point we can give an estimate of what we believe $\alpha$ could be. Given the relationships with the expected values, the mean score of each of the players is a great choice. + +```python +with pm.Model() as model: + # Random Variables + ALPHA = [1. / DISC_DATA[person].mean() for person in PEOPLE] + LAMBDAS = [pm.Exponential("lambda_" + person, alpha) for person, alpha in zip(PEOPLE, ALPHAS)] +``` + +Now to show how easy the library is, we will provide the data and run Monte Carlo simulations to see the distribution that the $\lambda$s live in. + +```python +with model: + OBS = [ + pm.Poisson("obs_" + person, lambda_, observed=DISC_DATA[person]) + for person, lambda_ in zip(PEOPLE, LAMBDAS) + ] + TRACES = pm.sample(10000, tune=5000) +``` + +We can then grab the distribution of $\lambda$s from the trace. + +```python +LAMBDA_SAMPLES = [TRACE["lambda_" + person] for person in PEOPLE] +``` + +## Visualization + +First let's check out the average disc tosses for each player by looking at how the $\lambda$s are distributed. + +```python +plt.figure("Distribution of Average Disc Tosses") +for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS): + plt.hist(lambda_sample, + histtype='stepfilled', bins=30, alpha=0.85, + label=r"posterior of $\lambda_{" + person + "}$", + color=color, density=True + ) +PARAMETER_TITLE = r"\;".join( + [r"\lambda_{" + person + "}" for person in PEOPLE] +) +plt.title(f"""Posterior distributions of the variables + ${PARAMETER_TITLE}$""") +plt.xlim([ + DISC_DATA[PEOPLE].min().min() - 2, + DISC_DATA[PEOPLE].max().max() + 2 +]) +plt.legend() +plt.xlabel(r"$\lambda$") +plt.show() +``` + +![](/files/images/2020032901.png) + +Now let's look at the distribution of the number of tosses for each player + +```python +plt.figure("Distribution of Disc Tosses") +for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS): + tosses = np.arange(15) + lambda_ = lambda_sample.mean() + plt.bar(tosses, poisson.pmf(tosses, lambda_), color=color, + label=r"$\lambda_{" + person + "}$ = " + "{:0.1f}".format(lambda_), + alpha=0.60, edgecolor=color, lw="3" + ) +plt.legend() +plt.xlabel("Number of Tosses") +plt.title("Poisson Distributions") +``` + +![](/files/images/2020032902.png) + +## Conclusion + +We can see in the first graphic that the average number of tosses required for Brandon and Chris is about $6$, while for Clare it's $8$. Looking at the second graphic though, shows us that Clare has a wide range of possible tosses. + +With PyMC3, we got to see the larger trends in the data by analyzing distributions that are more likely given the data. + +For easy reference, here is the entire script. + +```python +import pandas as pd +import pymc3 as pm +import matplotlib.pyplot as plt +import numpy as np +from scipy.stats import poisson + +DISC_DATA = pd.read_csv("data/discgolf03242020.csv") +PEOPLE = ["Brandon", "Chris", "Clare"] +COLORS = ["#A60628", "#7A68A6", "#4c9e65"] + +assert len(PEOPLE) == len(COLORS) + +with pm.Model() as model: + # Random Variables + ALPHAS = [1. / DISC_DATA[person].mean() for person in PEOPLE] + LAMBDAS = [pm.Exponential("lambda_" + person, alpha) for person, alpha in zip(PEOPLE, ALPHAS)] + + # Observations + OBS = [ + pm.Poisson("obs_" + person, lambda_, observed=DISC_DATA[person]) + for person, lambda_ in zip(PEOPLE, LAMBDAS) + ] + + # Monte-Carlo + TRACE = pm.sample(10000, tune=5000) + +LAMBDA_SAMPLES = [TRACE["lambda_" + person] for person in PEOPLE] + +# Graph histogram of samples +plt.figure("Distribution of Average Disc Tosses") +for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS): + plt.hist(lambda_sample, + histtype='stepfilled', bins=30, alpha=0.85, + label=r"posterior of $\lambda_{" + person + "}$", + color=color, density=True + ) +PARAMETER_TITLE = r"\;".join( + [r"\lambda_{" + person + "}" for person in PEOPLE] +) +plt.title(f"""Posterior distributions of the variables + ${PARAMETER_TITLE}$""") +plt.xlim([ + DISC_DATA[PEOPLE].min().min() - 2, + DISC_DATA[PEOPLE].max().max() + 2 +]) +plt.legend() +plt.xlabel(r"$\lambda$") + +# Graph Poisson Distributions +plt.figure("Distribution of Disc Tosses") +for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS): + tosses = np.arange(15) + lambda_ = lambda_sample.mean() + plt.bar(tosses, poisson.pmf(tosses, lambda_), color=color, + label=r"$\lambda_{" + person + "}$ = " + "{:0.1f}".format(lambda_), + alpha=0.60, edgecolor=color, lw="3" + ) +plt.legend() +plt.xlabel("Number of Tosses") +plt.title("Poisson Distributions") + +plt.show() + +``` + diff --git a/content/blog/pyleniterables.md b/content/blog/pyleniterables.md new file mode 100644 index 0000000..e8f11db --- /dev/null +++ b/content/blog/pyleniterables.md @@ -0,0 +1,17 @@ +--- +title: "Quick Python: Length of Iterables" +date: 2020-03-25T18:28:09-04:00 +draft: false +tags: ["python"] +--- + +I wanted to find the length of what I know is a finite iterable. Normally you would think of using the `len` function but it does not work in this case. [Al Hoo](https://stackoverflow.com/a/44351664) on StackOverflow shared a quick snippet to calculate this. + +```python +from functools import reduce + +def ilen(iterable): + return reduce(lambda sum, element: sum + 1, iterable, 0) +``` + +This also turns out to be memory efficient since we are only loading in one object into memory from the iterable at a time. \ No newline at end of file diff --git a/static/data/discgolf03242020.csv b/static/data/discgolf03242020.csv new file mode 100644 index 0000000..590c98a --- /dev/null +++ b/static/data/discgolf03242020.csv @@ -0,0 +1,15 @@ +Hole,Brandon,Chris,Clare +1,6,8,10 +2,7,9,10 +3,7,8,9 +4,7,7,8 +5,5,4,9 +6,5,5,10 +7,4,4,7 +8,5,6,9 +9,6,5,7 +10,7,7,8 +11,5,6,8 +12,6,5,7 +13,6,6,8 +14,5,4,6 \ No newline at end of file diff --git a/static/files/images/2020032901.png b/static/files/images/2020032901.png new file mode 100644 index 0000000..b55cb12 Binary files /dev/null and b/static/files/images/2020032901.png differ diff --git a/static/files/images/2020032902.png b/static/files/images/2020032902.png new file mode 100644 index 0000000..4ffdfed Binary files /dev/null and b/static/files/images/2020032902.png differ