New Posts

2026-01-11 16:20:24 +00:00 · 2020-03-29 00:35:21 -04:00 · 2020-03-29 00:35:21 -04:00 · 7b0fdf2fc9
commit 7b0fdf2fc9
parent 0817a10dc7
5 changed files with 236 additions and 0 deletions
--- a/content/blog/discgolfpymc.md
+++ b/content/blog/discgolfpymc.md
@ -0,0 +1,204 @@
 ---
 title: "Disc Golf and PyMC3"
 date: 2020-03-28T22:08:19-04:00
 draft: false
 tags: ["python", "stats"]
 ---
 I've been following along with [Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/tree/master/) and I wanted to try using PyMC3 with my own small dataset.
 A week ago, a couple friends and I went out and played Disc Golf at a local park. In case you don't know what Disc Golf is, the goal is for a player to throw a disc at a target with as few throws as possible. Throughout each of the *holes*, I recorded the number of tosses required until we completed that hole.
 | Brandon | Chris | Clare |
 | ------- | ----- | ----- |
 |    6    |   8   |  10   |
 |    7    |   9   |  10   |
 |    7    |   8   |  9    |
 |    7    |   7   |  8    |
 |    5    |   4   |  9    |
 |    5    |   5   |  10   |
 |    4    |   4   |  7    |
 |    5    |   6   |  9    |
 |    6    |   5   |  7    |
 |    7    |   7   |  8    |
 |    5    |   6   |  8    |
 |    6    |   5   |  7    |
 |    6    |   6   |  8    |
 |    5    |   4   |  6    |
 You can also [download the CSV](/data/discgolf03242020.csv).
 What I want to know is the distribution of the number of tosses for each player.
 ## PyMC3
 Let's try answering this with Bayesian Statistics + PyMC3.
 First we'll need to import the relevant packages
 ```python
 import pandas as pd
 import pymc3 as pm
 import matplotlib.pyplot as plt
 import numpy as np
 from scipy.stats import poisson
 ```
 Load in the data
 ```python
 DISC_DATA = pd.read_csv("data/discgolf03242020.csv")
 PEOPLE = ["Brandon", "Chris", "Clare"]
 ```
 Since the number of tosses are count data, we are going to use the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) for the estimation. This means that we are going to characterize each player's number of tosses with this distribution. The Poisson distribution has a  parameter $\lambda$ which also serves as the [expected value](https://en.wikipedia.org/wiki/Expected_value). The shape of the distribution is dependent on this parameter, so we'll need to estimate what this parameter is. Since the expected value must be positive, the exponential distribution is a good candidate.
 $$
 toss \sim Poisson(\lambda) \\
 \lambda \sim Exp(\alpha)
 $$
 The exponential distribution also has a parameter  $\alpha$.  The expected value of an exponential distribution with respect to $\alpha$ is $\frac{1}{\alpha}$. At this point we can give an estimate of what we believe $\alpha$ could be. Given the relationships with the expected values, the mean score of each of the players is a great choice.
 ```python
 with pm.Model() as model:
    # Random Variables
    ALPHA = [1. / DISC_DATA[person].mean() for person in PEOPLE]
    LAMBDAS = [pm.Exponential("lambda_" + person, alpha) for person, alpha in zip(PEOPLE, ALPHAS)]
 ```
 Now to show how easy the library is, we will provide the data and run Monte Carlo simulations to see the distribution that the $\lambda$s live in.
 ```python
 with model:
    OBS = [
        pm.Poisson("obs_" + person, lambda_, observed=DISC_DATA[person])
        for person, lambda_ in zip(PEOPLE, LAMBDAS)
    ]
    TRACES = pm.sample(10000, tune=5000)
 ```
 We can then grab the distribution of $\lambda$s from the trace.
 ```python
 LAMBDA_SAMPLES = [TRACE["lambda_" + person] for person in PEOPLE]
 ```
 ## Visualization
 First let's check out the average disc tosses for each player by looking at how the $\lambda$s are distributed.
 ```python
 plt.figure("Distribution of Average Disc Tosses")
 for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
    plt.hist(lambda_sample,
        histtype='stepfilled', bins=30, alpha=0.85,
        label=r"posterior of $\lambda_{" + person + "}$",
        color=color, density=True
    )
 PARAMETER_TITLE = r"\;".join(
    [r"\lambda_{" + person + "}" for person in PEOPLE]
 )
 plt.title(f"""Posterior distributions of the variables
    ${PARAMETER_TITLE}$""")
 plt.xlim([
    DISC_DATA[PEOPLE].min().min() - 2, 
    DISC_DATA[PEOPLE].max().max() + 2
 ])
 plt.legend()
 plt.xlabel(r"$\lambda$")
 plt.show()
 ```
 ![](/files/images/2020032901.png)
 Now let's look at the distribution of the number of tosses for each player
 ```python
 plt.figure("Distribution of Disc Tosses")
 for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
    tosses = np.arange(15)
    lambda_ = lambda_sample.mean()
    plt.bar(tosses, poisson.pmf(tosses, lambda_), color=color,
            label=r"$\lambda_{" + person + "}$ = " + "{:0.1f}".format(lambda_),
            alpha=0.60, edgecolor=color, lw="3"
    )
 plt.legend()
 plt.xlabel("Number of Tosses")
 plt.title("Poisson Distributions")
 ```
 ![](/files/images/2020032902.png)
 ## Conclusion
 We can see in the first graphic that the average number of tosses required for Brandon and Chris is about $6$, while for Clare it's $8$. Looking at the second graphic though, shows us that Clare has a wide range of possible tosses. 
 With PyMC3, we got to see the larger trends in the data by analyzing distributions that are more likely given the data.
 For easy reference, here is the entire script.
 ```python
 import pandas as pd
 import pymc3 as pm
 import matplotlib.pyplot as plt
 import numpy as np
 from scipy.stats import poisson
 DISC_DATA = pd.read_csv("data/discgolf03242020.csv")
 PEOPLE = ["Brandon", "Chris", "Clare"]
 COLORS = ["#A60628", "#7A68A6", "#4c9e65"]
 assert len(PEOPLE) == len(COLORS)
 with pm.Model() as model:
    # Random Variables
    ALPHAS = [1. / DISC_DATA[person].mean() for person in PEOPLE]
    LAMBDAS = [pm.Exponential("lambda_" + person, alpha) for person, alpha in zip(PEOPLE, ALPHAS)]
    # Observations
    OBS = [
        pm.Poisson("obs_" + person, lambda_, observed=DISC_DATA[person]) 
        for person, lambda_ in zip(PEOPLE, LAMBDAS)
    ]
    # Monte-Carlo
    TRACE = pm.sample(10000, tune=5000)
 LAMBDA_SAMPLES = [TRACE["lambda_" + person] for person in PEOPLE]
 # Graph histogram of samples
 plt.figure("Distribution of Average Disc Tosses")
 for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
    plt.hist(lambda_sample,
        histtype='stepfilled', bins=30, alpha=0.85,
        label=r"posterior of $\lambda_{" + person + "}$",
        color=color, density=True
    )
 PARAMETER_TITLE = r"\;".join(
    [r"\lambda_{" + person + "}" for person in PEOPLE]
 )
 plt.title(f"""Posterior distributions of the variables
    ${PARAMETER_TITLE}$""")
 plt.xlim([
    DISC_DATA[PEOPLE].min().min() - 2, 
    DISC_DATA[PEOPLE].max().max() + 2
 ])
 plt.legend()
 plt.xlabel(r"$\lambda$")
 # Graph Poisson Distributions
 plt.figure("Distribution of Disc Tosses")
 for person, lambda_sample, color in zip(PEOPLE, LAMBDA_SAMPLES, COLORS):
    tosses = np.arange(15)
    lambda_ = lambda_sample.mean()
    plt.bar(tosses, poisson.pmf(tosses, lambda_), color=color,
            label=r"$\lambda_{" + person + "}$ = " + "{:0.1f}".format(lambda_),
            alpha=0.60, edgecolor=color, lw="3"
    )
 plt.legend()
 plt.xlabel("Number of Tosses")
 plt.title("Poisson Distributions")
 plt.show()
 ```
--- a/content/blog/pyleniterables.md
+++ b/content/blog/pyleniterables.md
@ -0,0 +1,17 @@
 ---
 title: "Quick Python: Length of Iterables"
 date: 2020-03-25T18:28:09-04:00
 draft: false
 tags: ["python"]
 ---
 I wanted to find the length of what I know is a finite iterable. Normally you would think of using the `len` function but it does not work in this case. [Al Hoo](https://stackoverflow.com/a/44351664) on StackOverflow shared a quick snippet to calculate this.
 ```python
 from functools import reduce
 def ilen(iterable):
    return reduce(lambda sum, element: sum + 1, iterable, 0)
 ```
 This also turns out to be memory efficient since we are only loading in one object into memory from the iterable at a time.
--- a/static/data/discgolf03242020.csv
+++ b/static/data/discgolf03242020.csv
@ -0,0 +1,15 @@
 Hole,Brandon,Chris,Clare
 1,6,8,10
 2,7,9,10
 3,7,8,9
 4,7,7,8
 5,5,4,9
 6,5,5,10
 7,4,4,7
 8,5,6,9
 9,6,5,7
 10,7,7,8
 11,5,6,8
 12,6,5,7
 13,6,6,8
 14,5,4,6
--- a/static/files/images/2020032901.png
+++ b/static/files/images/2020032901.png
--- a/static/files/images/2020032902.png
+++ b/static/files/images/2020032902.png