Exploring the World through Data

Board Game Bonanza: EDA

2023-12-13T00:00:00+00:00

In a previous post, I showed how we can collect data to find what makes a board game popular. In this post, we’ll use python to explore that data.

As a reminder, all data comes from BoardGamesGeek!

What is in our data?

As you might remember from the previous post, this data consists of the top 1000 rated games on BoardGameGeek. The data also includes different variables about each board game, including year, age recommendations, estimated time, ratings, average price, etc.

Because this is the top 1000 board games and not all the board games (due to time constraints), we can’t draw any specific conclusions about what makes a board game highly rated (since this iis a biased sample). However, we can still make some interesting observations.

Distributions of Variables

What are the distributional breakdowns of each factor in our data? What outliers are interesting?

To answer these questions, let’s take a look at the distributions of our data. This first figure shows the histograms for our three categorical variables: group size, length of time, and age rating.

The vast majority of these top 1000 games are in the center of these 3 groups: Small - Large groups (but not too small or too large), short to long (but not quick or too long), and are rated for preteeens and teens (not too young but not too adult either). Whether this is because there are just more of games with these factors in general, or because these receive higher ratings is the question that can only be answered by looking at more data.

We will also take a look at the distributions of our numerical data. Unfortunately, some of these graphs are impacted by outliers:

I adjusted these graphs to account for outliers:

I changed the year published distribution to leave out the 6 games that are pre-1950 (one game, Go, was from 2000 B.C.E. which made it difficult to see the graph).
I removed the 4 games that have max players of 99-100 players.
For the distribution of playing times, I removed two games with a playing time of 20 hours. T
he Average Price distribution has “Magic: The Gathering” left out, because it has an average price in the $7000s (probably due to the rarity of some cards).
There are some other outliers I left in; for example, there are 11 games with over 80,000 ratings. But I removed the outliers that made the graphs difficult to see, and I wanted to note them because they are different than the vast majority of the games.

Some of my takeaways for these distributions and what they tell us about the most popular 1000 games:

The most popular games are all pretty recent; the median age is 8 years old, or around 2015.
It seems the most common number of players range is 2-4 players.
The median playing time is an hour and 15 minutes and the median age minimum is 12 years old.
While these games are sorted by top 1000 games by Bayes Rating, the better measure of popularity is number of ratings, which is the best indicator of whether someone played it.

Correlation of Variables

How are different variables related? And which variable relationshps should we ignore?

To answer these questions, let’s take a look at how the variables are all related through a correlation plot.

My biggest takeaways from this figure:

First is what to ignore: Bayes and Average Rating are highly correlated, which makes sense because they are the same except for an additional 30 average ratings for the Bayes rating. When performing analysis, it is best to ignore one of these values to avoid covariance.
As well, ignore the correlation between Age and Year Published; because one is a function of the other, these will always be perfectly negatively correlated.
“Number of Ratings” and “Number of Accessories” are pretty moderately correlated. Perhaps the more expansions and versions of a game, the more opportunities a person has to play the game and the more likely they are to rate the game.
There is a moderately high correlation between the variance of ratings and the amount of playing time a game requires; since a higher variance indicates a greater spread of opinion, this seems to show that the longer a game takes, the more ploarizing a game becomes.
There’s an interesting slight negative correlation between number of ratings and average price. My prediction is that the more popular a game is, the more games there are which lowers the price. It’s the games that are a little rarer but still have a devoted fanbase that are likely more expensive, wwhich is backed up by the fact that price and average rating are more highly correlated than price and Bayes rating (less ratings means the generic ratings in a Bayes Rating are weighted more heavily).
The minimum number of players has a very small correlation with the maximum number of players. This is very interesting because I assumed that if there is a smaller minimum, there will likely be a small maximum as well. There could just be noise throwing off the correlation in this case, but there’s enough that makes the correlation near 0.

Distribution of Popularity

How are number of reviews distributed across different factors? Is there any significant difference between their means?

When it comes to our categroical factors, I was interested to see if any of them had an impact on a game’s popularity. I took a look using violin plots:

These plots sort of parallel our distributions. That is, the more “central” values of these factors have the highest median number of ratins and the highest maximums (though it doesn’t look like any of these are significantly different from one another). However, there are more popular games in the quick and young categroies than one would expect given their number in the dataset. Perhaps this is tied to people rating games they play, and many people play a lot of games when they were kids. Playing time and age minimum are moderately correlated due to the fact that younger kids have shorter attention spans.

Popularity by Age of Game

How does year published impact popularity? In looking at these top 1000 games, it’s interesting to take a look at how when the game was published impacts the game’s popularity (as measured by number of ratings).

The most popular games were published between 5-20 years ago (with one of the most popular being almost 30 years ago). On one hand, we shouldn’t expect to see any extremely recent games have a lot of ratings, since ratings increase over time. But part of this could also be attributed to who is giving ratings and the response bias this creates.

Conclusion

Unfortunately, this data does not contain all board games, just the top 1000 rated games. Because of this, we have a biased sample and can’t see the composition of lower rated games. However, we can still make some assumptions. From these graphs, it indicates that most highly rated games are similar in basic features. They mainly seem to be in a goldilocks zone of sorts, not too long but not too short, not too many people but not too few, etc.

Take a look at the dashboard I built to explore the data further and see if there are any trends you notice. Send me an email with any further suggestions to explore!

Visit my repository for data and jupyter notebooks to create these graphs.

Board Game Bonanza: Data Collection and Wrangling

2023-12-12T00:00:00+00:00

Everyone loves a good board game. Board games bring people together, offer intellectual stimulation, and provide a form of entertainment that is not tied to a screen. But what makes a board game popular?

To explore this question, first we need to find data on popular board games. We’ll use python to collect, explore, and clean this data.

Data Source

One great source of board game data is BoardGameGeek. This website is an online database and community for those who love board games. Not only does it have information about hundreds and thousands of board and card games, but it also allows users to rate and sell board games on the website. Because of this, we can find data about each board game’s requirements, it’s popularity, and it’s value.

BoardGameGeek offers an API that allows the user to access data on a game by game basis. Before accessing the API, it’s best to check the Terms of Service. The good news is that we are able to use this data for non-commercial use. We just need to credit Board Games Geek by including its logo in public facing uses of the API.

So all credit for this data is given to BoardGameGeek. Here is their logo:

Accessing the API

Board Game Geek has an explanation of their API.

To access the data for each board game, we need the base url, the endpoint, parameters (set the id of the game and whether to show stats and marketplace data).

baseurl = "https://boardgamegeek.com/xmlapi2"
endpoint = "/thing?id="
parameter1 = str(ids[0])
parameter2 = "&stats=1&marketplace=1"

Each API pull will result in an XML file, that looks a little like this:

Prepping the API pull

Because each API pull does not result in all the board game data, but the data for one board game, we need to set up a process to pull data for all the board games we want. To do this, we need the IDs for each board game we want to pull. Luckily, on the API explanation page, BGG provides a csv file with data on every board game. Unfortuantely, this does not include much of the data we want, but we can use this csv for IDs of board games to pull.

To download the .csv file, you need to make an account with BGG. Once you have an account, you can download this file from the API explanation page.

Import the following packages:

import pandas as pd
import numpy as np
import requests
import re
import urllib.parse
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import time

Next, read in the board games .csv file and select the IDs you want to use. I used the first 1000 IDs because it is a time consuming process:

## Get Board game IDs from csv file provided by boardgamegeek.com
ids = pd.read_csv("boardgames_ranks.csv")
#Select top 1000
ids = ids["id"]
ids = ids[0:1000]

Then set up portions of the API url using the code referenced above:

baseurl = "https://boardgamegeek.com/xmlapi2"
endpoint = "/thing?id="
parameter1 = str(ids[0])
parameter2 = "&stats=1&marketplace=1"

Initialize the data fields

Then initialize the data fields we want. These include the board game name, the year it was released, minimum number of players, maximum number of players, the estimated playing time, the minimum age, the number of accessories, the number of users who gave a rating, the average rating, the Bayes rating (which is the average rating with 30 average ratings tacked on to prevent new board games with high ratings from taking over rankings), the standard deviation of the ratings, and the average USD price of games selling in the marketplace as of November 12.

name = []
yearpublished = []
minplayers = []
maxplayers = []
playingtime = []
minage = []
accessory_num = []
users_rated = []
averagescore = []
bayesaveragescore = []
stddev = []
avg_USD_price = []

Pulling and scraping the data

Finally, we need to loop through our list of IDs to pull the XML file for each board game, and pull the data we want from the XML file. A for loop is not the most time effective, but it was the best way I could think of to pull the API for each ID. Notice that there is a request delay of 10 seconds so requests don’t get throttled. For the most part, this code uses Beautiful Soup to pull much of the data. USD price data is calulated by summing up all the listings in USD dollars and then dividing it by the number of listings. Warning: This takes a long time to run. I only pulled the top 1000 rated games because of the time it took to pull this data.

for x in range(0,len(ids)):
    parameter1 = str(ids[x])
    time.sleep(10) #so that requests don't get throttled
    url = baseurl + endpoint + parameter1 + parameter2
    r = requests.get(url)
    if r.status_code == 429: #to prevent going on with the code when request is denied for too much traffic
        time.sleep(10)
        r = requests.get(url)
    if r.status_code == 429:
        time.sleep(10)
        r = requests.get(url)
    soup = BeautifulSoup(r.content, "xml")
    name.append(soup.find("name")["value"])
    yearpublished.append(soup.find("yearpublished")["value"])
    minplayers.append(soup.find("minplayers")["value"])
    maxplayers.append(soup.find("maxplayers")["value"])
    playingtime.append(soup.find("playingtime")["value"])
    minage.append(soup.find("minage")["value"])
    accessory_num.append(len(soup.find_all("link", type="boardgameaccessory")))
    users_rated.append(soup.find("usersrated")["value"])
    averagescore.append(soup.find("average")["value"])
    stddev.append(soup.find("stddev")["value"])
    bayesaveragescore.append(soup.find("bayesaverage")["value"])
    usd_prices = [float(item["value"]) for item in soup.find_all("price", currency = "USD")]
    if len(usd_prices) == 0:
        avg_USD_price.append(None)
    else:
        avg_USD_price.append(sum(usd_prices)/len(usd_prices))

Finally, we can create a data frame from all these lists.

boardGames = pd.DataFrame({"Title": name, "Year Published": yearpublished, "Min Players": minplayers, "Max Players": maxplayers, "Playing Time": playingtime, "Age Minimum": minage, "Number of Accessories": accessory_num, "Number of Ratings": users_rated, "Average Rating": averagescore, "Bayes Rating": bayesaveragescore, "Standard Deviation": stddev, "Average USD Price": avg_USD_price})
boardGames.to_csv("boardgamesdata.csv")

Data Cleaning and Engineering

The data now needs to be cleaned and engineered.

Create the new variable of age of game (by subtracting the year published from 2023, the year the data was pulled), change the playing time variable to be an integer, and set the other numeric columns to be numeric.

boardGames["Age (Years)"] = 2023 - boardGames["Year Published"].astype(int)
boardGames["Playing Time"] = boardGames["Playing Time"].astype(int)
num_cols = ["Year Published", "Min Players", "Max Players", "Age Minimum", "Number of Accessories", "Number of Ratings", "Average Rating", "Bayes Rating", "Standard Deviation", "Average USD Price"]
boardGames[num_cols] = boardGames[num_cols].apply(pd.to_numeric)

Create categories for estimated playing time, recommended minimum age, and maximum playing group. We can do this by using pandas cut function, which assigns numeric values to labelled bins you determine.

boardGames["Time Category"] = pd.cut(boardGames["Playing Time"], bins=[0,31,61,91,181,301,1501], labels = ["Quick", "Short", "Moderate", "Long", "Very Long", "Marathon"])
boardGames["AgeRating"] = pd.cut(boardGames["Age Minimum"], bins=[0,5,8,12,16,24], labels = ["Any", "Young", "PreTeen", "Teen", "Adult"])
boardGames["GroupSize"] = pd.cut(boardGames["Max Players"], bins=[0, 1, 4, 8, 101], labels = ["Individual", "Small", "Large", "Massive"])
boardGames.to_csv("boardgamesdata.csv", index = False)

The first five rows of the resulting data frame should look something like this:

Congratulations! We now have data for the top 1000 board games that we can now explore. In the next article, we’ll explore this data to find what features these popular board games share.

Visit my repository for the data and jupyter notebook to replicate this process.

Predictions in Sports using Poisson Distributions

2023-10-12T00:00:00+00:00

When the improbable victory occurs, it feels magical. When the unlikely defeat happens, it feels tragic. The randomness of games and sports is part of what makes them enjoyable. Understanding this randomness and getting a sense as to how likely or unlikely something is to occur can help one better appreciate the feat one’s witnessed. Using simple probability distributions, we can quickly determine the probability of a sporting event outcome. In this article, we will discuss the use of the Poisson distribution to determine probabilities in sports like basketball, baseball, soccer, and football.

What is the Poisson distribution?

The Poisson distribution is a statistical distribution that uses the average occurences in a given amount of time or space to give the probability of an event occurring a certain number of times in that same amount of time or space. In the context of sports, this occurence can be the number of points scored per minute, the number of runs scored per inning, etc. By using the poisson distribution, it allows us to find the probabilities that a team will score x number of points/runs/goals in a k amount of time remaining. Then, by comparing this to the distribution of the other team, we can quickly find the probability of a team scoring more than the other team by the end of the game.

Putting it into Practice

Probability for 1 Team: 1988 Kirk Gibson Home Run

For this example, we’ll look at baseball. In baseball, the unit of time is an out. Three outs make up an inning, and once a team completes 9 innings, they no longer have an opportunity to score runs. This is an imperfect unit of time since the length of time between each out can differ, but for a simple example it can work. Because this is a simple model, we’ll ignore effects like opponent and pitcher, and assume independence between each occurrence. On October 15, 1988, the Los Angeles Dodgers were playing the Oakland Athletics in the World Series. The Dodgers were down 4-3 in the final inning. This was the Dodgers final opportunity to win the first game of the Series. What was their odds of winning the game? We can use python to find out. First, we need to import the correct packages. In this case, we’ll use poisson from scipy.stats.

from scipy.stats import poisson

Second, we need to find the average number of runs the Dodgers scored in 1988. According to baseball reference, the Dodgers scored 3.88 runs per game (or 27 outs) in 1988. This means for every out, the Dodgers scored on average .144 runs. With three outs left, they were expected to score .431 runs. Poisson.cdf requires the parameters k (the outcome we are finding the probability for) and mu (the average outcome in that amount of time). It then returns the probability of getting k or less. In this case, k is 2 (because the Dodgers needed 2 runs to win) and mu is .144. Because this will return the probability of scoring anything less than 2, we will subtract our poisson.cdf function from 1 to get the probability of scoring anything more than 2.

1- poisson.cdf(k=2, mu =.431)

In this case, the probability comes out to be 0.97%. What actually happened was just as incredible. The Dodgers spent two outs and got one runner on base. They turned to their injured star, Kirk Gibson, to attempt to win the game. Literally hobbling up to bat against the Athletics’ best pitcher, Kirk Gibson hit a home run and won the game for Dodgers, who went on to win the rest of the World Series. Watching Kirk Gibson’s at bat unfold, the 1% chance of the Dodgers winning feels like magic.

Probability for 2 Teams: 1996 Utah Jazz Comeback

The Dodgers example only involves one team; how do we approach comparing two teams? To solve this, we’ll look at the greatest comeback in the National Basketball Association. The Utah Jazz were down 36 points halfway through a game in November of 1996. If you were a fan (or even alive) at this time, you probably knew it was extremely unlikely the Jazz would come back to win the game. But how unlikely?

Using Basketball Reference’s 1996-1997 team data, we find that the Utah Jazz averaged 103.1 points per game that season and the Denver Nuggets averaged 97.8 points per game. For simplicity sake, we’ll ignore team defense (like most NBA teams ;) ).

First, we’ll generate a range of values for the Nuggets that captures all values they could reasonably score in half of a game. We’ll do the same for the Jazz, except a minimum and maximum 36 more than the Nuggets.

#Utah Jazz averaged 103.1 points per game
#Denver Nuggets averaged 97.8 points per game

#Generate list of points from 0 to 101 for the Nuggets and a list of each of 
#those numbers plus 36 for the Jazz. This is the outcomes where Jazz can win.
point_values_Jazz = list(range(0+36,101+36))
point_values_Nuggets = list(range(0, 101))

We’ll use these range of values and Poisson.pmf to generate the probability the Nuggets will score each value in the range. We’ll use 1 - Poisson.cdf to find the probability the Utah Jazz will score that value or more.

#Find the probability of the Nuggets scoring each of those point values.
#Find the probability of the Jazz scoring at least that many point values
UtahJazz = 1 - poisson.cdf(k = point_values_Jazz, mu =103.1/2) #divide mu by 2 so it's average for the half
DenverNuggets = poisson.pmf(k = point_values_Nuggets, mu =97.8/2)

Because the Utah Jazz’s list of probabilities is for values incremented 36 points more than the Nuggets’, by multiplying each of these lists together, t gives us the probablity of the Nuggets scoring that number of points AND the Jazz scoring 36 points or more (assuming independence for simplicity). By summing these probabilities up, we can find the probability of any of these outcomes occurring.

#Multiply the two probabilities together to find the probabilitiy of each occuring (assuming independnece)
Deficitprobabilities = UtahJazz * DenverNuggets

#Sum up those probabilities to find the sum
Deficitprobabilities.sum()*100

According to our simple model, there was a .03% of the Jazz winning. Talk about improbable!

Conclusion

To build more complex models, you can take into account other factors, such as defense of the opposing team, average when under similar factors, etc. However, a simple Poisson model works well in a pinch! Next time you watch a sporting event, try a simple poisson model to see who is favored to win, or to see the probability a team holds on to their lead throughout the game. Because this model applies to any sport where occurrences occur in an amount of time, this works for soccer, baseball, basketball, football, hockey, and more. As you put this into practice, you can quantify the highs and lows of following sports by capturing how improbable it was!

First Post

2023-09-26T00:00:00+00:00

Hello World!

How to create a blog post

2022-08-01T00:00:00+00:00

Steps for creating a new post.

Create a new file in the _posts folder called YYYY-MM-DD-post-name.md, where YYYY is the year (2023), MM numeric month (01-12), and DD is the numeric day of the month (01-31). The post-name is a short name for the new post with - between words. You must use this name convention for all new posts.

Make the YML heading. All pages in the site need to start with a YML heading. For posts you should use the following header:

---
layout: post
title:  "Post Name"
author: Your name
description: Short yet informative description
image: /assets/images/blog-image.jpg
---

For this theme, the layout should stay as post. All the other fields should be updated with the information for your particular blog post. The blog image should be a .jpg or .png file that you should add to the folder assets/images. Don’t make it too large or the page will take longer to load (500-800 KB is a good size). Leave the file path as /assets/images/ in the header area.
Write the body of the blog using markdown. There are a lot of references for markdown available. I like the Markdown Guide because many of the examples show both the markdown and the html code. There are separate pages for basic syntax, extended syntax, and a cheatsheet for quick reference.
You can also use html code snippets along with the markdown. Often, using html will give you a little more control and flexibility as demonstrated below.

Links

To create a link (internal or external), enclose the link text in brackets (e.g., [Statistics Department]) and then follow it immediately with the URL in parentheses (e.g., (https://statistics.byu.edu)).

For example:

My favorite department at BYU is the [Statistics Department](https://statistics.byu.edu).

My favorite department at BYU is the Statistics Department

If you want external links to open in a separate window, you will need to use html code with target="_blank" inside the a tag.

For example:

My favorite department at BYU is the <a href="https:statistics.byu.edu" target="_blank">Statistics Department</a>

My favorite department at BYU is the Statistics Department

Internal Links and Files

If you want to have a link that points to another location on your site or if you want to include a file (such as an image or video) you must use the site.url and site.baseurl variables when making the link reference. For example, this link to pointing to the About page is coded as:

[About]({{site.url}}/{{site.baseurl}}/about)

Paths to files should also be referenced with the site.url and site.baseurl variables (see the section on Adding Images).

Adding Images

In the examples below, if your image ends with .png or .JPEG, use the appropriate extension instead of .jpg.

Images for the blog will generally but put into the assets/images folder. (You can also create a subfolder for images, but you will need to include the subfolder name in the reference link.)

Markdown syntax for including images is ![Fig Name](path/to/image). For example:

![Figure]({{site.url}}/{{site.baseurl}}/assets/images/image_name.jpg)

Resizing images

The image I added in the previous section seems a bit large for this post. Unfortunately, there isn’t a good way to resize images with markdown, so if you need to resize an image, use html instead of markdown and specify the width in the style parameter as follows:

<img src="{{site.url}}/{{site.baseurl}}/assets/images/image_name.jpg" alt="" style="width:300px;"/>

(Example with width set to 300 pixels)

(Example with width set to 100 pixels)

Troubleshooting

Here are some things to keep in mind if your blog appearance isn’t going as you planned:

Problem: The blog post that I created isn’t appearing

Possible Solutions:

Check your date. GitHub pages won’t display blog posts with future dates
Check the yaml header. If there are any special characters in any of the fields, you need to use quotes around the entire field entry. The most common culprit is the description. If you’re having trouble, try putting quotes around the entire description

Problem: I know that I made changes to a blog post but the changes aren’t appearing

Possible Solution:

Check the header. If there are any special characters in any of the fields, you need to use quotes around the entire field entry. The most common culprit is the description. If you’re having trouble, try putting quotes around the entire description.

Problem: My entire blog has weird formatting

Possible Solution:

Usually this is an address problem. Double check your url and baseurl in the _config file