Alex Boyce

Watching baseball better in 2026

2025-11-20T00:00:00+00:00

Even though the 2025 World Series ended just a few weeks ago, I’m already looking ahead ot the 2026 season. I watch a lot of baseball. But I don’t have any particular team loyalty. So when I turn on mlb.tv, I often default to Big Inning (which is great), the Yankees (I interned with them for a season), the Dodgers (Shohei), or some random good pitching matchup. My mlb.tv wrapped covered every team.

For 2026 I thought I’d do something different. With 181 days of at least 4 games, I’m perfectly set up to spend 6 days at a time with each team. I can immerse myself in their home broadcasts and hopefully get to know their main characters a bit more.

To get my schedule I threw together a very loose optimizer in python. 6 days per team, a game every day, minimizae the standard deviation of the number of games for the “other” team. It’s pretty rough, but it got hte job done.

I get to watch the single Opening Night game (Yankees vs Giants), then a choice of 3 games on Monday 9/21 (or maybe a day off?), and complete freedom for Game 162 to close the regular season. Otherwise, my games are set.

The Schedule

I’ll watch the Pirates 12 times outside of their “feature” set, and the Marlins 11x. Probably not teams I would normally be tuning in for so that’s a plus.

I get 5 “repeat” matchups Nats-Pirates in April and July Jays-Red Sox in April and August Angels-A’s in June (with just a 5 day gap) Braves-Mets in July Mets-Marlins in August and September

Matchups that caught my eye:

Dodgers-Giants is my first big rivalry series
I follow the A’s across California, playing the Giants, Angels, and Dodgers.
Pirates-Phillies all-PA matchup (rip POOP
Yankees-Red Sox in August
Finishing with the Marlins and White Sox in September will be interesting
Padres-Dodgers in the 2nd to last series may have some big playoff consequences

Who Will Win Season 25 of The Bachelor?

2021-01-01T00:00:00+00:00

Tomorrow night marks the premiere of the quarantine season of The Bachelor, and the first with an African American lead, Matt James. After the premiere, fans (including myself) will anxiously discuss who could and should win. But super fans have already been introduced to this years contestants, through the official bios on the ABC website. When released, podcasts like Bachelor Party dissected them in incredible depth.

And it all made me wonder whether I could predict who will go far on the show before seeing any of the season. After all, these bios are probably written by people who have knowledge of the season’s outcome. While they won’t outright tell us who wins, they might be slightly guiding us to view some contestants more favorably than others.

The work

So that brings us to today, where I’ve analyzed the words used in bios of the previous 4 seasons (it’s increasingly challenging to find these bios the further back we go) and built a model to predict future contestant performance.

Of course many caveats apply:

In two seasons (Arie & Nick), bios were in the form of explicit Q&A. In more recent seasons (Colton & Peter), these questions likely still exist but are now used to write narrative bios with more flair.
Bios have more than doubled in length over the previous two seasons - until Peter’s season they were 400-500 words. With Peter and Matt, they are now over 1,000
As the first African American bachelor, Matt’s contestants are considerably more diverse than past seasons. This possibly means different word usage as pointed out on that Bachelor Party pod (for example, soul has been used in 5 bios this year vs just one in the previous 4 seasons)
This sample size is very small and leads to an overfit model! This was a bad idea!

All that said, here’s what I did:

Found all words that appeared at least 5 times in the previous seasons’ bios. For the sake of this exercise different forms of a word, such “travel”, “travels”, and “traveling”, are considered the same word.
For each word and each contestant with that word in their bio, calculate the average week that the contestant’s reached in the season (10 weeks per season).
For each contestant, average the scores of each word they have in their bio.
There is some additional info we have on the women: age and job title. Those are incorporated into regression models along with the biography score calculated in the previous step. (While we also know the race of contestants, and that this franchise has a spotty history when it comes to race, I did not include that as a variable here).
Using this framework, predict elimination week and whether a contestant will reach hometowns (the final 4 contestants)
Rescale scores of both models, average them together, rank, pick the final 4 (…, profit?)

The Results

For past seasons, this model does ok! It picks about 2 of the 4 correctly for Arie, Nick, and Colton, including the winners (and 2 runners-up).

But when the bios got longer in Peter’s season, things got really good. This predicts 3 of the final 4, with the “villain” of the season as the only miss. Even better, the mis-pick in the top 4, Kelley, is the one Peter eventually ended up dating months after the season ended. So this gives me hope for Matt James’ predictions!

There are some pretty bad misses though. The model really liked Liz’s bio on Nick’s season, where she talked about family and her passion for her job. And then womp she flopped in week 2. More egregiously was Tracy on Colton’s season, who talked about her career, sister, AND travel. The trifecta.

Before we get to the main event, let’s take a quick diversion to the top and bottom words:

From this, it seems like women who are described as career-oriented, intelligent, loyal, like to travel, and have sisters have gone far in recent seasons. On the other side, those whose profiles reference more hobbies, talk about marriage, or have brothers bow out early. Of course we shouldn’t assume that these words reflect reality either. Or maybe it’s just that women with brothers scare the bachelors away because they’re afraid of getting beat up.

Finally, my predictions for Matt’s final 4. Spoiler alert(?)

After reviewing the actual bios of everyone, these feel like good picks! But Matt James is a complete unknown, having never been on any Bachelor shows, so it’s hard to say who he’ll go for. If I had to pick a winner from this group, I’d go with Khaylah - Magi is a bit too old (very rare for Bachelor’s to pick older women), Serena C is a bit too young, and Khaylah’s job in healthcare advocacy feels like a better fit with Matt’s work starting a non-profit food organization than Chelsea’s modeling career.

Is any of this right? Was it just a colossal waste of time? Probably yes to both! I’ll try to remember to post an update here when the season ends in March.

Analyzing Board Game Data: Background & Association Rules

2018-11-01T00:00:00+00:00

Now that the fall quarter is over and I have some free time, I’ve decided to dive in on some data from boardgamegeek.com. Board Game Geek is a site for all things board games (shock!) and is generally regarded as a great resource for reviews and rankings of games. I found a dataset from kaggle.com, which is basically an export of all games and data associated with them.

Board games are fun, but they are also a $10 billion industry. There is a large market here for future game design and sales, which makes it an interesting dataset to examine.

Initially, this dataset covered 90k games. But I want to look at games with significant volume, so I excluded anything with under 200 user ratings. This was a massive portion of the titles and I was left with 6.1k games. After removing games released prior to 1970 (including backgammon and marbles, both “released” in 3000 BC), games that are actually expansions or other versions of another game, and card games I was down to about 3.1k titles.

There are two main variables of interest throughout that measure “quality”: Average Score and Number of Ratings. Ratings are highly skewed, with a few games getting a ton of users rating them, but the vast majority are down under 1000 or so. Input variables will be mentioned throughout the models, so I won’t bore you by talking about all of them here.

Association Rules

My first models will be association rules - think of this as Amazon’s “frequently bought together” recommendations. They are “if…then…” type statements, where the “if” is known as the antecedent and the “then” is the consequent. Lift indicates how much more likely the consequent is to happen, given the antecedent.

The easiest way to understand is by example: If a game is high weight (difficulty level) and recommended for one player (antecedents), then it is 2.28 times more likely (lift) to have a score of at least 7 (consequent)

From these top rules with a consequent of Score7plus, it looks like high difficulty, newer, older ages, and recommendations for single players are indicators of higher scores.

The list above was filtered just to look at Score7plus with high lift, but it could also be useful to look at rules with the lowest lift. In the example of the second rule, games with a hand management mechanic (this includes games like Catan and Ticket to Ride) are less likely (lift below 1) to have been released in the 2010s. This could point to a shift in game design in recent years.

Two other rules that look interesting are that one time designers are less likely to have their game score above 7 and are less likely to create high difficulty games. This could just be a matter of perception, but it does suggest an uphill battle for first time designers.

Finally, let’s look at rules with high lift and both Score7plus and HighNumRatings as consequents. Games that are newer, for an older audience, and complex (in some combination) are more likely to have a high score from a lot of ratings.

So great, I ran some algorithm and it told me a few things about the data. Now what? For this dataset, not a ton. This isn’t customer level data. I can’t tell a game store which games to put next to each other or tell Amazon which to recommend once you’ve bought Ticket to Ride.

However, it does give a better sense of what characteristics drive ratings volume and ratings score, which will help when getting on to the prediction phase.

If I were talking to a game designer and had done no other work, I’d tell them that if they want a hugely popular and well regarded game to create something that is complex and for an older crowd.

The limitations are important here though - this is about probabilities; not every new game with a minimum player age above 11 and high weight will be a winner. These rules are helpful in understanding relationships between variables, but it’s the predictive models that will be the real drivers of decision making.

Stay tuned next week for some more data!

Has ‘Every Kid in a Park’ been successful?

2018-09-01T00:00:00+00:00

President Obama started the “Every Kid in a Park” program on 9/1/2015. This program allows all 4th graders and their families to get into National Parks (and Monuments, Forests, etc) for free. After a recent trip to Wind Cave National Park in South Dakota where we saw several families get tours for free under this program, Amanda suggested I find some data to learn about the impact of the program. I don’t expect to see any attendance growth from this program until summer 2016 when kids are out of school, since many do not live close enough to parks to visit outside of breaks and 2/3 of NP visits happen between May & September

To measure the impact, I looked at “recreation visits” from 1979-2017 from the NPS site here. Unfortunately, this appears to be the most granular form of visit reported (nothing on annual passes, senior passes, 4th graders, etc), so my analysis is fairly high-level and directional.

Spoiler alert: I didn’t find anything, but still found some cool things in the data.

Annual visits

While annual visits did increase by 9% from 2015-2016, this appears to be part of a trend that started a few years earlier. In fact, 2014 increased +7% vs 2013, the largest YoY percentage increase in this dataset (since 1979-80) at that point in time. 2014-16 have been the largest single YoY growth years in this timeframe and the first time that there have been three consecutive years of growth since the mid-1990s. The annual data does not provide any information that suggests the program contributed to recent visit growth.

Monthly

Drilling down to the monthly level, all months in 2015 and 2016 had more visits vs the same month in the respective previous year. This supports the idea of a longer-term trend not attributable to this program. The longer trend dating back to 2011 shows that visits in May and August have increased every year – these are months that I would expect to see increased 4th grader visits under this program.

Parks/Regions

Finally, I looked at the park and region level with the thought that some parks may benefit from this program more than others. For this analysis I calculated the average of the 2014-13 and 2015-14 annual changes and the 2016-15 and 2017-16 annual changes and took the difference. A positive difference indicates higher average growth in the post-EKIAP years than in the two years preceding the program. Across all parks, the pre-growth averaged 8% vs 5% post, for a net of -3%. Half of the 60 National Parks have a net growth above the aggregate.

Parks in the Pacific West and Midwest regions over-index as above average net growth parks, suggesting that they may be seeing greater increases in attendance via this program. On the flip side, the Intermountain and Southeast regions under-index and may not be attracting as many 4th grade families as they could.

Below are the top- and bottom-five parks for net change in attendance in recent years. It’s interesting that three of the bottom-five parks are in the Pacific West region, despite this region being the most over-performing in the above average group.

One other interesting note at the park level: the top-six parks by attendance in 2017 (Great Smoky Mountains, Grand Canyon, Zion, Rocky Mountain, Yosemite, and Yellowstone) all saw net change at or below the overall NP level.

Conclusion

There is no clear evidence that the EKIAP has itself lead to more National Park visits overall, though there are some signs that West Coast and Midwest parks have benefitted. Obviously, there are plenty of other possible reasons for this growth; the important thing for the NPS is that visits are still at a record high in 2017 and continue to see strong growth.

Next Steps

Next steps for investigating this question would be accessing any data specific to park attendance via this program. While I was able to find a few statistics on number of passes downloaded, I couldn’t find any specific park visit data for 4th graders and their families. Hopefully this is being collected somewhere by the NPS to assess effectiveness of the program.

Another step could be to bring in additional data that could explain yearly and monthly fluctuations in attendance. A broader analysis could help strip out other factors that have driven up NP attendance and provide a (slightly) clearer picture of the impact of this program.

Analyzing images in Python

2018-08-22T00:00:00+00:00

As I’ve continued to learn about Python, I’ve been really interested in expanding the scope of what I do beyond just traditional data analysis. One thing that interested me was what I could do with images. After a lot of searching around I found some really helpful blog resources. That last link is where I grabbed the base underlying code for my next project.

Here is the final result upfront. I took a logo for each MLB team and clustered them into 3 colors. Can you guess which is which? The answers are at the bottom of the post

I used a method called K-means clustering. Essentially what this does is loop through the pixels of the image (or data points in any kind of data set - this is definitely not an image-only methodology) and groups them into n-groups based on their similarity to each other.

The heavy lifting of the first two functions was done for me:

Now I just needed to make things look pretty. I wrote a function that took an original logo in RGBA format so that I could strip out the transparent background. This prevented extra white pixels from appearing - I just want to see what the logos colors look like. From there this logo is converted back to RGB and run through the clustering function. Lastly, I displayed the original logos and the new clusters side by side, to make sure that nothing weird had happened and to see which images appeared to have more clusters.

The downside to my setup is that each image has the same number of clusters. If you look at the Marlins and Pirates logos below, you’ll see that their clusters clearly don’t line up to their logos - they have more than 3 (and in the Marlins case at least 5) colors that really define their logos. My next steps would be to create the option to have different clusters for different images.

How’d you do?

Position Players Pitching

2018-08-01T00:00:00+00:00

Baseball people won’t shut up about position players pitching. The opinions range from “this is fun and weird” to “please just stop.” The light bulb for me was when multiple games had position players throwing multiple innings. I’ve seen these articles and the graphs within, but I haven’t actually found an official source that tracks this stat so I made my own.

Using Fangraphs’ defensive leaderboards, I was able to pull all defensive positioning data going back to 1961 (a mostly arbitrary cutoff). From there I just narrowed the list of everyone down to those who had pitching and non-pitching records, then removed the pitchers who had played the field (maybe this is phase 2 since I just saw the Rays’ closer play third in a 1-run game yesterday), and voila!

A mostly flat line, consistently under 15 and then 2014 hits and that starts climbing. 2015 broke past 20 and the purists were probably sweating it pretty hard, but then things leveled off (phew) until BAM we’re past 35 different position players throwing in 2018 with 40% of the season left.

As starters continue to throw fewer innings (avg 5.5 IP/start this season vs 6.1 in 1998) bullpens have been increasingly tapped to make up the difference. Whether this strategy is coming from the front office or the dugout is unclear, but there is definitely a greater willingness to preserve the full-time pitchers in games that are well out of hand. The only question left is this - where does it stop?

Analyzing the top 250 beers from BeerAdvocate.com

2018-07-20T00:00:00+00:00

In my Python class last quarter, our final project involved using BeautifulSoup to pull in data from a site and analyze it using methods we’d learned throughout the quarter. I chose to look at data on the top-250 rated beers on BeerAdvocate, which is probably the most known beer review/rating site on the internet.

I started by scraping the HTML data, which involved a combination of easy table locations, text parsing, and character removal to give me the best data for analysis. (This code is at the bottom of the post because it’s less fun than pretty graphs)

Once I had clean data compiled, the fun could begin:

As a good analyst, I started with some descriptive statistics and correlations. Correlations are pretty weak among our numeric variables, except for rank and score which are obviously negatively correlated. But it was interesting to see that this correlation was not -1. There are apparently other factors that influence ranking besides just the score.

Then I built a couple tools to get back information of interest from the data. You can find out the information of a beer at a given ranking, or return all beers from two states for comparison. I like these are easy ways to give an end user access to this data without having to dig for it on their own.

And what analysis would be complete without those graphs I mentioned earlier?

A bar graph by style is unsurprising for beer folks - the top styles are imperial IPAs and imperial stouts; together these two styles represent nearly half of the top-250!

Planning your next beer-cation? Head for the coasts - California and Massachusetts/Vermont

That Massachusetts and Vermont dominance is led by three breweries: Hill Farmstead in VT and Trillium and Tree House in MA. Each has over 15 breweries in the top-250, while no other brewery even has 10. So while the Northeast might be a little top-heavy, California is more balanced. Excuse me while I book a flight.

Scraping Code Breakdown

Is the home run revolution still happening?

2018-07-01T00:00:00+00:00

All baseball has been talking about for the past few years has been the rise of home runs.

So I wanted to see for myself what the pace of home runs looks like this year. I found this code for working with BB Ref box scores and made some changes to better suit my needs.

I thought that working with box scores would give me the most flexibility in analyzing all kinds of trends throughout the year. For this project, I aggregated home runs for every game through the All-Star break.

Using Retrosheet’s game log files, I trended home runs through the same number of games in each season going back to 2010:

At this point in the season, it seems clear that we’re on the decline from Peak HRs™. This season is pacing right in line with 2016 (which is still a massive HR year in the scheme of things) and well below the 2017 season. Halfway through the season the home run total is down 9% vs last year.

Early in the year there were theories that bad weather in the Northeast and Midwest (exemplified by the large number of postponed games) were to blame for suppressing home runs. This season started early than any other in the past decade, so maybe we should drop those early months and look at a slightly more “fair” comparison from May through the All-Star Break.

The ASB fell more than a week later this year, so while the gap between this year and last is smaller in absolute terms, on a per game basis this year is still lagging.

So weather is no longer enough to explain this difference. Are pitchers adapting? Is the ball changing again? Sounds like a topic for next time.

Code breakdown

Packages used:

To pull from each game individually, we need the URL to the box score and the list of tables to reference. Luckily, baseball reference has a consistent syntax for their URLs so I was able to build out the logic in Excel with limited effort.
Then we get to the actual pulling of the data. As I mentioned, most of this code was adapted from Ben Kite’s work on GitHub. Basically we are using BeautifulSoup to find the underlying HTML code, then locating the code for the two tables referenced in the dictionary. From there it’s just a matter of picking out the right columns. I opted for using the pitching tables because they provide an easy total of home runs. Home runs are not included in the hitting table, but as notes below it, which requires more parsing to get the total. At the end I am basically just reformatting everything so that I get one row per game and I can see which team was charged with which runs/HRs.
I readily admit that this could be cleaned up quite a bit, but it gets the job done in a relatively short amount of time and gives me the ability to run analyses at different cuts of time, team, etc