Is Grand Rapids ready to take the plunge?

A Novel Attempt at Market Research through Machine Learning

Michael Minzey - April 23, 2024

Email: minzey.michael@gmail.com

GitHub repo for this project

Chilling Out: The Cool Appeal of Cold Plunge Therapy

Cold plunge therapy, or taking a dip in super chilled water, has been making waves as a cool health trend. It isn’t just about bracing yourself for a shock of cold but reaping some real benefits: quicker muscle recovery, a mood boost from those happy endorphin kicks, improved dopamine regulation, better circulation, and even a stronger immune system.

It's not exactly new, but it's caught a fresh wave of attention thanks to big names like podcast star Joe Rogan and brain expert Dr. Andrew Huberman. They've shared how diving into cold water can really rev up your body and mind, helping to push this frosty therapy into the limelight and making it a “hot” topic in wellness circles.

Should we dive in?

Over the past two years, a couple of friends and I have become quite interested in the health benefits from cold plunging. We are considering a plan to start a business featuring cold plunge tubs and red light/infrared saunas. The format of the business would be very similar to tanning salons. 

As with any business plan, market research is critical, and so I wanted to develop a method for modeling the potential success of a cold plunge business in Grand Rapids, Michigan. Ideally, this tool could be adapted to suit any business market, but for this project, the scope would be limited to cold plunge businesses.

To achieve this goal, I planned to acquire data on existing cold plunge businesses across the United States and develop a model based on the socioeconomic characteristics of the cities where those businesses exist. For example, a model like this would indicate a meager chance of a successful snowplow business in Phoenix, Arizona. 

Let the collection begin!

To develop a model to predict the viability that Grand Rapids is a good place to start a cold plunge business, I was going to need data. Specifically, I knew I would need a healthy list of all the cold plunge therapy businesses in the U.S., and appropriate demographics for each of the cities where those businesses are located. I could then build a model using the cities’ metrics to predict whether or not a given city was likely to have one or more cold plunges in the area.

For business data, I elected to use the Google Places API, which houses much of the data that you get when using Google Maps to find a nearby restaurant. The Google Places API provides a ton of details for businesses, such as location, hours, and official websites. The last bit became crucial later on, which I will discuss in my first side quest.

For socioeconomic data, the U.S. Census is the gold standard. Not only does it provide a tremendous amount of data, their website caters to data collection and analysis, perfect for my needs. Having determined the data sources for my experiment, I set about harvesting what I needed.

Location, location, location

I started my data collection by focusing on the cities. I would need socioeconomic data at the city level, which is pretty granular. After some digging, I came across the American Community Survey. To say that this would suffice for my needs would be a dramatic understatement. The ACS has almost 30,000 variables of socioeconomic data for American cities! Not only is there a wealth of data, it’s quite current. The data I used is from 2022. How fortunate for my modeling! While I am no sociologist, I spent some time combing through the list of dimensions and decided on the following as to what indicators I felt would be prudent. I settled on the following:

  • B01001_001E: Total Population

  • B01001_026E: Female population (as percentage of total population)

  • B01002_001E: Median age

  • B17001_001E: Poverty status in the past 12 months (as percentage of total population)

  • B27001_001E: Health insurance coverage status for all people (as percentage of total population)

  • B25064_001E: Median gross rent (housing can impact health outcomes)

  • B25077_001E: Median value of owner-occupied housing units.

  • B25035_001E: Median year housing units were built

  • B19013_001E: Median household income in the past 12 months (inflation-adjusted dollars)

Fortunately, the US Census API wasn’t terribly complicated, and I was able to pull down that data for all American cities in a simple Python script I wrote. After making a few slight transformations and filtering by population > 20,000, I exported the data into an Excel spreadsheet. You can find all of my code in the GitHub repo linked above. Unfortunately, the next piece would not be as straightforward.

My First Side Quest

I have some experience in using the Google Places API, so I was able to pull up an old Python script of mine and dust it off. I had to start a new project in Google Cloud, which did require setting up a billing account, in order to get a Places API key. I set about importing my list of cities with population greater than 20,000, and executing a keyword search for “cold plunge” for each city. This worked quite well, and I began collecting the list of businesses. During my testing, I noticed something interesting. I found that the keyword search results were not what I was expecting. This leads me to my first unique find: Google is very biased for producing search results that are adjacent to what you are searching for, but are not necessarily “on target”.

In my case, I found that the search results for “cold plunge” included saunas, health spas, beauty salons, and other businesses which I knew would not have cold plunges. This was a bit disheartening as I knew that the ability to accurately identify these businesses would be crucial to the quality of my data modeling. I verified this by checking out the websites for my Grand Rapids results and determined that I would need a lot more scrutiny in identifying America’s cold plunge businesses.

Knowing that I would need a stronger method to determine which businesses in my Places API results were truly offering cold plunges (removing the false positives), I set about with the following logic.

Using Google Places, I pulled the website for each business result. I then ran a quick check to see if that website was still online (the business is still operational) and used the BeautifulSoup Python library to search the website for keywords that should be on a website of a cold plunge business (e.g. “cold plunge”, “ice bath”). If the website failed either of those checks, it was omitted from my list of businesses.

The results are in!

There are approximately 2,000 cities in America with a population over 20,000. Using the Google Places API with some additional error checking, I was able to find 4,200 operational cold plunge businesses across all 50 states. Finding the businesses was as lengthy and tricky as it was costly. There was quite a high error rate in the Places search results. Over the course of the script, which took about 18 hours to run, I had to make 27,000 API calls. That’s a 16% success rate. For those interested in attempting this strategy, beware. The cost of the Google compute landed me a $478 API bill. Fortunately, I had some credits which knocked the price down.

Exploration and Preparation

Before I began modeling, I needed to take a look at my results to see what I was working with. I loaded both my census and business data sets into R Studio, and combined them into a single data set, where each row represented a city, its count of cold plunge businesses, and demographic statistics. Fortunately, I had a complete set; no missing values. The charts below demonstrate some interesting patterns. Firstly, you can see that there is a huge number of cities with no cold plunge businesses. The data is very right-skewed, which will have implications for our modeling. Geographically, there isn’t a significant pattern to density across the continental US, save for the popularity in California. Of important note, Michigan is listed in the top 10 states for cold plunge businesses, so that was encouraging. Again, you can see in the third chart that California far outpaces any other state.

Now that I had a handle on the data, there are some further steps I needed to prepare for creating a functional prediction model. The first was to remove outliers from the business count and the population. I found that there were some really big outliers for those dimensions in the data, so removing them will aid in the effectiveness of my model. Using the code to the right, I removed those cities outside of the first and third quartiles.

The next step was to check for multicollinearity, which determines if two variables have a near exact linear relationship. This will help me understand if there are any variables I should be removing for my modeling step.

To check this, I used the variance inflation factor method, which measures how much the variance of an estimated regression coefficient increases if my predictors are correlated. A good rule of thumb is that any value over 10 shows multicollinearity is high, and 5 is a much stricter threshold. For my predictors, I only saw a strong correlation between median gross rent and median value of owned housing. The correlation wasn’t high enough to merit omitting, so I decided to leave them in. You can see my results below.

Time for Modeling

Now that I had achieved a comfort level that my dataset was ready to model, I had to determine which type of modeling to use. When it comes to predicting numerical values, regression modeling is the most common. Regressions come in a few different flavors, depending on what you are trying to predict. Because I am trying to estimate the count of cold plunge businesses a given city would have, a Poisson regression seemed the most appropriate. This is a typical approach when predicting the count of an event to occur, in my case, the number of cold plunge businesses for a given city.

When doing some research on this modeling strategy, I concluded that the high number of zeros may be something to consider. Because of this, I needed to determine if I should be using a Zero-inflated model. As seen in the graph above, I had a high number of cities with no cold plunge businesses. I ran a zero significance test on my data, and found that once I had removed the outliers, I did not need to use a zero-inflated model.

With nice and clean data, and a solid regression model, I was ready to get to machine learning. I split the dataset randomly, with 80% going into training, and 20% being used to evaluate the strength of the model. In R, I plugged in the training set and tested it against my testing data.

How’d we do?

They say that a picture is worth a thousand words, so I have a couple that will show how well the model performed. The first is a residual plot, and the second is a screenshot of a random sampling of predicted values and the actual values for the test data. A residual plot will show the differences between the predicted value and the actual value, called a residual. A good model will have a low level of heteroscedasticity. Big word that means there should be no patterns in the residual plot, and a good model will have all the dots near the zero line. As you can see below, not only is there a pattern, you’ll see that for some values, there is a pretty big difference. This shows that the model does not perform well at accurately predicting the number of cold plunge businesses, given the parameters that I applied. [Sigh]

Is Grand Rapids plunge-worthy?

Getting back to the impetus behind this little endeavor, the question still remains: is Grand Rapids a suitable candidate for a cold plunge business? My model predicts that Grand Rapids should have 5.3 businesses, so I think that’s a fairly good sign. If you recall from the chart above, Michigan is in the top ten for cold plunge businesses, so I think it’s fair to conclude that there could be room for one more in the market.

Final Thoughts

Data analytics in the marketing field is in very high demand, and I have a much better understanding of why. It is no small matter to develop models that can accurately predict customers and business across such a diverse economy as America. Given my strategy, there could be hundreds or thousands of predictors needed to develop a strong model. The scale of computation and time needed could be tremendous. Alas, my dream of selling a cold plunge startup crystal ball will not likely come to fruition.

As a very green data scientist, this was a very worthwhile exercise in the end, save for the API bill. I’m currently pursuing my Masters in Data Science at Grand Valley State University, and wanted to stretch myself with a real-world application. One thing that I found to be true, which I have heard from many seasoned data scientists, is that in this field, you never stop learning, and that often you have to teach yourself into the goal at hand. Before this experiment, I had no experience in Poisson regression, nor had I heard about Zero inflated models, and so it was fulfilling to research how best to apply those strategies.

If you made it this far, thank you for reading and I hope you enjoyed the experiment. I have linked my code for the project above if you have any interest in adapting this strategy to your own machine learning models.

-Michael