Analysing your Strava workouts with Python



Introduction

If you just care about analysing your own Strava data, skip ahead to the next section.

Despite how Nike or Adidas adverts try to convince you, amateur long-distance running is far from the most action-packed sport. In fact, a lot of the time, it’s actually quite boring. While meandering around the countryside and sending all my personal data to Strava, I’ve had a lot of time to think about what possible relationships there might be between different aspects of my runs.

There are some questions I want answered because they might actually be useful to know, like whether I can come up with useful measures of how my fitness has changed over time. But there’s also some things I want to know just for curiosity’s sake, like how often I start running at different times during the day.

I don’t particularly want to spend £54.99 per year for Strava Premium just so that I can answer these questions, but luckily, there are several great open-source tools like strava-offline, pandas and matplotlib which let you do all your own home-grown analysis.

In this post, I’ll walk through how you can get your own Strava data into a Pandas dataframe, ready inspecting with Python. Then I’ll go over how I answered some of my own long-standing questions using different parts of the PyData ecosystem.

Getting your Strava data into a dataframe

strava-offline is a Python program that will keep a mirror of your Strava activities in a local SQLite database. While it would be possible to use the Strava API directly, using strava-offline prevents the need for messing around with getting access tokens, and it’s always nice to have a local copy of your own data in case Strava loses it.

There are installation instructions on the README, but it should just be as simple as:

$ pip install strava-offline

Then you can run the following command:

$ strava-offline sqlite

This will open Strava in a browser and ask for permissions. Once you’ve authenticated, it will pause for a while as it puts all your data in a SQLite database. By default, the database is stored at:

  • Mac: ~/Library/Application Support/strava_offline/strava.sqlite
  • Linux: ~/.local/share/strava_offline/strava.sqlite
  • Windows: C:\Documents and Settings\<User>\Application Data\Local Settings\strava_offline\strava_offline/strava.sqlite

All activity data is stored in a table called activity. Now getting this data into a dataframe should be as simple as:

import sqlite3
import json
import pandas as pd

DATABASE_FILE = os.path.join(platformdirs.user_data_path("strava_offline"), "strava.sqlite")
con = sqlite3.connect(DATABASE_FILE)
df = pd.read_sql_query("SELECT * from activity", con)

df['json_parsed'] = df['json'].apply(json.loads) # Parse json into a new column
df = pd.concat([df, json_columns], axis=1) # Add parsed json to dataframe
df = df.drop(columns=['json_parsed']) # Remove json_parsed column
df = df.loc[:,~df.columns.duplicated()].copy() # Remove duplicate columns
df["datetime"] = pd.to_datetime(df["start_date_local"]) # Parse start_date
df.set_index("datetime") # Index by parsed date

Now you should have a dataframe with the columns corresponding to a DetailedActivity as per the Strava API. Here they are sorted in rough order of interesting-ness:

Attribute Explanation
sport_type Type of activity, e.g., “Run”, “Walk”, “Ride”. Details here.
distance Distance (metres).
elapsed_time Total workout time (seconds).
moving_time Moving time (seconds).
average_speed Average speed (metres/second).
max_speed Max speed (metres/second).
start_date_local Starting local time of your workout, in ISO 8601 standard.
start_date Starting time of your workout, in ISO 8601 standard.
datetime Parsed start_date_local from above.
average_heartrate Average heart rate of athlete.
max_heartrate Max heart rate of athlete.
elev_low Lowest elevation (metres).
elev_high Highest elevation (metres).
total_elevation_gain Elevation gain (metres).
average_cadence Average cadence of the activity.
name Name of your workout.
kudos_count Total kudos for the activity.
commute Whether the activity was marked as a commute (0 or 1).
trainer Whether the activity was recorded on a training machine (0 or 1).
comment_count Total comments on the activity.
athlete_count Number of athletes involved if this activity was a group activity.
achievement_count Number of achievements for the activity.
device_watts For rides only, whether wattage was reported by a recording device (0/1).
kilojoules For rides only, kilojoules of energy burned during the workout.
average_watts For rides only, the average wattage of the activity.

And some of the more arcane columns:

Attribute Explanation
pr_count Not mentioned in the Strava docs, maybe the number of personal records set on segments?
has_heartrate Whether the activity has heart rate data (True/False).
timezone Timezone, e.g., (GMT+00:00) Europe/London.
has_location_data Whether the activity has location data (0 or 1).
utc_offset UTC offset for local starting time.
photo_count Number of photos.
manual Whether you manually added the workout or tracked it somehow (True/False).
private Whether the activity was private (True/False).
visibility Visibility status of the workout, e.g., everyone.
flagged Whether the activity is flagged (True/False).
total_photo_count Total number of photos for this activity, including those linked to Instagram.
has_kudoed Whether the athlete who requested this activity from the API has kudoed the activity (0/1).
external_id Name of your workout as it was uploaded.
gear_id ID used by Strava if you’ve added your gear to Strava.
athlete.id Internal ID for the athlete that completed this activity.
athlete.resource_state Not mentioned in the Strava docs.
upload_id Internal ID used by Strava for activities.
type Deprecated alternative to sport_type.
upload_id_str String version of the upload ID?
heartrate_opt_out Whether the athlete has opted out of having their heart rate data stored (True/False).
display_hide_heartrate_option Not mentioned in the Strava docs.
from_accepted_tag Not mentioned in the Strava docs.
location_city Not documented, but null in all responses.
location_state Not documented, but null in all responses.
location_country Not documented, but null in all responses.
json Full response from the Strava API as stored by strava-offline.
resource_state Mystery attribute not documented by Strava API.
workout_type Mystery attribute documented by Strava API as “the activity’s workout type”.

Calculating calories

If you’ve carefully studied the list above, you might’ve seen that each activity has an associated kilojoules attribute, but this only is available for rides (despite having heart rate data). If you want to explore the relationship between calorie counts and other factors, you might need to approximate the calorie counts in terms of other variables.

One way to do this is to first estimate your VO2 max and then use this formula for calories in terms of workout duration, average heart rate, weight and age. This is quite straightforward:

# Personal stats
max_heartrate = 200
resting_heartrate = 55
weight = 75
age = 20

# Approximate formula
vo2_max = 15.3 * (max_heartrate / resting_heartrate)

def calories(time, average_heartrate):
    return time * (0.634*average_heartrate + 0.404*vo2_max + 0.394*weight + 0.271*age - 95.7735) / 4.184

df["estimated_calories"] = df.apply(lambda row: row["moving_time"] / 60, row["average_heartrate"], axis=1)

# Optionally, forget workouts without heartrate data
df = df.dropna(subset=["estimated_calories"])

But beware! Using this formula is going to be an approximation, and doesn’t account for things like how your weight or age has changed over time. Moreover, if you don’t have heart rate data for your workouts, then you’ll have to use an approximation that relies only on the workout duration. This isn’t ideal, because it means all workouts will be considered at the same level of difficulty, rather than accounting for the fact that some workouts might be harder than others.

Question 1: How does the relationship between calorie count and distance change with type of workout?

I’m going to be using a Jupyter notebook. All the answers to these questions are going to be subjective and one relationship or correlation that appears for me might not appear for everyone. I also doubt that the answers are very interesting – but I hope that these might serve as a useful example that you can use to kick-start your own analysis.

I’ve wondered for a long time how the calorie count of a workout varies with distance. It won’t be surprising that increasing the distance will increase the calories burned, but I’m more interested in how the relationship changes with mode of workout: is it more efficient to walk or run the same distance?

Or, put in slightly sillier terms: if you’re starving in the desert and have just one cookie to fuel your journey to an oasis, would it be a better idea to eat the cookie and then run or walk?

The easiest way to visualise this is a scatter plot with distance on the $x$-axis and energy on the $y$-axis:

import matplotlib.pyplot as plt

is_run = df["sport_type"] == "Run"
is_walk = df["sport_type"] == "Walk"

plt.scatter(df[is_run]["distance"], df[is_run]["estimated_calories"].to_numpy(), color='black', label="Run")
plt.scatter(df[is_walk]["distance"], df[is_walk]["estimated_calories"].to_numpy(), color='blue', label="Walk")

plt.ylabel("Energy (kcal)")
plt.xlabel("Distance (m)")

plt.legend()
plt.title("Distance vs Energy when Running or Walking")

plt.axvline(x=5000, color='red', linestyle='--', linewidth=2)    # 5K
plt.axvline(x=10000, color='red', linestyle='--', linewidth=2)   # 10K
plt.axvline(x=21097, color='red', linestyle='--', linewidth=2)   # Half marathon
plt.axvline(x=21097*2, color='red', linestyle='--', linewidth=2) # Full marathon

plt.show()

This produces the following plot:

Eyeballing the relationship, it looks like running is slightly less efficient. Things are more extreme when you account for the calories you would be burning anyway on account of your basal metabolic rate.

# Calories burned per second assuming 70bpm resting
background_calories_per_second = calories(1, 70) / 60

x_run, y_run = df[is_run]["distance"], df[is_run]["estimated_calories"] - df[is_run]["elapsed_time"] * background_calories_per_second
x_walk, y_walk = df[is_walk]["distance"], df[is_walk]["estimated_calories"] - df[is_walk]["elapsed_time"] * background_calories_per_second

plt.scatter(x_run, y_run, color='black', label="Run")
plt.scatter(x_walk, y_walk, color='blue', label="Walk")

# ...plot the same as above...

There’s several ways of adding two lines of best fit, one way is to use sklearn.linear_model:

plt.scatter(x_run, y_run, color='black', label="Run")
plt.scatter(x_walk, y_walk, color='blue', label="Walk")

plt.plot(x_run, linear_model.LinearRegression().fit(x_run.to_numpy()[:, None], y_run).predict(x_run.to_numpy()[:, None]), "grey")
plt.plot(x_walk, linear_model.LinearRegression().fit(x_walk.to_numpy()[:, None], y_walk).predict(x_walk.to_numpy()[:, None]), "lightblue")

Question 2: What are the most common distances, altitudes, durations and start times?

I’ve also wondered for a long time what my modal distances, altitudes, durations and start times are. These questions can be answered quite straightforwardly by just plotting a histogram on each of the different columns. Here it is for distance:

plt.hist(df["distance"], bins=80) # (80m bins)

I’ve also labelled 5K, 10K, a half marathon and a full marathon. You can see there’s little peaks for the half marathon and full marathon distance, since those distances make good stopping points for a run.

For elevation, it’s almost identical except using the elev_low attribute:

plt.hist(df["elev_low"], bins=150)

plt.xlabel("Lowest elevation (m)")
plt.ylabel("Frequency")

plt.title("What are my most frequent elevations?")

There’s quite a bit of range! I was curious about the runs apparently below sea level and around 400 metres above sea level, and these can be identified by:

df[df["elev_low"] < 0] # or,
df[df["elev_low"] > 400]

and then you can look up them up on Strava to see where they were. For me, it turns out that the runs measured as below sea level were runs at the beach, and there was nothing particularly special about the 400m run apart from that it was in a very hilly part of the country.

There’s also clearly two peaks; these separate the runs that I do while at home or while at university.

Finally, this is the distribution for workout duration:

# Oops, I've left Strava on a few times by accident
not_elapsed_time_outlier = df["elapsed_time"] < 2500*60

plt.hist(df[not_elapsed_time_outlier]["elapsed_time"] / 60, bins=50)

Since workout duration is going to be roughly proportional to distance, this is very similar to the distance distribution.

Question 3: What is the relationship between heart rate and pace?

Another question I’ve wondered about is how the average heart rate affects the average speed of a run. Again, it’s won’t be an surprise if a higher average heart rate increases the average speed, but it would be interesting to see the exact nature of this relationship.

plt.scatter(df[is_run].dropna(subset=["average_heartrate"])["average_heartrate"], df[is_run]["average_speed"], color='black', label="Run")
plt.scatter(df[is_walk].dropna(subset=["average_heartrate"])["average_heartrate"], df[is_walk]["average_speed"], color='blue', label="Walk")

plt.xlabel("Average heart rate (BPM)")
plt.ylabel("Average speed (m/s)")

plt.legend()
plt.title("How does average heart rate affect average speed?")
plt.show()

There is, as you would expect, a semi-linear relationship between the two variables. Runs far off the diagonal correspond to efficient and less efficient runs respectively. What about max heart rate and max speed?

There’s a clear outlier at 20mph, apparently on a walk where I think I must’ve got in a car.

Question 4: How has my fitness changed over time?

There’s several ways that this could be measured. Maybe the simplest is just to plot how my average speed has changed over time. Here I’m taking the rolling average over 10 runs since otherwise the plot is quite spiky:

plt.plot(df["datetime"], df["average_speed"].rolling(10).mean())

The horizontal and diagonal lines are times I took a break from running or from recording workouts on Strava. It’s also interesting to look at max_speed:

plt.plot(df["datetime"], df["max_speed"].rolling(10).mean())

Another way of measuring fitness over time is by measuring how many calories are needed to run one metre.

plt.plot(df[is_run]["datetime"], (df[is_run]["distance"] / df[is_run]["estimated_calories"]).rolling(30).mean())

Question 5: Are morning or evening runs faster?

Answering this question is a little bit more difficult since it’s necessary to regroup the data by time of day rather than datetime. First, I’ll just plot the distribution of runs by time of day. This can be done like so:

window_size = 20 # in minutes
num_labels = 6

chunks = (df["datetime"].dt.hour * 60 + df["datetime"].dt.minute) // window_size
chunk_counts = df.groupby(chunks).size().to_frame(name="count")

ticks = np.arange(0, (1440//window_size)+1, (1440//(window_size * (num_labels))))
labels = []
for i in ticks:
    hour = i * window_size // 60
    minute = (i * window_size) % 60
    labels.append(f"{hour:02d}:{minute:02d}")

plt.bar(chunk_counts.index, chunk_counts["count"], width=1.0)
plt.xticks(ticks, labels)
plt.show()

(It’s also necessary to be careful here with using start_time_local rather than start_time, since start_time is always measured relative to GMT+0).

To plot speed, I can take the mean of each group:

chunk_speeds = df.groupby(chunks)["max_speed"].mean().to_frame(name="max_speed")
plt.bar(chunk_speeds.index, chunk_speeds["max_speed"], width=1.0)
plt.xticks(ticks, labels)

plt.show()

It looks like max speed is relatively constant – this is a little surprising since I always feel like I run faster when it’s dark.

Conclusion

I hope that you have fun analysing your Strava data and answering some of your own questions. The full code I used to produce all the plots above can be found in this gist.