Introduction

You can find several use cases involving time series forecasting in business these days. Be it forecasting sales, website traffic, number of customers or even weather, most of the prediction problems we across has a time dimension around it.

Learning how to solve business problems involving time dimension using data science is a quite useful skill for data scientists. When I started my career, my go to approach towards time series problems was using classical time series approaches like ARIMA, SARIMA etc. However, over the years I have shifted my approach towards using non linear models like XGBoost because of several benefits which I’ll explain throughout this post.

Note: The following case study given in this post is based on a hypothetical data set. For business context, consider it as coming from a ride sharing app.

Table of Contents

  1. Business Problem
  2. Hypothesis
  3. Data Exploration
  4. Feature Engineering
  5. Model Training and Evaluation
  6. Improvement Ideas

1. Business Problem

For a ride sharing app, the most critical aspect of the business is to ensure the matching riders with drivers efficiently. It should be efficient in a way that:

  • The riders don’t wait long to find a driver
  • The drivers are able to complete many successful rides daily

With this context, the problem we are going to solve is: How can we better guide the drivers towards areas with higher expected demand at given time and location.

2. Hypothesis

Defining hypothesis early in a data science project is quite useful. It helps to list down the factors (or say model features) which we could use in solving the problem. Also, since we haven’t looked at the data till now, it also avoids availability bias in our thinking process.

Let’s think of some features which can help in solving our business problem:

  • historical riders demand
  • temporal information (date, time, weekday etc)
  • promotional/discount offer
  • weather conditions
  • rider location

Now we know, what data we need to start modeling the problem.

But, lets take a step back. There’s one more thing to do. We can’t directly model the business problem given to us. We must reframe the business problem it into a measurable data science problem.

Based on my business understanding of the given problem, one way to create a data science problem statement is: Predict the rider booking demand for a given location at a certain hour.

Now, this sounds like a doable time series (regression) problem we can model using data and ML algorithms.

3. Data Exploration

Let’s check what data is given to us:

start_timestart_latitudestart_longitudeend_latitudeend_longituderide_value
2022-03-26T06:37:29.85498200059.43451224.77504759.43278324.7454260.413750
2022-03-02T22:16:34.71501500059.39864724.68273459.44654524.7409121.555000
2022-03-23T18:18:16.73700000059.42632024.73851859.42138524.7316410.162448
2022-03-08T07:00:44.21559600059.41725224.74391459.43180924.7431370.404000
2022-03-09T07:45:51.06746300059.41182324.73761659.44529724.6973691.083750
Table showing data set given for model training

Following is the description of the information given to us in the data set:

  • start_time: start time of the ride
  • start_latitude: start latitude of the ride
  • start_longitude: start longitude of the ride
  • end_latitude: end latitude of the ride
  • end_longitude: end longitude of the ride
  • ride_value: value of the ride

We find the data has only six columns. But, where is the target (dependent variable) ? Also, how can we predict for a certain location ? We need to do some additional work to prepare the data such as:

  • Convert the ride coordinates into grid (each grid can be considered an area)
  • Create a target variable (number of rides) by aggregating on the grids

Lets start coding!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopy.distance
from h3 import h3
import catboost
from sklearn import metrics
import shap
import seaborn as sns
import folium

# read data set
df = pd.read_csv('dscasestudy.csv')
print(df.shape)
# 511324, 6

# Create location grid
resolution = 7
df['location_grid']=df.apply(lambda x : h3.geo_to_h3(x['start_lat'],x['start_lng'],resolution),axis=1)

# Create target variable
df_agg = df.groupby(['location_grid','start_date','hour']).agg(rider_count = ('location_grid', 'size')).reset_index()

First of all, let’s analyse the target variable.

# target variable distribution
sns.displot(x='rider_count', data=df_agg, aspect=3, height=4)

We see the following:

  • The target variable is right skewed.
  • There are lots of grids with very few (<= 5) rider_count value.

Since the target variable is skewed, we apply a log transformation to reduce the skewness and variance.

Let’s look at some temporal distribution of the target variable rider_count and find out if there are any weekly or daily patterns.

We see a strong weekly pattern with a clear drop in number of rides on day 7/14/21. Also, the peak number of rides tend to increase throughout the month.

I am curious to find out from which part of the city most of these rides are coming? Reason being, if we have 90% of the rides coming from city center, we might build two different models: one for city center and other for rest. Let’s check!

We do see that lots of rides are happening in the center of the city, with few ones spread across the city.

4. Feature Engineering

In a time series problem, creating features can be a bit challenging, specially when the data provided doesn’t have continuous dates. On further investigation, we found the data set given to us have missing dates. In order to tackle such cases, we’ll do the following:

  • Create a master data set containing all dates
  • Do a left join and fill the missing dates with zero (we assume no rides were booked on those)
  • Using the master data set, calculate the historical features
# Creating Master Data to make the dates continuous in order to create lag and rolling median features
from itertools import product

# Create a list of locations
locations = list(df_agg.location_grid.unique())

# Create a list of dates from 01/03/2022 to 28/03/2022
dates = pd.date_range(start='2022-03-01', end='2022-03-28')

## Create Master Data, Each grid have 28 days and each days have 24 days
hours = list(range(24))

# Create a DataFrame with all possible combinations of locations, dates, and hours
all_combinations = list(product(locations, dates, hours))
df_master = pd.DataFrame(all_combinations, columns=['location_grid', 'start_date', 'hour'])

# Do left join
df_agg['start_date']= pd.to_datetime(df_agg['start_date'])
df_complete=df_master.merge(df_agg,on=['location_grid','start_date','hour'],how='left')
df_complete=df_complete.fillna(0)
df_complete.head()
location_gridstart_datehourrider_count
87089b014ffffff2022-03-01 00:00:0000
87089b014ffffff2022-03-01 00:00:0010
87089b014ffffff2022-03-01 00:00:0020
87089b014ffffff2022-03-01 00:00:0030
87089b014ffffff2022-03-01 00:00:0040
table showing generated data set

As you can see, in the master dataset above, we have generated missing data for each location_grid, date and hour combination. Introducing this missing data will ensure that historical feature computation is accurate (shown below).

# Adding Day of week column
df_complete['day']=df_complete['start_date'].dt.weekday

# Calculate features : Lag features and Rolling Median of rider count for each location
shift_range = [1,3,6,24,48]
rolling_window=[3,5,10,15,24,48]
df_complete=df_complete.sort_values(by=['location_grid','start_date','hour'])
def create_ts_features(df: pd.DataFrame) -> pd.DataFrame:
    """Calculates lag and rolling features."""
    for shift in tqdm(shift_range):
      df[f"shift_lag_{shift}"] = df.groupby('location_grid')['rider_count'].transform(lambda x: x.shift(shift))

    for win in rolling_window:
      df[f'rolling_median_{win}'] = df.groupby(['location_grid'])['rider_count'].transform(lambda x: x.shift(1).rolling(win).median())

    return df

df_complete = create_ts_features(df_complete)

In the code above, we sort the data by start_date, hour to ensure the rolling and lag functions work correctly. We calculate historical features such as:

  • Rider count 1 hour ago, 3 hour ago, 6 hour ago and so on…
  • Average rider count in last 3 hours, last 5 hours, last 10 hours and so on…

Some more technical details:

  • We need to remove the samples where the lag and rolling features are all empty.
  • Since some of the samples which have rider_count = 0 (which were artificially created) should be removed before we do training.
# Removing the rows with all column as NA and fill remaining NA column as 0
df_model= df_complete.dropna(how='all', subset=['shift_lag_24', 'shift_lag_48','rolling_median_3','rolling_median_5','rolling_median_10'])
df_model=df_model.fillna(0)

# Removing rows where rider count is 0 which came due to merging with master data to make dates continuous in order to calculate lag and rolling features
df_model=df_model[df_model['rider_count']!=0]

5. Model Training and Evaluation

Before model training, we should create different data split to ensure model evaluation is done properly. For a time series problem, a time based split usually works fine.

# Create Train Test and Validation Splits
df_model['start_date']=pd.to_datetime(df_model['start_date'])
cutoff = df_model['start_date'].max() - pd.to_timedelta(14, unit = 'D')
print(f'{cutoff=}')
test_cutoff=df_model['start_date'].max()- pd.to_timedelta(7, unit = 'D')
print(f'{test_cutoff=}')

xtrain = df_model[df_model['start_date'].between('2022-01-03',cutoff) ].copy()
xvalid = df_model[(df_model['start_date'] > cutoff) & (df_model['start_date'] < test_cutoff)].copy()
xtest = df_model[df_model['start_date'] >= test_cutoff].copy()
xtrain.shape, xvalid.shape,xtest.shape

It is usually a good practice to start with a baseline model. In our case, we’ll consider the previous hour rider_count value as a baseline model. Having a baseline helps us understand how much improvement does the machine learning model bring.

def calculated_adj_rsquared(y_true, y_pred, df):
    r_squared = metrics.r2_score(y_true, y_pred)
    adj_r = 1 - (1-r_squared)*(len(y_true)-1)/(len(y_true)-df.shape[1]-1)
    return f"Adjusted R^2: {adj_r:.4f}"

pred = xvalid['shift_lag_1']
print(f'MAPE error: {metrics.mean_absolute_percentage_error(xvalid["rider_count"], pred)*100:.2f}%')
print(f"Model Test MAE: {metrics.mean_absolute_error(xvalid['rider_count'], pred)}")
calculated_adj_rsquared(xvalid['rider_count'], pred, xvalid)

MAPE error: 60.44%
Model Test MAE: 5.119927395306625

Adjusted R^2: 0.9393

Wow! The baseline already seems quite strong. One feature alone is able to explain 93% variance in the dependent variable. Anyways, this is expected since in our normal lives too, what we did in the last hour usually influences our next hour activity. Lets train a model now. We’ll use catboost library.

# select the features to use
num_features =['hour','shift_lag_1', 'shift_lag_3',
       'shift_lag_6','shift_lag_24','shift_lag_48','rolling_median_3',
       'rolling_median_5', 'rolling_median_10', 'rolling_median_15',
       'rolling_median_24', 'rolling_median_48']
cat_features=['location_grid','day']
features = num_features+ cat_features

Catboost requires separate list of numerical and categorical features. One of the advantages of using catboost is that it can handle raw categorical features internally.

# train the model on log transformed target variable
cat_train = catboost.Pool(
            data=xtrain[features],
            label=np.log1p(xtrain['rider_count']),
            cat_features=cat_features)
cat_test = catboost.Pool(
            data=xvalid[features],
            label=np.log1p(xvalid['rider_count']),
            cat_features=cat_features)

model = catboost.CatBoostRegressor(
    iterations=10000,
    learning_rate=0.01,
    colsample_bylevel=0.7,
    subsample=0.7,
    depth=4,
    random_seed=1,
    eval_metric='RMSE',
)

model.fit(cat_train,
          eval_set=[cat_test],
          use_best_model=True,
          verbose_eval=500,
          early_stopping_rounds=50)

0:	learn: 1.1837873	test: 1.2289138	best: 1.2289138 (0)	total: 55.5ms	remaining: 9m 14s
500:	learn: 0.3374208	test: 0.3386316	best: 0.3386316 (500)	total: 3.35s	remaining: 1m 3s
1000:	learn: 0.3236093	test: 0.3259324	best: 0.3259324 (1000)	total: 6.74s	remaining: 1m
1500:	learn: 0.3163504	test: 0.3198157	best: 0.3198157 (1500)	total: 15s	remaining: 1m 25s
2000:	learn: 0.3122722	test: 0.3172888	best: 0.3172888 (2000)	total: 18.8s	remaining: 1m 15s
2500:	learn: 0.3094581	test: 0.3157580	best: 0.3157580 (2500)	total: 23.5s	remaining: 1m 10s
3000:	learn: 0.3072566	test: 0.3147646	best: 0.3147646 (3000)	total: 28s	remaining: 1m 5s
3500:	learn: 0.3054573	test: 0.3140353	best: 0.3140353 (3500)	total: 31.5s	remaining: 58.5s
4000:	learn: 0.3038844	test: 0.3135471	best: 0.3135471 (4000)	total: 38.2s	remaining: 57.2s
4500:	learn: 0.3025143	test: 0.3131117	best: 0.3131117 (4500)	total: 43.7s	remaining: 53.3s
5000:	learn: 0.3012333	test: 0.3127479	best: 0.3127479 (5000)	total: 49.8s	remaining: 49.8s
5500:	learn: 0.3000466	test: 0.3124487	best: 0.3124487 (5497)	total: 53.1s	remaining: 43.5s
6000:	learn: 0.2989646	test: 0.3122281	best: 0.3122273 (5984)	total: 56.4s	remaining: 37.6s
6500:	learn: 0.2978673	test: 0.3120065	best: 0.3120020 (6489)	total: 1m	remaining: 32.7s
7000:	learn: 0.2969017	test: 0.3117524	best: 0.3117523 (6998)	total: 1m 4s	remaining: 27.7s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.3116317148
bestIteration = 7356

Let’s check the model performance on validation set.

pred = np.expm1(model.predict(xvalid[features]))

print(f'MAPE error: {metrics.mean_absolute_percentage_error(xvalid["rider_count"], pred)*100:.2f}%')
print(f"Model Test MAE: {metrics.mean_absolute_error(xvalid['rider_count'], pred)}")
calculated_adj_rsquared(xvalid['rider_count'], pred, xvalid)

MAPE error: 34.32%
Model Test MAE: 3.0848148214835884

Adjusted R^2: 0.9771

We see the catboost model provides significant improvement over baseline model. MAPE reduced by 40% and adjusted R2 also improved by 5%.

Lets check the feature importance plot using shapely. Shapely uses game theory to calculate feature importance.

## check feature importance
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(xvalid[features])
shap.summary_plot(shap_values, xvalid[features], max_display=10)

We gather the following insights from the shap importance plot:

  • Higher values of shift_lag_1 feature leads to higher values of rider_count
  • Higher values of shift_lag_24 feature leads to higher values of rider_count
  • Similarly, for all other features we see a monotonic relationship in input features with the target variable

In regression problem, it is usually a good practice to plot the error and see their distribution. Lets do it.

metrics.PredictionErrorDisplay.from_predictions(
    xvalid['rider_count'],
    pred,
    kind="actual_vs_predicted",
    scatter_kwargs={"alpha": 0.5},
)

I am not surprised looking at this plot. Since we already saw the model is performing well, we expect the actual and predicted values to correlated strongly.

One last think to check is, how is our model doing on test set? Test set is not touched in the entire modeling process. Usually, the performance we see on test set is set as a benchmark for model’s performance in the real world (post deployment).

# Prediction on Test
pred = np.expm1(model.predict(xtest[features]))
print(f'MAPE error: {metrics.mean_absolute_percentage_error(xtest["rider_count"], pred)*100:.2f}%')
print(f"Model Test MAE: {metrics.mean_absolute_error(xtest['rider_count'], pred)}")
calculated_adj_rsquared(xtest['rider_count'], pred, xtest)

MAPE error: 34.28%
Model Test MAE: 3.25369541049924

Adjusted R^2: 0.9714

We should be happy finding the model performance stable on validation and test set. The choice of metrics in regression problem depends on the data scientist. But in my experience, I usually include the following metrics (almost always):

  • Mean Absolute Error (MAE): Because it is less affected by outlier and the error can be translated in the same units as target variable. In our case, we have a skewed distribution, MAE fits better to our needs.
  • Mean Absolute Percentage Error (MAPE): Because it is easier to communicate to the stakeholder in the company. This metrics can be communicated in terms of accuracy. For example: Our model above is roughly 66% accurate in giving predictions.

6. Improvement Ideas

Until this point, we have trained and evaluated the model. All good so far! What if you are asked what additional data points would you require to further improve the model ?

I can think of the following data sources to use:

  • Holidays information
  • Weather information
  • Distance of nearest of point of attraction
  • Distance of sea from the grid
  • Population information
  • Competitors Coverage Information

7. Summary

Solving a machine learning problem is a time taking process. There are several steps a data scientist should keep in mind to make sure the model being delivered is performant.

In this post, we did a case study on a time series problem which was given to us from a business problem point of view. Having worked as a data scientist in several product companies, I can confirm a situation like this is quite common. Usually, you never get a structured data science problem to solve.

In case you have questions/suggestions, feel free to write them below in the comments section. Stay tuned for more posts:)

By Manish Saraswat

Experienced professional with over 9 years of experience building scalable microservices and data products powered by recommendations, search and ranking models across different markets in LATAM, APAC and Europe. Loves sharing knowledge and help people learn something new about AI.

2 thought on “Simple Time Series Data Science Case Study in Python”
  1. You are a savior. Tomorrow I have to present a regression case study in my class.

Leave a Reply

Your email address will not be published. Required fields are marked *