Model Step 2 - Model Card

Background

A model card provides brief, transparent, responsible reporting for a trained machine learning model. Expand the code chunk below to inspect the code that loads the model, training data, and evaluation data.

Code

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.0.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(vetiver)
library(pins)
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
## 
## Attaching package: 'yardstick'
## The following object is masked from 'package:readr':
## 
##     spec
library(glue)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows


board <- pins::board_rsconnect()
## Connecting to RSC 2024.02.0 at <https://colorado.posit.co/rsc>
v <- vetiver_pin_read(board, params$name, version = params$version)
v_meta <- pin_meta(board, params$name)

con <- odbc::dbConnect(odbc::odbc(), "Content DB", timeout = 10)
bike_model_data <- tbl(con, "bike_model_data")


train_start_date <- lubridate::as_date(v$metadata$user$train_dates[1])
train_end_date <- lubridate::as_date(v$metadata$user$train_dates[2])
test_start_date <- lubridate::as_date(v$metadata$user$test_dates[1])
test_end_date <- lubridate::as_date(v$metadata$user$test_dates[2])

data <- bike_model_data %>%
  filter(date >= train_start_date) %>%
  collect() %>%
  mutate(
    data_type = case_when(
      date <= train_end_date ~ "train",
      date <= test_end_date ~ "test",
      TRUE ~ "latest"
    )
  )

train_data <- data %>%
  filter(data_type == "train")

test_data <- data %>%
  filter(data_type == "test")

Model details

Developed by Sam Edwardes (RStudio - Solutions Engineer)
A ranger regression modeling workflow using 7 features
This model was developed to predict how many capital bikeshare bikes will be available for a given station based on the time of day, and day of the week.
The data used to train the model was obtained from the capital bikeshare API.
Version 112432 of this model was published at 2024-04-08 08:02:00
If you have questions about this model, please create an issue on our GitHub repo: https://github.com/sol-eng/bike_predict/issues.

Intended use

The primary intended use of this model is to demonstrate an end-to-end data science workflow using RStudio Connect.

Metrics

The metrics used to evaluate this model are:
- Root Mean Squared Error (RMSE),
- R Squared (RSQ), and
- Mean Absolute Error (MAE).
We chose these metrics because because they are the most common metrics for assessing the performance of regression models, and that they are well understood by the data science community.

Training data & evaluation data

The training data set for this model was was obtained from the capital bikeshare API. The Model is retrained on a daily basis using refreshed data.

Training date range: 2023-07-23 to 2023-08-02.
Evaluation date range: 2023-08-03 to 2023-08-04.

The data set for this model has the “prototype” or signature:

Code

glimpse(v$ptype)

Rows: 1
Columns: 7
$ id    <chr> "101"
$ hour  <dbl> 0
$ date  <date> 2023-07-23
$ month <dbl> 7
$ dow   <chr> "Sunday"
$ lat   <dbl> 38.89483
$ lon   <dbl> -76.98763

Below is a summary of the training data:

Code

train_data %>%
  select(-n_bikes) %>%
  skimr::skim()

Data summary

Name	Piped data
Number of rows	35110
Number of columns	8
_______________________
Column type frequency:
character	3
Date	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
id	1	2	3	266
dow	1	6	9	7
data_type	1	5	5	1

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2023-07-23	2023-08-02	2023-07-28	11

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hour	1	11.00	6.90	0.0	4.50	10.00	17.50	22.00	▇▅▅▅▇
month	1	7.18	0.39	7.0	7.00	7.00	7.00	8.00	▇▁▁▁▂
lat	1	38.92	0.06	38.8	38.89	38.91	38.95	39.12	▂▇▃▁▁
lon	1	-77.04	0.05	-77.2	-77.06	-77.03	-77.01	-76.91	▁▂▇▇▁

Below is a summary of the evaluation data:

Code

test_data %>%
  select(-n_bikes) %>%
  skimr::skim()

Data summary

Name	Piped data
Number of rows	5586
Number of columns	8
_______________________
Column type frequency:
character	3
Date	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
id	1	2	3	266
dow	1	6	8	2
data_type	1	4	4	1

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2023-08-03	2023-08-04	2023-08-03	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hour	1	9.71	6.39	0.0	4.00	10.00	14.00	22.00	▇▅▅▅▃
month	1	8.00	0.00	8.0	8.00	8.00	8.00	8.00	▁▁▇▁▁
lat	1	38.92	0.06	38.8	38.89	38.91	38.95	39.12	▂▇▃▁▁
lon	1	-77.04	0.05	-77.2	-77.06	-77.03	-77.01	-76.91	▁▂▇▇▁

Quantitative analyses

Overall model performance against the model evaluation data.

Code

# Explicitly define packages used for model metrics by vetiver so that Connect 
# is able to redeploy.
library(slider)

latest_metrics %>%
  kable() %>%
  kable_material()

.index	.n	.metric	.estimator	.estimate
2023-08-03	3192	rmse	standard	4.2824583
2023-08-03	3192	rsq	standard	0.3313428
2023-08-03	3192	mae	standard	3.3823794
2023-08-04	2394	rmse	standard	4.1438817
2023-08-04	2394	rsq	standard	0.2726845
2023-08-04	2394	mae	standard	3.3005535

A comparison of the model prediction vs. the ground truth for the model evaluation data.

Code

preds %>%
  ggplot(aes(n_bikes, .pred)) +
  geom_abline(slope = 1, lty = 2, color = "gray60", size = 1.2) +
  geom_jitter(alpha = 0.5, width = 0.4) +
  labs(
    title = "Model Performance by Prediction",
    subtitle = glue("Using the evaluation data from {test_start_date} to {test_end_date}"),
    x = "Acutal Number of Bikes",
    y = "Predicted Number of Bikes"
  )

A visual assessment of model performance by location.

Code

preds %>%
  mutate(.resid = n_bikes - .pred) %>%
  ggplot(aes(lon, lat, color = .resid)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient2() +
  coord_fixed() + 
  labs(
    title = "Model Performance by Location",
    subtitle = glue("Using the evaluation data from {test_start_date} to {test_end_date}"),
    x = "Longitude",
    y = "Latitude",
    color = "Residual"
  )

Ethical considerations

We considered the ethical impact the use of the model may have. We assessed that the overall risk due to an incorrect model prediction is relatively low.

Caveats & recommendations

This model does not guarantee that a bike will be available at a station.
We recommend using this model for demonstration purposes only.
To see the raw source code that produced this document visit https://github.com/sol-eng/bike_predict/blob/main/content/02-model/02-model-metrics/document.qmd.