{targets}
Department of Health Sciences
The Roux Institute
Northeastern University
2023-05-31
{targets}?
“a Make-like pipeline tool for statistics and data science in R”
01-data.R
library(tidyverse)
data <- read_csv("data.csv", col_types = cols()) %>%
filter(!is.na(Ozone))
write_rds(data, "data.rds")02-model.R
library(tidyverse)
data <- read_rds("data.rds")
model <- lm(Ozone ~ Temp, data) %>%
coefficients()
write_rds(model, "model.rds")03-plot.R
{targets}: The basics{targets} workflowR/functions.R
get_data <- function(file) {
read_csv(file, col_types = cols()) %>%
filter(!is.na(Ozone))
}
fit_model <- function(data) {
lm(Ozone ~ Temp, data) %>%
coefficients()
}
plot_model <- function(model, data) {
ggplot(data) +
geom_point(aes(x = Temp, y = Ozone)) +
geom_abline(intercept = model[1], slope = model[2])
}{targets} workflow_targets.R
library(targets)
tar_source()
tar_option_set(packages = c("tidyverse"))
list(
tar_target(file, "data.csv", format = "file"),
tar_target(data, get_data(file)),
tar_target(model, fit_model(data)),
tar_target(plot, plot_model(model, data))
)Run tar_make() to run pipeline
Tip
use_targets() will generate a _targets.R script for you to fill in.
{targets} workflowTargets are “hidden” away where you don’t need to manage them
├── _targets.R
├── data.csv
├── R/
│ ├── functions.R
├── _targets/
│ ├── objects
│ ├── data
│ ├── model
│ ├── plot
Tip
You can of course have multiple files in R/; tar_source() will source them all
{targets}R/.use_targets() and edit _targets.R accordingly, so that I list the data file as a target and clean_data as the output of the cleaning function.tar_make().tar_load(clean_data) so that I can work on the next step of my workflow.Tip
I usually include library(targets) in my project .Rprofile so that I can always call tar_load() on the fly
_targets.R tips and trickslist(
tar_target(
data_file,
"data/raw_data.csv",
format = "file"
),
tar_target(
raw_data,
read.csv(data_file)
),
tar_target(
clean_data,
clean_data_function(raw_data)
)
)Tip
I like to pair my functions/targets by name so that the workflow is clear to me
_targets.R tips and trickspreparation <- list(
...,
tar_target(
clean_data,
clean_data_function(raw_data)
)
)
modeling <- list(
tar_target(
linear_model,
linear_model_function(clean_data)
),
...
)
list(
preparation,
modeling
)Tip
By grouping the targets into lists, I can easily comment out chunks of the pipeline to not run the whole thing
_targets.R tips and tricks
Tip
In big projects, I comment my _targets.R file so that I can use the RStudio outline pane to navigate the pipeline (my buggy function)
{targets} functionsuse_targets() gets you started with a _targets.R script to fill intar_make() runs the pipeline and saves the results in _targets/objects/tar_make_future() runs the pipeline in parallel1tar_load() loads the results of a target into the global environmenttar_load(clean_data))tar_read() reads the results of a target into the global environmentdat <- tar_read(clean_data))tar_visnetwork() creates a network diagram of the pipelinetar_outdated() checks which targets need to be updatedtar_prune() deletes targets that are no longer in _targets.Rtar_destroy() deletes the .targets/ directory if you need to burn everything down and start again{targets}{tarchetypes}: reportsRender documents that depend on targets loaded with tar_load() or tar_read().
tar_render() renders an R Markdown documenttar_quarto() renders a Quarto document (or project)Warning
It can’t detect dependencies like tar_load(ends_with("plot"))
report.qmd look like?---
title: "My report"
---
```{r}
library(targets)
tar_load(results)
tar_load(plots)
```
There were `r results$n` observations with a mean age of `r results$mean_age`.
```{r}
library(ggplot2)
plots$age_plot
```Because report.qmd depends on results and plots, it will only be re-rendered if either of those targets change.
Tip
The extra_files = argument can be used to force it to depend on additional non-target files
{tarchetypes}: branchingUsing data from the National Longitudinal Survey of Youth,
_targets.R
we want to investigate the relationship between age at first birth and hours of sleep on weekdays and weekends among moms and dads separately1
Create (and name) a separate target for each combination of sleep variable ("sleep_wkdy", "sleep_wknd") and sex (male: 1, female: 2):
targets_1 <- list(
tar_target(
model_1,
model_function(outcome_var = "sleep_wkdy", sex_val = 1, dat = dat)
),
tar_target(
coef_1,
coef_function(model_1)
)
)… and so on…
[1] 0.00734859
Use tarchetypes::tar_map() to map over the combinations for you (static branching):
targets_2 <- tar_map(
values = tidyr::crossing(
outcome = c("sleep_wkdy", "sleep_wknd"),
sex = 1:2
),
tar_target(
model_2,
model_function(outcome_var = outcome, sex_val = sex, dat = dat)
),
tar_target(
coef_2,
coef_function(model_2)
)
)
tar_load(starts_with("coef_2"))
c(coef_2_sleep_wkdy_1, coef_2_sleep_wkdy_2, coef_2_sleep_wknd_1, coef_2_sleep_wknd_2)[1] 0.00734859 0.01901772 0.02595109 0.01422970
Use tarchetypes::tar_combine() to combine the results of a call to tar_map():
combined <- tar_combine(
combined_coefs_2,
targets_2[["coef_2"]],
command = vctrs::vec_c(!!!.x),
)
tar_read(combined_coefs_2)coef_2_sleep_wkdy_1 coef_2_sleep_wkdy_2 coef_2_sleep_wknd_1 coef_2_sleep_wknd_2
0.00734859 0.01901772 0.02595109 0.01422970
command = vctrs::vec_c(!!!.x) is the default, but you can supply your own function to combine the results
Use the pattern = argument of tar_target() (dynamic branching):
targets_3 <- list(
tar_target(
outcome_target,
c("sleep_wkdy", "sleep_wknd")
),
tar_target(
sex_target,
1:2
),
tar_target(
model_3,
model_function(outcome_var = outcome_target, sex_val = sex_target, dat = dat),
pattern = cross(outcome_target, sex_target)
),
tar_target(
coef_3,
coef_function(model_3),
pattern = map(model_3)
)
)
tar_read(coef_3)coef_3_85bbb1b6 coef_3_c47db1e2 coef_3_5ba8b6ec coef_3_19c76a86
0.00734859 0.01901772 0.02595109 0.01422970
| Dynamic | Static |
|---|---|
| Pipeline creates new targets at runtime. | All targets defined in advance. |
| Cryptic target names. | Friendly target names. |
| Scales to hundreds of branches. | Does not scale as easily for tar_visnetwork() etc. |
| No metaprogramming required. | Familiarity with metaprogramming is helpful. |
tar_map(values = ..., tar_target(..., pattern = map(...))){tarchetypes}: repetitiontar_rep() repeats a target multiple times with the same arguments
targets_4 <- list(
tar_rep(
bootstrap_coefs,
dat |>
dplyr::slice_sample(prop = 1, replace = TRUE) |>
model_function(outcome_var = "sleep_wkdy", sex_val = 1, dat = _) |>
coef_function(),
batches = 10,
reps = 10
)
)The pipeline gets split into batches x reps chunks, each with its own random seed
{tarchetypes}: mapping over iterationssensitivity_scenarios <- tibble::tibble(
error = c("small", "medium", "large"),
mean = c(1, 2, 3),
sd = c(0.5, 0.75, 1)
)tar_map_rep() repeats a target multiple times with different arguments
{tarchetypes}: mapping over iterations coef error mean sd tar_batch tar_rep tar_seed tar_group
1 0.0061384611 small 1 0.5 1 1 -1018279263 2
2 -0.0005346553 small 1 0.5 1 2 -720048594 2
3 0.0073674844 small 1 0.5 1 3 -1478913096 2
4 0.0039254289 small 1 0.5 1 4 -1181272269 2
5 0.0108489430 small 1 0.5 1 5 135877686 2
6 0.0029473286 small 1 0.5 1 6 -564559689 2
Ideal for sensitivity analyses that require multiple iterations of the same pipeline with different parameters
tar_read(sensitivity_analysis) |>
dplyr::group_by(error) |>
dplyr::summarize(q25 = quantile(coef, .25),
median = median(coef),
q75 = quantile(coef, .75)) error q25 median q75
1 large 0.001427986 0.007318120 0.011399772
2 medium 0.004158480 0.007770285 0.011367160
3 small 0.004058926 0.006614599 0.009004322
{targets} is a great tool for managing complex workflows{tarchetypes} makes it even more powerfulThanks!