1 Opinion: Three important types of parameters

1.1 Difficulty
1.2 Location onset and center
1.3 Scale
1.4 Which distribution is better?

2 Select data

3 Drag those sliders!

3.1 Gaussian (normal distribution)
3.2 Ex-gaussian
3.3 Skew normal
3.4 (Shifted) Log-normal
3.5 (Shifted) Inverse Gaussian
3.6 Wiener diffusion
3.7 Linear Ballistic Accumulator
3.8 Gamma
3.9 (Shifted) Weibull

4 Applied code example

4.1 Checking the fit
4.2 Understanding the parameter estimates
4.3 Priors and descriptives

5 Notes on good models for RT data

Reaction time distributions: an interactive overview

By Jonas Kristoffer Lindeløv (blog, profile).
Updated: 23 September, 2019 (source, changes).

OBS: This is the non-interactive version. Click here to go to the interactive version.

All too often, researchers use the normal distribution to model reaction times. Hopefully, this document to lower the burden of shifting to better distributions. Because the normal distribution is so poor for RTs (as you will see), the reward of shifting will be great!

The cheat sheet below gives an overview (get it in PNG or PDF). You can also skip straight to the interactive part or to the applied code example.

1 Opinion: Three important types of parameters

For reaction times, I’ll argue that there are three useful parameters to describe their distribution: Difficulty¹, Onset², and Scale.³ In the figure below, the blue curves show the effect of increasing each of these parameters. Try manipulating them yourself here.

1.1 Difficulty

A dominant feature of RTs is that the standard deviation increases approximately linearly with the mean (Wagenmakers et al., 2007). Reaction times are more dispersed for conditions with longer reaction times (e.g., incongruent trials) and are more condensed for conditions with shorter reaction times (e.g., congruent trials). For interpretability, it is convenient to have one parameter that changes both rather than having to regress on mean and SD separately and always interpret their interaction. Ideally, the value of the difficulty parameter should be somewhere around the center (highest density region, e.g., the median RT or the mode RT).

1.2 Location onset and center

Moves the whole distribution towards the left (longer RTs) or right (shorter RTs) without altering its shape.

Onset: This is the earliest possible reaction time (for descriptive models where it’s often called shift) or the earliest possible start of the decision process (for mechanistic models where it’s called non-decision time). In humans, this is often around 200 ms. It’s good to have an onset-parameter since no animal can perceive or respond in 0.0 ms. Models without an onset-parameter often predict positive probabilities of impossibly short RTs.
Center: Somewhere “in the middle” of the distribution (e.g., $\mu$ in the normal distribution). I argued above that we should prefer a difficulty-like parameter to model the center.

1.3 Scale

Stretches or squeezes the distribution over and above difficulty, without (severely) changing the location. This can be used to tune the distribution to population-specific or experiment-specific characteristics. For example, you would expect wider distributions in patient samples and for dual-tasks.

1.4 Which distribution is better?

It is nice when each parameter does only one thing. If they do several things at once, or if multiple parameters do the same thing, I call them messy, because you will need to tune many parameters to achieve one of the above effects. And that’s a mess for interpretability. However, if this behavior at the surface level is justified by an underlying theory that gives meaning to each parameter, I call them mechanism parameters.

With this terminology on the table, here are our three commandments:

Bad: Distributions with messy parameters.

Good: Distributions with an onset, a difficulty, AND a scale parameter.

Good: Distributions with mechanism parameters.

According to these criteria, DDM, LBA, the shifted log-normal, and the inverse Gaussian are the winners on this list. If we also consider how easy it is to interpret the parameter values, I’d say that DDM and the shifted log-normal come out on top, but this is up for discussion.

2 Select data

All of the following datasets are from a single participant in a single condition.

Choose a dataset:

Discriminate words/non-words - be accurate (Wagenmakers et al., (2008)

... or paste your own RTs (separator = space/,/;/tab)):

Min RT

Max RT

Bins

3 Drag those sliders!

Here are some fun challenges:

One-parameter champ: For the speed/accuracy and MRT data, try (1) fitting the parameters to one dataset, (2) change the dataset, (3) see how close you can get by only changing one parameter now.
Fit champ: There is a special reward if you can fit the data so well that the mean squared error (MSE) drops below 40. Use the keyboard arrows to fine-tune parameters. The MSE in the plots is multiplied by 10.000 to give easy-to-read numbers.
Optimal champ: Use brms to find the optimal fits as shown in the code snippets. The empirical datasets are in the rtdists package. Use library(rtdists'), data('rr98'), and data('speed_acc') to load them. Download the MRT data here. You can also try and use the (frequentist) gamlss package, though it’s harder to interpret the results.

3.1 Gaussian (normal distribution)

Models responses from a single mean with a symmetrical dispersion towards faster and slower RTs. In most cases, the normal distribution is the worst distribution to fit reaction times. For example, for typical reaction times, your fitted model will predict that more than 5% of the reaction times will be faster than the fastest RT ever recorded!

Yet, the Gaussian is by far the most frequently used analysis of reaction times because it is the default in most statistical tests, packages, and software. For example, t-tests, ANOVA, linear regression, lm, lme4::lmer, and brms::brm. Also, if your pre-processing includes computing means for each participant or condition and you use those means as data, you’ve subscribed to the Gaussian model (and thrown away a lot of wonderful data).

Parameters

$\mu$ (center): the mean. It is heavily affected by the tail.
$\sigma$ (scale): the standard deviation. This models (symmetric) variability in RTs.

It is easy to infer these parameters from data. Here, we model $\mu$ by condition. $\sigma$ is fixed across conditions. See notes on modeling and the applied example for more details.

library(brms)
fit = brm(rt ~ condition, rt_data, family=gaussian())

3.2 Ex-gaussian

If you believe that responses were caused by a mix of two independent processes, (1) a Gaussian and (2) a decaying exponential, the ex-gaussian is exactly this. Most people don’t believe this but they use the ex-gaussian in a purely descriptive way because it usually has an excellent fit to RTs. See Wikipedia.

Parameters:

$\mu$ (center): the mean of the gaussian. This models shorter/longer mean RTs.
$\sigma$ (scale): the standard deviation of the gaussian. It models symmetrical variability around $\mu$ .
$\lambda$ (messy): the decay rate of the exponential. This models the tail of long RTs. Higher $\lambda$ –> more tail dominance over the Gaussian.

It is easy to infer these parameters from data. Here, we model $\mu$ by condition. $\sigma$ and $\lambda$ are fixed across conditions. See notes on modeling and the applied example for more details.

library(brms)
fit = brm(rt ~ condition, rt_data, family=exgaussian())

3.3 Skew normal

Adds a skew to the normal (Gaussian) distribution via an extra skew-parameter. In my experience, it usually has a poorer fit to reaction times, though it is superior to the Gaussian.

Parameters:

$\mu$ (center): the mean.
$\sigma$ (scale): the standard deviation. Changing $\sigma$ preserves the mean $\mu$ but not the median. Note that when $\alpha \neq 0$ , the distribution is not symmetric, so don’t interpret it as the SD of a normal distribution.
$\alpha$ (messy): the skewness. When $\alpha = 0$ , it’s a Gaussian.

It is easy to infer these parameters from data. Here, we model $\mu$ by condition. $\sigma$ and $\alpha$ are fixed across conditions. See notes on modeling and the applied example for more details.

library(brms)
fit = brm(rt ~ condition, rt_data, family=skew_normal())

3.4 (Shifted) Log-normal

When using logarithms, you model percentage increase or decrease instead of absolute differences. That is, rather than saying “RTs are 40 ms slower in incongruent trials”, your parameter(s) say that “RTs are 7% slower in incongruent trials”. This embodies the intuition that a difference of 40 ms can be huge for simple reaction times but insignificant for RTs around 10 seconds. Mathematically it is about multiplicative relationships rather than additive relationships, and this makes some sense when talking about RTs, so I’d say that the shifted log-normal is somewhere between being descriptive and mechanistic. In general, log-like models makes a lot of sense for most positive-only variables like RT.

It’s called “log-normal” because the parameters are the mean and sd of the log-transformed RTs, which is assumed to be a normal (Gaussian) distribution. For example, for simple RTs you will often see an intercept near exp(-1.5) = 0.22 seconds.

Parameters:

$\mu$ (difficulty): the mean of the log-normal distribution. The mean of $\mu$ represents the median RT. The median RT is then $shift + exp(\mu)$ .
$\sigma$ (scale): the standard deviation of the log-normal distribution. Increases the mean but not the median of $\mu$ .
$shift$ (shift): time of the earliest possible response. When $shift = 0$ , it’s a straight-up log-normal distribution with two parameters.

It is easy to infer these parameters from data. Here, we model $\mu$ by condition. It is log, so use exp() to back-transform. $\sigma$ and $shift$ are fixed across conditions. See notes on modeling and the applied example for more details.

library(brms)
fit = brm(rt ~ condition, rt_data, family=shifted_lognormal())

3.5 (Shifted) Inverse Gaussian

… also known as the (Shifted) Wald. It is not as “inverse” of the Gaussian as it sounds. It is closely related to the Weiener / Decision Diffusion model and the LBA model in the following way (quote from Wikipedia:

The name can be misleading: it is an “inverse” only in that, while the Gaussian describes a Brownian motion’s level at a fixed time, the inverse Gaussian describes the distribution of the time a Brownian motion with positive drift takes to reach a fixed positive level.

In other words, if you have time on the x-axis and start a lot of walkers at $t_0 = shift$ , the Gaussian describes the distribution of their vertical location at a fixed time (t_i) while the inverse Gaussian describes the distribution of times when they cross a certain threshold on the y-axis. Like Wiener and LBA, the inverse Gaussian can be thought of as modeling a decision process, without incorporating the correctness of the response.

Parameters:

$\mu$ (difficulty): The mean. Notice that this will be heavily influenced by long RTs.
$\lambda$ (scale): Decreases the variance but is harder to intepret since $\sigma = \sqrt{(\mu^3/\lambda)}$ .
$shift$ (onset): time of earliest possible response. When $shift = 0$ it is a plain Inverse Gaussian or Wald distribution, i.e., only two parameters.

It is easy to infer the parameters of the (non-shifted) inverse Gaussian / Wald. Here, we model $\mu$ by condition. $\lambda$ is fixed across conditions. See notes on modeling and the applied example for more details.

library(brms)
fit = brm(rt ~ condition, rt_data, family=inverse.gaussian())

3.6 Wiener diffusion

Also called the Diffusion Decision Model (DDM), this is probably the most popular model focused at mechanisms. It applies to two-alternative forced-choice (2afc) scenarios and requires data on accuracy as well (scored as correct/wrong). It models RTs as a decision being made when a random walker crosses one of two thresholds (boundary). It’s clear mechanistic interpretability is a strength, but it requires some effort to understand and do. Read Ratcliff & McKoon (2008) for a motivation for the model. I like this visalization.

The diffusion idea also underlies the inverse Gaussian and the LBA model.

Parameters:

All of these are mechanism-like

$\alpha$ (difficulty): Boundary separation. The distance between thresholds. Longer distance –> more varied paths –> dispersed RTs.
$\tau$ (onset): Non-decision time (ndt). The time at which the random walker starts.
$\beta$ (difficulty): Bias. How close the walker starts to the upper threshold ( $\beta = 0.5$ is in the middle). Higher = closer = faster and more condensed RTs.
$\delta$ (difficulty): Drift Rate. The speed of the walker. Higher –> faster and more condensed RTs.

See this three-part brms tutorial by Henrik Singman on how to fit it using brms::brm and do regression on these parameters. It usually takes days or weeks to fit, but Wagenmakers (2007) made the EZ diffusion calculator which approximates the parameters in the simplest scenario. The HDDM python package can do it as well.

3.7 Linear Ballistic Accumulator

This model has some resemblance to the Wiener diffusion model in that it models latent “movements” towards decision thresholds, and is very much focused on mechanisms. It likewise applies to two-alternative forced-choice (2afc) scenarios, requiring data on accuracy as well (correct/incorrect response). Although it involves a lot of parameters, the model itself is (at least supposed to be) simpler than the diffusion model as explained in Brown & Heathcote (2008). A nice visualization from that paper escaped the paywall here.

Parameters:

All of these are mechanism-like:

$A$ : upper bound of the starting point. The starting point is modeled as randomly drawn in this interval.
$b$ (difficulty): response threshold (usually, $b > A$ ).
$t0$ (shift): the time before a “movement” is initiated, i.e., non-decision time.
$mean\_v$ (difficulty): the mean drift rate and it’s variability (SD).
$sd\_v$ : the mean standard deviation of the drift rate and its variability (SD)

Fit it using glba::lba(). There are some helpful distributions in the rtdists package.

3.8 Gamma

The gamma distribution is so important in statistics, that it probably spilled over to RT research as well. Although it can fit the data, it’s very hard to interpret the parameters in relation to RTs, and I will have little more to say. One could easily make a shifted Gamma, but I haven’t seen it used for the analysis of reaction times.

Parameters:

Both parameters in isolation are difficulty-like so their interaction determines scale and location. I think that this makes them messy.

$\alpha$ (messy): a shape parameter.
$\beta$ (messy): a rate parameter.

It is easy to infer these parameters from data. Here, we model $\alpha$ by condition. $\beta$ is fixed across conditions. See notes on modeling and the applied example for more details.

library(brms)
fit = brm(rt ~ condition, rt_data, family=gamma())

3.9 (Shifted) Weibull

… also called the three-parameter Weibull when there is a shift parameter. For RTs, it can be understood as modeling there being an underlying rate with which underlying non-response processes are “turned into” responses over time. If $\lambda = 0$ , this rate is constant. If $\lambda > 1$ , it decreases over time. The parameter $values$ do not correspond to anything intuitive, though.

Parameters:

$\lambda$ (difficulty): It’s called a scale parameter for some reason, but it affects the mean, median, and mode.
$k$ (messy): A shape parameter.
$shift$ (onset): when $shift = 0$ , it’s a straight-up Weibull distribution with the above two parameters.

It is easy to infer the parameters of the (non-shifted) Weibull from data. Here, we model $\lambda$ by condition. It is log, so use exp() to back-transform. $k$ is fixed across conditions. See notes on modeling and the applied example for more details.

library(brms)
fit = brm(rt ~ condition, rt_data, family=weibull())

4 Applied code example

We load the Wagenmakers & Brown (2007) data and pre-process it:

# Load data
library(rtdists)
data(speed_acc)

# Remove bad trials
library(tidyverse)
data = speed_acc %>%
  filter(rt > 0.18, rt < 3) %>%  # Between 180 and 3000 ms
  mutate_at(c('stim_cat', 'response'), as.character) %>%  # from factor to char
  filter(response != 'error', stim_cat == response) # Only correct responses

Now we use the shifted log-normal to ask: “Does median RT differ by condition?”. In other words, we model $\mu$ here, leaving the scale $\sigma$ and the onset $shift$ fixed. We include a random intercept per participant, but see my notes on formulas for RT models. In this example, it takes some time to run it because we have around 28.000 data points.

library(brms)
fit = brm(formula = rt ~ condition + (1|id),
          dat = data, 
          family = shifted_lognormal(),
          file = 'fit_slog')  # Save all that hard work.

Done!

4.1 Checking the fit

Much like the interactive applets above, we can see how the parameters predict the actual data:

#plot(fit)  # Check convergence and see posteriors
pp_check(fit)  # Posterior predictive check

This is much better than a Gaussian. But there’s room for improvement. Mostly because $shift$ is close to the earliest rt in the whole dataset, which for one participant (id #8) in one condition was 0.18 - just above the trim. The mean shortest rt was 0.28 seconds so that participant is not representative. Setting one $shift$ per participant would make more sense:

formula = bf(rt ~ condition + (1|id), 
             ndt ~ id)

Also, the shortest N RTs could be removed for each participant, if you can argue that they were accidental responses:

data %>%
  group_by(id) %>%
  arrange(rt) %>%
  slice(4:n())  # Remove three shortest RTs

4.2 Understanding the parameter estimates

Leaving that aside, let’s see what we learned about the parameters of the shifted log-normal model. Intercept and conditionspeed together define $\mu$ (the difficulty). You can see sigma (the scale) and ndt (the onset) too, so we have posteriors for all of our three parameters!

summary(fit)  # See parameters

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -0.69      0.04    -0.76    -0.61 1.01      508      672
conditionspeed    -0.40      0.00    -0.41    -0.39 1.00     2184     2143

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.37      0.00     0.36     0.37 1.00     1839     2088
ndt       0.17      0.00     0.17     0.18 1.00     1576     1724

Because $\mu$ and $\sigma$ define the log-normal distribution, they are on a log scale. This means that we expect a median RT of $ndt + exp(\mu) = 0.17 + exp(-0.69) = 0.50$ seconds for condition “accuracy” (the intercept), and a median RT of exp(-0.69 -0.40) = 0.34 seconds for condition “speed”. If you get confused about these transformations, worry not. brms has you covered, no matter the distribution:

marginal_effects(fit)  # Back-transformed parameter estimates
marginal_effects(fit, method='predict')  # Same, but for responses

We are very certain that they respond faster for “speed” because the upper bound of the credible interval is much smaller than zero. We could test this more directly:

hypothesis(fit, 'conditionspeed < 0')  # Numerically

4.3 Priors and descriptives

I cut corners here for the sake of brevity. Instead of going with the default priors, I would have specified something sensible and run brm(..., prior=prior). Mostly as regularization. There is so much data that most priors will be overwhelmed.

# See which priors you can set for this model
get_prior(rt ~ condition + (1|id), speed_acc, family = shifted_lognormal())

# Define them
prior = c(
  set_prior('normal(-1, 0.5)', class = 'Intercept'),  # around exp(-1) = 0.36 secs
  set_prior('normal(0.4, 0.3)', class = 'sigma'),  # SD of individual rts in log-units
  set_prior('normal(0, 0.3)', class = 'b'),  # around exp(-1) - exp(-1 + 0.3) = 150 ms effect in either direction
  set_prior('normal(0.3, 0.1)', class = 'sd')  # some variability between participants
)

I would probably also have checked that the pre-processing worked by plotting the response distributions of a few participants:

data_plot = filter(data, id %in% c(1, 8, 15))

ggplot(data_plot, aes(x=rt)) + 
    geom_histogram(aes(fill=condition), alpha=0.5, bins=60) + 
    facet_grid(~id) +  # One panel per id
    coord_cartesian(xlim=c(0, 1.6))  # Zoom in

5 Notes on good models for RT data

Throughout this document, I have used a simple model: rt ~ condition to demo code. For real-world data, you probably know a lot more about the data generating process. This is a better typical recipe:

rt ~ condition + covariates + (condition | id) + (1|nuisance)

This could be an example of using this recipe:

rt ~ congruent + trial_no + bad_lag1 + (congruent|id) + (1|counterbalancing)

Let’s unpack this:

condition is our contrast of interest. This can be a variable coded as congruent/incongruent, the number of items in a visual search, or something else. It can also be an interaction. I just did an analysis where I had a congruent:condition term to model a congruency effect per condition.
covariates could be, e.g., trial_no + bad_lag1, i.e., the effect of time (people typically get faster with practice) and longer RTs on the first trial in a block (you need to code this variable). It is often a good idea to $scale$ these to help the model fit, and to $center$ them (possible for continuous and binary predictors) so that the effect of condition represents the mean of the other conditions rather than a certain combination. You can use scale to scale and center.
(condition | id) models the expected individual differences in general rt, and allows the effect of condition to vary between participants. The practical effect of specifying a random effect rather than a fixed condition:id effect is that you get shrinkage in the former, naturally imposing a bit of control for outliers. Unfortunately, complicated random effects sometimes increase computation time considerably.
(1 | nuisance) is a way to let the model “understand” a structure in the data, without having it affect the fixed parameter estimates. If you have a factorial design, the otherwise non-modeled design variables would generally go here.

“Difficulty” is a term borrowed from Wagenmakers & Brown (2007), but it is not standard↩
Some distributions are parameterized using rate which is the inverse of location center.↩
Many distributions include a shape parameter, but since all RT distributions have approximately the same shape, I’d rather pick a distribution which inherently has the “right” shape, than model this using a parameter.↩