vignettes/packages.Rmd
packages.Rmd
There are a lot of change point packages out there already, so why mcp
? Here are my (probably biased) thoughts about this. I compiled some tables, summarising change point packages (.xlsx file here). I will demonstrate each of these packages in an applied example below to discuss their merits and shortcomings.
I think segmented
, and EnvCpt
are good if the data fits what they can model, and mcp
is a more capable and general-purpose package at the cost of speed. You can see the immediate roadmap for mcp
at the GitHub issues tracker.
How much flexibility does the package allow for modeling the system, you’re studying? A clear difference here is between packages that allow you to specify the number of change points, and packages which infer the number of change points automatically (using some criteria), captured in the N
column below.
What can you learn, once the model has fitted? While all packages return the estimated Change Points, few return an index of uncertainty around that estimate (CP CI
: confidence intervals, highest-density intervals, or posterior densities). Sometimes, it may not even be the change point that is of interest, but rather the parameters of the segments in between (params
, params CI
). Finally, you may want to compare models, e.g., testing whether a change point is present or not, or testing the nature of the segments between change points (hypothesis tests
).
Note that EnvCpt
returns one log-likelihood per model, which one then could do testing on. segmented
returns AIC and deviance, so ditto here. However, this is something you would have to do yourself. mcp
explicitly supports and recommends workflows for detailed model comparisons.
This is a list of other functions, which I find useful on other statistics packages, including simulating data, getting fitted and predicted values given the model, and an assessment of speed. The option to include prior knowledge could be important in some situations.
mcp
.mcp
.sigma()
(read more here) and ar()
(read more here) terms, also allowing for more advanced change points involving, e.g., a change in autocorrelation strength.bcp
is the only other Bayesian change point package I know of. It contains a few high-level prior parameters, but not for specific parameters such as where change points are expected to occur etc. Read more about priors in mcp for more.plot.mcpfit
is inherently a visual predictive check and can be applied to priors (mcp(..., sample = "prior")
and posteriors. Of course inspecting Gelman-Rubin statistics and effective sample sizes in summary.mcpfit
, as well as trace plots (plot_pars(fit, type = "trace")
), also go a long way.loo
and hypothesis
, including testing the existence of a change point. Most other packages stop at estimation, though some provide p-values. Read more about hypothesis testing and model comparison using mcp
.Let me go all-out on my ego-serving bias and say that mcp
is the best package unless:
mcp
is reasonably fast for typical problems, MCMC sampling is slower than analytical and specialized solutions. For datasets > 20.000 points and a complicated model, it may take mcp
hours (but not days) to fit. See tips and tricks how to speed up mcp
. Briefly, run in parallel and lower the number of iterations: (mcp(..., cores = 4, iter = 1000, adapt = 400
).mcp
tries to fill. But beware that many of the automatic procedures did not capture a change point in the worked examples below. If recommend segmented
for the models it “understands” followed by EnvCpt
and perhaps bcp
.mcp
only does univariate regression, both for \(y\) and \(x\). This will be implemented if there is a popular request for it. Don’t hesitate to raise an issue on GitHub. For now, bcp
seems like the best option to me.As a simple example, we simulate some intercept-only data with change points at 30 (from mean=2 to mean=0) and 70 (to mean=1) and a residual of 1 SD.
# Simulate
set.seed(42) # I always use 42; no fiddling
df = data.frame(
x = 1:100,
y = c(rnorm(30, 2), rnorm(40, 0), rnorm(30, 1))
)
# Plot it
plot(df)
abline(v = c(30, 70), col="red")
mcp
needs no further introduction. We fit the three-plateaus model with default priors:.
library(mcp)
model = list(y~1, 1~1, 1~1) # three intercept-only segments
fit_mcp = mcp(model, data = df, par_x = "x")
summary(fit_mcp)
## Family: gaussian(link = 'identity')
## Iterations: 9000 from 3 chains.
## Segments:
## 1: y ~ 1
## 2: y ~ 1 ~ 1
## 3: y ~ 1 ~ 1
##
## Population-level parameters:
## name mean lower upper Rhat n.eff
## cp_1 30.73 27.16 33.99 1 1819
## cp_2 63.66 46.87 76.65 1 982
## int_1 2.04 1.63 2.43 1 4352
## int_2 -0.08 -0.52 0.38 1 1990
## int_3 0.92 0.51 1.30 1 1784
## sigma_1 1.06 0.91 1.22 1 4755
The summary shows good parameter recovery, though the second change point is detected a bit early. This is understandable if you look at the data, and the true change point is still within the highest density interval.
Plotting the posterior distributions of the change points reveal that they are not well represented by a Gaussian or other known distributions. Therefore, confidence intervals are likely to be meaningless for this problem.
library(patchwork)
plot(fit_mcp) + plot_pars(fit_mcp, pars = c("cp_1", "cp_2"), type = "dens_overlay")
mcp
takes around 6 seconds to fit this example, which is the slowest of all the packages considered here. You can bring that down to 2.5 seconds by running in parallel, using mcp(..., cores = 3)
or for the full session using options(mc.cores = 3)
.
EnvCpt
can detect change points in mean and variance (not separately), slopes (“trends”), and AR(1)/AR(2), as well as conveniently fitting various models without change points. It automatically infers the number of change points. Unless otherwise instructed (through models
argument), EnvCpt
fits all models to the data, allowing you to pick one. It allows
library(EnvCpt)
fit_envcpt = envcpt(df$y) # Fit all models at once
fit_envcpt$summary # Show log-likelihoods
This can be used to maximize the log-likelihood. Of interest here is meancpt
. The documentation says that it calls changepoint::cpt.meanvar
, modeling a simultaneous change in mean and variance. Curiously, changepoint::cpt.meanvar(df$y)
only finds one change point when called directly (see below), but the output matches changepoint.np::cpt.np(df$y)@cpts
, so I suspect that this function is used instead. EnvCpt
has a nice plot of all the models:
plot(fit_envcpt)
Digging into the meancpt
, we get maximum-likelihood estimates of the change points and the parameters of each segment.
## [1] 30 65 100
## $mean
## [1] 2.06858683 -0.07385687 0.96511049
##
## $variance
## [1] 1.5225923 1.0064793 0.6442151
I think the change point at x = 100
should just be ignored. We see that it approximately identifies the change points, though without intervals.
segmented
seems to be the most popular package for change point analysis. It has a very shallow learning curve combined with great modeling flexibility. You simply specify your model in lm
, glm
, Arima
, and also work for e.g. coxph
(Cox proportional Hazard). Supply it to segmented
which then segment your data along the x-axis and applies the linear model in each segment. The trick is identifying the locations where this split works the best. The positive consequence is that you get great modeling flexibility with GLM, AR(1) models, etc.
The downside is that you can only have one kind of segments and (for some reason) it ignores the intercepts on anything but the first segment. Only (joined) slopes are supported in segment 2+. Unfortunately, this means that segmented
fails for the present changing-intercept data. In my limited testing, it also ignores quadratic and other terms, so even though the design philosophy is excellent, you have to carefully double-check whether it actually models what you asked it to.
segmented
is otherwise well developed with prediction functions, (frequentist) intervals, plots, etc. so if you have large datasets with impermissible long run times in mcp
and it matches what segmented
can model, it is a good option.
Enough talk:
library(segmented)
fit_lm = lm(y ~ 1 + x, data = df) # intercept-only model
fit_segmented = segmented(fit_lm, seg.Z = ~x, npsi = 2) # Two change points along x
summary(fit_segmented)
##
## ***Regression Model with Segmented Relationship(s)***
##
## Call:
## segmented.lm(obj = fit_lm, seg.Z = ~x, npsi = 2)
##
## Estimated Break-Point(s):
## Est. St.Err
## psi1.x 40.217 3.659
## psi2.x 91.000 4.989
##
## Meaningful coefficients of the linear terms:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.28673 0.33873 9.703 7.86e-16 ***
## x -0.08916 0.01440 -6.192 1.55e-08 ***
## U1.x 0.12164 0.01770 6.871 NA
## U2.x -0.13783 0.11618 -1.186 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.051 on 94 degrees of freedom
## Multiple R-Squared: 0.4154, Adjusted R-squared: 0.3843
##
## Convergence attained in 1 iter. (rel. change 9.9261e-06)
As expected, the change points are off (psi1.x
and psi2.x
). The default plot is sparse, but you can quickly add more info:
plot(fit_segmented)
points(df)
lines.segmented(fit_segmented)
points.segmented(fit_segmented)
Because the fitted objects returned by segmented
contains aic
, deviance
, etc. it is in principle possible to do model comparison. Here is a BIC-based Bayes Factor testing whether there is one or two change points in the data:
fit_segmented_1 = segmented(fit_lm, seg.Z = ~x, npsi = 1)
BF = exp((BIC(fit_segmented) - BIC(fit_segmented_1))/2) # From Wagenmakers (2007)
BF
## [1] 21.87528
We can do the same using mcp
:
model_1 = list(y~1, 1~1)
fit_mcp_1 = mcp(model_1, data = df, par_x = "x")
fit_mcp_1$loo = loo(fit_mcp_1)
fit_mcp$loo = loo(fit_mcp)
loo::loo_compare(fit_mcp$loo, fit_mcp_1$loo)
## elpd_diff se_diff
## model1 0.0 0.0
## model2 -5.5 2.8
While segmented
prefers the one-change-point model, mcp
prefers the (true) two-change-point model. Again, the models compared here are very different since segmented
does inference on a slope-only model.
strucchange::breakpoints()
is a lot like segmented
. The difference is that (1) it is limited to gaussian residuals, and (2) it actually models intercepts, quadratic terms, etc. It scans through fits with 1, 2, 3,… N break points and determine where the optimal break points between this number of segments would lie. For each model it computes the Residual Sum of Squares (RSS; monotonically smaller with increasing N) and the Bayesian Information Criterion (BIC; has a minimum) for each N. It selects the model with the smallest BIC.
For an intercept-only model (y ~ 1
) it correctly detects the two change points with estimates similar to mcp
. As with most other packages, there are no intervals on the change points nor estimates of the parameters of the segments in between.
library(strucchange)
## Loading required package: sandwich
fit_bp = breakpoints(y ~ 1, data = df, breaks = 2)
summary(fit_bp)
##
## Optimal (m+1)-segment partition:
##
## Call:
## breakpoints.formula(formula = y ~ 1, breaks = 2, data = df)
##
## Breakpoints at observation number:
##
## m = 1 30
## m = 2 30 65
##
## Corresponding to breakdates:
##
## m = 1 0.3
## m = 2 0.3 0.65
##
## Fit:
##
## m 0 1 2
## RSS 177.7 122.3 103.5
## BIC 350.5 322.4 314.8
Reassuringly, for y ~ 1 + x
it “recommends” the same single change point that segmented
found for this model.
cpm
is an intercept-only (in mean and variance) package, so it cannot model slopes. It can detect single change points via detectChangePoint
and multiple change points via processStream
. processStream
is an automatic change point detection, using a p-value threshold to determine if a candidate should be marked as a hit. Multiple ways of computing p-values (cpmType
) are available. In my limited testing, they return similar change points, though not identical.
library(cpm)
#fit_cpm = detectChangePoint(df$y, cpmType = "Student") # a single change point
#fit_cpm = processStream(df$y, cpmType = "Mann-Whitney") # Detects three
fit_cpm = processStream(df$y, cpmType = "Student") # Multiple change points
fit_cpm$changePoints
## [1] 30 62
It detects two change points, which is good. No intervals are returned and no plot functions are provided. The second change point is estimated a bit further away from the true value than mcp
.
changepoint
is focused on intercept-only changes. It can estimate changes in means (cpt.mean
), variance (cpt.var
), or both (cpt.meanvar
). It is semi-automatic in that you can set the number of change points using parameter Q
and this defaults to five. It can recover ML estimates of the intercepts. It does not estimate uncertainty, nor model checking. It only takes a response variable, so the change point is the data index, not the point on an x-axis. Make sure that your data is ordered. In our case it is ordered and we have 1 data point at each x
, so it is optimal for changepoint
.
It detects the first change point, but not the second.
library(changepoint)
fit_changepoint = cpt.mean(df$y)
# Return estimates
c(ints = param.est(fit_changepoint)$mean,
cp = cpts(fit_changepoint))
## ints1 ints2 cp
## 2.0685868 0.4456268 30.0000000
Plot:
plot(fit_changepoint)
The package changepoint.np
extends changepoint
by providing a non-parametric version. It is unclear which of the changepoint
functions are extended, but it manages to find both change points to a good precision. As with many other packages, no intervals are provided. As mentioned earlier, I suspect that EnvCpt
uses this function under the hood.
## [1] 29 65 100
bcp
is the only other Bayesian package in the game. It automatically detects change points and segment types, though you can use the parameter d
to increase the prior probability of intercept-only models. It provides estimates of means and probability of change point at each x-coordinate. It has little additional functionality. The summary method conveys the same as the plot, so let’s stick to the plot.
We see that it vaguely captures the change point at x = 30
, and has some smeared-out probability around x = 60
(mcp
detected a single change point at x = 64
here). bcp
has some “false alarms” at x < 20
.
library(bcp)
fit_bcp = bcp(df$y, d = 1000)
plot(fit_bcp)
ecp
contains six functions to detect change points. It is clearly built for multivariate data, but can take univariate too. Of the six functions, I have not managed to get e.agglo
working for df
, but here I demonstrate the other five. The resulting information is sparse. It detects one of the two change points, but not both (with the exception of e.divisive
, which returns four),
df_ecp = as.matrix(df$y)
fit_ecp1 = ecp::e.cp3o(df_ecp, K = 2) # maximum 2 change points
fit_ecp2 = ecp::e.cp3o_delta(df_ecp, K = 2) # maximum 2 change points
fit_ecp3 = ecp::e.divisive(df_ecp, k = 2) # 2 change points. Ignored???
fit_ecp4 = ecp::ks.cp3o(df_ecp, K = 2) # maximum 2 change points
fit_ecp5 = ecp::ks.cp3o_delta(df_ecp, K = 2) # maximum 2 change points
# Show the change point estimates
str(list(
e.cp3o = fit_ecp1$estimates,
e.cp3o_delta = fit_ecp2$estimates,
e.divisive = fit_ecp3$estimates,
ks.cp30 = fit_ecp4$estimates,
ks.cp3o_delta = fit_ecp5$estimates
))
## List of 5
## $ e.cp3o : int 31
## $ e.cp3o_delta : int 60
## $ e.divisive : num [1:4] 1 31 63 101
## $ ks.cp30 : int 31
## $ ks.cp3o_delta: int 31
Short for “Time-Series Multiple Change Point”. The output is a single number (!). For the problem at hand, changing method
makes no difference. For many other c
(c = 0.3
, c = 4
), it fails to find any change points at all. There is also TSMCP::cpvnts()
to model AR(N), but I have failed to make it find any change points in the present data set.
It finds the change point at x = 30
, but not the second change point at x = 70
.
## [1] 30
The wbsts
package is focused at time-series data. I must admit that I have not taken the time to understand the underlying model(s) or inference methods, so I am in a poor position to evaluate whether I am using it correctly. It returns no change points for the present data set using default settings:
## $cp.bef
## NULL
##
## $cp.aft
## NULL
robts
is about robust time-series regression. It is not on CRAN, so it has to be installed using install.packages("robts", repos="http://R-Forge.R-project.org")
. I fail to install the dependencies, and development of the package seems to have stopped around 2014. It won’t be covered further here.
strucchange::Fstats
: Returns the estimated change point (one number), and nothing else. strucchange::Fstats(y ~ 1, data = df)
find the change point at 30 in the present data.SiZer::piecewise.linear
: Returns a change point and parameter estimates, optionally with an interval. Call like piecewise.linear(df$x, df$y, CI = TRUE)
. Only joined slopes are supported which makes it misestimate the change point in the present data 39 [36.5, 47], but it is consistent with segmented
which uses a similar model.easyreg::bl
.lm.br::lm.br
. Choose between line-line, line-plateau, or plateau-line. Yields confidence intervals. All of these are non-suitable for the present plateau-only model. lm.br
stands out with the following features: (1) it can take multiple predictors, and it uniquely only models change point over just one of them, keeping the others constant. In comparison, segmented
models changes in everything, as far as I understand it. (2) It can take known variance. (3) It can take data weights.As explained in [../docs/articles/formulas.html], fixed change points can be implemented in almost all regression functions. You simply have an indicator variable saying whether there has been as shift at the corresponding x-value and use it like lm(y ~ 1 + x * I(x > 30), data = df)
. Or if you don’t want a carry-over effect from the first segment, do lm(y ~ 1 * I(x < 30) + x * I(x >= 30), data = df)
. This will work for most regression packages in R, including survival
and lme4
.
Some packages include this in a way where you don’t have to code your indicators yourself, including:
scan::plm
and scan::hplm
. To get started here, make sure to transform your data to a scan::scdf
object.I have yet to test these:
TSIS
may be coerced to do change point analysis, though it is mean for much more complicated switch point models in gene expression analysis. It’s primary interface is a Shiny App.plrs::plrs
. Piecewise Linear Regression, targeted at DNA and gene expression. May also be coerced into simpler problems.