LinearSegmentation
This is a small package that performs linear segmented regression: fitting
piecewise linear functions to data, and simultaneously estimating the best
breakpoints. Three algorithm are implemented, sliding_window
, top_down
, and
graph_segmentation
.
Interface
using LinearSegmentation
segments, glm_fits = segmentation_function(
x_values,
y_values;
min_segment_length = minimum_segment_x_length,
max_rmse = maximum_root_mean_squared_error,
)
Where each segment
in segments
is a type of LinearSegmentation.Segment
,
which contains all the indices of x_values
used in the segment
.
Corresponding to each segment
is a GLM.LinearModel
(see
GLM.jl), which is the fitted linear
model to the segment
. Minimum segment lengths are specified with
min_segment_length
and maximum allowed root mean square error for a segment is
set with max_rmse
. Both of these kwargs have large effects on the segmentation
--- play around to see what works best for your data.
Generate some data
N = 100
xs = collect(range(0, 3 * pi, length = N)) .+ 0.1 .* randn(N)
ys = sin.(xs) .+ 0.1 .* randn(N)
Sliding window
Starting from first(xs)
a "window" slides along the x-axis, with data points
added to this segment until the fit error reaches max_rmse
. Then a new segment
is created, and the process repeats until the dataset is finished. This
algorithm is the cheapest to run, but may generate worse fits due to its
simplicity (notice the overshoots in the picture).
segs, fits = sliding_window(xs, ys; min_segment_length=1.2, max_rmse=0.15)
Top down
This algorithm recursively splits the data into two parts, attempting to find segments that are both long enough, and have a good enough fit (set via the kwargs). In my experience, this is the most robust algorithm, but ymmv.
segs, fits = top_down(xs, ys; min_segment_length=1.2, max_rmse=0.15)
Graph segmentation
This algorithm is my take on the dynamic programming approaches used by the R
packages listed below. In essence, a weighted directional graph is constructed,
where each node corresponds to an index of xs
, and the weight corresponds to
the rmse of the fit between two nodes (segment length restrictions and maximum
error are both incorporated). The shortest path that spans xs
is the found
with Graphs.a_star
(see
Graphs.jl), and should correspond to
the best segmentation.
segs, fits = graph_segmentation(xs, ys; min_segment_length=1.2, max_rmse=0.15)
Other useful resources
- https://cran.r-project.org/web/packages/dpseg/vignettes/dpseg.html
- https://winvector.github.io/RcppDynProg/
- E. Keogh, S. Chu, D. Hart and M. Pazzani, "An online algorithm for segmenting time series," Proceedings 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 2001, pp. 289-296, doi: 10.1109/ICDM.2001.989531.