A fast Trie-based parser for a collection of literal Strings.
Matching any one of many String
s (e.g. names) is slow in PCRE with |
alternatives.
julia> using Random
julia> s = [ randstring(10) for _ in 1:1000 ];
julia> re = Regex(join(s,"|"));
julia> using BenchmarkTools
julia> @benchmark match(re,s[end])
BenchmarkTools.Trial: 10000 samples with 9 evaluations. Range (min … max): 2.709 μs … 8.185 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.914 μs ┊ GC (median): 0.00% Time (mean ± σ): 2.941 μs ± 173.223 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▁▅▅██▇▆▄▃▁ ▁▂▂▂▃▄▅▄▅▇▇██████████▇▆▅▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▁▁▁▁▁▁▁▁▁ ▃ 2.71 μs Histogram: frequency by time 3.47 μs < Memory estimate: 240 bytes, allocs estimate: 4.
A optimized state-machine parser will perform much better.
TODO: Benchmark Automa.jl here
CombinedParsers
does not use state machines, but provides fast _iterate
implementation for Either{Trie{Char}}
, that is constructed with Either(::Vector{<:AbstractString})
julia> using CombinedParsers
julia> pc = Either(s);
julia> @benchmark match(pc,s[end])
BenchmarkTools.Trial: 10000 samples with 10 evaluations. Range (min … max): 1.187 μs … 4.909 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 1.504 μs ┊ GC (median): 0.00% Time (mean ± σ): 1.547 μs ± 267.748 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▁▄▅▇▇▇█▇▇▆▅▄▄▅▄▅▃▃▃▃▁▁▂ ▁▆▇████████████████████████▇▇▅▅▄▄▄▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▄ 1.19 μs Histogram: frequency by time 2.42 μs < Memory estimate: 416 bytes, allocs estimate: 7.
Note that pc
does not capture the result by default with parse
.
julia> parse(!pc,s[end])
"TPiZTwAyw6"
For large word collections, PCRE fails
julia> s = [ randstring(10) for _ in 1:10000 ];
julia> re = Regex(join(s,"|"));
ERROR: PCRE compilation error: regular expression is too large at offset 109999
CombinedParsers
displays good timings also .
julia> pc = Either(s);
julia> @benchmark Either(s);
julia> @benchmark match(pc,s[end])
BenchmarkTools.Trial: 10000 samples with 10 evaluations. Range (min … max): 1.116 μs … 9.438 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 1.547 μs ┊ GC (median): 0.00% Time (mean ± σ): 1.588 μs ± 305.689 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▄▆███████▇▇▆▆▃▃▁ ▂▂▂███████████████████▇▆▆▅▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▁▂▂▂▂▂ ▄ 1.12 μs Histogram: frequency by time 3.05 μs < Memory estimate: 416 bytes, allocs estimate: 7.
Either{Trie{Char}}
can be used to perform fast text search in Julia.
Aho-Corasick algorithm?
The implementation could be expanded to implement Aho-Corasick algorithm.
This page was generated using Literate.jl.