A fast Trie-based parser for a collection of literal Strings.

Matching any one of many Strings (e.g. names) is slow in PCRE with | alternatives.

julia> using Random
julia> s = [ randstring(10) for _ in 1:1000 ];
julia> re = Regex(join(s,"|"));
julia> using BenchmarkTools
julia> @benchmark match(re,s[end])BenchmarkTools.Trial: 10000 samples with 9 evaluations. Range (min … max): 2.709 μs … 8.185 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.914 μs ┊ GC (median): 0.00% Time (mean ± σ): 2.941 μs ± 173.223 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▁▅▅██▇▆▄▃▁ ▁▂▂▂▃▄▅▄▅▇▇██████████▇▆▅▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▁▁▁▁▁▁▁▁▁ ▃ 2.71 μs Histogram: frequency by time 3.47 μs < Memory estimate: 240 bytes, allocs estimate: 4.

A optimized state-machine parser will perform much better.

TODO: Benchmark Automa.jl here

CombinedParsers does not use state machines, but provides fast _iterate implementation for Either{Trie{Char}}, that is constructed with Either(::Vector{<:AbstractString})

julia> using CombinedParsers
julia> pc = Either(s);
julia> @benchmark match(pc,s[end])BenchmarkTools.Trial: 10000 samples with 10 evaluations. Range (min … max): 1.187 μs … 4.909 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 1.504 μs ┊ GC (median): 0.00% Time (mean ± σ): 1.547 μs ± 267.748 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▁▄▅▇▇▇█▇▇▆▅▄▄▅▄▅▃▃▃▃▁▁▂ ▁▆▇████████████████████████▇▇▅▅▄▄▄▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▄ 1.19 μs Histogram: frequency by time 2.42 μs < Memory estimate: 416 bytes, allocs estimate: 7.

Note that pc does not capture the result by default with parse.

julia> parse(!pc,s[end])"TPiZTwAyw6"

For large word collections, PCRE fails

julia> s = [ randstring(10) for _ in 1:10000 ];
julia> re = Regex(join(s,"|"));ERROR: PCRE compilation error: regular expression is too large at offset 109999

CombinedParsers displays good timings also .

julia> pc = Either(s);
julia> @benchmark Either(s);
julia> @benchmark match(pc,s[end])BenchmarkTools.Trial: 10000 samples with 10 evaluations. Range (min … max): 1.116 μs … 9.438 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 1.547 μs ┊ GC (median): 0.00% Time (mean ± σ): 1.588 μs ± 305.689 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▄▆███████▇▇▆▆▃▃▁ ▂▂▂███████████████████▇▆▆▅▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▁▂▂▂▂▂ ▄ 1.12 μs Histogram: frequency by time 3.05 μs < Memory estimate: 416 bytes, allocs estimate: 7.

Either{Trie{Char}} can be used to perform fast text search in Julia.

Aho-Corasick algorithm?

The implementation could be expanded to implement Aho-Corasick algorithm.


This page was generated using Literate.jl.