Overview
ParseMatch
CombinedParsers.jl provides the @re_str
macro as a plug-in replacement for the base Julia @r_str
macro. Base Julia PCRE regular expressions:
julia> pattern = r"(?<a>a|B)+c"
r"(?<a>a|B)+c"
julia> mr = match(pattern,"aBc")
RegexMatch("aBc", a="B")
CombinedParsers.Regexp regular expression:
julia> pattern = re"(?<a>a|B)+c"
๐ Sequence |> regular expression combinator with 1 capturing groups โโ (?<a>|๐)+ Either |> Capture 1 |> with_name(:a) |> Repeat โ โโ a โ โโ B โโ c ::Tuple{Vector{Char}, Char}
julia> mre = match(pattern,"aBc")
ParseMatch("aBc", a="B")
The ParseMatch type has getproperty
and getindex
methods for handling like RegexMatch
.
julia> mre.match
"aBc"
julia> mre.captures
1-element Vector{SubString{String}}: "B"
julia> mre[1]
"B"
julia> mre[:a]
"B"
CombinedParsers.jl is tested and benchmarked against the PCRE C library testset, see compliance report.
Parsing
match
searches for the first match of the Regex in the String and return a RegexMatch
/Parsematch
object containing the match and captures, or nothing if the match failed. If a capture matches repeatedly only the last match is captured.
julia> match(pattern,"aBBac")
ParseMatch("aBBac", a="a")
Base.parse
methods parse a String into a Julia type. A CombinedParser p
will parse into an instance of result_type(p)
. For parsers defined with the @re_str
the result_type
s are nested Tuples and Vectors of SubString, Chars and Missing.
julia> parse(pattern,"aBBac")
(['a', 'B', 'B', 'a'], 'c')
Iterating
If a parsing is not uniquely defined different parsings can be lazily iterated, conforming to Julia's iterate
interface.
for p in parse_all(re"^(a|ab|b)+$","abab")
println(p)
end
(re"^", Union{Char, Tuple{Char, Char}}['a', 'b', 'a', 'b'], re"$")
(re"^", Union{Char, Tuple{Char, Char}}['a', 'b', ('a', 'b')], re"$")
(re"^", Union{Char, Tuple{Char, Char}}[('a', 'b'), 'a', 'b'], re"$")
(re"^", Union{Char, Tuple{Char, Char}}[('a', 'b'), ('a', 'b')], re"$")
Performance
CombinedParsers
are fast, utilizing parametric types and generated functions in the Julia compiler.
Compared with the Base.Regex (PCRE C implementation)
using BenchmarkTools
pattern = r"[aB]+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 492 evaluations.
Range (min โฆ max): 215.211 ns โฆ 46.446 ฮผs โ GC (min โฆ max): 0.00% โฆ 99.28%
Time (median): 259.238 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 333.037 ns ยฑ 1.621 ฮผs โ GC (mean ยฑ ฯ): 17.54% ยฑ 3.58%
โโ
โโโโโโโ
โ
โ
โ
โโโโโโโโโโโโโโโโโโ โ โโโโโโ โโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
215 ns Histogram: log(frequency) by time 421 ns <
Memory estimate: 224 bytes, allocs estimate: 3.
CombinedParsers
are slightly faster in this case, and for many other tested parsers.
pattern = re"[aB]+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 74 evaluations.
Range (min โฆ max): 889.243 ns โฆ 308.311 ฮผs โ GC (min โฆ max): 0.00% โฆ 99.49%
Time (median): 994.189 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 1.193 ฮผs ยฑ 6.717 ฮผs โ GC (mean ยฑ ฯ): 12.57% ยฑ 2.22%
โโ
โโโโ
โโโโโโ
โโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
889 ns Histogram: frequency by time 1.43 ฮผs <
Memory estimate: 544 bytes, allocs estimate: 5.
Matching Regex captures are supported for compatibility
pattern = r"([aB])+c"
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 294 evaluations.
Range (min โฆ max): 294.299 ns โฆ 88.584 ฮผs โ GC (min โฆ max): 0.00% โฆ 99.52%
Time (median): 347.509 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 466.993 ns ยฑ 2.770 ฮผs โ GC (mean ยฑ ฯ): 18.74% ยฑ 3.14%
โโโโโโโโโ
โโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโ โ
โโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
294 ns Histogram: log(frequency) by time 631 ns <
Memory estimate: 288 bytes, allocs estimate: 4.
CombinedParsers.Regexp.Capture
s are slow compared with PCRE,
pattern = re"([aB])+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min โฆ max): 1.392 ฮผs โฆ 2.162 ms โ GC (min โฆ max): 0.00% โฆ 99.74%
Time (median): 2.052 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 2.371 ฮผs ยฑ 21.602 ฮผs โ GC (mean ยฑ ฯ): 9.09% ยฑ 1.00%
โโโโโโโโ
โโโ โโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
1.39 ฮผs Histogram: frequency by time 4.06 ฮผs <
Memory estimate: 1.28 KiB, allocs estimate: 10.
But with CombinedParsers
you capture more flexibly with transformations anyway.
julia> pattern = re"[aB]+c";
julia> @btime (mre = match(pattern,"aBaBc"))
892.962 ns (5 allocations: 544 bytes) ParseMatch("aBaBc")
julia> @btime get(mre)
179.596 ns (2 allocations: 128 bytes) (['a', 'B'], 'c')
Transformations
Transform the result of a parsing with map
. The result_type
is inferred automatically using julia type inference.
julia> p = map(length,re"(ab)*")
(ab)* Sequence |> Capture 1 |> Repeat |> map(length) |> regular expression combinator with 1 capturing groups ::Int64
julia> parse(p,"abababab")
4
Conveniently, calling getindex(::CombinedParser,::Integer)
and map(::Integer,::CombinedParser)
create a transforming parser selecting from the result of the parsing.
julia> parse(map(IndexAt(2),re"abc"),"abc")
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
julia> parse(re"abc"[2],"abc")
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
Next: The User guide provides a summary of CombinedParsers types.