Overview

ParseMatch

CombinedParsers.jl provides the @re_str macro as a plug-in replacement for the base Julia @r_str macro. Base Julia PCRE regular expressions:

julia> pattern = r"(?<a>a|B)+c"r"(?<a>a|B)+c"
julia> mr = match(pattern,"aBc")RegexMatch("aBc", a="B")

CombinedParsers.Regexp regular expression:

julia> pattern = re"(?<a>a|B)+c"๐Ÿ—„ Sequence |> regular expression combinator with 1 capturing groups
โ”œโ”€ (?<a>|๐Ÿ—„)+ Either |> Capture 1 |> with_name(:a) |> Repeat
โ”‚  โ”œโ”€ a
โ”‚  โ””โ”€ B
โ””โ”€ c
::Tuple{Vector{Char}, Char}
julia> mre = match(pattern,"aBc")ParseMatch("aBc", a="B")

The ParseMatch type has getproperty and getindex methods for handling like RegexMatch.

julia> mre.match"aBc"
julia> mre.captures1-element Vector{SubString{String}}: "B"
julia> mre[1]"B"
julia> mre[:a]"B"
Note

CombinedParsers.jl is tested and benchmarked against the PCRE C library testset, see compliance report.

Parsing

match searches for the first match of the Regex in the String and return a RegexMatch/Parsematch object containing the match and captures, or nothing if the match failed. If a capture matches repeatedly only the last match is captured.

julia> match(pattern,"aBBac")ParseMatch("aBBac", a="a")

Base.parse methods parse a String into a Julia type. A CombinedParser p will parse into an instance of result_type(p). For parsers defined with the @re_str the result_types are nested Tuples and Vectors of SubString, Chars and Missing.

julia> parse(pattern,"aBBac")(['a', 'B', 'B', 'a'], 'c')

Iterating

If a parsing is not uniquely defined different parsings can be lazily iterated, conforming to Julia's iterate interface.

for p in parse_all(re"^(a|ab|b)+$","abab")
	println(p)
end
(re"^", Union{Char, Tuple{Char, Char}}['a', 'b', 'a', 'b'], re"$")
(re"^", Union{Char, Tuple{Char, Char}}['a', 'b', ('a', 'b')], re"$")
(re"^", Union{Char, Tuple{Char, Char}}[('a', 'b'), 'a', 'b'], re"$")
(re"^", Union{Char, Tuple{Char, Char}}[('a', 'b'), ('a', 'b')], re"$")

Performance

CombinedParsers are fast, utilizing parametric types and generated functions in the Julia compiler.

Compared with the Base.Regex (PCRE C implementation)

using BenchmarkTools
pattern = r"[aB]+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 492 evaluations.
 Range (min โ€ฆ max):  215.211 ns โ€ฆ 46.446 ฮผs  โ”Š GC (min โ€ฆ max):  0.00% โ€ฆ 99.28%
 Time  (median):     259.238 ns              โ”Š GC (median):     0.00%
 Time  (mean ยฑ ฯƒ):   333.037 ns ยฑ  1.621 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  17.54% ยฑ  3.58%

         โ–ƒโ–…โ–†โ–‡โ–ˆโ–ˆโ–‡โ–†โ–…โ–…โ–…โ–…โ–„โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–โ–โ–  โ–     โ–โ–โ–‚โ–‚โ–โ–      โ–โ–  โ–ƒ
  โ–„โ–ƒโ–โ–„โ–†โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆ
  215 ns        Histogram: log(frequency) by time       421 ns <

 Memory estimate: 224 bytes, allocs estimate: 3.

CombinedParsers are slightly faster in this case, and for many other tested parsers.

pattern = re"[aB]+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 74 evaluations.
 Range (min โ€ฆ max):  889.243 ns โ€ฆ 308.311 ฮผs  โ”Š GC (min โ€ฆ max):  0.00% โ€ฆ 99.49%
 Time  (median):     994.189 ns               โ”Š GC (median):     0.00%
 Time  (mean ยฑ ฯƒ):     1.193 ฮผs ยฑ   6.717 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  12.57% ยฑ  2.22%

         โ–ƒโ–…โ–ˆโ–†โ–†โ–‚                                                  
  โ–โ–โ–โ–‚โ–ƒโ–…โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–…โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–ƒโ–‚โ–‚โ–‚โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– โ–‚
  889 ns           Histogram: frequency by time         1.43 ฮผs <

 Memory estimate: 544 bytes, allocs estimate: 5.

Matching Regex captures are supported for compatibility

pattern = r"([aB])+c"
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 294 evaluations.
 Range (min โ€ฆ max):  294.299 ns โ€ฆ 88.584 ฮผs  โ”Š GC (min โ€ฆ max):  0.00% โ€ฆ 99.52%
 Time  (median):     347.509 ns              โ”Š GC (median):     0.00%
 Time  (mean ยฑ ฯƒ):   466.993 ns ยฑ  2.770 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  18.74% ยฑ  3.14%

      โ–„โ–†โ–‡โ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–„โ–„โ–ƒโ–„โ–ƒโ–‚โ–‚โ–โ–โ–โ– โ–    โ–โ–โ–‚โ–‚โ–ƒโ–‚โ–‚โ–โ–โ–โ–โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–             โ–ƒ
  โ–„โ–…โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–‡โ–‡โ–‡ โ–ˆ
  294 ns        Histogram: log(frequency) by time       631 ns <

 Memory estimate: 288 bytes, allocs estimate: 4.

CombinedParsers.Regexp.Captures are slow compared with PCRE,

pattern = re"([aB])+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min โ€ฆ max):  1.392 ฮผs โ€ฆ  2.162 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 99.74%
 Time  (median):     2.052 ฮผs              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   2.371 ฮผs ยฑ 21.602 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  9.09% ยฑ  1.00%

       โ–โ–ƒโ–„โ–‡โ–‡โ–ˆโ–ˆโ–…โ–„โ–‚โ–  โ–โ–โ–‚โ–ƒโ–‚โ–โ–โ–                                  
  โ–‚โ–‚โ–„โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–…โ–„โ–„โ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– โ–„
  1.39 ฮผs        Histogram: frequency by time        4.06 ฮผs <

 Memory estimate: 1.28 KiB, allocs estimate: 10.

But with CombinedParsers you capture more flexibly with transformations anyway.

julia> pattern = re"[aB]+c";
julia> @btime (mre = match(pattern,"aBaBc")) 892.962 ns (5 allocations: 544 bytes) ParseMatch("aBaBc")
julia> @btime get(mre) 179.596 ns (2 allocations: 128 bytes) (['a', 'B'], 'c')

Transformations

Transform the result of a parsing with map. The result_type is inferred automatically using julia type inference.

julia> p = map(length,re"(ab)*")(ab)* Sequence |> Capture 1 |> Repeat |> map(length) |> regular expression combinator with 1 capturing groups
::Int64
julia> parse(p,"abababab")4

Conveniently, calling getindex(::CombinedParser,::Integer) and map(::Integer,::CombinedParser) create a transforming parser selecting from the result of the parsing.

julia> parse(map(IndexAt(2),re"abc"),"abc")'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
julia> parse(re"abc"[2],"abc")'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

Next: The User guide provides a summary of CombinedParsers types.