AhoCorasickAutomatons.ACMatchType

ACMatch has 3 fields:

1. s : start of match
2. t : stop of match, using text[s, t] to get matched patterns
3. i : index of the key in *obj*, which is the original insertion order of keys to *obj*

The field i may be use as index of external property arrays, i.e., the AhoCorasickAutomaton can act as a Map{String, Any}.

See also

eachmatch, getindex, setindex!

AhoCorasickAutomatons.AhoCorasickAutomatonType
AhoCorasickAutomaton{T}(keys::Vector{Pair{String, Ti}}; sort = false) where {T, Ti}
AhoCorasickAutomaton{T}(keys::AbstractDict{String, Ti}; sort = false) where {T, Ti}
AhoCorasickAutomaton{T}(keys::Vector{String}; sort = false) where T
AhoCorasickAutomaton(keys::Vector{String}; sort = false) = AhoCorasickAutomaton{UInt32}(keys; sort = sort)
AhoCorasickAutomaton(keys::Vector{Pair{String, Ti}}; sort = false) where Ti = AhoCorasickAutomaton{UInt32}(keys; sort = sort)

T s.t. typemax(T) >= length(keys)

A 2-Array implementation of Aho–Corasick automaton(ACA)[1], Most used as an engine to mine a long text string, to get all occurences of substrings interested in the text. ACA is also adequate for unicode strings as a Set{String} or Dict{String, Unsigned} (similar as Trie, but with a far more less space usage).

The ACA acts as a key data structrue in Aho–Corasick algorithm for multiple consecutive string patterns finding[2]. This particular implementation use no-decreasing base, i.e.,

1. base[node] >= node
2. children[node] > node

This choice make it suitable for large-size key set of not-very-long keys, with faster insertion speed and moderate space usage.

Notes

  1. Maybe of slower speed than a oridinary tree-based ACA, specially for random generated keys.
  2. When inserting duplicated keys, only the last one will make sense.
  3. Inputing Sorted keys will be of (much) faster speed. Just turn the sort option on!

Examples

julia> v = ["A", "B"]; obj1 = AhoCorasickAutomaton{UInt8}(v; sort = true) # constructor
type  AhoCorasickAutomaton{UInt8}
––––– –––––––––––––––––––––––––––
keys                            2
nodes                          67
nnz                             3
size                    623 bytes

julia> for key in v @assert key ∈ obj1 end

julia> @assert sort!(keys(obj1)) == v

julia> d = Dict("A" => 1, "B" => 2); obj2 = AhoCorasickAutomaton{UInt8}(d; sort = true); obj1 == obj2 # constructor
true

julia> obj3 = AhoCorasickAutomaton{UInt8}(collect(d); sort = true); obj2 == obj3 # constructor
true

julia> s = "AABDB"; matches = eachmatch(obj1, s) # eachmatch
4-element Array{ACMatch,1}:
 ACMatch(1, 1, 1)
 ACMatch(2, 2, 1)
 ACMatch(3, 3, 2)
 ACMatch(5, 5, 2)

julia> matches == eachmatch(obj1, codeunits(s)) # eachmatch
true

julia> map(x -> s[x], matches) # getindex
4-element Array{String,1}:
 "A"
 "A"
 "B"
 "B"

julia> ss = collect(codeunits(s)); foreach(x -> ss[x] = ss[x], matches); String(ss) == s # setindex!
true

julia> io = IOBuffer(); write(io, obj1); seek(io, 0); obj11 = read(io, AhoCorasickAutomaton); obj1 == obj11 # read & write
true

For more : using Pkg; less(joinpath(Pkg.dir("AhoCorasickAutomatons"), "test", "runtests.jl"))

See also

ACMatch, eachmatch, getindex, setindex!, in, get, length, read, write, collect, keys, values.

References

Base.eachmatchMethod
eachmatch(obj::AhoCorasickAutomaton{T}, text::AbstractString)::Vector{ACMatch} where T
eachmatch(obj::AhoCorasickAutomaton{T}, text::DenseVector{T2})::Vector{ACMatch} where T where T2

Search for all matches of a AhoCorasickAutomaton obj in text and return a vector of the matches. Each match is represented as a ACMatch type.

See also

ACMatch

Base.getindexMethod
getindex(xs::AbstractString, pos::ACMatch)

Retrieve a substring of xs specified by pos.

See also

ACMatch

Base.getindexMethod
getindex(xs::DenseVector{UInt8}, pos::ACMatch)

Retrieve a sub-codeunits of xs specified by pos.

See also

ACMatch

Base.setindex!Method
setindex!(xs::DenseVector{UInt8}, sub::DenseVector{UInt8}, pos::ACMatch)

Set values of region specified by pos of xs to sub.

Note

@assert length(sub) == length(pos)

See also

ACMatch

  • 1Jun‐Ichi Aoe, Katsushi Morimoto and Takashi Sato 'An Efficient Implementation of Trie Structures', September 1992.
  • 2Aho, Alfred V.; Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM. 18 (6): 333&ndash, 340. doi:10.1145/360825.360855. MR 0371172.