AhoCorasickAutomatons.ACMatch
— TypeACMatch has 3 fields:
1. s : start of match
2. t : stop of match, using text[s, t] to get matched patterns
3. i : index of the key in *obj*, which is the original insertion order of keys to *obj*
The field i may be use as index of external property arrays, i.e., the AhoCorasickAutomaton can act as a Map{String, Any}
.
See also
eachmatch, getindex, setindex!
AhoCorasickAutomatons.AhoCorasickAutomaton
— TypeAhoCorasickAutomaton{T}(keys::Vector{Pair{String, Ti}}; sort = false) where {T, Ti}
AhoCorasickAutomaton{T}(keys::AbstractDict{String, Ti}; sort = false) where {T, Ti}
AhoCorasickAutomaton{T}(keys::Vector{String}; sort = false) where T
AhoCorasickAutomaton(keys::Vector{String}; sort = false) = AhoCorasickAutomaton{UInt32}(keys; sort = sort)
AhoCorasickAutomaton(keys::Vector{Pair{String, Ti}}; sort = false) where Ti = AhoCorasickAutomaton{UInt32}(keys; sort = sort)
T s.t. typemax(T) >= length(keys)
A 2-Array implementation of Aho–Corasick automaton(ACA)[1], Most used as an engine to mine a long text string, to get all occurences of substrings interested in the text. ACA is also adequate for unicode strings as a Set{String}
or Dict{String, Unsigned}
(similar as Trie, but with a far more less space usage).
The ACA acts as a key data structrue in Aho–Corasick algorithm for multiple consecutive string patterns finding[2]. This particular implementation use no-decreasing base
, i.e.,
1. base[node] >= node
2. children[node] > node
This choice make it suitable for large-size key set of not-very-long keys, with faster insertion speed and moderate space usage.
Notes
- Maybe of slower speed than a oridinary tree-based ACA, specially for random generated keys.
- When inserting duplicated keys, only the last one will make sense.
- Inputing Sorted keys will be of (much) faster speed. Just turn the sort option on!
Examples
julia> v = ["A", "B"]; obj1 = AhoCorasickAutomaton{UInt8}(v; sort = true) # constructor
type AhoCorasickAutomaton{UInt8}
––––– –––––––––––––––––––––––––––
keys 2
nodes 67
nnz 3
size 623 bytes
julia> for key in v @assert key ∈ obj1 end
julia> @assert sort!(keys(obj1)) == v
julia> d = Dict("A" => 1, "B" => 2); obj2 = AhoCorasickAutomaton{UInt8}(d; sort = true); obj1 == obj2 # constructor
true
julia> obj3 = AhoCorasickAutomaton{UInt8}(collect(d); sort = true); obj2 == obj3 # constructor
true
julia> s = "AABDB"; matches = eachmatch(obj1, s) # eachmatch
4-element Array{ACMatch,1}:
ACMatch(1, 1, 1)
ACMatch(2, 2, 1)
ACMatch(3, 3, 2)
ACMatch(5, 5, 2)
julia> matches == eachmatch(obj1, codeunits(s)) # eachmatch
true
julia> map(x -> s[x], matches) # getindex
4-element Array{String,1}:
"A"
"A"
"B"
"B"
julia> ss = collect(codeunits(s)); foreach(x -> ss[x] = ss[x], matches); String(ss) == s # setindex!
true
julia> io = IOBuffer(); write(io, obj1); seek(io, 0); obj11 = read(io, AhoCorasickAutomaton); obj1 == obj11 # read & write
true
For more : using Pkg; less(joinpath(Pkg.dir("AhoCorasickAutomatons"), "test", "runtests.jl"))
See also
ACMatch, eachmatch, getindex, setindex!, in, get, length, read, write, collect, keys, values.
References
Base.eachmatch
— Methodeachmatch(obj::AhoCorasickAutomaton{T}, text::AbstractString)::Vector{ACMatch} where T
eachmatch(obj::AhoCorasickAutomaton{T}, text::DenseVector{T2})::Vector{ACMatch} where T where T2
Search for all matches of a AhoCorasickAutomaton obj in text and return a vector of the matches. Each match is represented as a ACMatch
type.
See also
ACMatch
Base.getindex
— Methodgetindex(xs::AbstractString, pos::ACMatch)
Retrieve a substring of xs specified by pos.
See also
ACMatch
Base.getindex
— Methodgetindex(xs::DenseVector{UInt8}, pos::ACMatch)
Retrieve a sub-codeunits of xs specified by pos.
See also
ACMatch
Base.setindex!
— Methodsetindex!(xs::DenseVector{UInt8}, sub::DenseVector{UInt8}, pos::ACMatch)
Set values of region specified by pos of xs to sub.
Note
@assert length(sub) == length(pos)
See also
ACMatch
- 1Jun‐Ichi Aoe, Katsushi Morimoto and Takashi Sato 'An Efficient Implementation of Trie Structures', September 1992.
- 2Aho, Alfred V.; Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM. 18 (6): 333&ndash, 340. doi:10.1145/360825.360855. MR 0371172.