Text validators

The simplest use of Automa is to simply match a regex. It's unlikely you are going to want to use Automa for this instead of Julia's built-in regex engine PCRE, unless you need the extra performance that Automa brings over PCRE. Nonetheless, it serves as a good starting point to introduce Automa.

Suppose we have the FASTA regex from the regex page:

julia> fasta_regex = let
           header = re"[a-z]+"
           seqline = re"[ACGT]+"
           record = '>' * header * '\n' * rep1(seqline * '\n')

Buffer validator

Automa comes with a convenience function generate_buffer_validator:

Given a regex (RE) like the one above, we can do:

julia> eval(generate_buffer_validator(:validate_fasta, fasta_regex));

julia> validate_fasta
validate_fasta (generic function with 1 method)

And we now have a function that checks if some data matches the regex:

julia> validate_fasta(">hello\nTAGAGA\nTAGAG") # missing trailing newline

julia> validate_fasta(">helloXXX") # Error at byte index 7

julia> validate_fasta(">hello\nTAGAGA\nTAGAG\n") # nothing; it matches

IO validators

For large files, having to read the data into a buffer to validate it may not be possible. Automa also supports creating IO validators with the generate_io_validator function:

This works very similar to generate_buffer_validator, but the generated function takes an IO, and has a different return value:

  • If the data matches, still return nothing
  • Else, return (byte, (line, column)) where byte is the first errant byte, and (line, column) the position of the byte. If the errant byte is a newline, column is 0. If the input reaches unexpected EOF, byte is nothing, and (line, column) points to the last line/column in the IO:
julia> eval(generate_io_validator(:validate_io, fasta_regex));

julia> validate_io(IOBuffer(">hello\nTAGAGA\n"))

julia> validate_io(IOBuffer(">helX"))
(0x58, (1, 5))

julia> validate_io(IOBuffer(">hello\n\n"))
(0x0a, (3, 0))

julia> validate_io(IOBuffer(">hello\nAC"))
(nothing, (2, 2))


generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)

Generate code that, when evaluated, defines a function named name, which takes a single argument data, interpreted as a sequence of bytes. The function returns nothing if data matches Machine, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0.

If goto, the function uses the faster but more complicated :goto code.
If docstring, automatically create a docstring for the generated function.

See also: generate_io_validator

generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)

NOTE: This method requires TranscodingStreams to be loaded

Create code that, when evaluated, defines a function named funcname. This function takes an IO, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing. Else, return (byte, (line, col)), where byte is the first invalid byte, and (line, col) the 1-indexed position of that byte. If the invalid byte is a \n byte, col is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte is nothing, and the line and column given is the last byte in the file.

If goto, the function uses the faster but more complicated :goto code.

See also: generate_buffer_validator

compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine

Compile a finite state machine (FSM) from re. If optimize, attempt to minimize the number of states in the FSM. If unambiguous, disallow creation of FSM where the actions are not deterministic.


machine = let
    name = re"[A-Z][a-z]+"
    first_last = name * re" " * name
    last_first = name * re", " * name
    compile(first_last | last_first)
compile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine

Compile the regex tokens to a tokenizer machine. The machine can be passed to make_tokenizer.

The keyword unambiguous decides which of multiple matching tokens is emitted: If false (default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true, make_tokenizer will error if any possible input text can be broken ambiguously down into tokens.

See also: Tokenizer, make_tokenizer, tokenize