BytePairEncoding.BPELearnerType
BPELearner(tokenization::AbstractTokenization; min_freq = 10, endsym = "</w>", sepsym = nothing)

Construct a learner with a tokenization which has BPETokenization and NoBPE inside.

(bper::BPELearner)(word_counts, n_merge)

Calling the learner on a word_counts dictionary (created by count_words) generate a new tokenization where NoBPE is replaced with the learned BPE.

BytePairEncoding.bbpe2tiktokenFunction
bbpe2tiktoken(tkr)

Convert a gpt2-like byte-level tokenizer (with bpe::BPE) to tiktoken tokenizer (with bpe::TikToken). If there is a CodeNormalizer in the tokenizer, it will be removed accordingly.

see also: tiktoken2bbpe

BytePairEncoding.count_wordsMethod
count_words(bper::BPELearner, files::AbstractVector)

Given a list of files (where each line of the file would be considered as a (multi-sentences) document). Tokenize those file a count the occurence of each word token.

BytePairEncoding.load_tiktokenMethod
load_tiktoken(name)

Load tiktoken tokenizer. name can be "cl100k_base", "p50k_base", "p50k_base", "r50k_base", or "gpt2".

BytePairEncoding.tiktoken2bbpeFunction
tiktoken2bbpe(tkr, codemap::Union{CodeMap, Nothing} = nothing)

Convert a tiktoken tokenizer (with bpe::TikToken) to gpt2-like byte-level tokenizer (with bpe::BPE). If codemap is provided, it will add the corresponding CodeNormalizer to the tokenizer.

see also: bbpe2tiktoken