BSON

BSON is a efficient compressed json-like data encoding. This BSON specification is copied verbatim from http://bsonspec.org/spec.html. The original document uses pseudo-BNF syntax, this document uses rearranged but otherwise quite similar syntax

using CombinedParsers

The Notes contain information for reassembling the parsed part of the byte array into a JSON-like structure. This document adds the julia functions to the parser. The result document is a pure julia BSON parser.

Specification Version 1.1

BSON is a binary format in which zero or more ordered key/value pairs are stored as a single entity. We call this entity a document.

The following grammar specifies version 1.1 of the BSON standard. We've written the grammar using a CombinedParsers syntax. Valid BSON data is represented by the document non-terminal.

Basic Types

The following basic types are used as terminals in the rest of the grammar. Each type must be serialized in little-endian format.

byte = Bytes(1,UInt8)         #  (8-bits)
int32 = Bytes(4,Int32)        # (32-bit signed integer, two's complement)
int64 = Bytes(8,Int64)        # (64-bit signed integer, two's complement)
uint64 = Bytes(8,UInt64)      # (64-bit unsigned integer)
double = Bytes(8,Float64)     # (64-bit IEEE 754-2008 binary floating point)
using DecFP
decimal128 = Bytes(16,Dec128) # (128-bit IEEE 754-2008 decimal floating point)
16 TypedBytes::DecFP.Dec128
::DecFP.Dec128

Non-terminals

The following specifies the rest of the BSON grammar. Note that quoted strings represent terminals, and should be interpreted with C semantics (e.g. "\x01" represents the byte 0000 0001).

@syntax t = map(ValueIn,Sequence(2, "\\x", integer_base(16,2,2)))

element = Either{Any}(Any[])
e_list = Repeat(element)

Also note that we use the * operator as shorthand for repetition (e.g. ("\x01"*2) is "\x01\x01"). When used as a unary operator, * means that the repetition can occur 0 or more times.

Note

* here is replaced with Repeat.

BSON Document. int32 is the total number of bytes comprising the document.

@syntax document = Sequence(int32, e_list, t"\x00")[2]
๐Ÿ—„ Sequence[2] |> with_name(:document)
โ”œโ”€ 4 TypedBytes::Int32
โ”œโ”€ |* Either |> Repeat
โ””โ”€ [0] ValueIn
::Vector{Any}

Zero or more modified UTF-8 encoded characters followed by '\x00'. The (byte*) MUST NOT contain '\x00', hence it is not full UTF-8.

@syntax cstring = map(String,Repeat_until(byte, t"\x00"))
@syntax e_name = map(Symbol,cstring)            # Key name
๐Ÿ—„ Sequence[1] |> map(String) |> map(Symbol) |> with_name(:cstring) |> with_name(:e_name)
โ”œโ”€ (?>๐Ÿ—„*) Sequence[2] |> Repeat |> Atomic
โ”‚  โ”œโ”€ (?![0]) ValueIn |> NegativeLookahead
โ”‚  โ””โ”€ 1 TypedBytes::UInt8
โ””โ”€ [0] ValueIn
::Symbol

String - The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.

@inline string_until_before(v) = Bytes(v-1,String)
@syntax lstring = (after(string_until_before, String, int32) * t"\x00")[1]
๐Ÿ—„ Sequence[1] |> with_name(:lstring)
โ”œโ”€ ๐Ÿ—„ FlatMap
โ”‚  โ”œโ”€ 4 TypedBytes::Int32
โ”‚  โ””โ”€ string_until_before
โ””โ”€ [0] ValueIn
::String

Binary - The int32 is the number of bytes in the (byte*).

@syntax subtype = Either((
    t"\x00",                                  # Generic binary subtype
    t"\x01",                                  # Function
    t"\x02",                                  # Binary (Old)
    t"\x03",                                  # UUID (Old)
    t"\x04",                                  # UUID
    t"\x05",                                  # MD5
    t"\x06",                                  # Encrypted BSON value
    t"\x80",                                  # User defined
))
@syntax binary = after(Vector{UInt8}, int32) do v
    subtype * Bytes(v,Vector{UInt8})
end
๐Ÿ—„ FlatMap |> with_name(:binary)
โ”œโ”€ 4 TypedBytes::Int32
โ””โ”€ #1
::Vector{UInt8}
@syntax code_w_s = int32 * lstring * document # Code w/ scope โ€” Deprecated
for e in [
    :float64 => t"\x01" * e_name * double,               # 64-bit binary floating point
    :string => t"\x02" * e_name * lstring,                # UTF-8 string
    :embedded => t"\x03" * e_name * document,            # Embedded document
    :array => t"\x04" * e_name * document,               # Array
    t"\x05" * e_name * binary,                           # Binary data
    t"\x06" * e_name * (Always() => missing),            # Undefined (value) โ€” Deprecated
    t"\x07" * e_name * Bytes(12,Vector{UInt8}),          # ObjectId
    :boolean => t"\x08" * e_name * Either(
        t"\x00" => false,                                # Boolean "false"
        t"\x01" => true),                                # Boolean "true"
    :UTC => t"\x09" * e_name * map(int64) do v
    v |> Millisecond |> Dates.UTInstant |> DateTime
    end,                                                 # UTC datetime
    t"\x0A" * e_name * (Always() => nothing),            # Null value
    :regex => t"\x0B" * e_name * map(v->Regex(v...),cstring * cstring),
    # Regular expression - The first cstring is the regex pattern,
    # the second is the regex options string.
    # Options are identified by characters, which must be stored in alphabetical order.
    t"\x0C" * e_name * lstring * Bytes(12,Vector{UInt8}), # DBPointer โ€” Deprecated
    t"\x0D" * e_name * lstring,                           # JavaScript code
    t"\x0E" * e_name * lstring,                           # Symbol. โ€” Deprecated
    t"\x0F" * e_name * code_w_s,                         # JavaScript code w/ scope โ€” Deprecated
    t"\x10" * e_name * int32,                            # 32-bit integer
    t"\x11" * e_name * uint64,                           # Timestamp
    :int => t"\x12" * e_name * int64,                    # 64-bit integer
    t"\x13" * e_name * decimal128,                       # 128-bit decimal floating point
    t"\xFF" * e_name,                                    # Min key
    t"\x7F" * e_name,                                    # Max key
]
    pairp = parser(e)
    push!(element,
          map(if result_type(pairp) <: Tuple && fieldcount(result_type(pairp))==3
              v->v[2]=>v[3]
              else
              v->v[2]=>v[3:end]
              end,
              pairp))
end

Notes

  • Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
  • UTC datetime - The int64 is UTC milliseconds since the Unix epoch.
  • Timestamp - Special internal type used by MongoDB replication and sharding. First 4 bytes are an increment, second 4 are a timestamp.
  • Min key - Special type which compares lower than all other possible BSON element values.
  • Max key - Special type which compares higher than all other possible BSON element values.
  • Generic binary subtype - This is the most commonly used binary subtype and should be the 'default' for drivers and tools.
  • The BSON "binary" or "BinData" datatype is used to represent arrays of bytes. It is somewhat analogous to the Java notion of a ByteArray. BSON binary values have a subtype. This is used to indicate what kind of data is in the byte array. Subtypes from zero to 127 are predefined or reserved. Subtypes from 128-255 are user-defined.
    • \x02 Binary (Old) - This used to be the default subtype, but was deprecated in favor of \x00. Drivers and tools should be sure to handle \x02 appropriately. The structure of the binary data (the byte* array in the binary non-terminal) must be an int32 followed by a (byte*). The int32 is the number of bytes in the repetition.
    • \x03 UUID (Old) - This used to be the UUID subtype, but was deprecated in favor of \x04. Drivers and tools for languages with a native UUID type should handle \x03 appropriately.
    • \x80-\xFF "User defined" subtypes. The binary data can be anything.
  • Code w/ scope - Deprecated. The int32 is the length in bytes of the entire codews value. The string is JavaScript code. The document is a mapping from identifiers to values, representing the scope in which the string should be evaluated.

Testing examples

using Dates
testdata = Dict(:a => 1, :b => 3.0, :c => "yeay", :d => true)
Dict{Symbol, Any} with 4 entries:
  :a => 1
  :b => 3.0
  :d => true
  :c => "yeay"

The BSON.jl package provides an optimized and feature-rich BSON serializer and parser.

using BSON
bson("test.bson", testdata)

s = read("test.bson")
parse(document,s, log=true)
4-element Vector{Any}:
 :a => 1
 :b => 3.0
 :d => true
 :c => "yeay"

Aim of this page is demonstrating the straightforward implementation of a BSON parser. The parser is untested and not feature-complete regarding subtypes etc. Parsing logic can be plugged into the elements with map.

Benchmarking

using BenchmarkTools

BSON has an optimized parser, used for comparison

@benchmark BSON.parse(IOBuffer(s))
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min โ€ฆ max):  2.390 ฮผs โ€ฆ  1.972 ms  โ”Š GC (min โ€ฆ max):  0.00% โ€ฆ 99.49%
 Time  (median):     3.654 ฮผs              โ”Š GC (median):     0.00%
 Time  (mean ยฑ ฯƒ):   4.289 ฮผs ยฑ 33.623 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  13.53% ยฑ  1.72%

             โ–โ–ƒโ–„โ–‡โ–‡โ–ˆโ–‡โ–‡โ–‡โ–‡โ–ˆโ–ˆโ–‡โ–‡โ–‡โ–†โ–„โ–‚โ–                              
  โ–‚โ–โ–‚โ–ƒโ–„โ–„โ–…โ–…โ–†โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–…โ–…โ–…โ–„โ–„โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚ โ–…
  2.39 ฮผs        Histogram: frequency by time        5.89 ฮผs <

 Memory estimate: 1.50 KiB, allocs estimate: 19.

The CombinedParser matching is slower.

@benchmark match(document,s)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  35.301 ฮผs โ€ฆ  12.463 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 99.27%
 Time  (median):     43.526 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   48.182 ฮผs ยฑ 214.722 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  7.69% ยฑ  1.72%

             โ–‚โ–‚โ–…โ–‡โ–ˆโ–ˆโ–‡โ–„โ–                                          
  โ–โ–โ–‚โ–‚โ–ƒโ–„โ–„โ–†โ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–…โ–…โ–…โ–„โ–„โ–„โ–„โ–„โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– โ–ƒ
  35.3 ฮผs         Histogram: frequency by time         65.3 ฮผs <

 Memory estimate: 15.67 KiB, allocs estimate: 122.

Transformation is doing too much work currently - lazy access optimizations are planned.

@benchmark parse(document,s)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  69.432 ฮผs โ€ฆ  16.130 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 98.93%
 Time  (median):     88.237 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   98.705 ฮผs ยฑ 386.843 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  9.54% ยฑ  2.42%

                      โ–‚โ–‚โ–…โ–†โ–ˆโ–ˆโ–†โ–…โ–…โ–ƒโ–                               
  โ–‚โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–‚โ–ƒโ–ƒโ–ƒโ–ƒโ–„โ–„โ–…โ–…โ–…โ–†โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–†โ–„โ–„โ–„โ–„โ–„โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚ โ–„
  69.4 ฮผs         Histogram: frequency by time          113 ฮผs <

 Memory estimate: 27.14 KiB, allocs estimate: 376.

Profiling

Profiling reveals that most time is lost due to the Either state type.

# using Profile
# Profile.clear()
# f(n,a...) = for _ in 1:n; match(a...); end
# @profile f(100000,document,s);
# using ProfileView
# ProfileView.view()

This is looked into, I have a strong hunch that the allocation can be removed so CombinedParser can come close to BSON.jl.

Dates

Seems, reading a document written with BSON works. BSON.jl has a nonstandard way with Dates...

testdata = Dict(:a => 1, :b => 3.0, :c => "yeay", :d => true, :e => now())
bson("test.bson", testdata)
s = read("test.bson")
parse(document,s, log=true)
5-element Vector{Any}:
 :a => 1
 :b => 3.0
 :d => true
 :e => Any[:tag => "struct", :type => Any[:tag => "datatype", :params => Any[], :name => Any[Symbol("0") => "Dates", Symbol("1") => "DateTime"]], :data => Any[Symbol("0") => Any[:tag => "struct", :type => Any[:tag => "datatype", :params => Any[Symbol("0") => Any[:tag => "datatype", :params => Any[], :name => Any[Symbol("0") => "Dates", Symbol("1") => "Millisecond"]]], :name => Any[Symbol("0") => "Dates", Symbol("1") => "UTInstant"]], :data => Any[Symbol("0") => Any[:tag => "struct", :type => Any[:tag => "datatype", :params => Any[], :name => Any[Symbol("0") => "Dates", Symbol("1") => "Millisecond"]], :data => Any[Symbol("0") => 63766179376419]]]]]]
 :c => "yeay"

This parser does not do any of the magic in BSON.jl to reassemble into julia types. BSON dates work though:

let d = now()
    Any[:e => d] ==
        parse(document,UInt8[0x10,0x0,0x0,0x0, # we actually ignore the 4 byte length of the vector
                             0x09,0x65,0x0,reinterpret(UInt8,[Dates.value(d)])...,0x0], log=true)
end
true

This page was generated using Literate.jl.