Docstrings · ChunkedCSV.jl

ChunkedCSV.BitSetMatrix — Type

BitSetMatrix <: AbstractMatrix{Bool}

A matrix representing the missingness of values in the result buffer. The number of rows in the matrix is equal the number of rows with at least one missing value in the result buffer. The number of columns in the matrix is equal to the number of columns in the results buffer.

When consuming a TaskResultBuffer it is this recommended to iterate it from start to finish and note the RowStatus for the HasColumnIndicators which signals that the row contains missing values. Using ColumnIterators is the easiest way to do this. For example:

# The first column has type T
for (value, isinvalidrow, ismissingvalue) for ColumnIterator{T}(result_buffer, 1)
    if isinvalidrow
        # The row didn't match the schema, so we better discard it
        continue
    end
    if ismissingvalue
        # The value is missing, so we can't use it
        continue
    end
    # Use the value
end

Indexing

bs[r, c]: Get the value at row r and column c of the matrix.
bs[r, :]: Get the values in row r of the matrix.

See also:

TaskResultBuffer, RowStatus, ColumnIterator

ChunkedCSV.ColumnIterator — Type

ColumnIterator{T}

Iterate over a column of a TaskResultBuffer. The iterator yields values of type ParsedField{T}, which is a struct containing the parsed value, a flag indicating whether the row was invalid, and a flag indicating whether the value was missing.

ChunkedCSV.DebugContext — Type

DebugContext(error_only::Bool=true, n::Int=3, err_len::Int=255, show_values::Bool=false)

A consume context that prints out simple debug information about the parsed chunks. We print the first err_len bytes of the first n rows with error RowStatus in each chunk, and optionally the parsed values for those rows.

Arguments:

error_only: Set to false to also see the first n parsed values for each column in each chunk.
n: Number of rows to print for each chunk.
err_len: Number of bytes to print for each errored row.
show_values: Set to true to also see the parsed values for each errored row.

ChunkedCSV.GuessDateTime — Type

GuessDateTime

A type that implements Parsers.typeparser to parse various ISO8601-like formats into a DateTime. If the input timestamp has a timezone information, we always convert it to UTC.

It will parse the following formats:

yyyy-mm-dd
yyyy-mm-dd HH:MM:SS
yyyy-mm-dd HH:MM:SS.s # where s is 1-3 digits, but we also support rounding to milliseconds
yyyy-mm-dd HH:MM:SSZ # where Z is any valid timezone
yyyy-mm-dd HH:MM:SS.sZ
yyyy-mm-dd
yyyy-mm-ddTHH:MM:SS
yyyy-mm-ddTHH:MM:SS.s
yyyy-mm-ddTHH:MM:SSZ
yyyy-mm-ddTHH:MM:SS.sZ

Examples:

julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=10, val=2014-01-01T00:00:00)

julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=19, val=2014-01-01T00:00:00)

julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56.789")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=23, val=2014-01-01T12:34:56.789)

julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56Z")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=20, val=2014-01-01T12:34:56)

julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56.789Z")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=24, val=2014-01-01T12:34:56.789)

ChunkedCSV.TaskResultBuffer — Type

TaskResultBuffer

Holds the parsed results in columnar buffers.

Fields

id::Int: The unique identifier of the buffer object, in range of 1 to two times nworkers arg to the parse_file function.
cols::Vector{BufferedVector}: A vector of vectors, each corresponding to a column in the CSV file. Note this field is abstractly typed.
row_statuses::BufferedVector{RowStatus.T}: Contains a UInt8 status flag for each row.
column_indicators::BitSetMatrix: a special type of BitMatrix where each row is a bitset signalling missing column values. Number of rows corresponds to the number of row statuses where HasColumnIndicators flag is set.

Notes

Each column in the cols field is a BufferedVector of the same type as the corresponding column in the ParsingContext schema.
The row_statuses vector has the same length as each of the cols vectors.
Strings are stored lazily as Parsers.PosLen31 pointers to the underlying byte buffer (available in the bytes field of ParsingContext).
When the file was parsed with ignoreemptyrows=true and/or a non-default comment argument, the row_statuses field might contain SkippedRow flags for all rows that were skipped.

Example:

The following shows the structure of a TaskResultBuffer storing results for a messy CSV file which we parsed expecting 3 Int columns and while skipping over comments:

+-------------------------+-------------------------------------------------------------------------------+
|       INPUT CSV         |                               TASK_RESULT_BUFFER                              |
+-------------------------+---------------------------+--------------------+----------+---------+---------+
| head,er,row             |        row_statuses       | column_indicators  |  cols[1] | cols[2] | cols[3] |
+-------------------------+---------------------------+--------------------+----------+---------+---------+
| 1,1,1                   | Ok                        | No value           |     1    |    1    |    1    |
| 2,,2                    | HasCI                     |   0 1 0  #=[1,:]=# |     2    |  undef  |    2    |
| 2,,                     | HasCI                     |   0 1 1  #=[2,:]=# |     2    |  undef  |  undef  |
| 3,3                     | HasCI | TooFewColumns     |   0 0 1  #=[3,:]=# |     3    |    3    |  undef  |
| 3                       | HasCI | TooFewColumns     |   0 1 1  #=[4,:]=# |     3    |  undef  |  undef  |
| 4,4,4,4                 | TooManyColumns            | No value           |     4    |    4    |    4    |
| 4,4,4,4,4               | TooManyColumns            | No value           |     4    |    4    |    4    |
| garbage,garbage,garbage | HasCI | ValueParsingError |   1 1 1  #=[5,:]=# |   undef  |  undef  |  undef  |
| garbage,5,garbage       | HasCI | ValueParsingError |   1 0 1  #=[6,:]=# |   undef  |    5    |  undef  |
| garbage,,garbage        | HasCI | ValueParsingError |   1 1 1  #=[7,:]=# |   undef  |  undef  |  undef  |
| # comment               | HasCI | SkippedRow        |   1 1 1  #=[8,:]=# |   undef  |  undef  |  undef  |
+-------------------------+---------------------------+--------------------+----------+---------+---------+
HasCI = HasColumnIndicators

ChunkedCSV.parse_file — Function

parsefile(input, schema=nothing, consumectx::C; kwargs...) where {C<:AbstractConsumeContext} -> Nothing parsefile( shouldclose::Bool, parsingctx::ParsingContext, consumectx::C, chunking_ctx::ChunkingContext, lexer::Lexer; force::Symbol=:default, ) where {C<:AbstractConsumeContext} -> Nothing

Parse a CSV input by chunks of size buffersize and parse them in parallel using nworkers tasks.

Before calling this function, you should define a custom consume_ctx::C which is a subtype of AbstractConsumeContext and implement a consume!(::C, ::ParsedPayload) method. Then, the consume_ctx is used to consume the parsed data, by internally dispatching on consume!(::C, ::ParsedPayload) which are also called in parallel. The parsed results can be found in the results field ParsedPayload, see TaskResultBuffer for more information about the format in which the results are stored.

If you need to know the header and / or the schema which will be used to parse the file before creating your consume_ctx, you can call setup_parser and inspect the returned ParsingContext, then call parse_file with the other objects returned by setup_parser.

Arguments

input: The input source to parse. Can be a String file path or an IO object.
schema: An optional schema for the CSV file, if omitted, all columns would be parsed as Strings. It can be
- a single DataType in which case it will be used for all columns in the file,
- a Vector{DataType} in which case each element of the vector would correspond to single column,
- a Dict{Symbol,DataType} which will map a types to columns by name,
- a Dict{Int,DataType} which will map types to columns by position, or
- a Base.Callable which will be called with the column index and name and should return a DataType.
For the vector case, the length must match the number of columns in the CSV file. For the dictionary case, the keys must be a subset of the column names in the CSV file (unless validate_type_map is set to false), columns that are not present in the mapping will be parsed as Strings.
consume_ctx: A user-defined <:AbstractConsumeContext object which will be used to dispatch on consume!(::C, ::ParsedPayload) to consume the parsed data om each of nworkers tasks in parallel.

Keyword arguments

header: How the column names should be determined. They can be given explicitly as a Vector{Symbol} or a Vector{String}, which must match the number of columns in the input. Alternatively a positive Integer can set the row number from which the header should be parsed. This number is relative to the first row that wasn't empty and/or commented if you set ignoreemptyrows and/or comment. A value of 0 or false indicates that no header is present in the CSV file. You can use skipto to skip over headers that fail to parse for whatever reason.
skipto: The number of rows to skip before parsing the CSV file. Defaults to 0 (no skipping). This number is relative to the first row that wasn't empty and/or commented if you set ignoreemptyrows and/or comment.
limit: The maximum number of rows to parse. Defaults to 0 (no limit). Used in ChunkedBase.ChunkingContext.

Parsing-related options

delim: The delimiter character used in the CSV file. Defaults to ','. Only single-byte characters are supported. nothing indicates that the delimiter should be inferred from the first chunk of data. Used in Parsers.Options.
openquotechar: The character used to open quoted fields in the CSV file. Defaults to '"'. Only single-byte characters are supported. Used in Parsers.Options.
closequotechar: The character used to close quoted fields in the CSV file. Defaults to '"'. Only single-byte characters are supported. Used in Parsers.Options.
escapechar: The character used to escape special characters in the CSV file. Defaults to '"'. Only single-byte characters are supported. Used in Parsers.Options.
newlinechar: The character used to represent newlines in the CSV file. Defaults to '\n'. Only single-byte characters are supported. nothing indicates that the newline character should be inferred from the first chunk of data. Used in Parsers.Options.
sentinel: A sentinel value used to indicate missing values in the CSV file. Multiple sentinels might be provided as a Vector{String}. Used in Parsers.Options. Defaults to missing, meaning that empty fields (two consecutive delims) will be treated as missing values. Used in Parsers.Options
groupmark: The character used to group digits in numbers in the CSV file. Defaults to nothing (group marks are not expected). Used in Parsers.Options.
stripwhitespace: Whether to strip whitespace from fields in the CSV file. Defaults to false. Used in Parsers.Options.
ignorerepeated: Whether to ignore repeated delimiters in the CSV file. Defaults to false. Used in Parsers.Options.
truestrings: A vector of strings representing true values in the CSV file. Defaults to ["true", "True", "TRUE", "1", "t", "T"]. Used in Parsers.Options.
falsestrings: A vector of strings representing false values in the CSV file. Defaults to ["false", "False", "FALSE", "0", "f", "F"]. Used in Parsers.Options.
dateformat: The date format used in the CSV file. Defaults to nothing, in which case Parsers.defaul_format is used. Consider using GuessDateTime as a schema type instead. Used in Parsers.Options.
quoted: Whether fields in the CSV file are quoted. Defaults to true. Used in Parsers.Options.
decimal: The character used as the decimal separator in numbers in the CSV file. Defaults to '.'. Only single-byte characters are supported. Used in Parsers.Options.
ignoreemptyrows: Whether to ignore empty rows in the CSV file. Defaults to true. Used in Parsers.Options.
rounding: The rounding mode used for FixedDecimal and DateTime values in the CSV file. Defaults to RoundNearest. Used in Parsers.Options.
validate_type_map: Whether to validate the keys in provided schema dictionaries against the column names / the number of columns in the CSV file. Defaults to true.
comment: The string or byte prefix used to indicate comments in the CSV file. Defaults to nothing which means no comment skipping will be performed. Used in ChunkedBase.ChunkingContext.

Chunking and parallelism

nworkers: The number of worker threads to use for parsing the CSV file. Defaults to max(1, Threads.nthreads() - 1). Used in ChunkedBase.ChunkingContext.
buffersize: The size of the buffer used for parsing the CSV file, in bytes. Defaults to nworkers * 1024 * 1024. Must be larger than any single row in input and smaller than 2GiB. If the input is larger than buffersize and if we're using nworkers > 1, a secondary buffer will be allocated internally to facilitate double-buffering. Used in ChunkedBase.ChunkingContext.

Misc

default_colname_prefix: The default prefix to use for generated column names in the CSV file, used when some of the names are missing or when the file is missing the header altogether. Defaults to "COL_".
use_mmap: Whether to use memory-mapped I/O for parsing the CSV file when the input is a String path. Defaults to false.
no_quoted_newlines: Assert that all newline characters in the file are record delimiters and never part of string field data. This allows the lexer to find newlines more efficiently, but will lead to parsing errors if there are quoted newlines in the file. Defaults to false.
deduplicate_names: Whether to deduplicate column names in the CSV file by enumerating the colliding names with a _1, _2... suffixes. Defaults to true. Columns names are not significant during parsing, but they might be significant for the user when consuming the parsed data.
force: Force parallel or serial parsing regardless of input size of nworkers. One of :default, :serial or :parallel. Defaults to :default, which won't parallelize small files or use the parallel code-path with nworkers == 1. Useful for debugging.

Returns

Nothing

See also

TaskResultBuffer, ParsedPayload, AbstractConsumeContext, consume!
ChunkedBase.jl for more information about the ChunkedBase API.
setup_parser

ChunkedCSV.setup_parser — Function

setup_parser(input, schema=nothing; kwargs...) -> (Bool, ParsingContext, ChunkingContext, Lexer)

For when you need to know the header and / or the schema which will be used to parse the file before creating your consume_ctx.

setup_parser will validate user input, ingest enough data chunks to reach the first valid row in the input, and then examine the first row to ensure we have a valid header and schema. You can then inspect the returned ParsingContext to see what the header and schema will be, and then call parse_file with the other objects returned by setup_parser.

Arguments

input: The input source to parse. Can be a String file path or an IO object.
schema: An optional schema for the CSV file, if omitted, all columns would be parsed as Strings. It can be
- a single DataType in which case it will be used for all columns in the file,
- a Vector{DataType} in which case each element of the vector would correspond to single column,
- a Dict{Symbol,DataType} which will map a types to columns by name,
- a Dict{Int,DataType} which will map types to columns by position, or
- a Base.Callable which will be called with the column index and name and should return a DataType.
For the vector case, the length must match the number of columns in the CSV file. For the dictionary case, the keys must be a subset of the column names in the CSV file (unless validate_type_map is set to false), columns that are not present in the mapping will be parsed as Strings.
consume_ctx: A user-defined <:AbstractConsumeContext object which will be used to dispatch on consume!(::C, ::ParsedPayload) to consume the parsed data om each of nworkers tasks in parallel.

Keyword arguments

header: How the column names should be determined. They can be given explicitly as a Vector{Symbol} or a Vector{String}, which must match the number of columns in the input. Alternatively a positive Integer can set the row number from which the header should be parsed. This number is relative to the first row that wasn't empty and/or commented if you set ignoreemptyrows and/or comment. A value of 0 or false indicates that no header is present in the CSV file. You can use skipto to skip over headers that fail to parse for whatever reason.
skipto: The number of rows to skip before parsing the CSV file. Defaults to 0 (no skipping). This number is relative to the first row that wasn't empty and/or commented if you set ignoreemptyrows and/or comment.
limit: The maximum number of rows to parse. Defaults to 0 (no limit). Used in ChunkedBase.ChunkingContext.

Parsing-related options

delim: The delimiter character used in the CSV file. Defaults to ','. Only single-byte characters are supported. nothing indicates that the delimiter should be inferred from the first chunk of data. Used in Parsers.Options.
openquotechar: The character used to open quoted fields in the CSV file. Defaults to '"'. Only single-byte characters are supported. Used in Parsers.Options.
closequotechar: The character used to close quoted fields in the CSV file. Defaults to '"'. Only single-byte characters are supported. Used in Parsers.Options.
escapechar: The character used to escape special characters in the CSV file. Defaults to '"'. Only single-byte characters are supported. Used in Parsers.Options.
newlinechar: The character used to represent newlines in the CSV file. Defaults to '\n'. Only single-byte characters are supported. nothing indicates that the newline character should be inferred from the first chunk of data. Used in Parsers.Options.
sentinel: A sentinel value used to indicate missing values in the CSV file. Multiple sentinels might be provided as a Vector{String}. Used in Parsers.Options. Defaults to missing, meaning that empty fields (two consecutive delims) will be treated as missing values. Used in Parsers.Options
groupmark: The character used to group digits in numbers in the CSV file. Defaults to nothing (group marks are not expected). Used in Parsers.Options.
stripwhitespace: Whether to strip whitespace from fields in the CSV file. Defaults to false. Used in Parsers.Options.
ignorerepeated: Whether to ignore repeated delimiters in the CSV file. Defaults to false. Used in Parsers.Options.
truestrings: A vector of strings representing true values in the CSV file. Defaults to ["true", "True", "TRUE", "1", "t", "T"]. Used in Parsers.Options.
falsestrings: A vector of strings representing false values in the CSV file. Defaults to ["false", "False", "FALSE", "0", "f", "F"]. Used in Parsers.Options.
dateformat: The date format used in the CSV file. Defaults to nothing, in which case Parsers.defaul_format is used. Consider using GuessDateTime as a schema type instead. Used in Parsers.Options.
quoted: Whether fields in the CSV file are quoted. Defaults to true. Used in Parsers.Options.
decimal: The character used as the decimal separator in numbers in the CSV file. Defaults to '.'. Only single-byte characters are supported. Used in Parsers.Options.
ignoreemptyrows: Whether to ignore empty rows in the CSV file. Defaults to true. Used in Parsers.Options.
rounding: The rounding mode used for FixedDecimal and DateTime values in the CSV file. Defaults to RoundNearest. Used in Parsers.Options.
validate_type_map: Whether to validate the keys in provided schema dictionaries against the column names / the number of columns in the CSV file. Defaults to true.
comment: The string or byte prefix used to indicate comments in the CSV file. Defaults to nothing which means no comment skipping will be performed. Used in ChunkedBase.ChunkingContext.

Chunking and parallelism

nworkers: The number of worker threads to use for parsing the CSV file. Defaults to max(1, Threads.nthreads() - 1). Used in ChunkedBase.ChunkingContext.
buffersize: The size of the buffer used for parsing the CSV file, in bytes. Defaults to nworkers * 1024 * 1024. Must be larger than any single row in input and smaller than 2GiB. If the input is larger than buffersize and if we're using nworkers > 1, a secondary buffer will be allocated internally to facilitate double-buffering. Used in ChunkedBase.ChunkingContext.

Misc

default_colname_prefix: The default prefix to use for generated column names in the CSV file, used when some of the names are missing or when the file is missing the header altogether. Defaults to "COL_".
use_mmap: Whether to use memory-mapped I/O for parsing the CSV file when the input is a String path. Defaults to false.
no_quoted_newlines: Assert that all newline characters in the file are record delimiters and never part of string field data. This allows the lexer to find newlines more efficiently, but will lead to parsing errors if there are quoted newlines in the file. Defaults to false.
deduplicate_names: Whether to deduplicate column names in the CSV file by enumerating the colliding names with a _1, _2... suffixes. Defaults to true. Columns names are not significant during parsing, but they might be significant for the user when consuming the parsed data.
force: Force parallel or serial parsing regardless of input size of nworkers. One of :default, :serial or :parallel. Defaults to :default, which won't parallelize small files or use the parallel code-path with nworkers == 1. Useful for debugging.

Returns

should_close::Bool: true if we opened an IO object and we should close it later
parsing_ctx::ChunkedCSV.ParsingContext: Internal object, which contains the header, schema and settings needed for Parsers.jl
chunking_ctx::ChunkedBase.ChunkingContext: Internal object, which holds the ingested data, newline positions and other things required by ChunkedBase.jl internally
lexer::NewlineLexers.Lexer: Internal object, which is used to find newlines in the ingested chunks and which is also needed by ChunkedBase.jl

See also

TaskResultBuffer, ParsedPayload, AbstractConsumeContext, consume!
ChunkedBase.jl for more information about the ChunkedBase API.
parse_file

ChunkedCSV.RowStatus — Module

RowStatus

A module implementing a bitflag type used to indicate the status of a row in a TaskResultBuffer.

0x00 – Ok: All fields were parsed successfully.
0x01 – HasColumnIndicators: Some fields have missing values.
0x02 – TooFewColumns: The row has fewer fields than expected according to the schema. Implies HasColumnIndicators.
0x04 – TooManyColumns: The row has more fields than expected according to the schema.
0x08 – ValueParsingError: Some fields could not be parsed due to an unknown instance of a particular type. Implies HasColumnIndicators.
0x10 – SkippedRow: The row contains no valid values, e.g. it was a comment. Implies HasColumnIndicators.

Multiple flags can be set at the same time, e.g. HasColumnIndicators | TooFewColumns means that at least column in the row does not have a known value and that there were not enough fields in this row. If a row has HasColumnIndicators flag set, then the column_indicators field of the TaskResultBuffer will contain a bitset indicating which columns have missing values.

Distinguishing which values are missing due (i.e. successfully parsed sentinel values) and which failed to parse is currently unsupported, as we assume the integrity of the entire row is required.

See also:

TaskResultBuffer