ChunkedCSV.BitSetMatrix
— TypeBitSetMatrix <: AbstractMatrix{Bool}
A matrix representing the missingness of values in the result buffer. The number of rows in the matrix is equal the number of rows with at least one missing value in the result buffer. The number of columns in the matrix is equal to the number of columns in the results buffer.
When consuming a TaskResultBuffer
it is this recommended to iterate it from start to finish and note the RowStatus
for the HasColumnIndicators
which signals that the row contains missing values. Using ColumnIterator
s is the easiest way to do this. For example:
# The first column has type T
for (value, isinvalidrow, ismissingvalue) for ColumnIterator{T}(result_buffer, 1)
if isinvalidrow
# The row didn't match the schema, so we better discard it
continue
end
if ismissingvalue
# The value is missing, so we can't use it
continue
end
# Use the value
end
Indexing
bs[r, c]
: Get the value at rowr
and columnc
of the matrix.bs[r, :]
: Get the values in rowr
of the matrix.
See also:
ChunkedCSV.ColumnIterator
— TypeColumnIterator{T}
Iterate over a column of a TaskResultBuffer
. The iterator yields values of type ParsedField{T}
, which is a struct containing the parsed value, a flag indicating whether the row was invalid, and a flag indicating whether the value was missing.
ChunkedCSV.DebugContext
— TypeDebugContext(error_only::Bool=true, n::Int=3, err_len::Int=255, show_values::Bool=false)
A consume context that prints out simple debug information about the parsed chunks. We print the first err_len
bytes of the first n
rows with error RowStatus
in each chunk, and optionally the parsed values for those rows.
Arguments:
error_only
: Set tofalse
to also see the firstn
parsed values for each column in each chunk.n
: Number of rows to print for each chunk.err_len
: Number of bytes to print for each errored row.show_values
: Set totrue
to also see the parsed values for each errored row.
ChunkedCSV.GuessDateTime
— TypeGuessDateTime
A type that implements Parsers.typeparser
to parse various ISO8601-like formats into a DateTime
. If the input timestamp has a timezone information, we always convert it to UTC.
It will parse the following formats:
yyyy-mm-dd
yyyy-mm-dd HH:MM:SS
yyyy-mm-dd HH:MM:SS.s
# wheres
is 1-3 digits, but we also support rounding to millisecondsyyyy-mm-dd HH:MM:SSZ
# whereZ
is any valid timezoneyyyy-mm-dd HH:MM:SS.sZ
yyyy-mm-dd
yyyy-mm-ddTHH:MM:SS
yyyy-mm-ddTHH:MM:SS.s
yyyy-mm-ddTHH:MM:SSZ
yyyy-mm-ddTHH:MM:SS.sZ
Examples:
julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=10, val=2014-01-01T00:00:00)
julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=19, val=2014-01-01T00:00:00)
julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56.789")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=23, val=2014-01-01T12:34:56.789)
julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56Z")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=20, val=2014-01-01T12:34:56)
julia> Parsers.xparse(ChunkedCSV.GuessDateTime, "2014-01-01 12:34:56.789Z")
Parsers.Result{Dates.DateTime}(code=`SUCCESS: OK | EOF `, tlen=24, val=2014-01-01T12:34:56.789)
ChunkedCSV.TaskResultBuffer
— TypeTaskResultBuffer
Holds the parsed results in columnar buffers.
Fields
id::Int
: The unique identifier of the buffer object, in range of 1 to two timesnworkers
arg to theparse_file
function.cols::Vector{BufferedVector}
: A vector of vectors, each corresponding to a column in the CSV file. Note this field is abstractly typed.row_statuses::BufferedVector{RowStatus.T}
: Contains a UInt8 status flag for each row.column_indicators::BitSetMatrix
: a special type ofBitMatrix
where each row is a bitset signalling missing column values. Number of rows corresponds to the number of row statuses whereHasColumnIndicators
flag is set.
Notes
- Each column in the
cols
field is aBufferedVector
of the same type as the corresponding column in theParsingContext
schema. - The
row_statuses
vector has the same length as each of thecols
vectors. - Strings are stored lazily as
Parsers.PosLen31
pointers to the underlying byte buffer (available in thebytes
field ofParsingContext
). - When the file was parsed with
ignoreemptyrows=true
and/or a non-defaultcomment
argument, therow_statuses
field might containSkippedRow
flags for all rows that were skipped.
Example:
The following shows the structure of a TaskResultBuffer
storing results for a messy CSV file which we parsed expecting 3 Int
columns and while skipping over comments:
+-------------------------+-------------------------------------------------------------------------------+
| INPUT CSV | TASK_RESULT_BUFFER |
+-------------------------+---------------------------+--------------------+----------+---------+---------+
| head,er,row | row_statuses | column_indicators | cols[1] | cols[2] | cols[3] |
+-------------------------+---------------------------+--------------------+----------+---------+---------+
| 1,1,1 | Ok | No value | 1 | 1 | 1 |
| 2,,2 | HasCI | 0 1 0 #=[1,:]=# | 2 | undef | 2 |
| 2,, | HasCI | 0 1 1 #=[2,:]=# | 2 | undef | undef |
| 3,3 | HasCI | TooFewColumns | 0 0 1 #=[3,:]=# | 3 | 3 | undef |
| 3 | HasCI | TooFewColumns | 0 1 1 #=[4,:]=# | 3 | undef | undef |
| 4,4,4,4 | TooManyColumns | No value | 4 | 4 | 4 |
| 4,4,4,4,4 | TooManyColumns | No value | 4 | 4 | 4 |
| garbage,garbage,garbage | HasCI | ValueParsingError | 1 1 1 #=[5,:]=# | undef | undef | undef |
| garbage,5,garbage | HasCI | ValueParsingError | 1 0 1 #=[6,:]=# | undef | 5 | undef |
| garbage,,garbage | HasCI | ValueParsingError | 1 1 1 #=[7,:]=# | undef | undef | undef |
| # comment | HasCI | SkippedRow | 1 1 1 #=[8,:]=# | undef | undef | undef |
+-------------------------+---------------------------+--------------------+----------+---------+---------+
HasCI = HasColumnIndicators
ChunkedCSV.parse_file
— Functionparsefile(input, schema=nothing, consumectx::C; kwargs...) where {C<:AbstractConsumeContext} -> Nothing parsefile( shouldclose::Bool, parsingctx::ParsingContext, consumectx::C, chunking_ctx::ChunkingContext, lexer::Lexer; force::Symbol=:default, ) where {C<:AbstractConsumeContext} -> Nothing
Parse a CSV input by chunks of size buffersize
and parse them in parallel using nworkers
tasks.
Before calling this function, you should define a custom consume_ctx::C
which is a subtype of AbstractConsumeContext
and implement a consume!(::C, ::ParsedPayload)
method. Then, the consume_ctx
is used to consume the parsed data, by internally dispatching on consume!(::C, ::ParsedPayload)
which are also called in parallel. The parsed results can be found in the results
field ParsedPayload
, see TaskResultBuffer
for more information about the format in which the results are stored.
If you need to know the header and / or the schema which will be used to parse the file before creating your consume_ctx
, you can call setup_parser
and inspect the returned ParsingContext
, then call parse_file
with the other objects returned by setup_parser
.
Arguments
input
: The input source to parse. Can be aString
file path or anIO
object.schema
: An optional schema for the CSV file, if omitted, all columns would be parsed asString
s. It can be- a single
DataType
in which case it will be used for all columns in the file, - a
Vector{DataType}
in which case each element of the vector would correspond to single column, - a
Dict{Symbol,DataType}
which will map a types to columns by name, - a
Dict{Int,DataType}
which will map types to columns by position, or - a
Base.Callable
which will be called with the column index and name and should return aDataType
.
validate_type_map
is set tofalse
), columns that are not present in the mapping will be parsed asString
s.- a single
consume_ctx
: A user-defined<:AbstractConsumeContext
object which will be used to dispatch onconsume!(::C, ::ParsedPayload)
to consume the parsed data om each ofnworkers
tasks in parallel.
Keyword arguments
header
: How the column names should be determined. They can be given explicitly as aVector{Symbol}
or aVector{String}
, which must match the number of columns in the input. Alternatively a positiveInteger
can set the row number from which the header should be parsed. This number is relative to the first row that wasn't empty and/or commented if you setignoreemptyrows
and/orcomment
. A value of0
orfalse
indicates that no header is present in the CSV file. You can useskipto
to skip over headers that fail to parse for whatever reason.skipto
: The number of rows to skip before parsing the CSV file. Defaults to0
(no skipping). This number is relative to the first row that wasn't empty and/or commented if you setignoreemptyrows
and/orcomment
.limit
: The maximum number of rows to parse. Defaults to0
(no limit). Used inChunkedBase.ChunkingContext
.
Parsing-related options
delim
: The delimiter character used in the CSV file. Defaults to','
. Only single-byte characters are supported.nothing
indicates that the delimiter should be inferred from the first chunk of data. Used inParsers.Options
.openquotechar
: The character used to open quoted fields in the CSV file. Defaults to'"'
. Only single-byte characters are supported. Used inParsers.Options
.closequotechar
: The character used to close quoted fields in the CSV file. Defaults to'"'
. Only single-byte characters are supported. Used inParsers.Options
.escapechar
: The character used to escape special characters in the CSV file. Defaults to'"'
. Only single-byte characters are supported. Used inParsers.Options
.newlinechar
: The character used to represent newlines in the CSV file. Defaults to'\n'
. Only single-byte characters are supported.nothing
indicates that the newline character should be inferred from the first chunk of data. Used inParsers.Options
.sentinel
: A sentinel value used to indicate missing values in the CSV file. Multiple sentinels might be provided as aVector{String}
. Used inParsers.Options
. Defaults tomissing
, meaning that empty fields (two consecutivedelim
s) will be treated as missing values. Used inParsers.Options
groupmark
: The character used to group digits in numbers in the CSV file. Defaults tonothing
(group marks are not expected). Used inParsers.Options
.stripwhitespace
: Whether to strip whitespace from fields in the CSV file. Defaults tofalse
. Used inParsers.Options
.ignorerepeated
: Whether to ignore repeated delimiters in the CSV file. Defaults tofalse
. Used inParsers.Options
.truestrings
: A vector of strings representingtrue
values in the CSV file. Defaults to["true", "True", "TRUE", "1", "t", "T"]
. Used inParsers.Options
.falsestrings
: A vector of strings representingfalse
values in the CSV file. Defaults to["false", "False", "FALSE", "0", "f", "F"]
. Used inParsers.Options
.dateformat
: The date format used in the CSV file. Defaults tonothing
, in which caseParsers.defaul_format
is used. Consider usingGuessDateTime
as a schema type instead. Used inParsers.Options
.quoted
: Whether fields in the CSV file are quoted. Defaults totrue
. Used inParsers.Options
.decimal
: The character used as the decimal separator in numbers in the CSV file. Defaults to'.'
. Only single-byte characters are supported. Used inParsers.Options
.ignoreemptyrows
: Whether to ignore empty rows in the CSV file. Defaults totrue
. Used inParsers.Options
.rounding
: The rounding mode used forFixedDecimal
andDateTime
values in the CSV file. Defaults toRoundNearest
. Used inParsers.Options
.validate_type_map
: Whether to validate the keys in provided schema dictionaries against the column names / the number of columns in the CSV file. Defaults totrue
.comment
: The string or byte prefix used to indicate comments in the CSV file. Defaults tonothing
which means no comment skipping will be performed. Used inChunkedBase.ChunkingContext
.
Chunking and parallelism
nworkers
: The number of worker threads to use for parsing the CSV file. Defaults tomax(1, Threads.nthreads() - 1)
. Used inChunkedBase.ChunkingContext
.buffersize
: The size of the buffer used for parsing the CSV file, in bytes. Defaults tonworkers * 1024 * 1024
. Must be larger than any single row in input and smaller than 2GiB. If the input is larger thanbuffersize
and if we're usingnworkers
> 1, a secondary buffer will be allocated internally to facilitate double-buffering. Used inChunkedBase.ChunkingContext
.
Misc
default_colname_prefix
: The default prefix to use for generated column names in the CSV file, used when some of the names are missing or when the file is missing the header altogether. Defaults to"COL_"
.use_mmap
: Whether to use memory-mapped I/O for parsing the CSV file when theinput
is aString
path. Defaults tofalse
.no_quoted_newlines
: Assert that all newline characters in the file are record delimiters and never part of string field data. This allows the lexer to find newlines more efficiently, but will lead to parsing errors if there are quoted newlines in the file. Defaults tofalse
.deduplicate_names
: Whether to deduplicate column names in the CSV file by enumerating the colliding names with a_1
,_2
... suffixes. Defaults totrue
. Columns names are not significant during parsing, but they might be significant for the user when consuming the parsed data.force
: Force parallel or serial parsing regardless of input size ofnworkers
. One of:default
,:serial
or:parallel
. Defaults to:default
, which won't parallelize small files or use the parallel code-path withnworkers == 1
. Useful for debugging.
Returns
Nothing
See also
TaskResultBuffer
,ParsedPayload
,AbstractConsumeContext
,consume!
ChunkedBase.jl
for more information about theChunkedBase
API.
ChunkedCSV.setup_parser
— Functionsetup_parser(input, schema=nothing; kwargs...) -> (Bool, ParsingContext, ChunkingContext, Lexer)
For when you need to know the header and / or the schema which will be used to parse the file before creating your consume_ctx
.
setup_parser
will validate user input, ingest enough data chunks to reach the first valid row in the input, and then examine the first row to ensure we have a valid header and schema. You can then inspect the returned ParsingContext
to see what the header and schema will be, and then call parse_file
with the other objects returned by setup_parser
.
Arguments
input
: The input source to parse. Can be aString
file path or anIO
object.schema
: An optional schema for the CSV file, if omitted, all columns would be parsed asString
s. It can be- a single
DataType
in which case it will be used for all columns in the file, - a
Vector{DataType}
in which case each element of the vector would correspond to single column, - a
Dict{Symbol,DataType}
which will map a types to columns by name, - a
Dict{Int,DataType}
which will map types to columns by position, or - a
Base.Callable
which will be called with the column index and name and should return aDataType
.
validate_type_map
is set tofalse
), columns that are not present in the mapping will be parsed asString
s.- a single
consume_ctx
: A user-defined<:AbstractConsumeContext
object which will be used to dispatch onconsume!(::C, ::ParsedPayload)
to consume the parsed data om each ofnworkers
tasks in parallel.
Keyword arguments
header
: How the column names should be determined. They can be given explicitly as aVector{Symbol}
or aVector{String}
, which must match the number of columns in the input. Alternatively a positiveInteger
can set the row number from which the header should be parsed. This number is relative to the first row that wasn't empty and/or commented if you setignoreemptyrows
and/orcomment
. A value of0
orfalse
indicates that no header is present in the CSV file. You can useskipto
to skip over headers that fail to parse for whatever reason.skipto
: The number of rows to skip before parsing the CSV file. Defaults to0
(no skipping). This number is relative to the first row that wasn't empty and/or commented if you setignoreemptyrows
and/orcomment
.limit
: The maximum number of rows to parse. Defaults to0
(no limit). Used inChunkedBase.ChunkingContext
.
Parsing-related options
delim
: The delimiter character used in the CSV file. Defaults to','
. Only single-byte characters are supported.nothing
indicates that the delimiter should be inferred from the first chunk of data. Used inParsers.Options
.openquotechar
: The character used to open quoted fields in the CSV file. Defaults to'"'
. Only single-byte characters are supported. Used inParsers.Options
.closequotechar
: The character used to close quoted fields in the CSV file. Defaults to'"'
. Only single-byte characters are supported. Used inParsers.Options
.escapechar
: The character used to escape special characters in the CSV file. Defaults to'"'
. Only single-byte characters are supported. Used inParsers.Options
.newlinechar
: The character used to represent newlines in the CSV file. Defaults to'\n'
. Only single-byte characters are supported.nothing
indicates that the newline character should be inferred from the first chunk of data. Used inParsers.Options
.sentinel
: A sentinel value used to indicate missing values in the CSV file. Multiple sentinels might be provided as aVector{String}
. Used inParsers.Options
. Defaults tomissing
, meaning that empty fields (two consecutivedelim
s) will be treated as missing values. Used inParsers.Options
groupmark
: The character used to group digits in numbers in the CSV file. Defaults tonothing
(group marks are not expected). Used inParsers.Options
.stripwhitespace
: Whether to strip whitespace from fields in the CSV file. Defaults tofalse
. Used inParsers.Options
.ignorerepeated
: Whether to ignore repeated delimiters in the CSV file. Defaults tofalse
. Used inParsers.Options
.truestrings
: A vector of strings representingtrue
values in the CSV file. Defaults to["true", "True", "TRUE", "1", "t", "T"]
. Used inParsers.Options
.falsestrings
: A vector of strings representingfalse
values in the CSV file. Defaults to["false", "False", "FALSE", "0", "f", "F"]
. Used inParsers.Options
.dateformat
: The date format used in the CSV file. Defaults tonothing
, in which caseParsers.defaul_format
is used. Consider usingGuessDateTime
as a schema type instead. Used inParsers.Options
.quoted
: Whether fields in the CSV file are quoted. Defaults totrue
. Used inParsers.Options
.decimal
: The character used as the decimal separator in numbers in the CSV file. Defaults to'.'
. Only single-byte characters are supported. Used inParsers.Options
.ignoreemptyrows
: Whether to ignore empty rows in the CSV file. Defaults totrue
. Used inParsers.Options
.rounding
: The rounding mode used forFixedDecimal
andDateTime
values in the CSV file. Defaults toRoundNearest
. Used inParsers.Options
.validate_type_map
: Whether to validate the keys in provided schema dictionaries against the column names / the number of columns in the CSV file. Defaults totrue
.comment
: The string or byte prefix used to indicate comments in the CSV file. Defaults tonothing
which means no comment skipping will be performed. Used inChunkedBase.ChunkingContext
.
Chunking and parallelism
nworkers
: The number of worker threads to use for parsing the CSV file. Defaults tomax(1, Threads.nthreads() - 1)
. Used inChunkedBase.ChunkingContext
.buffersize
: The size of the buffer used for parsing the CSV file, in bytes. Defaults tonworkers * 1024 * 1024
. Must be larger than any single row in input and smaller than 2GiB. If the input is larger thanbuffersize
and if we're usingnworkers
> 1, a secondary buffer will be allocated internally to facilitate double-buffering. Used inChunkedBase.ChunkingContext
.
Misc
default_colname_prefix
: The default prefix to use for generated column names in the CSV file, used when some of the names are missing or when the file is missing the header altogether. Defaults to"COL_"
.use_mmap
: Whether to use memory-mapped I/O for parsing the CSV file when theinput
is aString
path. Defaults tofalse
.no_quoted_newlines
: Assert that all newline characters in the file are record delimiters and never part of string field data. This allows the lexer to find newlines more efficiently, but will lead to parsing errors if there are quoted newlines in the file. Defaults tofalse
.deduplicate_names
: Whether to deduplicate column names in the CSV file by enumerating the colliding names with a_1
,_2
... suffixes. Defaults totrue
. Columns names are not significant during parsing, but they might be significant for the user when consuming the parsed data.force
: Force parallel or serial parsing regardless of input size ofnworkers
. One of:default
,:serial
or:parallel
. Defaults to:default
, which won't parallelize small files or use the parallel code-path withnworkers == 1
. Useful for debugging.
Returns
should_close::Bool
:true
if we opened anIO
object and we should close it laterparsing_ctx::ChunkedCSV.ParsingContext
: Internal object, which contains the header, schema and settings needed forParsers.jl
chunking_ctx::ChunkedBase.ChunkingContext
: Internal object, which holds the ingested data, newline positions and other things required byChunkedBase.jl
internallylexer::NewlineLexers.Lexer
: Internal object, which is used to find newlines in the ingested chunks and which is also needed byChunkedBase.jl
See also
TaskResultBuffer
,ParsedPayload
,AbstractConsumeContext
,consume!
ChunkedBase.jl
for more information about theChunkedBase
API.
ChunkedCSV.RowStatus
— ModuleRowStatus
A module implementing a bitflag type used to indicate the status of a row in a TaskResultBuffer
.
0x00
–Ok
: All fields were parsed successfully.0x01
–HasColumnIndicators
: Some fields have missing values.0x02
–TooFewColumns
: The row has fewer fields than expected according to the schema. ImpliesHasColumnIndicators
.0x04
–TooManyColumns
: The row has more fields than expected according to the schema.0x08
–ValueParsingError
: Some fields could not be parsed due to an unknown instance of a particular type. ImpliesHasColumnIndicators
.0x10
–SkippedRow
: The row contains no valid values, e.g. it was a comment. ImpliesHasColumnIndicators
.
Multiple flags can be set at the same time, e.g. HasColumnIndicators | TooFewColumns
means that at least column in the row does not have a known value and that there were not enough fields in this row. If a row has HasColumnIndicators
flag set, then the column_indicators
field of the TaskResultBuffer
will contain a bitset indicating which columns have missing values.
Distinguishing which values are missing due (i.e. successfully parsed sentinel
values) and which failed to parse is currently unsupported, as we assume the integrity of the entire row is required.
See also: