CSV.jl Documentation

CSV.jl is built to be a fast and flexible pure-Julia library for handling delimited text files.

CSV.jl Documentation

Getting Started
Key Functions
Examples

Basic
Auto-Delimiter Detection
String Delimiter
No Header
Normalize Column Names
Datarow
Reading Chunks
Transposed Data
Commented Rows
Missing Strings
Fixed Width Files
Quoted & Escaped Fields
DateFormat
Custom Decimal Separator
Custom Bool Strings
Matrix-like Data
Providing Types
Typemap
Pooled Values
Select/Drop Columns From File
Non-UTF-8 character encodings
Reading CSV from gzip (.gz) and zip files

Getting Started

CSV.jl provides a number of utilities for working with delimited files. CSV.File provides a way to read files into columns of data, detecting column types. CSV.Rows provides a row iterator for looping over rows in a file. Inputs to either should be filenames as Strings or FilePaths, or byte vectors (AbstractVector{UInt8}). To read other IO inputs, just call read(io) and pass the bytes directly to CSV.File or CSV.Rows.

If julia is started with multiple threads (i.e. julia -t 4, or with JULIA_NUM_THREADS environment variable set), CSV.File will use those threads by default to parse large enough files. There are a few keyword arguments to control multithreaded parsing, including:

threaded=false: turn off multithreaded parsing, the file will be read sequentially using a single thread
tasks=N: control how many tasks/chunks are used to break up a file; by default, Threads.nthreads() will be used
lines_to_check=M: when a file is split into chunks, the parser must then find valid starts/ends to rows; this keyword argument controls how many lines are checked to ensure valid rows are found; for files with very large quoted text fields, it may be required to use a higher number here (10, 30, etc.)

Key Functions

CSV.File — Type

CSV.File(source; kwargs...) => CSV.File

Read a UTF-8 CSV input and return a CSV.File object.

The source argument can be one of:

filename given as a string or FilePaths.jl type
an AbstractVector{UInt8} like a byte buffer or codeunits(string)
an IOBuffer

To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:

f = CSV.File(HTTP.get(url).body)

For other IO or Cmd inputs, you can pass them like: f = CSV.File(read(obj)).

Opens the file and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector for all columns, or specified per column via name or index in a Dict).

For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, source, enc"ISO-8859-1")).

The returned CSV.File object supports the Tables.jl interface and can iterate CSV.Rows. CSV.Row supports propertynames and getproperty to access individual row values. CSV.File also supports entire column access like a DataFrame via direct property access on the file object, like f = CSV.File(file); f.col1. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:

for row in CSV.File(file)
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end

By supporting the Tables.jl interface, a CSV.File can also be a table input to any other table sink function. Like:

# materialize a csv file as a DataFrame, without copying columns from CSV.File
df = CSV.File(file) |> DataFrame

# load a csv file directly into an sqlite database table
db = SQLite.DB()
tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")

Supported keyword arguments include:

File layout options:
- header=1: the header argument can be an Int, indicating the row to parse for column names; or a Range, indicating a span of rows to be concatenated together as column names; or an entire Vector{Symbol} or Vector{String} to use as column names; if a file doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemtpylines are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
- normalizenames=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
- datarow: an Int argument to specify the row where the data starts in the csv file; by default, the next row after the header row is used. If header=0, then the 1st row is assumed to be the start of data; providing a datarow or skipto argument does not affect the header argument. Note that if a row number datarow and comment or ignoreemtpylines are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
- skipto::Int: identical to datarow, specifies the number of rows to skip before starting to read data
- footerskip::Int: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
- limit: an Int to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in exact an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
- transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
- comment: rows that begin with this String will be skipped while parsing. Note that if a row number header or datarow and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
- ignoreemptylines::Bool=true: whether empty rows/lines in a file should be ignored (if false, each column will be assigned missing for that empty row)
- threaded::Bool: whether parsing should utilize multiple threads; by default threads are used on large enough files, but isn't allowed when transpose=true; only available in Julia 1.3+
- tasks::Integer=Threads.nthreads(): for multithreaded parsing, this controls the number of tasks spawned to read a file in chunks concurrently; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable)
- lines_to_check::Integer=5: for multithreaded parsing, a file is split up into tasks # of equal chunks, then lines_to_check # of lines are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
- select: an AbstractVector of Int, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
- drop: inverse of select; an AbstractVector of Int, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
Parsing options:
- missingstrings, missingstring: either a String, or Vector{String} to use as sentinel values that will be parsed as missing; by default, only an empty field (two consecutive delimiters) is considered missing
- delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
- ignorerepeated::Bool=false: whether repeated (consecutive) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
- quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
- escapechar='"': the Char used to escape quote characters in a quoted field
- dateformat::Union{String, Dates.DateFormat, Nothing}: a date format string to indicate how Date/DateTime columns are formatted for the entire file
- dateformats::Union{AbstractDict, Nothing}: a Dict of date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
- decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 used '.', or 3,14 uses a comma ','
- truestrings, falsestrings: Vectors of Strings that indicate how true or false values are represented; by default only true and false are treated as Bool
Column Type Options:
- type: a single type to use for parsing an entire file; i.e. all columns will be treated as the same type; useful for matrix-like data files
- types: a Vector or Dict of types to be used for column types; a Dict can map column index Int, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector if provided, it must match the # of columns provided or detected in header
- typemap::Dict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
- pool::Union{Bool, Float64}=0.1: if true, all columns detected as String will be internally pooled; alternatively, the proportion of unique values below which String columns should be pooled (by default 0.1, meaning that if the # of unique strings in a column is under 10%, it will be pooled)
- lazystrings::Bool=false: avoid allocating full strings in string columns; returns a custom LazyStringVector array type that does not support mutable operations (e.g. push!, append!, or even setindex!). Calling copy(x) will materialize a full Vector{String}. Also note that each LazyStringVector holds a reference to the full input file buffer, so it won't be closed after parsing and trying to delete or modify the file may result in errors (particularly on windows) and generally has undefined behavior. Given these caveats, this setting can help avoid lots of string allocations in large files and lead to faster parsing times.
- strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
- silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
- maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings

CSV.Chunks — Type

CSV.Chunks(source; tasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks

Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.

The tasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.

Each iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.

This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.

CSV.Rows — Type

CSV.Rows(source; kwargs...) => CSV.Rows

Read a csv input returning a CSV.Rows object.

The source argument can be one of:

filename given as a string or FilePaths.jl type
an AbstractVector{UInt8} like a byte buffer or codeunits(string)
an IOBuffer

To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:

f = CSV.Rows(HTTP.get(url).body)

For other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).

While similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:

Very minimal memory footprint; while iterating, only the current row values are buffered
Only provides row access via iteration; to access columns, one can stream the rows into a table type
Performs no type inference; each column/cell is essentially treated as Union{String, Missing}, users can utilize the performant Parsers.parse(T, str) to convert values to a more specific type if needed, or pass types upon construction using the type or types keyword arguments

Opens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:

for row in CSV.Rows(file)
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end

Supported keyword arguments include:

File layout options:
- header=1: the header argument can be an Int, indicating the row to parse for column names; or a Range, indicating a span of rows to be concatenated together as column names; or an entire Vector{Symbol} or Vector{String} to use as column names; if a file doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemtpylines are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
- normalizenames=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
- datarow: an Int argument to specify the row where the data starts in the csv file; by default, the next row after the header row is used. If header=0, then the 1st row is assumed to be the start of data; providing a datarow or skipto argument does not affect the header argument. Note that if a row number datarow and comment or ignoreemtpylines are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
- skipto::Int: similar to datarow, specifies the number of rows to skip before starting to read data
- limit: an Int to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file
- transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
- comment: rows that begin with this String will be skipped while parsing. Note that if a row number header or datarow and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
- ignoreemptylines::Bool=true: whether empty rows/lines in a file should be ignored (if false, each column will be assigned missing for that empty row)
Parsing options:
- missingstrings, missingstring: either a String, or Vector{String} to use as sentinel values that will be parsed as missing; by default, only an empty field (two consecutive delimiters) is considered missing
- delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
- ignorerepeated::Bool=false: whether repeated (consecutive) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
- quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
- escapechar='"': the Char used to escape quote characters in a quoted field
- dateformat::Union{String, Dates.DateFormat, Nothing}: a date format string to indicate how Date/DateTime columns are formatted for the entire file
- decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 used '.', or 3,14 uses a comma ','
- truestrings, falsestrings: Vectors of Strings that indicate how true or false values are represented; by default only true and false are treated as Bool
Column Type Options:
- type: a single type to use for parsing an entire file; i.e. all columns will be treated as the same type; useful for matrix-like data files
- types: a Vector or Dict of types to be used for column types; a Dict can map column index Int, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector if provided, it must match the # of columns provided or detected in header
- typemap::Dict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String
- lazystrings::Bool=true: avoid allocating full strings while parsing; accessing a string column will materialize the full String
- strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
- silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
- maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default
Iteration options:
- reusebuffer=false: while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")

CSV.write — Function

CSV.write(file, table; kwargs...) => file
table |> CSV.write(file; kwargs...) => file

Write a Tables.jl interface input to a csv file, given as an IO argument or String/FilePaths.jl type representing the file name to write to. Alternatively, CSV.RowWriter creates a row iterator, producing a csv-formatted string for each row in an input table.

Supported keyword arguments include:

bufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown
delim::Union{Char, String}=',': a character or string to print out as the file's delimiter
quotechar::Char='"': ascii character to use for quoting text fields that may contain delimiters or newlines
openquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters
escapechar::Char='"': ascii character used to escape quote characters in a text field
missingstring::String="": string to print for missing values
dateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns
append=false: whether to append writing to an existing file/IO, if true, it will not write column names by default
writeheader=!append: whether to write an initial row of delimited column names, not written by default if appending
header: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table
newline='\n': character or string to use to separate rows (lines in the csv file)
quotestrings=false: whether to force all strings to be quoted or not
decimal='.': character to use as the decimal point when writing floating point numbers
transform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)
bom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not
partition::Bool=false: by passing true, the table argument is expected to implement Tables.partitions and the file argument can either be an indexable collection of IO, file Strings, or a single file String that will have an index appended to the name

CSV.RowWriter — Type

CSV.RowWriter(table; kwargs...)

Creates an iterator that produces csv-formatted strings for each row in the input table.

Supported keyword arguments include:

bufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown
delim::Union{Char, String}=',': a character or string to print out as the file's delimiter
quotechar::Char='"': ascii character to use for quoting text fields that may contain delimiters or newlines
openquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters
escapechar::Char='"': ascii character used to escape quote characters in a text field
missingstring::String="": string to print for missing values
dateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns
header: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table
newline='\n': character or string to use to separate rows (lines in the csv file)
quotestrings=false: whether to force all strings to be quoted or not
decimal='.': character to use as the decimal point when writing floating point numbers
transform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)
bom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not