CSV.CSV
— ModuleCSV provides fast, flexible reader & writer for delimited text files in various formats.
Reading:
CSV.File
reads delimited data and returns aCSV.File
object, which allows dot-access to columns and iterating rows.CSV.read
is similar toCSV.File
but used when the input will be passed directly to a sink function such as aDataFrame
.CSV.Rows
reads delimited data and returns aCSV.Rows
object, which allows "streaming" the data by iterating and thereby has a lower memory footprint thanCSV.File
.CSV.Chunks
allows processing extremely large files in "batches" or "chunks".
Writing:
CSV.write
writes a Tables.jl interface input such as aDataFrame
to a csv file or an in-memory IOBuffer.CSV.RowWriter
creates an iterator that produces csv-formatted strings for each row in the input table.
Here is an example of reading a csv file and passing the input to a DataFrame
:
using CSV, DataFrames
ExampleInputDF = CSV.read("ExampleInputFile.csv", DataFrame)
Here is an example of writing out a DataFrame
to a csv file:
using CSV, DataFrames
ExampleOutputDF = DataFrame(rand(10,10), :auto)
CSV.write("ExampleOutputFile.csv", ExampleOutputDF)
CSV.Chunks
— MethodCSV.Chunks(source; ntasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks
Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File
, see those docs for explanations of each keyword argument.
The ntasks
keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS
environment variable) or 8 if Julia is run single-threaded.
Each iteration of CSV.Chunks
produces the next chunk of a file as a CSV.File
. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File
, types may be passed manually via the type
or types
keyword arguments.
This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
CSV.Column
— TypeInternal structure used to track information for a single column in a delimited file.
Fields:
type
: always a single, concrete type; no Union{T, Missing}; missingness is tracked in anymissing field; this field is mutable; it may start as one type and get "promoted" to another while parsing; two special types exist:NeedsTypeDetection
, which specifies that we need to try and detect what type this column's values are andHardMissing
which means the column type is definitelyMissing
and we don't need to detect anything; to get the "final type" of a column after parsing, callCSV.coltype(col)
, which takes into accountanymissing
anymissing
: whether any missing values have been encountered while parsing; if a user provided a type likeUnion{Int, Missing}
, we'll set this totrue
, or whenmissing
values are encountered while parsinguserprovidedtype
: whether the column type was provided by the user or not; this affects whether we'll promote a column's type while parsing, or emit a warning/error depending onstrict
keyword argwilldrop
: whether we'll drop this column from the final columnset; computed from select/drop keyword arguments; this will result in a column type ofHardMissing
while parsing, where an efficient parser is used to "skip" a field w/o allocating any parsed valuepool
: computed frompool
keyword argument;true
is1.0
,false
is0.0
, everything else isFloat64(pool)
; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected;columnspecificpool
: ifpool
was provided via Vector or Dict by user, thentrue
, otherfalse
; iffalse
, then only string column types will attempt poolingcolumn
: the actual column vector to hold parsed values; field is typed asAbstractVector
and while parsing, we do switches oncol.type
to assert the column type to make code concretely typedlock
: in multithreaded parsing, we have a top-level set ofVector{Column}
, then each threaded parsing task makes its own copy to parse its own chunk; when synchronizing column types/pooled refs, the task-localColumn
willlock(col.lock)
to make changes to the parentColumn
; each task-localColumn
shares the samelock
of the top-levelColumn
position
: for transposed reading, the current column positionendposition
: for transposed reading, the expected ending position for this column
CSV.File
— TypeCSV.File(input; kwargs...) => CSV.File
Read a UTF-8 CSV input and return a CSV.File
object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...)
instead if the CSV.File
intermediate object isn't needed.
The input
argument can be one of:
- filename given as a string or FilePaths.jl type
- a
Vector{UInt8}
orSubArray{UInt8, 1, Vector{UInt8}}
byte buffer - a
CodeUnits
object, which wraps aString
, likecodeunits(str)
- a csv-formatted string can also be passed like
IOBuffer(str)
- a
Cmd
or otherIO
- a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
- a
Vector
of any of the above, which will parse and vertically concatenate each source, returning a single, "long"CSV.File
To read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response
body can be passed like:
using Downloads, CSV
f = CSV.File(Downloads.download(url))
# or
using HTTP, CSV
f = CSV.File(HTTP.get(url).body)
Opens the file or files and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types
keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector
for all columns, or specified per column via name or index in a Dict
).
When a Vector
of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using ChainedVector
s to lazily concatenate each thread's columns.
For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc"ISO-8859-1"))
.
The returned CSV.File
object supports the Tables.jl interface and can iterate CSV.Row
s. CSV.Row
supports propertynames
and getproperty
to access individual row values. CSV.File
also supports entire column access like a DataFrame
via direct property access on the file object, like f = CSV.File(file); f.col1
. Or by getindex access with column names, like f[:col1]
or f["col1"]
. The returned columns are AbstractArray
subtypes, including: SentinelVector
(for integers), regular Vector
, PooledVector
for pooled columns, MissingVector
for columns of all missing
values, PosLenStringVector
when stringtype=PosLenString
is passed, and ChainedVector
will chain one of the previous array types together for data inputs that use multiple threads to parse (each thread parses a single "chain" of the input). Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a
will become a_1
). For example, one could iterate over a csv file with column names a
, b
, and c
by doing:
for row in CSV.File(file)
println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
By supporting the Tables.jl interface, a CSV.File
can also be a table input to any other table sink function. Like:
# materialize a csv file as a DataFrame, copying columns from CSV.File
df = CSV.File(file) |> DataFrame
# to avoid making a copy of parsed columns, use CSV.read
df = CSV.read(file, DataFrame)
# load a csv file directly into an sqlite database table
db = SQLite.DB()
tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
CSV.RowWriter
— TypeCSV.RowWriter(table; kwargs...)
Creates an iterator that produces csv-formatted strings for each row in the input table.
Supported keyword arguments include:
bufsize::Int=2^22
: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than thebufsize
an error is throwndelim::Union{Char, String}=','
: a character or string to print out as the file's delimiterquotechar::Char='"'
: ascii character to use for quoting text fields that may contain delimiters or newlinesopenquotechar::Char
: instead ofquotechar
, useopenquotechar
andclosequotechar
to support different starting and ending quote charactersescapechar::Char='"'
: ascii character used to escape quote characters in a text fieldmissingstring::String=""
: string to print formissing
valuesdateformat=Dates.default_format(T)
: the date format string to use for printing outDate
&DateTime
columnsheader
: pass a list of column names (Symbols or Strings) to use instead of the column names of the input tablenewline='\n'
: character or string to use to separate rows (lines in the csv file)quotestrings=false
: whether to force all strings to be quoted or notdecimal='.'
: character to use as the decimal point when writing floating point numberstransform=(col,val)->val
: a function that is applied to every cell e.g. we can transform allnothing
values tomissing
using(col, val) -> something(val, missing)
bom=false
: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not
CSV.Rows
— MethodCSV.Rows(source; kwargs...) => CSV.Rows
Read a csv input returning a CSV.Rows
object.
The input
argument can be one of:
- filename given as a string or FilePaths.jl type
- a
Vector{UInt8}
orSubArray{UInt8, 1, Vector{UInt8}}
byte buffer - a
CodeUnits
object, which wraps aString
, likecodeunits(str)
- a csv-formatted string can also be passed like
IOBuffer(str)
- a
Cmd
or otherIO
- a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response
body can be passed like:
f = CSV.Rows(HTTP.get(url).body)
For other IO
or Cmd
inputs, you can pass them like: f = CSV.Rows(read(obj))
.
While similar to CSV.File
, CSV.Rows
provides a slightly different interface, the tradeoffs including:
- Very minimal memory footprint; while iterating, only the current row values are buffered
- Only provides row access via iteration; to access columns, one can stream the rows into a table type
- Performs no type inference; each column/cell is essentially treated as
Union{String, Missing}
, users can utilize the performantParsers.parse(T, str)
to convert values to a more specific type if needed, or pass types upon construction using thetype
ortypes
keyword arguments
Opens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String
unless otherwise manually provided). The returned CSV.Rows
object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames
, getproperty
, and getindex
to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a
will become a_1
). For example, one could iterate over a csv file with column names a
, b
, and c
by doing:
for row in CSV.Rows(file)
println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
CSV.checkvaliddelim
— Methodcheckvaliddelim(delim)
Checks whether a character or string is valid for use as a delimiter. If delim
is nothing
, it is assumed that the delimiter will be auto-selected. Throws an error if delim
is invalid.
CSV.detect
— FunctionCSV.detect(str::String)
Use the same logic used by CSV.File
to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows
type, which returns each cell of a file as a String. The order of types attempted is: Int
, Float64
, Date
, DateTime
, Bool
, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options
type as a keyword argument option=ops
for sentinel value detection.
CSV.isvaliddelim
— Methodisvaliddelim(delim)
Whether a character or string is valid for use as a delimiter.
CSV.read
— FunctionCSV.read(source, sink::T; kwargs...)
=> T
Read and parses a delimited file or files, materializing directly using the sink
function. Allows avoiding excessive copies of columns for certain sinks like DataFrame
.
Example
julia> using CSV, DataFrames
julia> path = tempname();
julia> write(path, "a,b,c\n1,2,3");
julia> CSV.read(path, DataFrame)
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> CSV.read(path, DataFrame; header=false)
2×3 DataFrame
Row │ Column1 Column2 Column3
│ String1 String1 String1
─────┼───────────────────────────
1 │ a b c
2 │ 1 2 3
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
CSV.write
— FunctionCSV.write(file, table; kwargs...) => file
table |> CSV.write(file; kwargs...) => file
Write a Tables.jl interface input to a csv file, given as an IO
argument or String
/FilePaths.jl type representing the file name to write to. Alternatively, CSV.RowWriter
creates a row iterator, producing a csv-formatted string for each row in an input table.
Supported keyword arguments include:
bufsize::Int=2^22
: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than thebufsize
an error is throwndelim::Union{Char, String}=','
: a character or string to print out as the file's delimiterquotechar::Char='"'
: ascii character to use for quoting text fields that may contain delimiters or newlinesopenquotechar::Char
: instead ofquotechar
, useopenquotechar
andclosequotechar
to support different starting and ending quote charactersescapechar::Char='"'
: ascii character used to escape quote characters in a text fieldmissingstring::String=""
: string to print formissing
valuesdateformat=Dates.default_format(T)
: the date format string to use for printing outDate
&DateTime
columnsappend=false
: whether to append writing to an existing file/IO, iftrue
, it will not write column names by defaultcompress=false
: compress the written output using standard gzip compression (provided by the CodecZlib.jl package); note that a compression stream can always be provided as the first "file" argument to support other forms of compression; passingcompress=true
is just for convenience to avoid needing to manually setup a GzipCompressorStreamwriteheader=!append
: whether to write an initial row of delimited column names, not written by default if appendingheader
: pass a list of column names (Symbols or Strings) to use instead of the column names of the input tablenewline='\n'
: character or string to use to separate rows (lines in the csv file)quotestrings=false
: whether to force all strings to be quoted or notdecimal='.'
: character to use as the decimal point when writing floating point numberstransform=(col,val)->val
: a function that is applied to every cell e.g. we can transform allnothing
values tomissing
using(col, val) -> something(val, missing)
bom=false
: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or notpartition::Bool=false
: by passingtrue
, thetable
argument is expected to implementTables.partitions
and thefile
argument can either be an indexable collection ofIO
, fileString
s, or a single fileString
that will have an index appended to the name
Examples
using CSV, Tables, DataFrames
# write out a DataFrame to csv file
df = DataFrame(rand(10, 10), :auto)
CSV.write("data.csv", df)
# write a matrix to an in-memory IOBuffer
io = IOBuffer()
mat = rand(10, 10)
CSV.write(io, Tables.table(mat))