CSV.jl Documentation
GitHub Repo: https://github.com/JuliaData/CSV.jl
Welcome to CSV.jl! A pure-Julia package for handling delimited text data, be it comma-delimited (csv), tab-delimited (tsv), or otherwise.
Installation
You can install CSV by typing the following in the Julia REPL:
] add CSV
followed by
using CSV
to load the package.
Overview
To start out, let's discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:
CSV.File
: the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns aCSV.File
object, which is like a lightweight table/DataFrame. Assumingfile
is a variable of aCSV.File
object, individual columns can be accessed likefile.col1
,file[:col1]
, orfile["col"]
. You can see parsed column names viafile.names
. ACSV.File
can also be iterated, where aCSV.Row
is produced on each iteration, which allows access to each value in the row viarow.col1
,row[:col1]
, orrow[1]
. You can also index aCSV.File
directly, likefile[1]
to return the entireCSV.Row
at the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated.CSV.File
satisfies the Tables.jl "source" interface, and so can be passed to valid sink functions likeDataFrame
,SQLite.load!
,Arrow.write
, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options.CSV.read
: a convenience function identical toCSV.File
, but used when aCSV.File
will be passed directly to a sink function, like aDataFrame
. In some cases, sinks may make copies of incoming data for their own safety; by callingCSV.read(file, DataFrame)
, no copies of the parsedCSV.File
will be made, and theDataFrame
will take direct ownership of theCSV.File
's columns, which is more efficient than doingCSV.File(file) |> DataFrame
which will result in an extra copy of each column being made. Keyword arguments are identical toCSV.File
. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. LikeCSV.File
, a vector of data inputs can be passed as the 1st argument, which will result in a single "long" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).CSV.Rows
: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint thanCSV.File
. Supports many of the same options asCSV.File
, except column type handling is a little different. By default, every column type will be essentiallyUnion{Missing, String}
, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing aCSV.Rows
object, rows can be "streamed" by iterating, where each iteration produces aCSV.Row2
object, which operates similar toCSV.File
'sCSV.Row
type where individual row values can be accessed viarow.col1
,row[:col1]
, orrow[1]
. If each row is processed individually, additional memory can be saved by passingreusebuffer=true
, which means a single buffer will be allocated to hold the values of only the currently iterated row.CSV.Rows
also supports the Tables.jl interface and can also be passed to valid sink functions.CSV.Chunks
: similar toCSV.File
, but allows passing antasks::Integer
keyword argument which will cause the input file to be "chunked" up intontasks
number of chunks. After constructing aCSV.Chunks
object, each iteration of the object will return aCSV.File
of the next parsed chunk. Useful for processing extremely large files in "chunks". Because each iterated element is a valid Tables.jl "source",CSV.Chunks
satisfies theTables.partitions
interface, so sinks that can process input partitions can operate by passingCSV.Chunks
as the "source".CSV.write
: A valid Tables.jl "sink" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out aDataFrame
, for example, it's simplyCSV.write("data.csv", df)
, or to write out a matrix, it'susing Tables; CSV.write("data.csv", Tables.table(mat))
CSV.RowWriter
: An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table's row.
That's quite a bit! Let's boil down a TL;DR:
- Just want to read a delimited file or collection of files and do basic stuff with data? Use
CSV.File(file)
orCSV.read(file, DataFrame)
- Don't need the data as a whole or want to stream through a large file row-by-row? Use
CSV.Rows
. - Want to process a large file in "batches"/chunks? Use
CSV.Chunks
. - Need to produce a csv? Use
CSV.write
. - Want to iterate an input table and produce a single csv string per row?
CSV.RowWriter
.
For the rest of the manual, we're going to have two big sections, Reading and Writing where we'll walk through the various options to CSV.File
/CSV.read
/CSV.Rows
/CSV.Chunks
and CSV.write
/CSV.RowWriter
.
- Reading
input
header
normalizenames
skipto
footerskip
transpose
comment
ignoreemptyrows
select
/drop
limit
ntasks
rows_to_check
source
missingstring
delim
ignorerepeated
quoted
quotechar
/openquotechar
/closequotechar
escapechar
dateformat
decimal
groupmark
/ thousands separatortruestrings
/falsestrings
types
typemap
pool
downcast
stringtype
strict
/silencewarnings
/maxwarnings
debug
- API Reference
- Common terms
- Writing
- Examples
- Non-UTF-8 character encodings
- Concatenate multiple inputs at once
- Gzipped input
- Delimited data in a string
- Data from the web/a url
- Reading from a zip file
- Column names on 2nd row
- No column names in data
- Manually provide column names
- Multi-row column names
- Normalizing column names
- Skip to specific row where data starts
- Skipping trailing useless rows
- Reading transposed data
- Ignoring commented rows
- Ignoring empty rows
- Including/excluding columns
- Limiting number of rows from data
- Specifying custom missing strings
- String delimiter
- Fixed width files
- Turning off quoted cell parsing
- Quoted & escaped fields
- DateFormat
- Custom decimal separator
- Thousands separator
- In some contexts, separators other than thousands separators group digits in a number.
groupmark
supports ignoring them as long as the separator character is ASCII- By default, parsing only considers the string values
true
andfalse
as validBool
values. To consider alternative - values, we can pass a
Vector{String}
to thetruestrings
andfalsestrings
keyword arguments. - This file contains a 3x3 identity matrix of
Float64
. By default, parsing will detect the delimiter and type, but we can - also explicitly pass
delim= ' '
andtypes=Float64
, which tells parsing to explicitly treat each column asFloat64
, - without having to guess the type on its own.
- as a last step if you want to convert this to a Matrix, this can be done by reading in first as a DataFrame and then
- function chaining to a Matrix
- another alternative is to simply use CSV.Tables.matrix and say
- In this file, our 3rd column has an invalid value on the 2nd row
invalid
. Let's imagine we'd still like to treat it as an Int
column, and ignore theinvalid
value. The syntax examples provide several ways we can tell parsing to treat the 3rd- column as
Int
, by referring to column index3
, or column name withSymbol
orString
. We can also provide an entire Vector
of types for each column (and which needs to match the length of columns in the file). There are two additional- keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like
"warning: invalid Int64 value on row 2, column 3"
. In the fifth example, passingsilencewarnings=true
will suppress this- warning printing. In the last syntax example, passing
strict=true
will result in an error being thrown during parsing. - In this file we have lots of columns, and would like to specify the same type for all
- columns except one which should have a different type. We can do this by providing a
- function that takes the column index and column name and uses these to decide the type.
- Alternatively by providing the exact name for the first column and a Regex to match the rest.
- Note that an exact column name always takes precedence over a regular expression.
- In this file, we have U.S. zipcodes in the first column that we'd rather not treat as
Int
, but parsing will detect it as - such. In the first syntax example, we pass
typemap=IdDict(Int => String)
, which tells parsing to treat any detectedInt
- columns as
String
instead. In the second syntax example, we alternatively set thezipcode
column type manually. - In this file, we have an
id
column and acode
column. There can be advantages with various DataFrame/table operations - like joining and grouping when
String
values are "pooled", meaning each unique value is mapped to aUInt32
. By default, pool=(0.2, 500)
, so string columns with low cardinality are pooled by default. Via thepool
keyword argument, we can provide- greater control:
pool=0.4
means that if 40% or less of a column's values are unique, then it will be pooled. - in this data, our
category
column is an integer type, but represents a limited set of values that could benefit from - pooling. Indeed, we may want to do various DataFrame grouping/joining operations on the column, which can be more
- efficient if the column type is a PooledVector. By default, passing
pool=true
will only pool string column types, - if we pass a vector or dict however, we can specify how specific, non-string type, columns should be pooled.
- In this file, we have an
id
column and acode
column. There can be advantages with various DataFrame/table operations - like joining and grouping when
String
values are "pooled", meaning each unique value is mapped to aUInt32
. By default, pool=(0.2, 500)
, so string columns with low cardinality are pooled by default. Via thepool
keyword argument, we can provide- greater control:
pool=(0.5, 2)
means that if a column has 2 or fewer unique values and the total number of unique values is less than 50% of all values, then it will be pooled.