Python CSV Load Times and the Feather Format

One of the most common data science tasks is loading structured data in tabular form. For “small data”, a popular format is often Comma-Separated-Value (CSV), which nearly everybody can read and write correctly. For so-called “big data”, one often needs a database, preferably SQL database like Postgres, or if the dataset gets very large, then often NoSQL becomes the only practical choice. Well, what do you do when you have “medium data”? Data that is still small enough to fit on a single computer, but is too large to load into memory?

Parquet is pretty good, but I also rather like the feather format. It’s considerably faster than CSV, and can be read easily from Pandas.

Format	Read Time [ms]	Write Time [ms]	Filesize [Bytes]
CSV.GZ	2148	15075	15,738,200
CSV	1842	7274	67,625,407
FEATHER	120	821	70,229,376

Here’s a little snippet of code that I used to test out the formats. On my MacBook Pro, it’s about 18x faster to use feather files than gzipped CSVs for a particular timeseries data set that I was working on. Your mileage may vary, depending on your machine and your data set, of course.

import pandas as pd
import datetime as dt

# Uncomment to load CSV
# t_start = dt.datetime.now()
# df = pd.read_csv('data/timeseries.csv')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to save compressed CSV
# t_start = dt.datetime.now()
# df.to_csv('data/cleaned_fitstats_copy.csv.gz', compression='gzip')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to load uncompressed CSV
# t_start = dt.datetime.now()
# df = pd.read_csv('data/timeseries.csv.gz')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to save the feather
# t_start = dt.datetime.now()
# df.to_feather('data/timeseries.feather')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to load the feather file
t_start = dt.datetime.now()
df2 = pd.read_feather('data/timeseries.feather')
t_end = dt.datetime.now()
print(t_end - t_start)

Art is Never Finished

...Only Abandoned. – Leonardo da Vinci

Python CSV Load Times and the Feather Format