Python CSV Load Times and the Feather Format

Ivar Thorson bio photo By Ivar Thorson

One of the most common data science tasks is loading structured data in tabular form. For “small data”, a popular format is often Comma-Separated-Value (CSV), which nearly everybody can read and write correctly. For so-called “big data”, one often needs a database, preferably SQL database like Postgres, or if the dataset gets very large, then often NoSQL becomes the only practical choice. Well, what do you do when you have “medium data”? Data that is still small enough to fit on a single computer, but is too large to load into memory?

Parquet is pretty good, but I also rather like the feather format. It’s considerably faster than CSV, and can be read easily from Pandas.

Format Read Time [ms] Write Time [ms] Filesize [Bytes]
CSV.GZ 2148 15075 15,738,200
CSV 1842 7274 67,625,407
FEATHER 120 821 70,229,376

Here’s a little snippet of code that I used to test out the formats. On my MacBook Pro, it’s about 18x faster to use feather files than gzipped CSVs for a particular timeseries data set that I was working on. Your mileage may vary, depending on your machine and your data set, of course.

import pandas as pd
import datetime as dt

# Uncomment to load CSV
# t_start = dt.datetime.now()
# df = pd.read_csv('data/timeseries.csv')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to save compressed CSV
# t_start = dt.datetime.now()
# df.to_csv('data/cleaned_fitstats_copy.csv.gz', compression='gzip')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to load uncompressed CSV
# t_start = dt.datetime.now()
# df = pd.read_csv('data/timeseries.csv.gz')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to save the feather
# t_start = dt.datetime.now()
# df.to_feather('data/timeseries.feather')
# t_end = dt.datetime.now()
# print(t_end - t_start)

# Uncomment to load the feather file
t_start = dt.datetime.now()
df2 = pd.read_feather('data/timeseries.feather')
t_end = dt.datetime.now()
print(t_end - t_start)