One of the most common data science tasks is loading structured data in tabular form. For “small data”, a popular format is often Comma-Separated-Value (CSV), which nearly everybody can read and write correctly. For so-called “big data”, one often needs a database, preferably SQL database like Postgres, or if the dataset gets very large, then often NoSQL becomes the only practical choice. Well, what do you do when you have “medium data”? Data that is still small enough to fit on a single computer, but is too large to load into memory?
Parquet is pretty good, but I also rather like the feather format. It’s considerably faster than CSV, and can be read easily from Pandas.
Format | Read Time [ms] | Write Time [ms] | Filesize [Bytes] |
---|---|---|---|
CSV.GZ | 2148 | 15075 | 15,738,200 |
CSV | 1842 | 7274 | 67,625,407 |
FEATHER | 120 | 821 | 70,229,376 |
Here’s a little snippet of code that I used to test out the formats. On my MacBook Pro, it’s about 18x faster to use feather files than gzipped CSVs for a particular timeseries data set that I was working on. Your mileage may vary, depending on your machine and your data set, of course.
import pandas as pd
import datetime as dt
# Uncomment to load CSV
# t_start = dt.datetime.now()
# df = pd.read_csv('data/timeseries.csv')
# t_end = dt.datetime.now()
# print(t_end - t_start)
# Uncomment to save compressed CSV
# t_start = dt.datetime.now()
# df.to_csv('data/cleaned_fitstats_copy.csv.gz', compression='gzip')
# t_end = dt.datetime.now()
# print(t_end - t_start)
# Uncomment to load uncompressed CSV
# t_start = dt.datetime.now()
# df = pd.read_csv('data/timeseries.csv.gz')
# t_end = dt.datetime.now()
# print(t_end - t_start)
# Uncomment to save the feather
# t_start = dt.datetime.now()
# df.to_feather('data/timeseries.feather')
# t_end = dt.datetime.now()
# print(t_end - t_start)
# Uncomment to load the feather file
t_start = dt.datetime.now()
df2 = pd.read_feather('data/timeseries.feather')
t_end = dt.datetime.now()
print(t_end - t_start)