Currently there are two ways to read/write dataframes to files: csv and json. While developing the new DataSet library, I benchmarked the reading speeds of different methods. Following is the result of benchmark of reading speeds on my machine, in milliseconds:

Datasetcsvjsonsubtypes:   
 splitrecordsindexcolumnsvalues
Iris4.572.955.1918.088.922.69
Boston42.718.5943.6121.119318.59
Digits663.5277.96571889.28606.5257.9
MNIST test (10k rows)494901349149019NANA13360
MNIST train (60k rows)303505106291NANANA102917

As you can see, the fastest method is json with orient='split', since the format is close to the object’s internal representation. Next fastset is the csv and json orient='records'. Others are significantly slower.

CSV reading profile

The primary reason for slowdown in reading csv is the DataFrameTypeDetector detectTypesAndConvert method. It runs through all elements in all columns and determines the type of that column. It may run multiple times on a column, causing a significant slowdown for large dataframes. On large datasets, it has slowdown of at least 85%. Modifying this method to query on only a few elements will cause a significant performance boost.

JSON reading profile

Similarly, for reading json using the fastest method available (orient: split), detectTypesAndConvert method causes slowdown.


TL;DR: Improving DataFrameTypeDetector will massively increase reading speeds, by 50% to 90% for large files.