DataFrame IO - current performance
Currently there are two ways to read/write dataframes to files: csv and json. While developing the new DataSet library, I benchmarked the reading speeds of different methods. Following is the result of benchmark of reading speeds on my machine, in milliseconds:
|MNIST test (10k rows)||49490||13491||49019||NA||NA||13360|
|MNIST train (60k rows)||303505||106291||NA||NA||NA||102917|
As you can see, the fastest method is json with
orient='split', since the format is close to the object’s internal representation. Next fastset is the csv and json
orient='records'. Others are significantly slower.
The primary reason for slowdown in reading csv is the
DataFrameTypeDetector detectTypesAndConvert method. It runs through all elements in all columns and determines the type of that column. It may run multiple times on a column, causing a significant slowdown for large dataframes. On large datasets, it has slowdown of at least 85%. Modifying this method to query on only a few elements will cause a significant performance boost.
Similarly, for reading json using the fastest method available (
detectTypesAndConvert method causes slowdown.
DataFrameTypeDetector will massively increase reading speeds, by 50% to 90% for large files.
2019-08-03 10:02 -07:00