2 minutes
GSoC 2019 Phase 2 progress
Phase 2 of GSoC has been completed and here are the updates of the project:
DataFrame library
PR#111 and other PRs (102, 103, 104 and 107) have been modified and merged. This means that all issues related to missing data have been fixed! 🎉
In addition to this, two major features have been added - Joins support and JSON I/O.
DataFrame joins
SQL-type joins are now possible between DataFrames. There are two APIs for this. For joins using rowNames
, following is used:
|
|
And for joins using arbitary columns, the API is:
|
|
Joining using rowNames (first API) is faster than using arbitary columns.
JSON I/O
The library can perform can read or write DataFrames from/to .json
files. Under the hood, it uses the NeoJSON library to read/write json, and then converts it to DataFrames.
There are five different formats of JSON to which read/write is supported:
columns
in which the format is of{column:{rowName:data, ...}, ...}
index
which has the format{rowName:{column:data, ...}, ...}
values
which is an array of arrays (rows)split
in which the format is{columns:[columnNames], index:[rowNames], data:[data]}
records
which is an array of rows[{col1: data1, ...}, ...]
Out of these, split
is the fastest format to read/write into, since it stores DataFrame similar to it’s internal representation.
There are three ways to write to JSON:
df writeTo: aFileRef using: (DataFrameJsonWriter new).
DataFrameJsonWriter new write: df to: aFileRef.
DataFrameJsonWriter new writeAsString: df.
Similarly, you can read JSON to DataFrame object as:
df := DataFrame readFrom: aFileRef using: (DataFrameJsonReader new).
df := DataFrameJsonReader new readFrom: aFileRef.