GSoC 2019 Phase 2 progress

Phase 2 of GSoC has been completed and here are the updates of the project:

DataFrame library

PR#111 and other PRs (102, 103, 104 and 107) have been modified and merged. This means that all issues related to missing data have been fixed! 🎉

In addition to this, two major features have been added - Joins support and JSON I/O.

DataFrame joins

SQL-type joins are now possible between DataFrames. There are two APIs for this. For joins using rowNames, following is used:

1
2
3
4


df innerJoin: df2.
df outerJoin: df2.
df leftJoin: df2.
df rightJoin: df2.

And for joins using arbitary columns, the API is:

1
2
3
4


df innerJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.
df outerJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.
df leftJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.
df rightJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.

Joining using rowNames (first API) is faster than using arbitary columns.

JSON I/O

The library can perform can read or write DataFrames from/to .json files. Under the hood, it uses the NeoJSON library to read/write json, and then converts it to DataFrames.

There are five different formats of JSON to which read/write is supported:

columns in which the format is of {column:{rowName:data, ...}, ...}
index which has the format {rowName:{column:data, ...}, ...}
values which is an array of arrays (rows)
split in which the format is {columns:[columnNames], index:[rowNames], data:[data]}
records which is an array of rows [{col1: data1, ...}, ...]

Out of these, split is the fastest format to read/write into, since it stores DataFrame similar to it’s internal representation.

There are three ways to write to JSON:

df writeTo: aFileRef using: (DataFrameJsonWriter new).

DataFrameJsonWriter new write: df to: aFileRef.

DataFrameJsonWriter new writeAsString: df.

Similarly, you can read JSON to DataFrame object as:

df := DataFrame readFrom: aFileRef using: (DataFrameJsonReader new).

df := DataFrameJsonReader new readFrom: aFileRef.