4 minutes
Joins in Pharo’s DataFrame library
One of the newest feature added to the DataFrame library is the support for joins. This post covers joining and appending dataframes.
Joining DataFrames
The library supports SQL-type joins: inner, outer, left, right.
Joining using rowNames
Joining dataframes using rowNames is faster vs an arbitary column, since rowNames are unique leading to faster construction of output dataframe.
Consider two dataframes with course names as rowName:
|
|
Location |
Timing |
|
---|---|---|
CS-10 | RTH-01 | 11:00 - 01:00 |
CS-11 | RTH-02 | 04:00 - 05:50 |
CS-13 | THH-02 | 09:00 - 10:30 |
Instructor |
Capacity |
|
---|---|---|
CS-11 | Prof. Foo |
50 |
CS-12 | Prof. Bar |
45 |
Inner join combines two dataframes based on their intersection (in this case, the rowNames). Doing df1 innerJoin: df2
would output intersecting rows between dataframes (CS-11) with columns combined.
Location |
Timing |
Instructor | Capacity | |
---|---|---|---|---|
CS-11 | RTH-02 |
04:00 - 05:50 |
Prof. Foo |
50 |
Outer join includes all the rows in the dataframe, and adds nil
to the missing columns.
Location |
Timing |
Instructor | Capacity | |
---|---|---|---|---|
CS-10 | RTH-01 |
11:00 - 01:00 | nil |
nil |
CS-11 |
RTH-02 | 04:00 - 05:50 |
Prof. Foo |
50 |
CS-12 | nil |
nil | Prof. Bar |
45 |
CS-13 | THH-02 | 09:00 - 10:30 |
nil |
nil |
Left join includes all the rows present in the left dataframe, while adding nil
in places with missing data. CS-10, CS-11 and CS13 will be present in the output of df1 leftJoin: df2
, with nil
in column Instructor and Capacity for rows CS-10 and CS-13.
Location |
Timing |
Instructor | Capacity | |
---|---|---|---|---|
CS-10 | RTH-01 |
11:00 - 01:00 | nil |
nil |
CS-11 |
RTH-02 | 04:00 - 05:50 |
Prof. Foo |
50 |
CS-13 | THH-02 | 09:00 - 10:30 |
nil |
nil |
Right join is similar to left join, except it preserves the rows in the right dataframe.
Location |
Timing |
Instructor | Capacity | |
---|---|---|---|---|
CS-11 |
RTH-02 | 04:00 - 05:50 |
Prof. Foo |
50 |
CS-12 | nil |
nil | Prof. Bar |
45 |
Joining using arbitary column
The API for joining arbitary column is:
|
|
Consider similar dataframes as before, except with course names being it’s own column:
|
|
Joining using left join would lead to:
Course |
Location |
Timing | CourseName |
Instructor | Capacity |
|
---|---|---|---|---|---|---|
1 |
CS-10 |
RTH-01 |
11:00 - 01:00 |
nil | nil |
nil |
2 | CS-11 | RTH-02 | 04:00 - 05:50 |
CS-11 |
Prof. Foo |
50 |
3 | CS-13 |
THH-02 | 09:00 - 10:30 |
nil |
nil | nil |
Joining datafames with same column names
If the two dataframes have conflicting column names, it will be replaced by _x
and _y
. For instance, in the previous example, if both dataframes has column names Course
, output dataframe would be:
Course_x |
Location |
Timing | Course_y |
Instructor | Capacity |
|
---|---|---|---|---|---|---|
1 |
CS-10 |
RTH-01 |
11:00 - 01:00 |
nil | nil |
nil |
2 | CS-11 | RTH-02 | 04:00 - 05:50 |
CS-11 |
Prof. Foo |
50 |
3 | CS-13 |
THH-02 | 09:00 - 10:30 |
nil |
nil | nil |
Appending dataframes
Consider two dataframes:
Course | Location |
Timing | Instructor | Capacity | |
---|---|---|---|---|---|
1 |
CS-10 | RTH-01 | 11:00 - 01:00 | Prof. Foo |
50 |
2 | CS-11 | RTH-02 | 04:00 - 05:50 |
Prof. Bar |
50 |
3 | CS-12 | THH-01 | TBA | TBA | 45 |
Course | Location |
Timing | Instructor | Capacity | |
---|---|---|---|---|---|
4 | CS-13 | THH-02 | 09:00 - 10:30 |
Prof. Baz |
30 |
You can append/concatenate them by:
|
|
There are a few caveats:
- Row names should be unique. Else, the concatenation operator will not work. You can use
df1 rowNames: (1 to: df1 size)
anddf2 rowNames: (df1 size + 1 to: df1 size + df2 size)
followed bydf1, df2
. - Order of columns should be the same. If it is not, reconstruct the dataframe.