One of the newest feature added to the DataFrame library is the support for joins. This post covers joining and appending dataframes.

Joining DataFrames

The library supports SQL-type joins: inner, outer, left, right.

Joining using rowNames

Joining dataframes using rowNames is faster vs an arbitary column, since rowNames are unique leading to faster construction of output dataframe.

Consider two dataframes with course names as rowName:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
df1 := DataFrame withRows: #(
	('RTH-01', '11:00 - 01:00')
	('RTH-02', '04:00 - 05:50')
	('THH-02', '09:00 - 10:30')
	)
	rowNames: #('CS-10' 'CS-11' 'CS-13')
	columnNames: #(Location Timing).

df2 := DataFrame withRows: #(
	('Prof. Foo' 50)
	('Prof. Bar' 45)
	)
	rowNames: #('CS-11' 'CS-12')
	columnNames: #(Instructor Capacity).

df1

Location
Timing
 CS-10  RTH-01  11:00 - 01:00
 CS-11  RTH-02  04:00 - 05:50
 CS-13  THH-02  09:00 - 10:30

df2

Instructor
Capacity
 CS-11  Prof. Foo
 50
 CS-12  Prof. Bar
 45

Inner join combines two dataframes based on their intersection (in this case, the rowNames). Doing df1 innerJoin: df2 would output intersecting rows between dataframes (CS-11) with columns combined.

df1 innerJoin: df2

Location
Timing
Instructor Capacity
 CS-11  RTH-02
 04:00 - 05:50
 Prof. Foo
 50

Outer join includes all the rows in the dataframe, and adds nil to the missing columns.

df1 outerJoin: df2

Location
Timing
Instructor Capacity
 CS-10  RTH-01
 11:00 - 01:00  nil
 nil
CS-11
  RTH-02 04:00 - 05:50
 Prof. Foo
 50
 CS-12   nil
 nil Prof. Bar
 45
 CS-13   THH-02  09:00 - 10:30
nil
 nil

Left join includes all the rows present in the left dataframe, while adding nil in places with missing data. CS-10, CS-11 and CS13 will be present in the output of df1 leftJoin: df2, with nil in column Instructor and Capacity for rows CS-10 and CS-13.

df1 leftJoin: df2

Location
Timing
Instructor Capacity
 CS-10  RTH-01
 11:00 - 01:00  nil
 nil
CS-11
  RTH-02 04:00 - 05:50
 Prof. Foo
 50
 CS-13   THH-02  09:00 - 10:30
nil
 nil

Right join is similar to left join, except it preserves the rows in the right dataframe.

df1 rightJoin: df2

Location
Timing
Instructor Capacity
CS-11
  RTH-02 04:00 - 05:50
 Prof. Foo
 50
 CS-12   nil
 nil Prof. Bar
 45

Joining using arbitary column

The API for joining arbitary column is:

1
2
3
4
df innerJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.
df outerJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.
df leftJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.
df rightJoin: df2 onLeft: 'LeftCol' onRight: 'RightCol'.

Consider similar dataframes as before, except with course names being it’s own column:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
df1 := DataFrame withRows: #(
	('CS-10' 'RTH-01', '11:00 - 01:00')
	('CS-11' 'RTH-02', '04:00 - 05:50')
	('CS-13' 'THH-02', '09:00 - 10:30')
	)
	columnNames: #(Course Location Timing).

df2 := DataFrame withRows: #(
	('CS-11' 'Prof. Foo' 50)
	('CS-12' 'Prof. Bar' 45)
	)
	columnNames: #(Course Instructor Capacity).

Joining using left join would lead to:

df1 leftJoin: df2 onLeft: 'Course' onRight: 'CourseName'

Course
Location
Timing CourseName
Instructor Capacity
 1
CS-10
RTH-01
 11:00 - 01:00
 nil nil
nil
 2 CS-11  RTH-02 04:00 - 05:50
CS-11
 Prof. Foo
 50
 3 CS-13
 THH-02 09:00 - 10:30
nil
 nil  nil

Joining datafames with same column names

If the two dataframes have conflicting column names, it will be replaced by _x and _y. For instance, in the previous example, if both dataframes has column names Course, output dataframe would be:

df1 leftJoin: df2 onLeft: 'Course' onRight: 'Course'

Course_x
Location
Timing Course_y
Instructor Capacity
 1
CS-10
RTH-01
 11:00 - 01:00
 nil nil
nil
 2 CS-11  RTH-02 04:00 - 05:50
CS-11
 Prof. Foo
 50
 3 CS-13
 THH-02 09:00 - 10:30
nil
 nil  nil

Appending dataframes

Consider two dataframes:

df1
Course Location
Timing Instructor Capacity
 1
 CS-10  RTH-01  11:00 - 01:00  Prof. Foo
 50
 2  CS-11  RTH-02  04:00 - 05:50
 Prof. Bar
 50
 3  CS-12  THH-01  TBA  TBA  45


df2
Course Location
Timing Instructor Capacity
 4  CS-13  THH-02  09:00 - 10:30
 Prof. Baz
 30

You can append/concatenate them by:

1
appenDf := df1, df2.

There are a few caveats:

  1. Row names should be unique. Else, the concatenation operator will not work. You can use df1 rowNames: (1 to: df1 size) and df2 rowNames: (df1 size + 1 to: df1 size + df2 size) followed by df1, df2.
  2. Order of columns should be the same. If it is not, reconstruct the dataframe.