GSoC Community Bonding period: Exploring DataFrame, PolyMath and Roassal

I spent the last 2 weeks exploring PolyMath, DataFrame and Roassal. These three libraries were developed independently, and solve different goals: PolyMath for scientific computing, DataFrame for data analysis, and Roassal for visualization. However, the work cohesively, due to the class structure of these libraries.

To demonstrate this, here is a piece of code utilising all three libraries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82


"Iris plotting in Roassal"
reader := DataFrameCsvReader new.
fileRef :=  'iris.csv' asFileReference.
df := DataFrame readFrom: fileRef using: reader.

df_x := df columns: #('Sepal length' 'Sepal width' 'Petal length' 'Petal width').

dataServer := PMMemoryBasedDataServer new.
dataServer data: df_x.

finder := PMClusterFinder new: 3 server: dataServer type: PMEuclideanCluster.
finder minimumRelativeClusterSize: 0.01.
clusters := finder evaluate.

y := DataSeries new name: 'Output'.
df withIndexDo: [ :row :index |
	y add: index->(finder indexOfNearestCluster: ({row at: 'Sepal length'. row at: 'Sepal width'. row at: 'Petal length'. row at: 'Petal width'} asPMVector)).
	].

df addColumn: y.

m := PMMatrix rows: df_x.
m := (PMStandardizationScaler new) fitAndTransform: m.
pca := PMPrincipalComponentAnalyserJacobiTransformation new componentsNumber: 2.
pca fit: m.
reduced := pca transform: m.

transformOutput := Dictionary newFrom: {
	3->'Iris-setosa' .
	2->'Iris-versicolor' .
	1->'Iris-virginica' .
}.
df column: 'Output' transform: [ :column |
	column collect: [ :number |
		transformOutput at: number.
	].
].

df addColumn: (reduced atColumn: 1) named: 'PCA x'.
df addColumn: (reduced atColumn: 2) named: 'PCA y'.

b := RTGrapher new.

ds_setosa := RTData new.
ds_setosa label: 'Iris setosa'.
ds_setosa dotShape circle color: Color red trans.
ds_setosa points: (df select: [:row | ((row at: #Type) = 'Iris-setosa') & ((row at: #Type) = (row at: #Output))]).
ds_setosa interaction popupText: [ :row | 'Actual: ', (row at: #Type) asString, '. Predicted: ', (row at: #Output) asString ].
ds_setosa x: [ :row | row at: 'PCA x' ].
ds_setosa y: [ :row | row at: 'PCA y' ].
b add: ds_setosa.


ds_versicolor := RTData new.
ds_versicolor label: 'Iris versicolor'.
ds_versicolor dotShape circle color: Color blue trans.
ds_versicolor points: (df select: [:row | ((row at: #Type) = 'Iris-versicolor') & ((row at: #Type) = (row at: #Output))]).
ds_versicolor interaction popupText: [ :row | 'Actual: ', (row at: #Type) asString, '. Predicted: ', (row at: #Output) asString ].
ds_versicolor x: [ :row | row at: 'PCA x' ].
ds_versicolor y: [ :row | row at: 'PCA y' ].
b add: ds_versicolor.


ds_virginica := RTData new.
ds_virginica label: 'Iris virginica'.
ds_virginica dotShape circle color: Color green trans.
ds_virginica points: (df select: [:row | ((row at: #Type)) = 'Iris-virginica' & ((row at: #Type) = (row at: #Output))]).
ds_virginica interaction popupText: [ :row | 'Actual: ', (row at: #Type) asString, '. Predicted: ', (row at: #Output) asString ].
ds_virginica x: [ :row | row at: 'PCA x' ].
ds_virginica y: [ :row | row at: 'PCA y' ].
b add: ds_virginica.

ds_misclassified := RTData new.
ds_misclassified label: 'Misclassified'.
ds_misclassified dotShape circle color: Color black trans.
ds_misclassified points: (df select: [:row | (row at: #Type) ~= (row at: #Output)]).
ds_misclassified interaction popupText: [ :row | 'Actual: ', (row at: #Type) asString, '. Predicted: ', (row at: #Output) asString ].
ds_misclassified x: [ :row | row at: 'PCA x' ].
ds_misclassified y: [ :row | row at: 'PCA y' ].
b add: ds_misclassified.

b.

Here is it exported from Roassal as HTML (go on - hover over the points!):

Note: I have clustered first, then applied PCA. Exercise for the reader - try the opposite!

In the last week, I begun implementing t-SNE for PolyMath. It is exciting to dissect the paper, working through the nitty-gritty details and translating them into code. By the end of next week, it will be the first paper I have implemented! I also am writing an accompanying post, which will explain the math behind the algorithm - it is one of the reason that this week’s post is small!

You can track the progress of the implementation at this link: t-SNE project board.