Today (June 24th) marks the completion of phase 1 of GSoC. Here is the summary of what I have been working on for past few weeks:

PolyMath library

I spent most of my community bonding period as well as first week of phase 1 on PolyMath library, exploring codebase, fixing bugs and adding features. My most favorite part was adding t-SNE implementation! Here is a peek at a couple of visualizations:

The visualization is made possible using Roassal3, be sure to check it out at https://github.com/ObjectProfile/Roassal3.

Here is the list of issues I created:

  1. Math-TSNE is incomplete
  2. PMVector > < operators modify in-place
  3. PMVector sum is extremely slow
  4. PMStandardScalar fails when scale = 0

And here is the list of PRs that solved them:

  1. Implementing the t-SNE algorithm
  2. Fix PMVector comparison operators
  3. Removed PMVector sum for speedup
  4. Removed == method from PMVector
  5. Refactored PMTSNE to include steps
  6. Added vizualization examples for PMTSNE
  7. Fixed ZeroDivideError in PMStandardizationScaler

DataFrame library

The main focus for phase 1 was getting the library work with missing data - initializing with missing values, reading files, and providing methods to fill the data.

Here is the list of issues created by me:

  1. Handling missing data - collection of all issues related to missing data
  2. DataFrame select: fails when no rows are selected
  3. DataSeries does not support boolean operators with scalars
  4. Add JSON read/write support
  5. DataFrameInternal - Using OrderedCollection over Array2D
  6. DataFrame addRow does not consider key order

PRs which solve some of these issues:

  1. Boolean operators for DataSeries
  2. Added support for DataFrame init with missing values
  3. Added ability to remove nil from DataFrame and Series
  4. Added DataSeries fillNilsWith method
  5. Added method to convert missing values from files
  6. DataFrameTypeDetector now works with nil

Next steps

Some of the features planned for phase 2 include implenting joins, adding json support and a new DataSet library. I will be creating a detailed post after the first evaluation (first week of July). Be sure to star and follow DataFrame and PolyMath on Github for updates on the progress! :)