I have been accepted in Google Summer of Code by Pharo for the dataframe library and during this summer, various features and documentation will be added to the library. The mentors for this project are Oleksandr Zaitsev, Serge Stinckwich and Konrad Hinsen.

About Pharo

Pharo is an easy-to-learn, pure object oriented programming language based on smalltalk. Pharo comes with it’s integrated IDE, which makes browsing code, documents and examples easier, as well as it’s inspector and inbuilt debugging tool allows you for faster development. It also has various libraries like Seaside (web-dev), PolyMath (scientific computing), Roassal (Visualization) etc. Overall, it is a great language for learning object oriented programming as well as enterprise applications.

About DataFrame library

The dataFrame library was first created as a GSoC 2017 project by Oleksandr, which introduced multiple functionalities making it fit for data analysis. The DataFrame Booklet consists of documentation regarding current API. I have used this library as a base for the Whatsapp Analyzer project.

The Plan

1. Handling missing data

Advanced missing data functionality such as detecting different types of missing data (NA, nill, ?), replacing them, reading files with incomplete data etc. will be added. This should also enable detecting column-type functionality of dataframes.

2. Joins between dataframe

Merging dataframes using joins (left, right, inner) is crucial and will be supported by end of summer

3. JSON import/export

Json is frequently used for transmitting data over the network, and having a json import/export functionality will enable creating and consuming dataset API endpoints over the internet.

4. Mathematical operations

Different operations on DataFrame and DataSeries are planned, such as correlation, covariance, cummulative (min, max, product, sum), clip, and different operators such as >, <, >=, <=, mod, pow. This will bring flexiblity to the library.

5. Dataset fetcher

A new Dataset fetcher would also be added to fetch popular datasets such as Iris, Boston, Mnist etc., making it easier to experiment with the library.

6. Documentation and tests

A good amount of documentation would be added ranging from examples of the API to the comments of the messages, along with additions of tests. The aim is to make the library easy to use for the user

Detailed proposal

The detail proposal has been uploaded here, which has the timelines and messages that will be implemented.

Next steps

The community bonding period is till May 27th, in which I am going to explore the PolyMath library and fix some issues, as well as refine the tasks and schedule for this summer.


You can track the progress of this project by clicking on the tag “GSoC progress” at below, or view the monthly posts by clicking on “GSoC summary”.