3 minutes
GSoC 2019: Extending DataFrame library for Pharo Consortium
I have been accepted in Google Summer of Code by Pharo for the dataframe library and during this summer, various features and documentation will be added to the library. The mentors for this project are Oleksandr Zaitsev, Serge Stinckwich and Konrad Hinsen.
About Pharo
Pharo is an easy-to-learn, pure object oriented programming language based on smalltalk. Pharo comes with it’s integrated IDE, which makes browsing code, documents and examples easier, as well as it’s inspector and inbuilt debugging tool allows you for faster development. It also has various libraries like Seaside (web-dev), PolyMath (scientific computing), Roassal (Visualization) etc. Overall, it is a great language for learning object oriented programming as well as enterprise applications.
About DataFrame library
The dataFrame library was first created as a GSoC 2017 project by Oleksandr, which introduced multiple functionalities making it fit for data analysis. The DataFrame Booklet consists of documentation regarding current API. I have used this library as a base for the Whatsapp Analyzer project.
The Plan
1. Handling missing data
Advanced missing data functionality such as detecting different types of missing data (NA
, nill
, ?
), replacing them, reading files with incomplete data etc. will be added. This should also enable detecting column-type functionality of dataframes.
2. Joins between dataframe
Merging dataframes using joins (left, right, inner) is crucial and will be supported by end of summer
3. JSON import/export
Json is frequently used for transmitting data over the network, and having a json import/export functionality will enable creating and consuming dataset API endpoints over the internet.
4. Mathematical operations
Different operations on DataFrame and DataSeries are planned, such as correlation, covariance, cummulative (min, max, product, sum), clip, and different operators such as >
, <
, >=
, <=
, mod
, pow
. This will bring flexiblity to the library.
5. Dataset fetcher
A new Dataset fetcher would also be added to fetch popular datasets such as Iris, Boston, Mnist etc., making it easier to experiment with the library.
6. Documentation and tests
A good amount of documentation would be added ranging from examples of the API to the comments of the messages, along with additions of tests. The aim is to make the library easy to use for the user
Detailed proposal
The detail proposal has been uploaded here, which has the timelines and messages that will be implemented.
Next steps
The community bonding period is till May 27th, in which I am going to explore the PolyMath library and fix some issues, as well as refine the tasks and schedule for this summer.
You can track the progress of this project by clicking on the tag “GSoC progress” at below, or view the monthly posts by clicking on “GSoC summary”.