As a part of GSoC, I have added a new library to fetch popular datasets in Pharo - https://github.com/AtharvaKhare/DataSet. Currently, it includes following datasets:

  1. Boston house prices dataset
  2. Breast cancer wisconsin (diagnostic) dataset
  3. Diabetes dataset
  4. Optical recognition of handwritten digits dataset
  5. Iris plants dataset
  6. MNIST testing data
  7. Wine recognition dataset

Any requests for additional datasets to be added can be put at this issue.

Structure and Working of the library

The DataSetFiles repo has datasets in csv format, and the DataSet library has the code to fetch and convert it as a DataFrame object.

To load a dataset, you use DataSet loadXYZ. For example:

1
df := DataSet loadBoston.

This checks the filesystem to see if the dataset is available, and downloads if it isn’t present using DataSet downloadXYZ method. The downloaded files are stored at data folder in the root of repo stored in the filesystem.

To download all the datasets in advance, use:

1
DataSet downloadAll.

This downloads the datasets that have not yet been downloaded.

Future work

A possible improvement to the library would be storing a list of datasets in DataSetFiles, and modifing DataSet library such that user will be able to dynamically fetch available datasets and download whichever is needed. If you have thoughts/suggestions regarding this or any improvements, email me or create an issue on Github!