As a part of GSoC, I have added a new library to fetch popular datasets in Pharo - https://github.com/AtharvaKhare/DataSet. Currently, it includes following datasets:
- Boston house prices dataset
- Breast cancer wisconsin (diagnostic) dataset
- Diabetes dataset
- Optical recognition of handwritten digits dataset
- Iris plants dataset
- MNIST testing data
- Wine recognition dataset
Any requests for additional datasets to be added can be put at this issue.
Structure and Working of the library
The DataSetFiles repo has datasets in csv format, and the DataSet library has the code to fetch and convert it as a DataFrame object.
To load a dataset, you use
DataSet loadXYZ. For example:
This checks the filesystem to see if the dataset is available, and downloads if it isn’t present using
DataSet downloadXYZ method.
The downloaded files are stored at
data folder in the root of repo stored in the filesystem.
To download all the datasets in advance, use:
This downloads the datasets that have not yet been downloaded.
A possible improvement to the library would be storing a list of datasets in DataSetFiles, and modifing DataSet library such that user will be able to dynamically fetch available datasets and download whichever is needed. If you have thoughts/suggestions regarding this or any improvements, email me or create an issue on Github!
2019-08-05 20:13 -07:00