This package represents a community effort to provide a common
interface for accessing common Machine Learning (ML) datasets. In
contrast to other data-related Julia packages, the focus of
MLDatasets.jl
is specifically on downloading, unpacking, and
accessing benchmark dataset. Functionality for the purpose of
data processing or visualization is only provided to a degree
that is special to some dataset.
Package Status | Package Evaluator | Build Status |
---|---|---|
This package is a part of the
JuliaML
ecosystem. Its
functionality is build on top of the package
DataDeps.jl
.
The way MLDatasets.jl
is organized is that each dataset has its
own dedicated sub-module. Where possible, those sub-module share
a common interface for interacting with the datasets. For example
you can load the training set and the test set of the MNIST
database of handwritten digits using the following commands:
using MLDatasets
train_x, train_y = MNIST.traindata()
test_x, test_y = MNIST.testdata()
To load the data the package looks for the necessary files in
various locations (see
DataDeps.jl
for more information on how to configure such defaults). If the
data can't be found in any of those locations, then the package
will trigger a download dialog to ~/.julia/datadeps/MNIST
. To
overwrite this on a case by case basis, it is possible to specify
a data directory directly in traindata(dir = <directory>)
and
testdata(dir = <directory>)
.
Check out the latest documentation
Additionally, you can make use of Julia's native docsystem.
The following example shows how to get additional information
on MNIST.traintensor
within Julia's REPL:
?MNIST.traintensor
Each dataset has its own dedicated sub-module. As such, it makes sense to document their functionality similarly distributed. Find below a list of available datasets and links to their their documentation.
This package provides a variety of common benchmark datasets for the purpose of image classification.
Dataset | Classes | traintensor |
trainlabels |
testtensor |
testlabels |
---|---|---|---|---|---|
MNIST | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
FashionMNIST | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
CIFAR-10 | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
CIFAR-100 | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2) |
SVHN-2 (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032 |
(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
The PTBLM
dataset consists of Penn Treebank sentences for
language modeling, available from
tomsercu/lstm. The unknown
words are replaced with <unk>
so that the total vocabulary size
becomes 10000.
This is the first sentence of the PTBLM dataset.
x, y = PTBLM.traindata()
x[1]
> ["no", "it", "was", "n't", "black", "monday"]
y[1]
> ["it", "was", "n't", "black", "monday", "<eos>"]
where MLDataset
adds the special word: <eos>
to the end of y
.
The UD_English Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.
traindata = UD_English.traindata()
devdata = UD_English.devdata()
testdata = UD_English.devdata()
Train x | Train y | Test x | Test y | |
---|---|---|---|---|
PTBLM | 42068 | 42068 | 3761 | 3761 |
UD_English | 12543 | - | 2077 | - |
Check out the latest documentation
Additionally, you can make use of Julia's native docsystem.
The following example shows how to get additional information
on MNIST.convert2image
within Julia's REPL:
?MNIST.convert2image
convert2image(array) -> Array{Gray}
Convert the given MNIST horizontal-major tensor (or feature matrix) to a vertical-major Colorant array. The values are also color corrected according to
the website's description, which means that the digits are black on a white background.
julia> MNIST.convert2image(MNIST.traintensor()) # full training dataset
28×28×60000 Array{Gray{N0f8},3}:
[...]
julia> MNIST.convert2image(MNIST.traintensor(1)) # first training image
28×28 Array{Gray{N0f8},2}:
[...]
To install MLDatasets.jl
, start up Julia and type the following
code snippet into the REPL. It makes use of the native Julia
package manger.
import Pkg
Pkg.add("MLDatasets")
This code is free to use under the terms of the MIT license.