DataSet Reader

Included here is documentation for the DataSet class in MaTEx TensorFlow.

data = tf.DataSet(data_name,
                  train_file=None,
                  validation_file=None,
                  test_file=None,
                  train_batch_size=None,
                  test_batch_size=None,
                  normalize=1.0,
                  valid_pct=0.0,
                  test_pct=0.0,
                  sep=',',
                  label_col=True)

datasets.py provides the DataSet class, which reads MNIST, CIFAR, CSV, HDF5 and PNetCDF data in parallel. The DataSet class takes the following arguments:

data_name: "MNIST", "CIFAR10", "CIFAR100", "CSV", "HDF5" or "PNETCDF"
train_file (optional): Required for CSV, HDF5 or PNETCDF, the location of a file consisting of training data
validation_file (optional): Location of a file consisting of Validation data for CSV, HDF5 or PNETCDF
test_file (optional): Location of a file consisting of Testing data for CSV, HDF5 or PNETCDF
train_batch_size (optional): Size of a training batch for use with next_train_batch method
test_batch_size (optional): Size of a testing batch for use with next_validation_batch and next_test_batch methods
normalize (optional): Float to divide all data entries by (default value is 1.0).
valid_pct (optional): Float. If provided for CIFAR or CSV data, will place this fraction of loaded data into a validation set.
test_pct (optional): Float. If provided, for CIFAR or CSV data_name, will place this fraction of loaded data into a testing set
sep (optional): String. Separator for CSV reader. Usually ',' (the default) or '\t' for TSV files
label_col (optional): Bool. If False, will not look for label column in CSV file

DataSet class provides the following methods:

[data, labels] = data.next_train_batch()
[data, labels] = data.next_validation_batch()
[data, labels] = data.next_test_batch()

None of which take any arguments, each of which returns the next batch (of sizes provided by DataSet class) of both data and labels as a list.

Note that for CSV files, all rows must have the same number of elements, with the first element the label.

Examples-0.6

Below we include several examples of use of our parallel dataset reader.

import tensorflow as tf

MNIST

Load MNIST with all entries divided by 255.0.

mnist = tf.DataSet("MNIST", normalize=255.0)

CIFAR-10

Load CIFAR-10 data

cifar10 = tf.DataSet("CIFAR10")

CSV

csv = tf.DataSet("CSV", train_file='train.csv', test_file='test.csv')

HDF5

hdf5 = tf.DataSet("HDF5", train_file='train.h5', test_file='test.h5')

PNetCDF

pnetcdf = tf.DataSet("PNETCDF", train_file='train.nc', test_file='test.nc')

With Next Batch functions

mnist = tf.DataSet("MNIST", normalize=255.0, train_batch_size=64, test_batch_size=100)
batch_x, batch_y = mnist.next_train_batch()

0.5

from datasets import DataSet

data = DataSet(data_name,
               train_batch_size=None,
               test_batch_size=None,
               normalize=1.0,
               file1=None,
               file2=None,
               valid_pct=0.0,
               test_pct=0.0)

datasets.py provides the DataSet class, which reads MNIST, CIFAR, CSV and PNetCDF data in parallel. The DataSet class takes the following arguments:

data_name: "MNIST", "CIFAR10", "CIFAR100", "CSV" or "PNETCDF"
train_batch_size (optional): Size of a training batch for use with next_train_batch method
test_batch_size (optional): Size of a testing batch for use with next_validation_batch and next_test_batch methods
normalize (optional): Float to divide all data entries by (default value is 1.0)
file1 (optional): Required for CSV or PNETCDF, the location of a file consisting of training data
file2 (optional): Location of a file consisting of Testing or Validation data for PNETCDF
valid_pct (optional): Float. If provided, for non-MNIST or PNETCDF data_name, will place this fraction of loaded data into a validation set
test_pct (optional): Float. If provided, for non-MNIST or PNETCDF data_name, will place this fraction of loaded data into a testing set

DataSet class provides the following methods:

[data, labels] = data.next_train_batch()
[data, labels] = data.next_validation_batch()
[data, labels] = data.next_test_batch()

None of which take any arguments, each of which returns the next batch (of sizes provided by DataSet class) of both data and labels as a list.

Note that for CSV files, all rows must have the same number of elements, with the first element the label.

Examples-0.5

Below we include several examples of use of our parallel dataset reader.

from datasets import DataSet

MNIST

Load MNIST with all entries divided by 255.0.

mnist = DataSet("MNIST", normalize=255.0)

CIFAR-10

Load CIFAR-10 data

cifar10 = DataSet("CIFAR10")

CSV

csv = DataSet("CSV", file1='train.csv', file2='test.csv')

PNetCDF

pnetcdf = DataSet("PNETCDF", file1='train.nc', file2='test.nc')

With Next Batch functions

mnist = DataSet("MNIST", normalize=255.0, train_batch_size=64, test_batch_size=100)
batch_x, batch_y = mnist.next_train_batch()

Getting Started on MaTEx-TensorFlow

Getting Started with MaTEx Other ML Algorithms

FAQ

Old MaTEx-TensorFlow Notes

DataSet Reader

Table of Contents

0.6

Examples-0.6

MNIST

CIFAR-10

CSV

HDF5

PNetCDF

With Next Batch functions

0.5

Examples-0.5

MNIST

CIFAR-10

CSV

PNetCDF

With Next Batch functions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally