-
Notifications
You must be signed in to change notification settings - Fork 40
DataSet Reader
Included here is documentation for the DataSet class in MaTEx TensorFlow.
In MaTEx 0.6, datasets.py has been integrated into TensorFlow.
data = tf.DataSet(data_name,
train_file=None,
validation_file=None,
test_file=None,
train_batch_size=None,
test_batch_size=None,
normalize=1.0,
valid_pct=0.0,
test_pct=0.0,
sep=',',
label_col=True)datasets.py provides the DataSet class, which reads MNIST, CIFAR, CSV, HDF5 and PNetCDF data in parallel. The DataSet class takes the following arguments:
- data_name: "MNIST", "CIFAR10", "CIFAR100", "CSV", "HDF5" or "PNETCDF"
- train_file (optional): Required for CSV, HDF5 or PNETCDF, the location of a file consisting of training data
- validation_file (optional): Location of a file consisting of Validation data for CSV, HDF5 or PNETCDF
- test_file (optional): Location of a file consisting of Testing data for CSV, HDF5 or PNETCDF
- train_batch_size (optional): Size of a training batch for use with next_train_batch method
- test_batch_size (optional): Size of a testing batch for use with next_validation_batch and next_test_batch methods
- normalize (optional): Float to divide all data entries by (default value is 1.0).
- valid_pct (optional): Float. If provided for CIFAR or CSV data, will place this fraction of loaded data into a validation set.
- test_pct (optional): Float. If provided, for CIFAR or CSV data_name, will place this fraction of loaded data into a testing set
- sep (optional): String. Separator for CSV reader. Usually ',' (the default) or '\t' for TSV files
- label_col (optional): Bool. If False, will not look for label column in CSV file
DataSet class provides the following methods:
[data, labels] = data.next_train_batch()
[data, labels] = data.next_validation_batch()
[data, labels] = data.next_test_batch()None of which take any arguments, each of which returns the next batch (of sizes provided by DataSet class) of both data and labels as a list.
Note that for CSV files, all rows must have the same number of elements, with the first element the label.
Below we include several examples of use of our parallel dataset reader.
import tensorflow as tfLoad MNIST with all entries divided by 255.0.
mnist = tf.DataSet("MNIST", normalize=255.0)Load CIFAR-10 data
cifar10 = tf.DataSet("CIFAR10")csv = tf.DataSet("CSV", train_file='train.csv', test_file='test.csv')hdf5 = tf.DataSet("HDF5", train_file='train.h5', test_file='test.h5')pnetcdf = tf.DataSet("PNETCDF", train_file='train.nc', test_file='test.nc')mnist = tf.DataSet("MNIST", normalize=255.0, train_batch_size=64, test_batch_size=100)
batch_x, batch_y = mnist.next_train_batch()from datasets import DataSet
data = DataSet(data_name,
train_batch_size=None,
test_batch_size=None,
normalize=1.0,
file1=None,
file2=None,
valid_pct=0.0,
test_pct=0.0)datasets.py provides the DataSet class, which reads MNIST, CIFAR, CSV and PNetCDF data in parallel. The DataSet class takes the following arguments:
- data_name: "MNIST", "CIFAR10", "CIFAR100", "CSV" or "PNETCDF"
- train_batch_size (optional): Size of a training batch for use with next_train_batch method
- test_batch_size (optional): Size of a testing batch for use with next_validation_batch and next_test_batch methods
- normalize (optional): Float to divide all data entries by (default value is 1.0)
- file1 (optional): Required for CSV or PNETCDF, the location of a file consisting of training data
- file2 (optional): Location of a file consisting of Testing or Validation data for PNETCDF
- valid_pct (optional): Float. If provided, for non-MNIST or PNETCDF data_name, will place this fraction of loaded data into a validation set
- test_pct (optional): Float. If provided, for non-MNIST or PNETCDF data_name, will place this fraction of loaded data into a testing set
DataSet class provides the following methods:
[data, labels] = data.next_train_batch()
[data, labels] = data.next_validation_batch()
[data, labels] = data.next_test_batch()None of which take any arguments, each of which returns the next batch (of sizes provided by DataSet class) of both data and labels as a list.
Note that for CSV files, all rows must have the same number of elements, with the first element the label.
Below we include several examples of use of our parallel dataset reader.
from datasets import DataSetLoad MNIST with all entries divided by 255.0.
mnist = DataSet("MNIST", normalize=255.0)Load CIFAR-10 data
cifar10 = DataSet("CIFAR10")csv = DataSet("CSV", file1='train.csv', file2='test.csv')pnetcdf = DataSet("PNETCDF", file1='train.nc', file2='test.nc')mnist = DataSet("MNIST", normalize=255.0, train_batch_size=64, test_batch_size=100)
batch_x, batch_y = mnist.next_train_batch()Getting Started on MaTEx-TensorFlow
- Required Software
- Installing MaTEx-TensorFlow on CPU Clusters
- Installing MaTEx-TensorFlow on GPU Clusters
- MaTEx-TensorFlow on Older glibc(v<2.19)
- DataSet Reader
- Testing Scripts
- Performance
- Running on PNNL Systems
- Running on NERSC Systems
- Restarting the MaTEx TensorFlow environment