Fast Neural Network (FastNN) is a distributed neural network library based on the PAISoar framework. FastNN includes common neural networks, such as Inception, Residual Networks (ResNet), and Visual Geometry Group (VGG), and plans to release more advanced models in the future. FastNN is integrated into the Machine Learning Designer module of Platform for AI (PAI). You can use FastNN in the PAI console.
GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.
Prepare datasets
You can use FastNN in the PAI console in an easy manner. The CIFAR-10, MNIST, and flowers datasets are downloaded and converted into TFRecord files and then stored in Object Storage Service (OSS). You can access the datasets by using the Read Table or OSS Data Synchronization component of PAI. The following table describes the OSS storage paths of the datasets.
Dataset | Number of classes in the dataset | Number of samples in the training dataset | Number of samples in the test dataset | Storage path |
MNIST | 10 | 3320 | 350 |
|
CIFAR-10 | 10 | 50000 | 10000 |
|
flowers | 5 | 60000 | 10000 |
|
FastNN can read data that is stored in a TFRecord file. You can use the TFRecordDataset class to build dataset pipelines for model training, which reduces the time required for data preprocessing. Additionally, FastNN does not support fine-grained data partitioning. To ensure even distribution of data among workers, we recommend that you apply the following rules:
Each TFRecord file contains an equal number of samples.
Each worker processes an equal number of TFRecord files.
If your dataset is stored in TFRecord files, you can download FastNN code and use the sample files in the datasets directory to build dataset pipelines, including cifar10.py, mnist.py, and flowers.py. In the following example, the CIFAR-10 dataset is used.
The features in the CIFAR-10 dataset are in the following format:
features={
'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
'image/class/label': tf.FixedLenFeature(
[], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
}
In the datasets directory, create a file named cifar10.py for data parsing and edit the file.
"""Provides data for the Cifar10 dataset. The dataset scripts used to create the dataset can be found at: datasets/download_and_covert_data/download_and_convert_cifar10.py """ from __future__ import division from __future__ import print_function import tensorflow as tf """Expect func_name is 'parse_fn' """ def parse_fn(example): with tf.device("/cpu:0"): features = tf.parse_single_example( example, features={ 'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''), 'image/format': tf.FixedLenFeature((), tf.string, default_value='png'), 'image/class/label': tf.FixedLenFeature( [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)), } ) image = tf.image.decode_jpeg(features['image/encoded'], channels=3) label = features['image/class/label'] return image, label
In the datasets directory, open the dataset_factory.py file and configure the dataset_map parameter.
from datasets import cifar10 datasets_map = { 'cifar10': cifar10, }
When you run a training job, add dataset_name=cifar10 and train_files=cifar10_train.tfrecord in the command to use the CIFAR-10 dataset for model training.
To read dataset in other formats, refer to the utils/dataset_utils.py file to build a dataset pipeline.
Prepare a hyperparameter file
FastNN supports the following types of hyperparameters:
Dataset hyperparameters: the basic attributes of training datasets. For example, the dataset_dir hyperparameter specifies the storage path of a training dataset.
Data preprocessing hyperparameters: data preprocessing functions and dataset pipeline parameters.
Model hyperparameters: the basic parameters for model training, including model_name and batch_size.
Learning rate hyperparameters: learning rate parameters and tuning parameters.
Optimizer hyperparameters: parameters related to the optimizer.
Log hyperparameters: parameters related to the output log.
Performance tuning hyperparameters: tuning parameters, such as mixed precision.
The following example shows the format of a hyperparameter file:
enable_paisora=True
batch_size=128
use_fp16=True
dataset_name=flowers
dataset_dir=oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/flowers/
model_name=inception_resnet_v2
optimizer=sgd
num_classes=5
job_name=worker
Dataset hyperparameters
Parameter
Type
Description
dataset_name
string
The name of the input dataset that you want to parse. Valid values: mock, cifar10, mnist, and flowers. For more information, see the dataset_factory.py file in the image_models/datasets directory. Default value: mock.
dataset_dir
string
The absolute path of the input dataset. Default value: None.
num_sample_per_epoch
integer
The total number of samples in the dataset. Adjust the learning rate based on the value of this parameter.
num_classes
integer
The number of classes in the dataset. Default value: 100.
train_files
string
The names of the files that contain all training data. Separate multiple names with commas (,). Example: 0.tfrecord,1.tfrecord.
Data preprocessing hyperparameters
Parameter
Type
Description
preprocessing_name
string
This parameter is used together with the model_name parameter to specify the name of the data preprocessing function. For information about the valid values, see the preprocessing_factory.py file in the image_models/preprocessing directory. Default value: None, which specifies that the data is not preprocessed.
shuffle_buffer_size
integer
The size of the buffer pool for sample-based shuffles when a data pipeline is created. Default value: 1024.
num_parallel_batches
integer
The number of parallel threads, which is multiplied by the value of the batch_size parameter to obtain the value of the map_and_batch parameter. This parameter is used to specify the parallel granularity of parsing samples. Default value: 8.
prefetch_buffer_size
integer
The number of data batches that are prefetched by the data pipeline. Default value: 32.
num_preprocessing_threads
integer
The number of threads that are used by the data pipeline to prefetch data at the same time. Default value: 16.
datasets_use_caching
bool
Specifies whether to enable caching for compressed input data by using memory. Default value: False, which specifies that caching is disabled.
Model hyperparameters
Parameter
Type
Description
task_type
string
The type of the task. Valid values:
pretrain: model pre-training. This is the default value.
finetune: model tuning.
model_name
string
The name of the model that you want to train. Valid values include all models in the image_models/models directory. You can configure this parameter based on the models defined in the image_models/models/model_factory.py file. Default value: inception_resnet_v2.
num_epochs
integer
The number of training rounds for the training dataset. Default value: 100.
weight_decay
float
The weight decay factor during model training. Default value: 0.00004.
max_gradient_norm
float
Specifies whether to perform gradient clipping based on the global normalization value. Default value: None, which specifies that gradient clipping is not performed.
batch_size
integer
The amount of data that a GPU processes in each iteration. Default value: 32.
model_dir
string
The path of the checkpoint file that is used to reload the model. Default value: None, which specifies that model tuning is not performed.
ckpt_file_name
string
The name of the checkpoint file that is used to reload the model. Default value: None.
Learning rate hyperparameters
Parameter
Type
Description
warmup_steps
integer
The number of iterations for inverse decay of the learning rate. Default value: 0.
warmup_scheme
string
The inverse decay scheme of the learning rate. Set the value to t2t (Tensor2Tensor). This value specifies the following scheme: Initialize the learning rate at 1/100 of the specified learning rate and follow an inverse exponential decay to reach the specified learning rate.
decay_scheme
string
The decay scheme of the learning rate. Valid values:
luong234: Start a four-step decay scheme after two-thirds of the total iterations are completed. Each step reduces the learning rate by 1/2.
luong5: Start a five-step decay scheme after half of the total iterations are completed. Each step reduces the learning rate by 1/2.
luong10: Start a ten-step decay scheme after half of the total iterations are completed. Each step reduces the learning rate by 1/2.
learning_rate_decay_factor
float
The factor of learning rate decay. Default value: 0.94.
learning_rate_decay_type
string
The type of learning rate decay. Valid values: fixed, exponential, and polynomial. Default value: exponential.
learning_rate
float
The initial learning rate. Default value: 0.01.
end_learning_rate
float
The minimum learning rate during decay. Default value: 0.0001.
Optimizer hyperparameters
Parameter
Type
Description
optimizer
string
The name of the optimizer. Valid values: adadelta, adagrad, adam, ftrl, momentum, sgd, rmsprop, adamweightdecay. Default value: rmsprop.
adadelta_rho
float
The decay factor of the Adadelta optimizer. Default value: 0.95. This parameter is valid only if you set the optimizer parameter to adadelta.
adagrad_initial_accumulator_value
float
The initial value of the Adagrad accumulator. Default value: 0.1. This parameter is valid only if you set the optimizer parameter to adagrad.
adam_beta1
float
The exponential decay rate in primary momentum prediction. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to adam.
adam_beta2
float
The exponential decay rate in secondary momentum prediction. Default value: 0.999. This parameter is valid only if you set the optimizer parameter to adam.
opt_epsilon
float
The offset of the optimizer. Default value: 1.0. This parameter is valid only if you set the optimizer parameter to adam.
ftrl_learning_rate_power
float
The idempotent parameter of the learning rate. Default value: -0.5. This parameter is valid only if you set the optimizer parameter to ftrl.
ftrl_initial_accumulator_value
float
The starting point of the FTRL accumulator. Default value: 0.1. This parameter is valid only if you set the optimizer parameter to ftrl.
ftrl_l1
float
The regularization term of FTRL l1. Default value: 0.0. This parameter is valid only if you set the optimizer parameter to ftrl.
ftrl_l2
float
The regularization term of FTRL l2. Default value: 0.0. This parameter is valid only if you set the optimizer parameter to ftrl.
momentum
float
The momentum parameter of the Momentum optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to momentum.
rmsprop_momentum
float
The momentum parameter of the RMSProp optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to rmsprop.
rmsprop_decay
float
The decay factor of the RMSProp optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to rmsprop.
Log hyperparameters
Parameter
Type
Description
stop_at_step
integer
The total number of training epochs. Default value: 100.
log_loss_every_n_iters
integer
The iterative frequency at which the loss information is printed. Default value: 10.
profile_every_n_iters
integer
The iterative frequency at which the timeline is printed. Default value: 0.
profile_at_task
integer
The index of the machine that generates the timeline. Default value: 0, which corresponds to the index of the chief worker.
log_device_placement
bool
Specifies whether to print the device placement information. Default value: False.
print_model_statistics
bool
Specifies whether to print the trainable variable information. Default value: false.
hooks
string
The training hooks. Default value: StopAtStepHook,ProfilerHook,LoggingTensorHook,CheckpointSaverHook.
Performance tuning hyperparameters
Parameter
Type
Description
use_fp16
bool
Specifies whether to perform semi-precision training. Default value: True.
loss_scale
float
The scaling factor of the loss function during training. Default value: 1.0.
enable_paisoar
bool
Specifies whether to use the PAISoar framework. Default value: True.
protocol
string
Default value: grpc.rdma, which specifies that the cluster uses gRPC Remote Procedure Calls (gRPC) to improve data access efficiency.
Develop a main file
If the existing FastNN models cannot meet your requirements, you can use the dataset, model, and preprocessing APIs for further development. Before development, make sure that you are familiar with the basic logic of a FastNN model. If you download FastNN code, you can view the basic logic of an image classification model in the train_image_classifiers.py entry file. Sample code:
# Initialize the model by using the model_name parameter to create the network_fn function. The input parameter train_image_size may be returned.
network_fn = nets_factory.get_network_fn(
FLAGS.model_name,
num_classes=FLAGS.num_classes,
weight_decay=FLAGS.weight_decay,
is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# Initialize the preprocess_fn function by using the model_name or preprocessing_name parameter.
preprocessing_fn = preprocessing_factory.get_preprocessing(
FLAGS.model_name or FLAGS.preprocessing_name,
is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# Select the valid TFRecord format based on the dataset_name parameter and synchronously call the preprocess_fn function to parse the dataset and obtain the dataset_iterator object.
dataset_iterator = dataset_factory.get_dataset_iterator(FLAGS.dataset_name,
train_image_size,
preprocessing_fn,
data_sources,
# Call the network_fn and dataset_iterator.get_next functions to define the loss_fn function that is used to calculate the loss.
def loss_fn():
with tf.device('/cpu:0'):
images, labels = dataset_iterator.get_next()
logits, end_points = network_fn(images)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(logits, tf.float32), weights=1.0)
if 'AuxLogits' in end_points:
loss += tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(end_points['AuxLogits'], tf.float32), weights=0.4)
return loss
# Call the PAI-Soar API to encapsulate the native TensorFlow optimizer and the loss_fn function.
opt = paisoar.ReplicatedVarsOptimizer(optimizer, clip_norm=FLAGS.max_gradient_norm)
loss = optimizer.compute_loss(loss_fn, loss_scale=FLAGS.loss_scale)
# Define training tensors based on the value of the opt and loss parameter.
train_op = opt.minimize(loss, global_step=global_step)