How to use FastNN - Platform For AI - Alibaba Cloud Documentation Center

Fast Neural Network (FastNN) is a distributed neural network library based on the PAISoar framework. FastNN includes common neural networks, such as Inception, Residual Networks (ResNet), and Visual Geometry Group (VGG), and plans to release more advanced models in the future. FastNN is integrated into the Machine Learning Designer module of Platform for AI (PAI). You can use FastNN in the PAI console.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Prepare datasets

You can use FastNN in the PAI console in an easy manner. The CIFAR-10, MNIST, and flowers datasets are downloaded and converted into TFRecord files and then stored in Object Storage Service (OSS). You can access the datasets by using the Read Table or OSS Data Synchronization component of PAI. The following table describes the OSS storage paths of the datasets.

Dataset	Number of classes in the dataset	Number of samples in the training dataset	Number of samples in the test dataset	Storage path
MNIST	10	3320	350	China (Beijing): oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/mnist/ China (Shanghai): oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/mnist/
CIFAR-10	10	50000	10000	China (Beijing): oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/cifar10/ China (Shanghai): oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/cifar10/
flowers	5	60000	10000	China (Beijing): oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/flowers/ China (Shanghai): oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/flowers/

FastNN can read data that is stored in a TFRecord file. You can use the TFRecordDataset class to build dataset pipelines for model training, which reduces the time required for data preprocessing. Additionally, FastNN does not support fine-grained data partitioning. To ensure even distribution of data among workers, we recommend that you apply the following rules:

Each TFRecord file contains an equal number of samples.
Each worker processes an equal number of TFRecord files.

If your dataset is stored in TFRecord files, you can download FastNN code and use the sample files in the datasets directory to build dataset pipelines, including cifar10.py, mnist.py, and flowers.py. In the following example, the CIFAR-10 dataset is used.

The features in the CIFAR-10 dataset are in the following format:

features={
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
        'image/class/label': tf.FixedLenFeature(
          [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
}

In the datasets directory, create a file named cifar10.py for data parsing and edit the file.

"""Provides data for the Cifar10 dataset.
The dataset scripts used to create the dataset can be found at:
datasets/download_and_covert_data/download_and_convert_cifar10.py
"""
from __future__ import division
from __future__ import print_function
import tensorflow as tf
"""Expect func_name is 'parse_fn'
"""
def parse_fn(example):
  with tf.device("/cpu:0"):
    features = tf.parse_single_example(
      example,
      features={
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
        'image/class/label': tf.FixedLenFeature(
          [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
      }
    )
    image = tf.image.decode_jpeg(features['image/encoded'], channels=3)
    label = features['image/class/label']
    return image, label

In the datasets directory, open the dataset_factory.py file and configure the dataset_map parameter.
```
from datasets import cifar10
datasets_map = {
    'cifar10': cifar10,
}
```
When you run a training job, add dataset_name=cifar10 and train_files=cifar10_train.tfrecord in the command to use the CIFAR-10 dataset for model training.

Note

To read dataset in other formats, refer to the utils/dataset_utils.py file to build a dataset pipeline.

Prepare a hyperparameter file

FastNN supports the following types of hyperparameters:

Dataset hyperparameters: the basic attributes of training datasets. For example, the dataset_dir hyperparameter specifies the storage path of a training dataset.
Data preprocessing hyperparameters: data preprocessing functions and dataset pipeline parameters.
Model hyperparameters: the basic parameters for model training, including model_name and batch_size.
Learning rate hyperparameters: learning rate parameters and tuning parameters.
Optimizer hyperparameters: parameters related to the optimizer.
Log hyperparameters: parameters related to the output log.
Performance tuning hyperparameters: tuning parameters, such as mixed precision.

The following example shows the format of a hyperparameter file:

enable_paisora=True
batch_size=128
use_fp16=True
dataset_name=flowers
dataset_dir=oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/flowers/
model_name=inception_resnet_v2
optimizer=sgd
num_classes=5
job_name=worker

Dataset hyperparameters

Parameter	Type	Description
dataset_name	string	The name of the input dataset that you want to parse. Valid values: mock, cifar10, mnist, and flowers. For more information, see the dataset_factory.py file in the image_models/datasets directory. Default value: mock.
dataset_dir	string	The absolute path of the input dataset. Default value: None.
num_sample_per_epoch	integer	The total number of samples in the dataset. Adjust the learning rate based on the value of this parameter.
num_classes	integer	The number of classes in the dataset. Default value: 100.
train_files	string	The names of the files that contain all training data. Separate multiple names with commas (,). Example: 0.tfrecord,1.tfrecord.

Data preprocessing hyperparameters

Parameter	Type	Description
preprocessing_name	string	This parameter is used together with the model_name parameter to specify the name of the data preprocessing function. For information about the valid values, see the preprocessing_factory.py file in the image_models/preprocessing directory. Default value: None, which specifies that the data is not preprocessed.
shuffle_buffer_size	integer	The size of the buffer pool for sample-based shuffles when a data pipeline is created. Default value: 1024.
num_parallel_batches	integer	The number of parallel threads, which is multiplied by the value of the batch_size parameter to obtain the value of the map_and_batch parameter. This parameter is used to specify the parallel granularity of parsing samples. Default value: 8.
prefetch_buffer_size	integer	The number of data batches that are prefetched by the data pipeline. Default value: 32.
num_preprocessing_threads	integer	The number of threads that are used by the data pipeline to prefetch data at the same time. Default value: 16.
datasets_use_caching	bool	Specifies whether to enable caching for compressed input data by using memory. Default value: False, which specifies that caching is disabled.

Model hyperparameters

Parameter	Type	Description
task_type	string	The type of the task. Valid values: pretrain: model pre-training. This is the default value. finetune: model tuning.
model_name	string	The name of the model that you want to train. Valid values include all models in the image_models/models directory. You can configure this parameter based on the models defined in the image_models/models/model_factory.py file. Default value: inception_resnet_v2.
num_epochs	integer	The number of training rounds for the training dataset. Default value: 100.
weight_decay	float	The weight decay factor during model training. Default value: 0.00004.
max_gradient_norm	float	Specifies whether to perform gradient clipping based on the global normalization value. Default value: None, which specifies that gradient clipping is not performed.
batch_size	integer	The amount of data that a GPU processes in each iteration. Default value: 32.
model_dir	string	The path of the checkpoint file that is used to reload the model. Default value: None, which specifies that model tuning is not performed.
ckpt_file_name	string	The name of the checkpoint file that is used to reload the model. Default value: None.

Learning rate hyperparameters

Parameter	Type	Description
warmup_steps	integer	The number of iterations for inverse decay of the learning rate. Default value: 0.
warmup_scheme	string	The inverse decay scheme of the learning rate. Set the value to t2t (Tensor2Tensor). This value specifies the following scheme: Initialize the learning rate at 1/100 of the specified learning rate and follow an inverse exponential decay to reach the specified learning rate.
decay_scheme	string	The decay scheme of the learning rate. Valid values: luong234: Start a four-step decay scheme after two-thirds of the total iterations are completed. Each step reduces the learning rate by 1/2. luong5: Start a five-step decay scheme after half of the total iterations are completed. Each step reduces the learning rate by 1/2. luong10: Start a ten-step decay scheme after half of the total iterations are completed. Each step reduces the learning rate by 1/2.
learning_rate_decay_factor	float	The factor of learning rate decay. Default value: 0.94.
learning_rate_decay_type	string	The type of learning rate decay. Valid values: fixed, exponential, and polynomial. Default value: exponential.
learning_rate	float	The initial learning rate. Default value: 0.01.
end_learning_rate	float	The minimum learning rate during decay. Default value: 0.0001.

Optimizer hyperparameters

Parameter	Type	Description
optimizer	string	The name of the optimizer. Valid values: adadelta, adagrad, adam, ftrl, momentum, sgd, rmsprop, adamweightdecay. Default value: rmsprop.
adadelta_rho	float	The decay factor of the Adadelta optimizer. Default value: 0.95. This parameter is valid only if you set the optimizer parameter to adadelta.
adagrad_initial_accumulator_value	float	The initial value of the Adagrad accumulator. Default value: 0.1. This parameter is valid only if you set the optimizer parameter to adagrad.
adam_beta1	float	The exponential decay rate in primary momentum prediction. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to adam.
adam_beta2	float	The exponential decay rate in secondary momentum prediction. Default value: 0.999. This parameter is valid only if you set the optimizer parameter to adam.
opt_epsilon	float	The offset of the optimizer. Default value: 1.0. This parameter is valid only if you set the optimizer parameter to adam.
ftrl_learning_rate_power	float	The idempotent parameter of the learning rate. Default value: -0.5. This parameter is valid only if you set the optimizer parameter to ftrl.
ftrl_initial_accumulator_value	float	The starting point of the FTRL accumulator. Default value: 0.1. This parameter is valid only if you set the optimizer parameter to ftrl.
ftrl_l1	float	The regularization term of FTRL l1. Default value: 0.0. This parameter is valid only if you set the optimizer parameter to ftrl.
ftrl_l2	float	The regularization term of FTRL l2. Default value: 0.0. This parameter is valid only if you set the optimizer parameter to ftrl.
momentum	float	The momentum parameter of the Momentum optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to momentum.
rmsprop_momentum	float	The momentum parameter of the RMSProp optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to rmsprop.
rmsprop_decay	float	The decay factor of the RMSProp optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to rmsprop.

Log hyperparameters

Parameter	Type	Description
stop_at_step	integer	The total number of training epochs. Default value: 100.
log_loss_every_n_iters	integer	The iterative frequency at which the loss information is printed. Default value: 10.
profile_every_n_iters	integer	The iterative frequency at which the timeline is printed. Default value: 0.
profile_at_task	integer	The index of the machine that generates the timeline. Default value: 0, which corresponds to the index of the chief worker.
log_device_placement	bool	Specifies whether to print the device placement information. Default value: False.
print_model_statistics	bool	Specifies whether to print the trainable variable information. Default value: false.
hooks	string	The training hooks. Default value: StopAtStepHook,ProfilerHook,LoggingTensorHook,CheckpointSaverHook.

Performance tuning hyperparameters

Parameter	Type	Description
use_fp16	bool	Specifies whether to perform semi-precision training. Default value: True.
loss_scale	float	The scaling factor of the loss function during training. Default value: 1.0.
enable_paisoar	bool	Specifies whether to use the PAISoar framework. Default value: True.
protocol	string	Default value: grpc.rdma, which specifies that the cluster uses gRPC Remote Procedure Calls (gRPC) to improve data access efficiency.

Develop a main file

If the existing FastNN models cannot meet your requirements, you can use the dataset, model, and preprocessing APIs for further development. Before development, make sure that you are familiar with the basic logic of a FastNN model. If you download FastNN code, you can view the basic logic of an image classification model in the train_image_classifiers.py entry file. Sample code:

# Initialize the model by using the model_name parameter to create the network_fn function. The input parameter train_image_size may be returned. 
    network_fn = nets_factory.get_network_fn(
            FLAGS.model_name,
            num_classes=FLAGS.num_classes,
            weight_decay=FLAGS.weight_decay,
            is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# Initialize the preprocess_fn function by using the model_name or preprocessing_name parameter. 
    preprocessing_fn = preprocessing_factory.get_preprocessing(
                FLAGS.model_name or FLAGS.preprocessing_name,
                is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# Select the valid TFRecord format based on the dataset_name parameter and synchronously call the preprocess_fn function to parse the dataset and obtain the dataset_iterator object. 
    dataset_iterator = dataset_factory.get_dataset_iterator(FLAGS.dataset_name,
                                                            train_image_size,
                                                            preprocessing_fn,
                                                            data_sources,
# Call the network_fn and dataset_iterator.get_next functions to define the loss_fn function that is used to calculate the loss. 
    def loss_fn():
      with tf.device('/cpu:0'):
          images, labels = dataset_iterator.get_next()
        logits, end_points = network_fn(images)
        loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(logits, tf.float32), weights=1.0)
        if 'AuxLogits' in end_points:
          loss += tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(end_points['AuxLogits'], tf.float32), weights=0.4)
        return loss
# Call the PAI-Soar API to encapsulate the native TensorFlow optimizer and the loss_fn function. 
    opt = paisoar.ReplicatedVarsOptimizer(optimizer, clip_norm=FLAGS.max_gradient_norm)
    loss = optimizer.compute_loss(loss_fn, loss_scale=FLAGS.loss_scale)
# Define training tensors based on the value of the opt and loss parameter. 
    train_op = opt.minimize(loss, global_step=global_step)