如何使用模型仓库FastNN - 人工智能平台 PAI

PAI模型仓库FastNN（Fast Neural Networks）是一个基于PAISoar的分布式神经网络仓库。FastNN支持Inception、Resnet、VGG等经典算法，后续会逐步开放更多的先进模型。目前FastNN已经内置于Designer平台中，可以直接在该平台中使用。

警告

公共云GPU服务器即将过保下线，您可以继续提交CPU版本的TensorFlow任务。如需使用GPU进行模型训练，请前往DLC提交任务，具体操作请参见创建训练任务。

准备数据源

为了方便在PAI控制台上试用FastNN，cifar10、mnist、flowers数据已下载并转换为tfrecord后存储在公开OSS上，可通过PAI的读数据表或OSS数据同步组件访问。存储OSS的路径如下。

数据集	分类数	训练集	测试集	存储路径
mnist	10	3320	350	北京：oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/mnist/ 上海：oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/mnist/
cifar10	10	50000	10000	北京：oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/cifar10/ 上海：oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/cifar10/
flowers	5	60000	10000	北京：oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/flowers/ 上海：oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/flowers/

FastNN库已支持读取tfrecord格式的数据，并基于TFRecordDataset接口实现dataset pipeline以供模型训练试用，几乎可掩盖数据预处理时间。另外，由于目前FastNN库在数据分片方面不够精细，建议您在准备数据时，尽量保证数据能平均分配到每台机器，即：

每个tfrecord文件的样本数量基本一致。
每个worker处理的tfrecord文件数量基本一致。

如果数据格式同为tfrecord，可参考datasets目录下的cifar10、mnist和flowers等各文件实现dataset pipeline。以cifar10数据为例，实现方法如下。

假设cifar10数据的key_to_features格式为如下。

features={
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
        'image/class/label': tf.FixedLenFeature(
          [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
}

在datasets目录下创建数据解析文件cifar10.py，并编辑内容。

"""Provides data for the Cifar10 dataset.
The dataset scripts used to create the dataset can be found at:
datasets/download_and_covert_data/download_and_convert_cifar10.py
"""
from __future__ import division
from __future__ import print_function
import tensorflow as tf
"""Expect func_name is ‘parse_fn’
"""
def parse_fn(example):
  with tf.device("/cpu:0"):
    features = tf.parse_single_example(
      example,
      features={
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
        'image/class/label': tf.FixedLenFeature(
          [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
      }
    )
    image = tf.image.decode_jpeg(features['image/encoded'], channels=3)
    label = features['image/class/label']
    return image, label

在datasets/dataset_factory.py中补足dataset_map。

from datasets import cifar10
datasets_map = {
    'cifar10': cifar10,
}

执行任务脚本时，指定参数dataset_name=cifar10和train_files=cifar10_train.tfrecord，即可使用cifar10数据进行模型训练。

说明

如果您需要读取其他的格式数据，需自行实现dataset pipeline构建逻辑（参考utils/dataset_utils.py）。

超参文件说明

PAI-FastNN支持以下类型的超参：

数据集参数：确定训练集的基本属性的参数，例如训练集存储路径dataset_dir。
数据预处理参数：数据预处理函数及dataset pipeline相关参数。
模型参数：模型训练基本参数，包括model_name、batch_size等。
学习率参数：学习率及其相关调优参数。
优化器参数：优化器及其相关参数。
日志参数：关于输出日志的参数。
性能调优参数：混合精度等其他调优参数。

超参文件的格式如下。

enable_paisora=True
batch_size=128
use_fp16=True
dataset_name=flowers
dataset_dir=oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/flowers/
model_name=inception_resnet_v2
optimizer=sgd
num_classes=5
job_name=worker

数据集参数

名称	类型	描述
dataset_name	string	指定输入数据解析文件的名称。取值包括：mock、cifar10、mnist、flowers，取值说明请参见images/datasets目录下所有的数据解析文件。默认使用模拟数据mock。
dataset_dir	string	指定输入数据集的绝对路径，默认为None。
num_sample_per_epoch	integer	指定数据集总样本数，一般用来配合学习率的衰减。
num_classes	integer	指定样本分类数，默认为100。
train_files	string	指定所有训练数据的文件名，文件间分隔符为逗号，例如0.tfrecord,1.tfrecord。

数据预处理参数

名称	类型	描述
preprocessing_name	string	和model_name共同指定数据预处理的方法名，取值范围请参见images/preprocessing目录下的preprocessing_factory文件。默认设置为None，表示不进行数据预处理。
shuffle_buffer_size	integer	在生成数据流水线时，以样本为粒度进行shuffle的缓存池大小，默认为1024。
num_parallel_batches	integer	与batch_size乘积为map_and_batch的并行线程数，协助指定解析样本的并行粒度，默认为8。
prefetch_buffer_size	integer	指定数据流水线预取数据的批数，默认为32。
num_preprocessing_threads	integer	指定数据流水线进行并行数据预取的线程数，默认为16。
datasets_use_caching	bool	是否打开以内存为开销，进行输入数据的压缩缓存。默认为False，表示不打开。

模型参数

名称	类型	描述
task_type	string	任务类型，取值包括： pretrain：模型预训练，默认。 finetune：模型调优
model_name	string	指定进行训练的模型，取值包括images/models下的所有模型。您可以参考images/models/model_factory文件中所有定义的模型设置model_name，默认为inception_resnet_v2。
num_epochs	integer	训练集训练轮数，默认为100。
weight_decay	float	模型训练时权重的衰减系数，默认为0.00004。
max_gradient_norm	float	是否根据全局归一化值进行梯度裁剪。默认为None，表示不进行梯度裁剪。
batch_size	integer	单卡一次迭代处理的数据量，默认为32。
model_dir	string	重载checkpoint的路径。默认为None，表示不进行模型调优。
ckpt_file_name	string	重载checkpoint的文件名，默认为None。

学习率参数

名称	类型	描述
warmup_steps	integer	逆衰减学习率的迭代数，默认为0。
warmup_scheme	string	学习率逆衰减的方式。取值为t2t（Tensor2Tensor），表示初始化为指定学习率的1/100，然后exponentiate逆衰减到指定学习率为止。
decay_scheme	string	学习率衰减的方式。可选值： luong234：在2/3的总迭代数之后，开始4次衰减，衰减系数为1/2。 luong5：在1/2的总迭代数之后，开始5次衰减，衰减系数为1/2。 luong10：在1/2的总迭代数之后，开始10次衰减，衰减系数为1/2。
learning_rate_decay_factor	float	指定学习率衰减系数，默认为0.94。
learning_rate_decay_type	string	指定学习率衰减类型，可选值：fixed、exponential（默认）和polynomial。
learning_rate	float	指定学习率初始值，默认为0.01。
end_learning_rate	float	指定衰减时学习率值的下限，默认为0.0001。

优化器参数

名称	类型	描述
optimizer	string	指定优化器名称。可选值：adadelta、 adagrad、adam、ftrl、momentum、sgd、rmsprop、adamweightdecay，默认为rmsprop。
adadelta_rho	float	adadelta的衰减系数，默认为0.95。
adagrad_initial_accumulator_value	float	AdaGrad积累器的起始值，默认为0.1。是AdaGrad优化器专用参数。
adam_beta1	float	一次动量预测的指数衰减率，默认为0.9。是Adam优化器专用参数。
adam_beta2	float	二次动量预测的指数衰减率，默认为0.999。是Adam优化器专用参数。
opt_epsilon	float	优化器偏置值，默认为1.0。是Adam优化器专用参数。
ftrl_learning_rate_power	float	学习率参数的幂参数，默认为-0.5。是Ftrl优化器专用参数。
ftrl_initial_accumulator_value	float	FTRL积累器的起始，默认为0.1，是Ftrl优化器专用参数。
ftrl_l1	float	FTRL l1正则项，默认为0.0，是Ftrl优化器专用参数。
ftrl_l2	float	FTRL l2正则项，默认为0.0，是Ftrl优化器专用参数。
momentum	float	MomentumOptimizer的动量参数，默认为0.9，是Momentum优化器专用参数。
rmsprop_momentum	float	RMSPropOptimizer的动量参数，默认为0.9。
rmsprop_decay	float	RMSProp的衰减系数，默认为0.9。

日志参数

名称	类型	描述
stop_at_step	integer	训练总迭代数，默认为100。
log_loss_every_n_iters	integer	打印loss信息的迭代频率，默认为10。
profile_every_n_iters	integer	打印timeline的迭代频率，默认为0。
profile_at_task	integer	输出timeline的机器对应索引，默认为0，对应chief worker。
log_device_placement	bool	是否输出device placement信息，默认为False。
print_model_statistics	bool	是否输出可训练变量信息，默认为false。
hooks	string	训练hooks，默认为StopAtStepHook,ProfilerHook,LoggingTensorHook,CheckpointSaverHook。

性能调优参数

名称	类型	描述
use_fp16	bool	是否进行半精度训练，默认为True。
loss_scale	float	训练中loss值scale的系数，默认为1.0。
enable_paisoar	bool	是否使用paisoar框架，默认True。
protocol	string	默认grpc.rdma集群可以使用grpc+verbs，提升数据存取效率。

开发主文件

如果已有模型无法满足您的需求，您可以通过继承dataset、models和preprocessing接口进一步开发。在此之前需要了解FastNN库的基本流程（以images为例，代码入口文件为train_image_classifiers.py），整体代码架构流程如下。

# 根据model_name初始化models中对应模型得到network_fn，并可能返回输入参数train_image_size。
    network_fn = nets_factory.get_network_fn(
            FLAGS.model_name,
            num_classes=FLAGS.num_classes,
            weight_decay=FLAGS.weight_decay,
            is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# 根据model_name或preprocessing_name初始化相应数据预处理函数得到preprocess_fn。
    preprocessing_fn = preprocessing_factory.get_preprocessing(
                FLAGS.model_name or FLAGS.preprocessing_name,
                is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# 根据dataset_name，选择正确的tfrecord格式，同步调用preprocess_fn解析数据集得到数据dataset_iterator。
    dataset_iterator = dataset_factory.get_dataset_iterator(FLAGS.dataset_name,
                                                            train_image_size,
                                                            preprocessing_fn,
                                                            data_sources,
    )
# 调用network_fn、dataset_iterator，定义计算loss的函数loss_fn。
    def loss_fn():
      with tf.device('/cpu:0'):
        images, labels = dataset_iterator.get_next()
        logits, end_points = network_fn(images)
        loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(logits, tf.float32), weights=1.0)
        if 'AuxLogits' in end_points:
          loss += tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(end_points['AuxLogits'], tf.float32), weights=0.4)
        return loss
# 调用PAI-Soar API封装loss_fn、tf原生optimizer。
    opt = paisoar.ReplicatedVarsOptimizer(optimizer, clip_norm=FLAGS.max_gradient_norm)
    loss = optimizer.compute_loss(loss_fn, loss_scale=FLAGS.loss_scale)
# 根据opt和loss形式化定义training tensor。
    train_op = opt.minimize(loss, global_step=global_step)