使用Tensorflow Object Detection进行训练和推理

2024-02-15 19:57:08

整体流程(以PASCAL VOC为例)

1.下载PASCAL VOC2012数据集，并将数据集转为tfrecord格式

2.选择并下载预训练模型

3.配置训练文件configuration（所有的训练参数都通过配置文件来配置）

4.训练模型

5.利用tensorboard查看训练过程中loss，accuracy等变化曲线

6.冻结模型参数

7.调用冻结pb文件进行预测

文件格式

首先建立一下文件结构，把models/research/object_detection/data下的label_map.pbtxt文件移动到自己建立的data下。

label_map.txt：定义了class id和class name的映射

文件结构如下：

.
├── data/
│   ├── eval-00000-of-00001.tfrecord  	# file
│   ├── label_map.txt  								 	# file
│   ├── train-00000-of-00002.tfrecord  	# file
│   └── train-00001-of-00002.tfrecord  	# file
└── models/
    └── my_model_dir/
        ├── eval/                 # Created by evaluation job.
        ├── my_model.config  			# pipeline config
        └── model_ckpt-100-data@1 #
        └── model_ckpt-100-index  # Created by training job.
        └── checkpoint            #

把label_map.pbtxt移动过去（以PASCAL VOC2012为例）：

cp /xxx/models/research/object_detection/data/pascal_label_map.pbtxt ./data/

准备输入数据

Tensorflow Object Detection API使用TFRecord格式的数据。提供了create_pascal_tf_record.py 和create_pet_tf_record.py两个脚本来转换PASCAL VOC和Pet数据集到TFRecord格式。

产生PASCAL VOC的TFRecord文件

如果本地没有数据集的话，使用如下命令下载数据集（here）：

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xvf VOCtrainval_11-May-2012.tar

使用如下命令将PSACAL VOC转换成TFRecord格式：

Examples：data_dir改为自己的数据集路径

# From tensorflow/models/research/
python object_detection/dataset_tools/create_pascal_tf_record.py \
    --label_map_path=/root/data/pascal_label_map.pbtxt \
    --data_dir=/data2/VOC2007/VOCdevkit --year=VOC2007 --set=train \
    --output_path=/root/data/pascal_train.record
python object_detection/dataset_tools/create_pascal_tf_record.py \
    --label_map_path=/root/data/pascal_label_map.pbtxt \
    --data_dir=/data2/VOC2007/VOCdevkit --year=VOC2007 --set=val \
    --output_path=/root/data/pascal_val.record

data_dir：PASCAL VOC的数据集的路径
output_dir：想保存TFRecord的路径

执行完上述命令后可以在research文件夹下，看到pascal_train.record和pascal_val.record两个文件。

Generating the COCO TFRecord files.

COCO数据集的位置： here.
使用如下命令将COCO转换成TFRecord格式：

Examples：路径改为自己的路径

# From tensorflow/models/research/
python object_detection/dataset_tools/create_coco_tf_record.py --logtostderr \
  --train_image_dir=/data2/datasets/coco/train2017 \
  --val_image_dir=/data2/datasets/coco/val2017 \
  --test_image_dir=/data2/datasets/coco/unlabeled2017 \
  --train_annotations_file=/data2/datasets/coco/annotations/instances_train2017.json \
  --val_annotations_file=/data2/datasets/coco/annotations/instances_val2017.json \
  --testdev_annotations_file=/data2/datasets/coco/annotations/image_info_test-dev2017.json \
  --output_dir=/root/data

执行完上述命令后可以在research文件夹下，可以看到coco开头的许多文件。

同时要把coco的pbtxt移动到output_dir下。

使用Tensorflow1进行训练和推理

配置训练的Pipeline

Tensorflow Object Detection API使用protobuf文件来配置训练和推理流程。训练的Pipeline模板可以在object_detection/protos/pipeline.proto中找到。同时object_detection/samples/configs 文件夹中提供了简单的可以直接使用的配置。

下面主要介绍配置的具体内容。

整个配置文件可以分成五个部分：

model：
train_config
eval_config
train_input_config
eval_input_config

整体结构如下：

model {
(... Add model config here...)
}

train_config : {
(... Add train_config here...)
}

train_input_reader: {
(... Add train_input configuration here...)
}

eval_config: {
}

eval_input_reader: {
(... Add eval_input configuration here...)
}

选择模型参数

需要注意修改 num_classes 的值去适配自己的任务。

定义输入

支持TFRecord格式的输入。需要指明training和evaluation的文件位置，label map的位置。traning和evaluation数据集的label map应该是相同的。

例子：

tf_record_input_reader {
  input_path: "/usr/home/username/data/train.record"
}
label_map_path: "/usr/home/username/data/label_map.pbtxt"

配置Trainer

train_config定义了三部分训练流程：

模型参数初始化
输入预处理：可选的
SGD参数

例子：

batch_size: 1
optimizer {
  momentum_optimizer: {
    learning_rate: {
      manual_step_learning_rate {
        initial_learning_rate: 0.0002
        schedule {
          step: 0
          learning_rate: .0002
        }
        schedule {
          step: 900000
          learning_rate: .00002
        }
        schedule {
          step: 1200000
          learning_rate: .000002
        }
      }
    }
    momentum_optimizer_value: 0.9
  }
  use_moving_average: false
}
fine_tune_checkpoint: "/usr/home/username/tmp/model.ckpt-#####"
from_detection_checkpoint: true
load_all_detection_checkpoint_vars: true
gradient_clipping_by_norm: 10.0
data_augmentation_options {
  random_horizontal_flip {
  }
}

配置Evaluator

eval_config中主要的设置为num_examples和metrics_set。

num_examples：batches的大小
metrics_set：在evaluation的时候使用什么metrics

Model Parameter Initialization

关于checkpoint的使用。配置文件中的train_config部分提供了两个已经存在的checkpoint：

fine_tune_checkpoint：一个路径前缀(ie:"/usr/home/username/checkpoint/model.ckpt-#####").
fine_tune_checkpoint_type：classification/detection

A list of classification checkpoints can be found here.

A list of detection checkpoints can be found here.

Training

单机单卡

Template:

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
NUM_TRAIN_STEPS=50000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --num_train_steps=${NUM_TRAIN_STEPS} \
    --sample_1_of_n_eval_examples=${SAMPLE_1_OF_N_EVAL_EXAMPLES} \
    --alsologtostderr

Examples：

python object_detection/model_main.py \
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config \
    --model_dir=/root/my_models/checkpoint \
    --num_train_steps=1 \

${PIPELINE_CONFIG_PATH} ：pipeline config的路径
${MODEL_DIR}：训练产生的checkpoint的保存文件路径
num_train_steps：train steps的数量
num_worker：
- = 1：MirroredStrategy
- > 1：MultiWorkerMirroredStrategy.

单机多卡

单机多卡和单机单卡使用的不是用一个启动程序

Examples：

CUDA_VISIBLE_DEVICES=0,1 python object_detection/legacy/train.py \
		--pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config 
		--train_dir=/root/my_models/checkpoint \
		--num_clones=2 \
		--ps_tasks=1

train_dir：训练产生的checkpoint的保存文件路径
num_clones：通常有几个gpu就是几
ps_tasks：parameter server的数量。Default:0，不使用ps

多机多卡

官方没有给出多机多卡的使用方式，google查到的一个是基于hadoop集群实现的分布式训练

Evaluation

单机单卡

Template:

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
CHECKPOINT_DIR=${MODEL_DIR}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --checkpoint_dir=${CHECKPOINT_DIR} \
    --alsologtostderr

Examples:

python object_detection/model_main_tf2.py \
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config \
    --model_dir=/root/my_models \
    --checkpoint_dir=/root/my_models/checkpoint

${CHECKPOINT_DIR} ：训练产生的checkpoint的地址。如果使用了这个参数，就会是eval-only的模式，evaluation metrix会存在model_dir路径下。
${MODEL_DIR/eval}：推理产生的events的地址

单机多卡

Examples：

CUDA_VISIBLE_DEVICES=0,1 python object_detection/legacy/eval.py \
		--checkpoint_dir=/root/my_models/checkpoint \
		--eval_dir=/root/my_models/eval \
		--pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config

使用Tensorflow2进行训练和推理

Training

Template：

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --alsologtostderr

Examples：

python object_detection/model_main_tf2.py \
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config \
    --model_dir=/root/my_models/checkpoint

${PIPELINE_CONFIG_PATH} ：pipeline config的路径
${MODEL_DIR}：训练产生的checkpoint的保存文件路径

注：tf2下默认使用MirroredStrategy()，会直接使用当前机器上的全部GPU进行训练。如果只用一部分卡可以指定卡号，如strategy = tf.compat.v2.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])，使用了第0号和第1号卡。

Evaluation

Template：

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
CHECKPOINT_DIR=${MODEL_DIR}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --checkpoint_dir=${CHECKPOINT_DIR} \
    --alsologtostderr

Examples：

python object_detection/model_main_tf2.py \
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config \
    --model_dir=/root/my_models/checkpoint \
    --checkpoint_dir=/root/my_models/checkpoint/eval

${CHECKPOINT_DIR}：trainin*生的checkpoints的路径
${MODEL_DIR/eval}：evaluation events保存的路径

多机多卡

参考Tensorflow1.X的多机多卡部分

常见问题

单机多卡训练时报错：ValueError: not enough values to unpack (expected 7, got 0)

配置文件中batchsize设置成了1。batchsize需要设置成和num_clones同样的大小。
Tensorflow2.X下使用Faster-RCNN模型报错：RuntimeError: Groundtruth tensor boxes has not been provide

Tensorflow object detection api在2021/2之后的某次更新中新引入的bug，可以checkout到旧的commit id（31e86e8）。然后重新安装object detection api。

码农公寓

使用Tensorflow Object Detection进行训练和推理

整体流程(以PASCAL VOC为例)

文件格式

准备输入数据

产生PASCAL VOC的TFRecord文件

Generating the COCO TFRecord files.

使用Tensorflow1进行训练和推理

配置训练的Pipeline

选择模型参数

定义输入

配置Trainer

配置Evaluator

Model Parameter Initialization

Training

单机单卡

单机多卡

多机多卡

Evaluation

单机单卡

单机多卡

使用Tensorflow2进行训练和推理

Training

Evaluation

多机多卡

常见问题

Reference

码农公寓

整体流程(以PASCAL VOC为例)

文件格式

准备输入数据

产生PASCAL VOC的TFRecord文件

Generating the COCO TFRecord files.

使用Tensorflow1进行训练和推理

配置训练的Pipeline

选择模型参数

定义输入

配置Trainer

配置Evaluator

Model Parameter Initialization

Training

单机单卡

单机多卡

多机多卡

Evaluation

单机单卡

单机多卡

使用Tensorflow2进行训练和推理

Training

Evaluation

多机多卡

常见问题

Reference

相关文章