技术点滴 · 2024年6月18日 0

在 Intel® OpenVINO™ 上优化 YOLOv10 的教程

实时物体检测旨在以低延迟准确预测图像中的物体类别和位置。YOLO 系列凭借其在性能和效率之间的平衡而处于该研究的前沿。然而,对 NMS 的依赖和架构效率低下阻碍了最佳性能。YOLOv10 通过引入一致的双重分配以实现无 NMS 训练和整体效率-准确度驱动的模型设计策略来解决这些问题。

YOLOv10 由清华大学的研究人员基于Ultralytics Python 软件包开发,引入了一种实时物体检测的新方法,解决了之前 YOLO 版本中存在的后处理和模型架构缺陷。通过消除非最大抑制 (NMS) 并优化各种模型组件,YOLOv10 实现了最先进的性能,同时显著降低了计算开销。大量实验表明,它在多个模型规模上都具有出色的准确率-延迟权衡。

教程简介

本教程演示了如何使用 OpenVINO 运行和优化 PyTorch YOLO V10 的分步说明。

本教程包括以下步骤:

  • 准备 PyTorch 模型
  • 将 PyTorch 模型转换为 OpenVINO IR
  • 使用 OpenVINO 运行模型推理
  • 使用 NNCF 准备并运行优化管道
  • 比较 FP16 和量化模型的性能。
  • 对视频运行优化的模型推理
  • 启动交互式 Gradio 演示

使用说明

这是一个独立的示例,完全依赖于其自身的代码。
关于 OpenVINO 最新版本和 Jupyter Notebook 的开发环境,可以参阅本站文章(如下),并从文章中下载包含了 OpenVINO 2024.2 最新版本和 Jupyter Notebook 环境的 docker 镜像直接运行。

https://blog.yanghong.dev/intel-openvino-2024-2-release/

优化步骤

1. Prerequisite

import os

os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false"

%pip install -q "nncf>=2.11.0"
%pip install --pre -Uq openvino --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
%pip install -q "git+https://github.com/THU-MIG/yolov10.git" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "torch>=2.1" "torchvision>=0.16" tqdm opencv-python "gradio>=4.19" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install ipywidgets

如果使用上面博文里预编译好的 docker 镜像环境,则把第 4 条命令(%pip install –pre -Uq openvino –extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly)注释掉不用执行,否则会出现如下的版本冲突错误(因为 docker 镜像里已经安装了 2024.2 版本):

运行正常完成后则会显示如下结果:

%pip install ipywidgets

注意在这一步完成后重新启动一下 python kernel,或者直接重启一下 OpenVINO 容器。

from pathlib import Path

# Fetch `notebook_utils` module
import requests

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)

open("notebook_utils.py", "w").write(r.text)

from notebook_utils import download_file, VideoPlayer

2. 下载 PyTorch 模型

模型作者提供了多个版本的 YOLO V10 模型。每个模型都有不同的特性,具体取决于训练参数的数量、性能和准确性。出于演示目的,我们将使用 yolov10n,但相同的步骤也适用于 YOLO V10 系列中的其他模型。

models_dir = Path("./models")
models_dir.mkdir(exist_ok=True)
model_weights_url = "https://github.com/jameslahm/yolov10/releases/download/v1.0/yolov10n.pt"
file_name = model_weights_url.split("/")[-1]
model_name = file_name.replace(".pt", "")

download_file(model_weights_url, directory=models_dir)

3. 将 PyTorch 模型导出为 OpenVINO IR 格式

如前所述,YOLO V10 代码是在 Ultralytics 库的基础上设计的,并且具有与 YOLO V8 类似的接口(您可以查看 YOLO V8 笔记本以获取有关如何使用 Ultralytics API 的更详细说明)。Ultralytics 支持使用模型类的导出方法导出 OpenVINO 模型。此外,我们可以指定负责目标输入大小、静态或动态输入形状和模型精度(FP32/FP16/INT8)的参数。INT8 量化可以在导出阶段额外执行,但为了使方法更灵活,我们考虑如何使用 NNCF 执行量化。

import types
from ultralytics.utils import ops, yaml_load, yaml_save
from ultralytics import YOLOv10
import torch

detection_labels = {
    0: "person",
    1: "bicycle",
    2: "car",
    3: "motorcycle",
    4: "airplane",
    5: "bus",
    6: "train",
    7: "truck",
    8: "boat",
    9: "traffic light",
    10: "fire hydrant",
    11: "stop sign",
    12: "parking meter",
    13: "bench",
    14: "bird",
    15: "cat",
    16: "dog",
    17: "horse",
    18: "sheep",
    19: "cow",
    20: "elephant",
    21: "bear",
    22: "zebra",
    23: "giraffe",
    24: "backpack",
    25: "umbrella",
    26: "handbag",
    27: "tie",
    28: "suitcase",
    29: "frisbee",
    30: "skis",
    31: "snowboard",
    32: "sports ball",
    33: "kite",
    34: "baseball bat",
    35: "baseball glove",
    36: "skateboard",
    37: "surfboard",
    38: "tennis racket",
    39: "bottle",
    40: "wine glass",
    41: "cup",
    42: "fork",
    43: "knife",
    44: "spoon",
    45: "bowl",
    46: "banana",
    47: "apple",
    48: "sandwich",
    49: "orange",
    50: "broccoli",
    51: "carrot",
    52: "hot dog",
    53: "pizza",
    54: "donut",
    55: "cake",
    56: "chair",
    57: "couch",
    58: "potted plant",
    59: "bed",
    60: "dining table",
    61: "toilet",
    62: "tv",
    63: "laptop",
    64: "mouse",
    65: "remote",
    66: "keyboard",
    67: "cell phone",
    68: "microwave",
    69: "oven",
    70: "toaster",
    71: "sink",
    72: "refrigerator",
    73: "book",
    74: "clock",
    75: "vase",
    76: "scissors",
    77: "teddy bear",
    78: "hair drier",
    79: "toothbrush",
}


def v10_det_head_forward(self, x):
    one2one = self.forward_feat([xi.detach() for xi in x], self.one2one_cv2, self.one2one_cv3)
    if not self.export:
        one2many = super().forward(x)

    if not self.training:
        one2one = self.inference(one2one)
        if not self.export:
            return {"one2many": one2many, "one2one": one2one}
        else:
            assert self.max_det != -1
            boxes, scores, labels = ops.v10postprocess(one2one.permute(0, 2, 1), self.max_det, self.nc)
            return torch.cat(
                [boxes, scores.unsqueeze(-1), labels.unsqueeze(-1).to(boxes.dtype)],
                dim=-1,
            )
    else:
        return {"one2many": one2many, "one2one": one2one}


ov_model_path = models_dir / f"{model_name}_openvino_model/{model_name}.xml"
if not ov_model_path.exists():
    model = YOLOv10(models_dir / file_name)
    model.model.model[-1].forward = types.MethodType(v10_det_head_forward, model.model.model[-1])
    model.export(format="openvino", dynamic=True, half=True)
    config = yaml_load(ov_model_path.parent / "metadata.yaml")
    config["names"] = detection_labels
    yaml_save(ov_model_path.parent / "metadata.yaml", config)

4. 使用 Ultralytics API 在 AUTO 设备上运行 OpenVINO 推理

现在,当我们将模型导出到 OpenVINO 时,我们可以将其直接加载到 YOLOv10 类中,其中自动推理后端将提供易于使用的用户体验,以与原始 PyTorch 模型类似的级别运行 OpenVINO YOLOv10 模型。下面的代码演示了如何使用 Ultralytics API 在单个图像上运行 OpenVINO 导出模型的推理。AUTO 设备将用于启动模型。

ov_yolo_model = YOLOv10(ov_model_path.parent, task="detect")
from PIL import Image

IMAGE_PATH = Path("./data/coco_bike.jpg")
download_file(
    url="https://storage.openvinotoolkit.org/repositories/openvino_notebooks/data/data/image/coco_bike.jpg",
    filename=IMAGE_PATH.name,
    directory=IMAGE_PATH.parent,
)
res = ov_yolo_model(IMAGE_PATH, iou=0.45, conf=0.2)
Image.fromarray(res[0].plot()[:, :, ::-1])

5. 使用 Ultralytics API 在选定设备上运行 OpenVINO 推理

在这一部分,您可以选择推理设备来运行模型推理,以便将结果与 AUTO 设备进行比较。

import openvino as ov

import ipywidgets as widgets

core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value="CPU",
    description="Device:",
    disabled=False,
)

device

可以看到,我的设备支持 CPU,GPU(集成显卡)和 AUTO 三种设备设置。如果是第14代 Meteor Lake 平台的话,应该还会有 NPU 可以选择。

ov_model = core.read_model(ov_model_path)

# load model on selected device
if "GPU" in device.value or "NPU" in device.value:
    ov_model.reshape({0: [1, 3, 640, 640]})
ov_config = {}
if "GPU" in device.value:
    ov_config = {"GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}
det_compiled_model = core.compile_model(ov_model, device.value, ov_config)
ov_yolo_model.predictor.model.ov_compiled_model = det_compiled_model
res = ov_yolo_model(IMAGE_PATH, iou=0.45, conf=0.2)
Image.fromarray(res[0].plot()[:, :, ::-1])

6. 使用 NNCF 训练后量化 API 优化模型

NNCF 提供了一套高级算法,用于 OpenVINO 中的神经网络推理优化,同时将准确度下降降至最低。我们将在训练后模式下使用 8 位量化(不使用微调管道)来优化 YOLOv10。

优化过程包含以下步骤:

创建用于量化的数据集。运行 nncf.quantize 以获取优化的模型。使用 openvino.save_model 函数序列化 OpenVINO IR 模型。量化是一个耗时且耗内存的过程,您可以使用下面的复选框跳过此步骤:

import ipywidgets as widgets

int8_model_det_path = models_dir / "int8" / f"{model_name}_openvino_model/{model_name}.xml"
ov_yolo_int8_model = None

to_quantize = widgets.Checkbox(
    value=True,
    description="Quantization",
    disabled=False,
)

to_quantize
# Fetch skip_kernel_extension module
r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py",
)
open("skip_kernel_extension.py", "w").write(r.text)

%load_ext skip_kernel_extension

7. 准备量化数据集

要开始量化,我们需要准备数据集。我们将使用 MS COCO 数据集中的验证子集进行模型量化,并使用 Ultralytics 验证数据加载器准备输入数据。

%%skip not $to_quantize.value

from zipfile import ZipFile

from ultralytics.data.utils import DATASETS_DIR

if not int8_model_det_path.exists():

    DATA_URL = "http://images.cocodataset.org/zips/val2017.zip"
    LABELS_URL = "https://github.com/ultralytics/yolov5/releases/download/v1.0/coco2017labels-segments.zip"
    CFG_URL = "https://raw.githubusercontent.com/ultralytics/ultralytics/v8.1.0/ultralytics/cfg/datasets/coco.yaml"
    
    OUT_DIR = DATASETS_DIR
    
    DATA_PATH = OUT_DIR / "val2017.zip"
    LABELS_PATH = OUT_DIR / "coco2017labels-segments.zip"
    CFG_PATH = OUT_DIR / "coco.yaml"
    
    download_file(DATA_URL, DATA_PATH.name, DATA_PATH.parent)
    download_file(LABELS_URL, LABELS_PATH.name, LABELS_PATH.parent)
    download_file(CFG_URL, CFG_PATH.name, CFG_PATH.parent)
    
    if not (OUT_DIR / "coco/labels").exists():
        with ZipFile(LABELS_PATH, "r") as zip_ref:
            zip_ref.extractall(OUT_DIR)
        with ZipFile(DATA_PATH, "r") as zip_ref:
            zip_ref.extractall(OUT_DIR / "coco/images")

上述的 warning 信息并不影响使用。如果介意的话,可以修改 ~/jupyter_notebook_config.py 的相应参数。

%%skip not $to_quantize.value

from ultralytics.utils import DEFAULT_CFG
from ultralytics.cfg import get_cfg
from ultralytics.data.converter import coco80_to_coco91_class
from ultralytics.data.utils import check_det_dataset

if not int8_model_det_path.exists():
    args = get_cfg(cfg=DEFAULT_CFG)
    args.data = str(CFG_PATH)
    det_validator = ov_yolo_model.task_map[ov_yolo_model.task]["validator"](args=args)
    
    det_validator.data = check_det_dataset(args.data)
    det_validator.stride = 32
    det_data_loader = det_validator.get_dataloader(OUT_DIR / "coco", 1)

NNCF 提供 nncf.Dataset wrapper,用于在量化管道中使用原生框架数据加载器。此外,我们指定转换函数,该函数将负责以模型预期格式准备输入数据。

%%skip not $to_quantize.value

import nncf
from typing import Dict


def transform_fn(data_item:Dict):
    """
    Quantization transform function. Extracts and preprocess input data from dataloader item for quantization.
    Parameters:
       data_item: Dict with data item produced by DataLoader during iteration
    Returns:
        input_tensor: Input data for quantization
    """
    input_tensor = det_validator.preprocess(data_item)['img'].numpy()
    return input_tensor

if not int8_model_det_path.exists():
    quantization_dataset = nncf.Dataset(det_data_loader, transform_fn)

8. 量化并保存 INT8 模型

nncf.quantize 函数提供了模型量化的接口。它需要 OpenVINO 模型和量化数据集的实例。可选地,可以提供一些用于配置量化过程的附加参数(量化样本数、预设、忽略范围等)。YOLOv10 模型包含非 ReLU 激活函数,这需要对激活进行非对称量化。为了获得更好的结果,我们将使用混合量化预设。它提供权重的对称量化和激活的非对称量化。

注意:模型训练后量化是一个耗时的过程。请耐心等待,这可能需要几分钟,具体取决于您的硬件。

%%skip not $to_quantize.value

import shutil

if not int8_model_det_path.exists():
    quantized_det_model = nncf.quantize(
        ov_model,
        quantization_dataset,
        preset=nncf.QuantizationPreset.MIXED,
    )

    ov.save_model(quantized_det_model,  int8_model_det_path)
    shutil.copy(ov_model_path.parent / "metadata.yaml", int8_model_det_path.parent / "metadata.yaml")

9. 运行优化后的模型推理

INT8 量化模型的使用方法与量化前的模型相同。让我们检查一下量化模型在单张图片上的推理结果。

9.1 在 AUTO 设备上运行优化后的模型

%%skip not $to_quantize.value
ov_yolo_int8_model = YOLOv10(int8_model_det_path.parent, task="detect")
%%skip not $to_quantize.value
res = ov_yolo_int8_model(IMAGE_PATH, iou=0.45, conf=0.2)
Image.fromarray(res[0].plot()[:, :, ::-1])

9.2 在指定的设备上运行优化后的模型

%%skip not $to_quantize.value

device
%%skip not $to_quantize.value

ov_config = {}
if "GPU" in device.value or "NPU" in device.value:
    ov_model.reshape({0: [1, 3, 640, 640]})
ov_config = {}
if "GPU" in device.value:
    ov_config = {"GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}

quantized_det_model = core.read_model(int8_model_det_path)
quantized_det_compiled_model = core.compile_model(quantized_det_model, device.value, ov_config)

ov_yolo_int8_model.predictor.model.ov_compiled_model = quantized_det_compiled_model

res = ov_yolo_int8_model(IMAGE_PATH,  iou=0.45, conf=0.2)
Image.fromarray(res[0].plot()[:, :, ::-1])

10. 比较原始模型和优化后的模型

10.1 模型大小

ov_model_weights = ov_model_path.with_suffix(".bin")
print(f"Size of FP16 model is {ov_model_weights.stat().st_size / 1024 / 1024:.2f} MB")
if int8_model_det_path.exists():
    ov_int8_weights = int8_model_det_path.with_suffix(".bin")
    print(f"Size of model with INT8 compressed weights is {ov_int8_weights.stat().st_size / 1024 / 1024:.2f} MB")
    print(f"Compression rate for INT8 model: {ov_model_weights.stat().st_size / ov_int8_weights.stat().st_size:.3f}")

10.2 性能

10.2.1 fp16 性能
!benchmark_app -m $ov_model_path -d $device.value -api async -shape "[1,3,640,640]" -t 15

10.2.2 int8 性能

if int8_model_det_path.exists():
    !benchmark_app -m $int8_model_det_path -d $device.value -api async -shape "[1,3,640,640]" -t 15

可以看到 int8 的性能是 fp16 的性能的两倍左右。

11. 视频演示

以下代码对视频运行模型推理:

import collections
import time
from IPython import display
import cv2
import numpy as np


# Main processing function to run object detection.
def run_object_detection(
    source=0,
    flip=False,
    use_popup=False,
    skip_first_frames=0,
    det_model=ov_yolo_int8_model,
    device=device.value,
):
    player = None
    try:
        # Create a video player to play with target fps.
        player = VideoPlayer(source=source, flip=flip, fps=30, skip_first_frames=skip_first_frames)
        # Start capturing.
        player.start()
        if use_popup:
            title = "Press ESC to Exit"
            cv2.namedWindow(winname=title, flags=cv2.WINDOW_GUI_NORMAL | cv2.WINDOW_AUTOSIZE)

        processing_times = collections.deque()
        while True:
            # Grab the frame.
            frame = player.next()
            if frame is None:
                print("Source ended")
                break
            # If the frame is larger than full HD, reduce size to improve the performance.
            scale = 1280 / max(frame.shape)
            if scale < 1:
                frame = cv2.resize(
                    src=frame,
                    dsize=None,
                    fx=scale,
                    fy=scale,
                    interpolation=cv2.INTER_AREA,
                )
            # Get the results.
            input_image = np.array(frame)

            start_time = time.time()
            detections = det_model(input_image, iou=0.45, conf=0.2, verbose=False)
            stop_time = time.time()
            frame = detections[0].plot()

            processing_times.append(stop_time - start_time)
            # Use processing times from last 200 frames.
            if len(processing_times) > 200:
                processing_times.popleft()

            _, f_width = frame.shape[:2]
            # Mean processing time [ms].
            processing_time = np.mean(processing_times) * 1000
            fps = 1000 / processing_time
            cv2.putText(
                img=frame,
                text=f"Inference time: {processing_time:.1f}ms ({fps:.1f} FPS)",
                org=(20, 40),
                fontFace=cv2.FONT_HERSHEY_COMPLEX,
                fontScale=f_width / 1000,
                color=(0, 0, 255),
                thickness=1,
                lineType=cv2.LINE_AA,
            )
            # Use this workaround if there is flickering.
            if use_popup:
                cv2.imshow(winname=title, mat=frame)
                key = cv2.waitKey(1)
                # escape = 27
                if key == 27:
                    break
            else:
                # Encode numpy array to jpg.
                _, encoded_img = cv2.imencode(ext=".jpg", img=frame, params=[cv2.IMWRITE_JPEG_QUALITY, 100])
                # Create an IPython image.
                i = display.Image(data=encoded_img)
                # Display the image in this notebook.
                display.clear_output(wait=True)
                display.display(i)
    # ctrl-c
    except KeyboardInterrupt:
        print("Interrupted")
    # any different error
    except RuntimeError as e:
        print(e)
    finally:
        if player is not None:
            # Stop capturing.
            player.stop()
        if use_popup:
            cv2.destroyAllWindows()
use_int8 = widgets.Checkbox(
    value=ov_yolo_int8_model is not None,
    description="Use int8 model",
    disabled=ov_yolo_int8_model is None,
)

use_int8
WEBCAM_INFERENCE = False

if WEBCAM_INFERENCE:
    VIDEO_SOURCE = 0  # Webcam
else:
    download_file(
        "https://storage.openvinotoolkit.org/repositories/openvino_notebooks/data/data/video/people.mp4",
        directory="data",
    )
    VIDEO_SOURCE = "data/people.mp4"
run_object_detection(
    det_model=ov_yolo_model if not use_int8.value else ov_yolo_int8_model,
    source=VIDEO_SOURCE,
    flip=True,
    use_popup=False,
)

12. Gradio 交互演示

import gradio as gr


def yolov10_inference(image, int8, conf_threshold, iou_threshold):
    model = ov_yolo_model if not int8 else ov_yolo_int8_model
    results = model(source=image, iou=iou_threshold, conf=conf_threshold, verbose=False)[0]
    annotated_image = Image.fromarray(results.plot())

    return annotated_image


with gr.Blocks() as demo:
    gr.HTML(
        """
    <h1 style='text-align: center'>
    YOLOv10: Real-Time End-to-End Object Detection using OpenVINO
    </h1>
    """
    )
    with gr.Row():
        with gr.Column():
            image = gr.Image(type="numpy", label="Image")
            conf_threshold = gr.Slider(
                label="Confidence Threshold",
                minimum=0.1,
                maximum=1.0,
                step=0.1,
                value=0.2,
            )
            iou_threshold = gr.Slider(
                label="IoU Threshold",
                minimum=0.1,
                maximum=1.0,
                step=0.1,
                value=0.45,
            )
            use_int8 = gr.Checkbox(
                value=ov_yolo_int8_model is not None,
                visible=ov_yolo_int8_model is not None,
                label="Use INT8 model",
            )
            yolov10_infer = gr.Button(value="Detect Objects")

        with gr.Column():
            output_image = gr.Image(type="pil", label="Annotated Image")

        yolov10_infer.click(
            fn=yolov10_inference,
            inputs=[
                image,
                use_int8,
                conf_threshold,
                iou_threshold,
            ],
            outputs=[output_image],
        )
    examples = gr.Examples(
        [
            "data/coco_bike.jpg",
        ],
        inputs=[
            image,
        ],
    )


try:
    demo.launch(debug=True)
except Exception:
    demo.launch(debug=True, share=True)