英特尔正式发布OpenVINO™ 2023.3版本

2024年1月24日，英特尔正式发布了OpenVINO™ 2023.3版本（Release Notes for Intel Distribution of OpenVINO Toolkit 2023.3）。OpenVINO™是英特尔针对自家硬件平台开发的一套深度学习工具库，包含推断库，模型优化等等一系列与深度学习模型部署相关的功能。OpenVINO™工具包是用于快速开发应用程序和解决方案的综合工具包，可解决各种任务，包括模拟人类视觉，自动语音识别，自然语言处理，推荐系统等。该工具包基于最新一代的人工神经网络，包括卷积神经网络（CNN），循环和基于注意力的网络，可在英特尔®硬件上扩展计算机视觉和非视觉工作负载，从而最大限度地提高性能。它通过从边缘到云的高性能，人工智能和深度学习推理来加速应用程序。

OpenVINO toolkit 2023.3 LTS版本主要的新功能和改进

更多的生成式人工智能覆盖范围和框架集成，以最大限度地减少代码更改。
- 引入 GitHub OpenVINO Gen AI项目，该项目演示了大型语言模型 (LLM) 的本机 C 和 C++ pipeline示例。现在支持string tensors作为input和tokenizers，以减少开销并简化生产。
- 经过验证的新的的模型： Mistral、Zephyr、Qwen、ChatGLM3 和 Baichuan。
- 用于潜在一致性模型 (Latent Consistency Model，LCM) 和 Distil-Whisper 的新 Jupyter Notebook。更新了 LLM Chatbot notebook，包括 LangChain、Neural Chat、TinyLlama、ChatGLM3、Qwen、Notus 和 Youri 模型。
- Torch.compile 现已与 OpenVINO 完全集成，其中包含硬件“选项”参数，允许利用 OpenVINO 中的插件架构进行无缝推理硬件选择。

更广泛的大型语言模型 (LLM) 支持和更多模型压缩技术。
- 作为神经网络压缩框架 (NNCF) 的一部分，除了英特尔® 酷睿™ 和 iGPU 之外，英特尔® 至强® CPU 现在完全支持 INT4 权重压缩模型格式，从而在使用大语言模型时提高性能、降低内存使用率并提高准确性。
- 使用状态模型（stateful model）技术提高了 CPU 和 GPU 上基于Transformer的 LLM 的性能，以提高内存效率，其中内部状态在推理的多次迭代之间共享。
- Tokenizer 和 TorchVision 转换支持现已在 OpenVINO runtime（通过新 API），只需要更少的预处理代码并通过自动处理此模型设置来增强性能。有关 Tokenizer 支持的更多详细信息，请参阅生态系统部分。

支持变更和弃用通知
- OpenVINO™ 开发工具包 (pip install openvino-dev) 已弃用，将从 2025.0 版本开始从安装选项和分发渠道中删除。有关更多详细信息，请参阅 OpenVINO 传统功能和组件页面。
- 2023.3 LTS 版本中不再支持 Ubuntu 18.04。推荐的 Ubuntu 版本是 22.04。
- 从 2023.3 开始，由于 Python 社区停止支持，OpenVINO 不再支持 Python 3.7。更新到较新的版本（当前为 3.8-3.11）以避免中断。
- 所有 ONNX 前端旧版 API（称为 ONNX_IMPORTER_API）在 2024.0 版本中将不再可用。
- 作为 OpenVINO Python API 一部分的“PerfomanceMode.UNDEFINED”属性将在 2024.0 版本中停用。
- 工具方面：
  - Deployment Manager 已弃用，根据 LTS 政策将提供两年支持。访问选择器工具以查看包分发选项或部署指南文档。
  - 准确性检查器（Accuracy Checker）已弃用，并将于 2024.0 停止。
  - 训练后优化工具 (Post-Training Optimization Tool, POT) 已被弃用，2023.3 LTS 是支持该工具的最后一个版本。鼓励开发人员使用神经网络压缩框架 (Neural Network Compression Framework, NNCF) 来实现此功能。
  - 模型优化器（Model Optimizer）已弃用，但在 2025.0 版本之前还会有全面支持。我们鼓励开发者通过 OpenVINO Model Converter（API 调用：OVC）进行模型转换。请遵循模型转换过渡指南了解更多详细信息。
  - 已弃用对用于 NNCF 与 Huggingface/transformers 集成的 git 补丁的支持。推荐的方法是使用 Huggingface/optimum-intel 在 Hugging Face 的模型之上应用 NNCF 优化。
  - 对 Apache MXNet、Caffe 和 Kaldi 模型格式的支持已弃用，并将在 2024.0 版本中停止支持。
- RunTime方面：
  - 英特尔® 高斯和神经加速器（Gaussian & Neural Accelerator, GNA）将在未来版本中弃用。我们鼓励开发人员将神经处理单元 (NPU) 用于低功耗系统，例如英特尔® 酷睿™ Ultra 或第 14 代及更高版本。
  - OpenVINO C++/C/Python 1.0 API 已弃用，并将在 2024.0 版本中停止使用。请在您的应用程序中使用 API 2.0 以避免中断。
  - OpenVINO 属性 Affinity API 将从 2024.0 起弃用，并将于 2025.0 终止。它将被 CPU 绑定配置 (ov::hint::enable_cpu_pinning) 取代。

OpenVINO™ toolkit 2023.3 LTS版本改进和弃用的细节如下：

Support Change and Deprecation Notices

The OpenVINO™ Development Tools package (pip install openvino-dev) is deprecated and will be removed from installation options and distribution channels beginning with the 2025.0 release. For more details, refer to the OpenVINO Legacy Features and Components page.
Ubuntu 18.04 support is discontinued in the 2023.3 LTS release. The recommended version of Ubuntu is 22.04.
Starting with 2023.3 OpenVINO longer supports Python 3.7 due to the Python community discontinuing support. Update to a newer version (currently 3.8-3.11) to avoid interruptions.
All ONNX Frontend legacy API (known as ONNX_IMPORTER_API) will no longer be available in the 2024.0 release.
‘PerfomanceMode.UNDEFINED’ property as part of the OpenVINO Python API will be discontinued in the 2024.0 release.
Tools:
- Deployment Manager is deprecated and will be supported for two years according to the LTS policy. Visit the selector tool to see package distribution options or the deployment guide documentation.
- Accuracy Checker is deprecated and will be discontinued with 2024.0.  
- Post-Training Optimization Tool (POT) has been deprecated and the 2023.3 LTS is the last release that supports the tool. Developers are encouraged to use the Neural Network Compression Framework (NNCF) for this feature.
- Model Optimizer is deprecated and will be fully supported until the 2025.0 release. We encourage developers to perform model conversion through OpenVINO Model Converter (API call: OVC). Follow the model conversion transition guide for more details.
- Deprecated support for a git patch for NNCF integration with huggingface/transformers. The recommended approach is to use huggingface/optimum-intel for applying NNCF optimization on top of models from Hugging Face.
- Support for Apache MXNet, Caffe, and Kaldi model formats is deprecated and will be discontinued with the 2024.0 release.
Runtime:
- Intel® Gaussian & Neural Accelerator (Intel® GNA) will be deprecated in a future release. We encourage developers to use the Neural Processing Unit (NPU) for low-powered systems like Intel® Core^TM Ultra or 14^th generation and beyond.
- OpenVINO C++/C/Python 1.0 APIs are deprecated and will be discontinued in the 2024.0 release. Please use API 2.0 in your applications going forward to avoid disruption.
- OpenVINO property Affinity API will be deprecated from 2024.0 and will be discontinued in 2025.0. It will be replaced with CPU binding configurations (ov::hint::enable_cpu_pinning).

OpenVINO™ Development Tools

List of components and their changes:
- Neural Network Compression Framework (NNCF)
  - Weight compression API, nncf.compress_weights(), has been extended by
    - When using the ‘all_layers’ parameter, it compresses the model, including embeddings and final layers, to the 4-bit format. This helps make the model footprint smaller and improves performance, but it might impact the model accuracy. By default, this parameter is disabled, and the backup precision (INT8) is assigned for the embeddings and last layers.
    - When using INT8_SYM compression mode for better performance of the compressed model in case of 8-bit weight compression you might experience an impact on model accuracy therefore by default, we use INT8_ASYM mode to better balance performance and accuracy.
    - We implemented a 4-bit data-aware weight compression feature, introducing the ‘dataset’ optional parameter in nncf.compress_weights(). This parameter can be utilized to mitigate accuracy loss in compressed models. It’s important to note that enabling this option will extend the compression time
    - Post-training Quantization with Accuracy Control, nncf. quantize_with_accuracy_control(), has been extended by the ‘restore_mode’ optional parameter to revert weights to INT8 instead of the original precision. This parameter helps to reduce the size of the quantized model and improves its performance. By default, it is disabled and model weights are reverted to the original precision in nncf.quantize_with_accuracy_control().

OpenVINO™ Runtime

Model Import Updates
- TensorFlow Framework Support
  - Supported TF1 While Control flow construction w/o TensorArray* operations #20800
  - Support for complex tensors has been added #20860, #21477
  - Provided fixes for the following:
    - Accept any model file extension for frozen protobuf format #21508
    - Correct ArgMin/ArgMax translators for repeating elements case #21364
    - Correct PartitionedCall translator when numbers of external and internal body inputs mismatch #20825
- PyTorch Framework Support
  - Added support of nested dictionaries and lists as example input
  - Disabled torch.jit.freeze in default model tracing scenario and improved support for models without freezing, extending model coverage and improving accuracy for some models
- ONNX Framework Support
  - Switched to ONNX 1.15.0 as a supported version of original framework #20929

CPU
- Full support for 5th Gen Intel® Xeon® Scalable processors (codename Emerald Rapids) with sub-numa (SNC) and efficient core resource scheduling to improve performance.
- Further optimized performance on Intel® Core™ Ultra (codename Meteor Lake) CPU with latency hint, by leveraging both P-core and E-cores.
- Further improved performance of LLMs in INT4 weight compression, especially on 1^st token latency and on 4^th and 5^th Gen of Intel Xeon platforms (codename Sapphire Rapids and Emerald Rapids) with AMX capabilities.
- Improved performance of transformer-based LLM using stateful model technique to increase memory efficiency where internal states (KV cache) are shared among multiple iterations of inference. The stateful model implementation supports both greedy search and beam search (preview) for LLMs. This technique also reduces the memory footprint of LLMs, where Intel Core and Ultra platforms like Raptor Lake and Meteor Lake can run INT4 models, such as Llama v2 7B.
- Improved performance on ARM platforms with throughput hint, by increasing efficiency in usage of the CPU cores and memory bandwidth.
GPU
- Full support for Intel® Core™ Ultra (codename Meteor Lake) integrated graphics.
- For LLMs, the first inference latency for INT8 and INT4 weight-compressed models has been improved on iGPU thanks to more efficient context processing. Overall average token latency for INT8 and INT4 has also been enhanced on iGPU with graph compilation optimization, various host overhead optimization, and dynamic padding support for GEMM.
- Stateful model is functionally supported for LLMs.
- Model caching for dynamically shaped models is now supported. Model loading time is improved for these models, including LLMs.
- API for switching between size mode (model caching) and speed mode (kernel caching) is introduced.
- The model cache file name is changed to be independent of GPU driver versions. The GPU will not generate separate model cache files when the driver is updated.
- Compilation time for Stable Diffusion models has been improved.
NPU
- NPU plugin is available as part of OpenVINO. With the Intel® Core Ultra NPU driver installed, inference can run on the NPU device.
AUTO device plug-in (AUTO)
- Introduced the round-robin policy to AUTO cumulative throughput hint, which dispatches inference requests to multiple devices (such as multiple GPU devices) in the round-robin sequence, instead of in the device priority sequence. The device priority sequence remains as the default configuration.
- AUTO loads stateful models to GPU or CPU per device priority, since GPU now supports stateful model inference.
OpenVINO Common
- Enhanced support of String tensors has been implemented, enabling the use of operators and models that rely on string tensors. This update also enhances the capability in the torchvision preprocessing (#21244).
- A new feature has been added that enables the selection of P-Cores for model compilation on CPU device(s) with hybrid architecture (i.e. Intel® Core™ 12^th Gen and beyond). This will reduce compilation time compared to previous implementation where P-cores and E-cores are used randomly by OS scheduling.
OpenVINO JavaScript API (preview feature)
- We’ve introduced a preview version of JS API for OpenVINO runtime in this release. We hope that you will try this feature and provide your feedback through GitHub issues
- Known limitations:
  - Only supported in manylinux and x86
    (Windows, ARM, ARM64, and macOS have not been tested)
  - Node.js version >= 18.16
  - CMake version < 3.14 is not supported
  - gcc compiler version < 7 is not supported
OpenVINO Python API
- Introducing string tensor support for Python API.
- Added support for the following:
  - Create ov.Tensor from Python lists
  - Create ov.Tensor from empty numpy arrays.
  - Constants from empty numpy arrays.
  - Autogenerated get/set methods for Node attributes.
  - Inference functions (InferRequest.infer/start_async, CompiledModel.__call__ etc.) support OVDict as the input.
  - PILLOW interpolation modes bindings. (external contribution: @meetpatel0963 #21188)
- Torchvision to OpenVINO preprocessing converter documentation has been added to OpenVINO docs.

OpenVINO Ecosystem

OpenVINO Tokenizer (Preview)

OpenVINO Tokenizer adds text processing operations to OpenVINO:

Text PrePostprocessing without third-party dependencies.
Convert a HuggingFace tokenizer into the OpenVINO model tokenizer and the detokenizer using a CLI tool or Python API.
Connect a tokenizer and a model to get a single model with text input.

OpenVINO Tokenizer models work only on the CPU device

Supported platforms: Linux (x86 and ARM), Windows and Mac (x86 and ARM)

OpenVINO Model Server

Added support for serving pipelines with custom nodes implemented as a python code. This greatly simplifies exposing GenAI algorithms based on Hugging Face and Optimum libraries. It can be also applied for arbitrary pre and post-processing in model serving pipelines.
Included a new set of model serving demos that use custom nodes with python code. These include LLM text generation, stable diffusion and seq2seq translation.
Improved video stream analysis demo. A simple client example can now process the video stream from a local camera, video file or RTSP stream.
Learn more about these changes on Github.

Jupyter Notebook Tutorials

The following notebooks have been updated or newly added: