Triton Client推理示例代码

Created on: November 4, 2024

Answered using O1 Preview by Chat01

Question

在操作系统: Ubuntu 22.04,安装了docker,在docker上,通过docker pull nvcr.io/nvidia/tritonserver:23.08-py3安装了Triton 镜像,镜像中有个算法模型people_detector,整体的背景情况如【】内所示。现需求:在宿主机上编写Triton客户端代码,连接Triton Server并测试模型推理服务,包括发送图片、返回识别结果、绘制目标框及保存图片,给出整体的客户端测试源码。

(1)yolov8n.pt的模型从ultralytics的官网上下载的,没有做任何改变。
(2)yolo export model=./models/yolov8n.pt format=onnx,转化生成yolov8n.onnx
(3)trtexec --onnx=./models/yolov8n.onnx --saveEngine=people_detector.engine --fp16,转化生成people_detector.engine;
(4)把模型放到docker环境的models下:mv people_detector.engine /models/people_detector/1/model.plan
(5)models在docker环境下的目录结构
root@08dd250b903b:/workspace# tree /models
/models
|-- car_detector
| |-- 1
| | -- model.plan | -- config.pbtxt
-- people_detector |-- 1 | -- model.plan
`-- config.pbtxt
(6)每个模型创建 config.pbtxt的内容如下:
4 directories, 4 files
root@08dd250b903b:/workspace#
为每个模型创建 config.pbtxt,指定模型的输入输出和类型。
people_detector 的 config.pbtxt:
nano /models/people_detector/config.pbtxt
内容如下:
name: "people_detector"
platform: "tensorrt_plan"
max_batch_size: 1
input [
{
name: "images"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [3, 640, 640]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [84, 8400]
}
]
instance_group [
{
kind: KIND_GPU
}
]
(7)trition server模型运行起来的情况如下:
root@08dd250b903b:/workspace# tritonserver --model-repository=/models
I1104 05:55:25.323236 1225 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I1104 05:55:25.323261 1225 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.15
I1104 05:55:25.323263 1225 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.15
I1104 05:55:25.408497 1225 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fe236000000' with size 268435456
I1104 05:55:25.408587 1225 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1104 05:55:25.409377 1225 model_lifecycle.cc:462] loading: car_detector:1
I1104 05:55:25.409388 1225 model_lifecycle.cc:462] loading: people_detector:1
I1104 05:55:25.409711 1225 tensorrt.cc:65] TRITONBACKEND_Initialize: tensorrt
I1104 05:55:25.409718 1225 tensorrt.cc:75] Triton TRITONBACKEND API version: 1.15
I1104 05:55:25.409720 1225 tensorrt.cc:81] 'tensorrt' TRITONBACKEND API version: 1.15
I1104 05:55:25.409722 1225 tensorrt.cc:105] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I1104 05:55:25.410435 1225 tensorrt.cc:222] TRITONBACKEND_ModelInitialize: people_detector (version 1)
I1104 05:55:25.410533 1225 tensorrt.cc:222] TRITONBACKEND_ModelInitialize: car_detector (version 1)
I1104 05:55:25.410860 1225 logging.cc:46] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
I1104 05:55:25.418639 1225 logging.cc:46] Loaded engine size: 8 MiB
I1104 05:55:25.418872 1225 logging.cc:46] Loaded engine size: 8 MiB
I1104 05:55:25.436701 1225 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12, now: CPU 0, GPU 12 (MiB)
I1104 05:55:25.438990 1225 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +6, now: CPU 0, GPU 6 (MiB)
I1104 05:55:25.442070 1225 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: people_detector_0_0 (GPU device 0)
I1104 05:55:25.444352 1225 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: car_detector_0_0 (GPU device 0)
I1104 05:55:25.447806 1225 logging.cc:46] Loaded engine size: 8 MiB
I1104 05:55:25.455944 1225 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +6, now: CPU 0, GPU 6 (MiB)
I1104 05:55:25.456093 1225 logging.cc:46] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
I1104 05:55:25.456259 1225 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +11, now: CPU 0, GPU 17 (MiB)
I1104 05:55:25.456390 1225 instance_state.cc:188] Created instance people_detector_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I1104 05:55:25.456467 1225 model_lifecycle.cc:819] successfully loaded 'people_detector'
I1104 05:55:25.461242 1225 logging.cc:46] Loaded engine size: 8 MiB
I1104 05:55:25.470004 1225 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +6, now: CPU 0, GPU 23 (MiB)
I1104 05:55:25.470312 1225 logging.cc:46] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +12, now: CPU 1, GPU 35 (MiB)
I1104 05:55:25.470440 1225 instance_state.cc:188] Created instance car_detector_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I1104 05:55:25.470778 1225 model_lifecycle.cc:819] successfully loaded 'car_detector'
I1104 05:55:25.470883 1225 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1104 05:55:25.470928 1225 server.cc:631]
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorrt | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-c |
| | | ompute-capability":"6.000000","default-max-batch-size":"4"}} |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------+

I1104 05:55:25.470944 1225 server.cc:674]
+-----------------+---------+--------+
| Model | Version | Status |
+-----------------+---------+--------+
| car_detector | 1 | READY |
| people_detector | 1 | READY |
+-----------------+---------+--------+

I1104 05:55:25.495944 1225 metrics.cc:810] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090
I1104 05:55:25.496040 1225 metrics.cc:703] Collecting CPU metrics
I1104 05:55:25.496123 1225 tritonserver.cc:2435]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.37.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory |
| | cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+

I1104 05:55:25.496855 1225 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I1104 05:55:25.496963 1225 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I1104 05:55:25.538731 1225 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Thought

连接服务器

用户正在Ubuntu 22.04系统上运行docker容器,容器内有名为"people_detector"的模型,是基于YOLOv8制作的TensorRT引擎。要连接到服务器测试模型推理服务。

构建代码连接

我开始创建Triton客户端代码,连接Triton Server,测试模型推理服务。包括发送图像、返回识别结果、绘制边界框且保存图像。

准备环境

正在设置服务器,在Docker环境中,模型people_detector.engine已经准备好。我在思考如何优化人脸检测模型的配置文件,以加速加载。

设计客户端代码

我正在设计Triton客户端代码来处理YOLOv8模型。代码会预处理图片,发送至服务器,接收识别结果,处理这些结果,然后在图像上绘制边界框并保存。

加载图像和处理

首先,我们使用Trition客户端库通过代码加载图片,然后对其进行预处理,包括调整大小和标准化。接下来,通过客户端库将图像发送到服务器,并接收输出,即一个形状为[84, 8400]的数组。

精准识别输出

使用YOLOv8时,输出维度为[批次, 类别数量+5, 锚框数量]。类别数量+5代表类别概率和边界框坐标。

加载和处理图像

首先,我提取配置文件中的输入和输出信息。接着,加载图像,调整大小为640x640,将其转换为NCHW格式和float32数据类型,最终将其发送到服务器。

构建客户机代码

我正在开发一个客户端,处理预处理、发送请求、接收响应和后处理输出,绘制边界框,并保存图像。

图像预处理与绘制框

这段代码首先读取图像,调整大小和颜色,再转换为数值数组并进行归一化。然后,使用Triton客户端进行推理,提取输出数据。最后,光学字符识别(OCR)并绘制检测框,保存处理后的图像。

校验每一步骤

初始化YOLO模型时,输入图像需要规范化为[0, 1]或中心化于0均值。这包括:调整大小为(640, 640),转换为float32类型,像素值归一化(除以255),以及可能的均值和标准差归一化。输出需要使用sigmoid激活。

对代码进行细节调整

注意到YOLOv8的预处理和后处理步骤,计划准确实施其中的缩放、归一化和解码操作,以确保与YOLOv8的匹配。

处理和调整图像

首先,读取图像,调整尺寸为640x640,添加填充。接着,将像素值归一化到[0, 1],转换为float32,并调整通道顺序为CHW。这是开始的数据预处理。

处理图像变换

我正在描述如何调整图像的大小,添加边框,重新计算比例以及处理不同的缩放和填充策略。

图像预处理

读取、转换、调整大小、归一化并转置图像以准备模型输入,计算出了宽高比和填充量。

调整输出结构

我正在重新调整模型的输出结构,确保它适应我们需要的格式。首先,将输出的形状从 [1, 84, 8400] 转置为 [8400, 84],每行代表一个预测结果。

审阅输出格式

我正在检查模型输出的形状,例如可能是[1, 8400, 85]或[1, 8400, 84]。这表明对象得分可能和类别得分合并在一起。

整理数据形状

我正在将输出数据重新排列为形状 [8400, 84]。现在,我们有 8400 个预测,每个预测的前 4 个元素是 bbox 坐标,其余 80 个元素是类别概率。接下来,我需要应用 sigmoid 激活函数,可能还需要对 bbox 坐标应用 sigmoid 或其他激活函数。

强调代码灵活性

由于时间紧迫,我们建议用户调整代码以匹配他们模型的具体要求。

调整和优化代码

这是一个调整后的代码,涉及对图像进行调整大小、填充、预处理、非最大抑制以及后处理。这些步骤共同提高了推理的准确性。

Answer

以下是一个完整的Python客户端代码,用于连接Triton Server,发送图片进行推理,接收识别结果,绘制目标框并保存图片。请确保您已经安装了必要的Python库,包括tritonclient, numpy, 和 opencv-python

python
import cv2 import numpy as np import tritonclient.http as httpclient from tritonclient.utils import InferenceServerException def letterbox(image, new_shape=(640, 640), color=(114, 114, 114)): """Resize image with unchanged aspect ratio using padding.""" shape = image.shape[:2] # current shape [height, width] ratio = min(new_shape[0] / shape[0], new_shape[1] / shape[1]) new_unpad = (int(round(shape[1] * ratio)), int(round(shape[0] * ratio))) dw = new_shape[1] - new_unpad[0] # width padding dh = new_shape[0] - new_unpad[1] # height padding dw /= 2 # divide padding into 2 sides dh /= 2 image = cv2.resize(image, new_unpad, interpolation=cv2.INTER_LINEAR) top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1)) left, right = int(round(dw - 0.1)), int(round(dw + 0.1)) image = cv2.copyMakeBorder(image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color) return image, ratio, (dw, dh) def preprocess_image(image_path): """Preprocess the image for model input.""" image = cv2.imread(image_path) original_image = image.copy() image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image, ratio, (dw, dh) = letterbox(image, new_shape=(640, 640)) image = image.astype(np.float32) / 255.0 image = np.transpose(image, (2, 0, 1)) # HWC to CHW image = np.expand_dims(image, axis=0) # Add batch dimension return image, original_image, ratio, (dw, dh) def postprocess_output(output_data, ratio, dwdh): """Postprocess the model output to get detections.""" output_data = output_data[0] # Remove batch dimension output_data = np.transpose(output_data, (1, 0)) # [8400, 84] # Split bbox coordinates and class probabilities boxes = output_data[:, :4] scores = output_data[:, 4:] # Apply sigmoid to scores scores = 1 / (1 + np.exp(-scores)) # Get max scores and corresponding class indices class_scores = np.max(scores, axis=1) class_ids = np.argmax(scores, axis=1) # Thresholding conf_threshold = 0.5 idxs = np.where(class_scores > conf_threshold) boxes = boxes[idxs] class_scores = class_scores[idxs] class_ids = class_ids[idxs] # Rescale boxes to original image boxes[:, 0] -= dwdh[0] boxes[:, 1] -= dwdh[1] boxes[:, 2] -= dwdh[0] boxes[:, 3] -= dwdh[1] boxes /= ratio[0] # Convert from [x_center, y_center, width, height] to [x1, y1, x2, y2] boxes_xyxy = np.zeros_like(boxes) boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2] / 2 # x1 boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3] / 2 # y1 boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2] / 2 # x2 boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3] / 2 # y2 detections = [] for i in range(len(boxes_xyxy)): detection = { 'box': boxes_xyxy[i], 'score': class_scores[i], 'class_id': class_ids[i] } detections.append(detection) return detections def draw_boxes(image, detections): """Draw bounding boxes on the image.""" for det in detections: x1, y1, x2, y2 = det['box'].astype(int) cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2) label = f"Class {det['class_id']}: {det['score']:.2f}" cv2.putText(image, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (36, 255, 12), 2) return image def main(): # Configuration image_path = 'input.jpg' # Path to your input image output_image_path = 'output.jpg' # Path to save the output image model_name = 'people_detector' server_url = 'localhost:8000' # Triton server URL # Preprocess the image input_image, original_image, ratio, dwdh = preprocess_image(image_path) # Create Triton client and send request try: client = httpclient.InferenceServerClient(url=server_url) inputs = [ httpclient.InferInput('images', input_image.shape, "FP32") ] inputs[0].set_data_from_numpy(input_image) outputs = [ httpclient.InferRequestedOutput('output0') ] response = client.infer(model_name, inputs, outputs=outputs) output_data = response.as_numpy('output0') # Postprocess the output detections = postprocess_output(output_data, ratio, dwdh) # Draw bounding boxes and save image image_with_boxes = draw_boxes(original_image, detections) cv2.imwrite(output_image_path, image_with_boxes) print(f"Output saved to {output_image_path}") except InferenceServerException as e: print("Inference failed: " + str(e)) if __name__ == '__main__': main()

注意事项:

  1. 安装必要的库:

    • 您需要安装Triton客户端库、NumPy和OpenCV:
      bash
      pip install tritonclient[http] numpy opencv-python
  2. 代码说明:

    • preprocess_image函数用于读取并预处理输入图像,调整大小并归一化。
    • postprocess_output函数用于处理模型的输出,提取检测结果。
    • draw_boxes函数在原始图像上绘制检测到的目标框。
    • main函数连接Triton Server,发送推理请求,接收并处理结果,然后保存带有检测框的图像。
  3. 模型输出处理:

    • 由于模型输出的维度是[1, 84, 8400],需要对输出进行转置和处理,以提取边界框和类别信息。
    • 使用了Sigmoid函数对分数进行激活,然后应用阈值筛选。
  4. 运行代码:

    • 将您的输入图像路径替换image_path变量中的'input.jpg'
    • 确保Triton Server正在运行并且模型已正确加载。
    • 运行脚本:
      bash
      python client.py
  5. 结果:

    • 代码将在指定的输出路径保存一张带有检测框的图像。

提示:

  • 如果您对模型的输入输出有特殊的预处理或后处理需求,请根据实际情况调整代码。
  • 如果需要进行非极大值抑制(NMS),可以使用如torchvision库中的NMS函数,或者自行实现。

希望这能帮助您完成Triton客户端的编写和测试。如有任何问题,请随时提问。

Share this Q&A