Vitis AI

Getting Started

https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html

System Requirements

https://docs.xilinx.com/r/en-US/ug1414-vitis-ai
https://xilinx.github.io/Vitis-AI/3.5/html/docs/reference/system_requirements.html (3.0)

正式にサポートしているFPGAボードは限られているため注意
開発側はUbuntu 20.04(Debianは不可)にCUDA 11.3 + Container ToolkitおよびDockerを用意する必要がある (2023年時点)
便宜上CUDA 11.4を使用するケースもあったが問題は無さそう
開発側スペックは少なくともメモリを8GB程度は用意しないと処理中にOoM Killerで停止する > GCPなら最低でもn1-standard-2相当が必要

IP Core / Bitstream

https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Deep-Learning-Processor-Unit
https://docs.xilinx.com/r/en-US/pg338-dpu (JP)
https://github.com/Xilinx/Vitis-AI/tree/master/dpu
https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-system-integration (3.0)

深層学習用の処理機能が詰まったDPUという名前のFPGA回路情報(IPコア)が提供されている
あらかじめFPGAにDPU回路を書き込んでおいてXRTという名前のドライバで制御する
1. Vivado/Vitisでビットストリームを生成
2. ビットストリームをターゲットボードにダウンロード
3. 関連するドライバ(および依存関係のあるライブラリ)をインストール
例えばZynq UltraScale+ MPSoCの場合はDPUCZDX8GというIPコア(Bitstream)が使用可能
アーキテクチャ(ISA含む)やVitis AIのバージョンも合わせる必要があるので注意: B512/B800/B1024/B1152/B1600/B2304/B3136/B4096
[IMPORTANT] Vitis AIは回路合成までは自動で行わないため適切なDPUを予めFPGAにダウンロードしておく必要がある !!!
- Kira KV260でDPU調整済みのOSイメージ(Option C)を使用すれば適切なDPUが自動でダウンロードされる
- OSイメージにUbuntuなどを選んだ場合は自分で設定するかapt経由(Kira KV260参照)でも可能 > バージョンの整合性に注意

OS Image

https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Flashing-the-OS-Image-to-the-SD-Card

DPUのために調整されたVitis AI用のPetaLinuxを使用すれば余計な設定は不要 (Kira KV260参照)
ライセンスに記載のあるBenchmarkingとはビジネス用語でありFPGAの性能試験のことではない
公式HPのUbuntuなどを使用する場合はDPUの調整が別途必要 (Kira KV260参照)

Model Zoo

https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-model-zoo.html (3.0)
https://xilinx.github.io/Vitis-AI/3.5/html/docs/reference/ModelZoo_Github_web.htm (3.0)
https://github.com/Xilinx/Vitis-AI/tree/master/model_zoo

用意されているモデルは限られており、かつ商用不可ライセンスの物も多いため要注意
ダウンロード先のURLはVitis-AI/model_zooにあるYAMLファイルで参照可能
再トレーニング用のプログラムもダウンロードしたzipファイルに含まれている
対応するDPUがFPGAにダウンロードされていないとFingerprint Failureで実行不可 > 部分的な回避策は下部に記載

Evaluation Boards

https://www.xilinx.com/products/boards-and-kits/see-all-evaluation-boards.html
https://www.xilinx.com/products/som/kria/kv260-vision-starter-kit.html

Kria KV260が最安値

AI Optimizer (Optional)

https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-model-development.html#model-optimization (3.0)

Vitis AI Optimizer is an optional tool that can significantly enhance performance in many applications
Vitis AI Optimizer requires the developer to purchase a license

Host Setup on Ubuntu

https://xilinx.github.io/Vitis-AI/3.5/html/docs/install/install.html (3.0)

The pre-built cpu container should only be used when a GPU is not available on the host machine
The docker_build process may take several hours to complete
Often simply re-running the build script will result in success

git clone https://github.com/Xilinx/Vitis-AI
#git clone -b 3.0 https://github.com/Xilinx/Vitis-AI

PyTorch (CPU Only)

cd ~/Vitis-AI

docker pull xilinx/vitis-ai-pytorch-cpu:latest

./docker_run.sh xilinx/vitis-ai-pytorch-cpu:latest

latest: Pulling from xilinx/vitis-ai-pytorch-cpu
Digest: sha256:f55bd069ffd56c6358cae29df19e6085f2bcf8ea5e045744aa412fd72db521ed
Status: Image is up to date for xilinx/vitis-ai-pytorch-cpu:latest
docker.io/xilinx/vitis-ai-pytorch-cpu:latest
Setting up user 's environment in the Docker container...
Running as vitis-ai-user with ID 0 and group 0
==========================================
__      ___ _   _                   _____
\ \    / (_) | (_)            /\   |_   _|
 \ \  / / _| |_ _ ___ ______ /  \    | |
  \ \/ / | | __| / __|______/ /\ \   | |
   \  /  | | |_| \__ \     / ____ \ _| |_
    \/   |_|\__|_|___/    /_/    \_\_____|
==========================================
Docker Image Version: ubuntu2004-3.0.0.106   (CPU)
Vitis AI Git Hash: d4ec26f
Build Date: 2023-01-08
WorkFlow: pytorch

PyTorch (GPU Support)

cd ~/Vitis-AI/docker

## License is necessary to build opt_pytorch
./docker_build.sh -t gpu -f pytorch
#./docker_build.sh -t gpu -f opt_pytorch

Validating Arguments...
Your inputs: Docker-Type:gpu, FrameWork:pytorch
...
=> => writing image sha256:f555d3cf57c1562de00d29987b768f08836018fba6052a189bd1365c292d54b9
=> => naming to docker.io/xilinx/vitis-ai-pytorch-gpu:3.5.0.001-bbd45838d

The list of Nvidia CUDA and cuDNN images of Docker Hub is available below:
https://hub.docker.com/r/nvidia/cuda/tags

#docker run --gpus all nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 nvidia-smi
docker run --gpus all nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04 nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8    10W /  70W |    105MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

cd ~/Vitis-AI

docker images

REPOSITORY                        TAG                                                          IMAGE ID       CREATED        SIZE
xilinx/vitis-ai-pytorch-gpu       3.5.0.001-bbd45838d                                          f555d3cf57c1   2 hours ago    31GB

## The latest tag is not working in vitis-ai version 3.5 (as of July 2023)
#./docker_run.sh xilinx/vitis-ai-pytorch-gpu:latest
./docker_run.sh xilinx/vitis-ai-pytorch-gpu:3.5.0.001-bbd45838d

Setting up user 's environment in the Docker container...
Running as vitis-ai-user with ID 0 and group 0
==========================================
__      ___ _   _                   _____
\ \    / (_) | (_)            /\   |_   _|
 \ \  / / _| |_ _ ___ ______ /  \    | |
  \ \/ / | | __| / __|______/ /\ \   | |
   \  /  | | |_| \__ \     / ____ \ _| |_
    \/   |_|\__|_|___/    /_/    \_\_____|
==========================================
Docker Image Version: 3.5.0.001-bbd45838d   (GPU)
Vitis AI Git Hash: bbd45838d
Build Date: 2023-07-20
WorkFlow: pytorch

Troubleshooting - Exit Code 137

https://stackoverflow.com/questions/31297616/what-is-the-authoritative-list-of-docker-run-exit-codes
https://komodor.com/learn/exit-codes-in-containers-and-kubernetes-the-complete-guide/

## ERROR: failed to solve: process "..." did not complete successfully: exit code: 137

Exit code 137 indicates that container was immediately terminated by the operating system via SIGKILL signal
The host machine needs more system memory; see Docker and OoM Killer for more details
Another workaround is preparing additional swap memory: see Swap for instructions

Troubleshooting - Docker Image Not Found

https://github.com/Xilinx/Vitis-AI/pull/1296

The 'latest' tag is not working in vitis-ai version 3.5 (as of July 2023)

## Unable to find image 'xilinx/vitis-ai-pytorch-gpu:latest' locally
## docker: Error response from daemon: pull access denied for xilinx/vitis-ai-pytorch-gpu,
## repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

docker images

REPOSITORY                        TAG                                                          IMAGE ID       CREATED        SIZE
xilinx/vitis-ai-pytorch-gpu       3.5.0.001-bbd45838d                                          f555d3cf57c1   2 hours ago    31GB

docker tag xilinx/vitis-ai-pytorch-gpu:3.5.0.001-bbd45838d xilinx/vitis-ai-pytorch-gpu:latest

docker images

REPOSITORY                        TAG                                                          IMAGE ID       CREATED        SIZE
xilinx/vitis-ai-pytorch-gpu       3.5.0.001-bbd45838d                                          f555d3cf57c1   2 hours ago    31GB
xilinx/vitis-ai-pytorch-gpu       latest                                                       f555d3cf57c1   2 hours ago    31GB

Jupyter Notebook in Docker Image

See Jupyter for details

jupyter notebook --port=8888 --ip=0.0.0.0

http://127.0.0.1:8888/?token=abcdefg

Target Board Setup (Zynq UltraScale+ MPSoC - DPUCZDX8G)

https://xilinx.github.io/Vitis-AI/3.0/html/docs/quickstart/mpsoc.html

Cross-Compiler on Host Machine

## Exit or detach from the vitis-ai docker container before installing the cross compiler

cd ~/Vitis-AI/board_setup/mpsoc
chmod 755 ./host_cross_compiler_setup.sh
./host_cross_compiler_setup.sh

#unset LD_LIBRARY_PATH
source ~/petalinux_sdk_2022.2/environment-setup-cortexa72-cortexa53-xilinx-linux

## Run the vitis-ai docker container after installing the cross compiler
cd ~/Vitis-AI
./docker_run.sh xilinx/vitis-ai-pytorch-cpu:latest

## Activate the conda environment in the docker container
conda activate vitis-ai-pytorch

## Cross compile an example
#cd ~/Vitis-AI/examples/vai_runtime/resnet50
cd /workspace/examples/vai_runtime/resnet50
#bash –x build.sh
bash build.sh

Troubleshooting - LSB Modules

https://postgresweb.com/ubuntu-no-lsb-modules-are-available

## No LSB modules are available.

sudo apt install lsb-core

lsb_release -a

OS Installation

See Kira KV260 for the installation

Checking Device

xbutil examine

System Configuration
  OS Name              : Linux
  Release              : 5.15.36-xilinx-v2022.2
  Version              : #1 SMP Mon Oct 3 07:50:07 UTC 2022
  Machine              : aarch64
  CPU Cores            : 4
  Memory               : 3929 MB
  Distribution         : PetaLinux 2022.2_release_S10071807 (honister)
  GLIBC                : 2.34
  Model                : ZynqMP SMK-K26 Rev1/B/A

XRT
  Version              : 2.14.0
  Branch               : 2022.2
  Hash                 : 43926231f7183688add2dccfd391b36a1f000bea
  Hash Date            : 2022-10-07 05:12:02
  ZOCL                 : 2.14.0, 43926231f7183688add2dccfd391b36a1f000bea

Devices present
BDF             :  Shell  Platform UUID  Device ID     Device Ready*
----------------------------------------------------------------------
[0000:00:00.0]  :  edge   0x0            user(inst=0)  Yes

* Devices that are not ready will have reduced functionality when using XRT tools

Checking DPU Availability

export DEBUG_DPU_CONTROLLER=1
show_dpu

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 08:24:27.245379  1928 dpu_controller_dnndk.cpp:279] cancel register the dnndk dpu controller, because /dev/dpu is not opened
I0606 08:24:27.245836  1928 dpu_controller.cpp:42] add factory method 02_xrt
I0606 08:24:27.245878  1928 dpu_control_xrt.cpp:113] register the xrt edge dpu controller
I0606 08:24:27.258949  1928 dpu_control_xrt.cpp:53] xrt dpu cu  is detected, kernel = DPUCZDX8G
I0606 08:24:27.259016  1928 dpu_control_xrt.cpp:82] create DpuControllerXrtEdge for DPUCZDX8G
I0606 08:24:27.259049  1928 dpu_control_xrt_edge.cpp:53] creating dpu controller:  this=0xaaab013e8e10
I0606 08:24:27.259078  1928 dpu_controller.cpp:57] create dpu controller via 02_xrt ret= 0xaaab013e8e10
device_core_id=0 device= 0 core = 0 fingerprint = 0x101000056010407 batch = 1 full_cu_name=DPUCZDX8G:DPUCZDX8G_1

I0606 08:24:27.259140  1928 dpu_control_xrt_edge.cpp:60] destroying dpu controller:  this=0xaaab013e8e10

xdputil query

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 08:26:45.284555  2048 dpu_controller_dnndk.cpp:279] cancel register the dnndk dpu controller, because /dev/dpu is not opened
I0606 08:26:45.285107  2048 dpu_controller.cpp:42] add factory method 02_xrt
I0606 08:26:45.285149  2048 dpu_control_xrt.cpp:113] register the xrt edge dpu controller
{
    "DPU IP Spec":{
        "DPU Core Count":1,
        "IP version":"v4.1.0",
        "generation timestamp":"2022-11-30 19-15-00",
        "git commit id":"ce8dd1",
        "git commit time":2022113019,
        "regmap":"1to1 version"
    },
    "VAI Version":{
        "libvaip-core.so":"Xilinx vaip Version: 1.0.0-a176db67b19f94b0a31f9d24ef80322efe4494ad  2022-12-27-01:24:22 ",
        "libvart-runner.so":"Xilinx vart-runner Version: 3.0.0-2efa5fe1e56c2b2c8a7e71e9fc1636242dd50a9f  2022-12-27-00:47:05 ",
        "libvitis_ai_library-dpu_task.so":"Xilinx vitis_ai_library dpu_task Version: 3.0.0-1cccff04dc341c4a6287226828f90aed56005f4f  2022-12-20 10:29:01 [UTC] ",
        "libxir.so":"Xilinx xir Version: xir-9204ac72103092a7b253a0c23ec7471481656940 2022-12-27-00:46:16",
        "target_factory":"target-factory.3.0.0 860ed0499ab009084e2df3004eeb9ae710c26351"
    },
    "kernels":[
        {
            "DPU Arch":"DPUCZDX8G_ISA1_B4096",
            "DPU Frequency (MHz)":300,
            "IP Type":"DPU",
            "Load Parallel":2,
            "Load augmentation":"enable",
            "Load minus mean":"disable",
            "Save Parallel":2,
            "XRT Frequency (MHz)":300,
            "cu_addr":"0xa0010000",
            "cu_handle":"0xaaaaf9957c70",
            "cu_idx":0,
            "cu_mask":1,
            "cu_name":"DPUCZDX8G:DPUCZDX8G_1",
            "device_id":0,
            "fingerprint":"0x101000056010407",
            "name":"DPU Core 0"
        }
    ]
}

ResNet50 (Image Classification)

## Root Login (as needed)
sudo su -l root

## Download Model Zoo (as needed)
#wget https://www.xilinx.com/bin/public/openDownload?filename=resnet50-zcu102_zcu104_kv260-r3.0.0.tar.gz -O resnet50-zcu102_zcu104_kv260-r3.0.0.tar.gz
#tar -xzvf resnet50-zcu102_zcu104_kv260-r3.0.0.tar.gz
#cp resnet50 /usr/share/vitis_ai_library/models -r

## Download Sample Images from Xilinx > Extract
cd
tar -xzvf vitis_ai_runtime_r3.0.0_image_video.tar.gz -C Vitis-AI/examples/vai_runtime

## Build Application (as needed)
#cd ~/Vitis-AI/examples/vai_runtime/resnet50
#chmod 755 ./build.sh
#./build.sh

## Run Example
cd ~/Vitis-AI/examples/vai_runtime/resnet50
./resnet50 /usr/share/vitis_ai_library/models/resnet50/resnet50.xmodel

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 09:46:05.968466  6374 main.cc:292] create running for subgraph: subgraph_conv1
I0606 09:46:05.985018  6374 dpu_controller_dnndk.cpp:279] cancel register the dnndk dpu controller, because /dev/dpu is not opened
I0606 09:46:05.985324  6374 dpu_controller.cpp:42] add factory method 02_xrt
I0606 09:46:05.985356  6374 dpu_control_xrt.cpp:113] register the xrt edge dpu controller
I0606 09:46:05.998131  6374 dpu_control_xrt.cpp:53] xrt dpu cu  is detected, kernel = DPUCZDX8G
I0606 09:46:05.998200  6374 dpu_control_xrt.cpp:82] create DpuControllerXrtEdge for DPUCZDX8G
I0606 09:46:05.998237  6374 dpu_control_xrt_edge.cpp:53] creating dpu controller:  this=0xaaaaf72b4960
I0606 09:46:05.998266  6374 dpu_controller.cpp:57] create dpu controller via 02_xrt ret= 0xaaaaf72b4960
I0606 09:46:06.282402  6374 dpu_control_xrt_edge.cpp:115] code 0x19000000 core_idx 0 gen_reg:  0x19100000 0x1aa00000 ...

Image : vitis-ai_gorilla_market.jpg
top[0] prob = 0.xxxxxx  name = Gorilla
I0606 09:46:26.501123  6374 dpu_control_xrt_edge.cpp:60] destroying dpu controller:  this=0xaaaaf72b4960

Troubleshooting - File Descriptor

## F0606 09:12:00.986861  4629 file_lock_lnx.cpp:28] Check failed: fd >= 0 (-1 vs. 0) cannot open file: /tmp/DPU_0

## Reset file descriptor
rm /tmp/DPU_0

Troubleshooting - OpenCV

https://github.com/opencv/opencv/issues/18461

## terminate called after throwing an instance of 'cv::Exception'
##   what():  OpenCV(4.5.2) /usr/src/debug/opencv/4.5.2-r0/git/modules/highgui/src/window_gtk.cpp:624: error: (-2:Unspecified error) Can't initialize GTK backend in function 'cvInitSystem'

## Disable DISPLAY if running the application via SSH or non-GUI environment
export DISPLAY=:0.0

Troubleshooting - Fingerprint Failure

https://support.xilinx.com/s/question/0D54U00006wDmkzSAC/info-post-about-dpu-fingerprint

DPU fingerprint is a unique identifier used in Vitis AI to characterize different DPU targets
DPU fingerprint has a feature code depending on:
1. IP Core: DPUCZDX8G/DPUCVDX8G/…
2. Unique Architecture: B512/B800/B1024/B1152/B1600/B2304/B3136/B4096
3. Instruction Set Architecture (ISA): 0x01
4. Vitis AI Version: 2.5/3.0/…

## W0608 07:27:35.154747 83456 dpu_runner_base_imp.cpp:676] CHECK fingerprint fail ! model_fingerprint 0x101000056010407 dpu_fingerprint 0x101000016010406
## F0608 07:27:35.154840 83456 dpu_runner_base_imp.cpp:648] fingerprint check failure.

## Check if DPU architecture and fingerprint are same as that for the compiled model
xdputil query
cat resnet50/meta.json

## Workaround with disabling fingerprint check
env XLNX_ENABLE_FINGERPRINT_CHECK=0 ./resnet50 resnet50.xmodel

YOLOvX (Object Detection)

## Root Login (as needed)
sudo su -l root

cd ~/Vitis-AI/examples/vai_library/samples/yolovx
./test_jpeg_yolovx /usr/share/vitis_ai_library/models/yolox_nano_pt/yolox_nano_pt.xmodel vitis-ai_gorilla_market.jpg

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 10:08:46.059096  7557 demo.hpp:1193] batch: 0     image: vitis-ai_gorilla_market.jpg
I0606 10:08:46.059296  7557 process_result.hpp:32] RESULT: 16   78.75   25.94   502.17  505.75  0.469689

Custom Model Development (PyTorch)

https://github.com/Xilinx/Vitis-AI-Tutorials/blob/1.4/Design_Tutorials/09-mnist_pyt/README.md
https://www.paltek.co.jp/techblog/techinfo/211115_01

Basic training flow for Vitis AI:
1. Training: floating-point_model.pth
2. Quantization: floating-point_model.pth > int_model.xmodel
3. Compile: int_model.xmodel > int_model_kv260.xmodel

Step 0 - Setting-up Workspace

cd ~/Vitis-AI
./docker_run.sh xilinx/vitis-ai-pytorch-gpu:latest

conda activate vitis-ai-pytorch

git clone -b 1.4 https://github.com/Xilinx/Vitis-AI-Tutorials.git
cd Vitis-AI-Tutorials/Design_Tutorials/09-mnist_pyt/files/

Step 1 - Training

export BUILD=./build
export LOG=${BUILD}/logs
mkdir -p ${LOG}

vi train.py

## Remove following lines from train.py
================================
> torchvision.datasets.MNIST.resources = [
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz', 'f68b3c2dcbeaaa9fbdd348bbdeb94873'),
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz', 'd53e105ee54ea40749a09fcbcd1e9432'),
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz', '9fb629c4189551a2d022fa330f9573f3'),
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz', 'ec29112dd5afa0611ce80d1b7f02629c')
> ]
================================

## Training
python -u train.py -d ${BUILD} 2>&1 | tee ${LOG}/train.log

PyTorch version :  1.12.1
3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
-----------------------------------------
Command line options:
--build_dir    :  ./build
--batchsize    :  100
--learnrate    :  0.001
--epochs       :  3
-----------------------------------------
You have 1 CUDA devices available
Device 0 :  Tesla T4
Selecting device 0..
...
Epoch 1
Test set: Accuracy: 9814/10000 (98.14%)
Epoch 2
Test set: Accuracy: 9866/10000 (98.66%)
Epoch 3
Test set: Accuracy: 9898/10000 (98.98%)
Trained model written to ./build/float_model/f_model.pth

Step 2 - Quantization

## Quantize
python -u quantize.py -d ${BUILD} --quant_mode calib 2>&1 | tee ${LOG}/quant_calib.log

[VAIQ_NOTE]: Loading NNDCT kernels...
-----------------------------------------
PyTorch version :  1.12.1
3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
-----------------------------------------
Command line options:
--build_dir    :  ./build
--quant_mode   :  calib
--batchsize    :  100
-----------------------------------------
You have 1 CUDA devices available
Device 0 :  Tesla T4
Selecting device 0..
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization calibration process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing CNN...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model CNN is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(./build/quant_model/CNN.py)
[VAIQ_NOTE]: =>Get module with quantization.
Test set: Accuracy: 9904/10000 (99.04%)
[VAIQ_NOTE]: =>Exporting quant config.(./build/quant_model/quant_info.json)

## Evaluate Quantized Model
python -u quantize.py -d ${BUILD} --quant_mode test 2>&1 | tee ${LOG}/quant_test.log

[VAIQ_NOTE]: Loading NNDCT kernels...
-----------------------------------------
PyTorch version :  1.12.1
3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
-----------------------------------------
Command line options:
--build_dir    :  ./build
--quant_mode   :  test
--batchsize    :  100
-----------------------------------------
You have 1 CUDA devices available
Device 0 :  Tesla T4
Selecting device 0..
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization test process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing CNN...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model CNN is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(./build/quant_model/CNN.py)
[VAIQ_NOTE]: =>Get module with quantization.
Test set: Accuracy: 9901/10000 (99.01%)
[VAIQ_NOTE]: =>Converting to xmodel ...
[VAIQ_NOTE]: =>Successfully convert 'CNN' to xmodel.(./build/quant_model/CNN_int.xmodel)

Step 3 - Compile

vi compile.sh

## Add following lines to compile.sh if using KV260
================================
elif [ $1 = kv260 ]; then
      ARCH=/opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json
      TARGET=kv260
      echo "-----------------------------------------"
      echo "COMPILING MODEL FOR KV260.."
      echo "-----------------------------------------"
================================

## Compile
source compile.sh kv260 ${BUILD} ${LOG}

COMPILING MODEL FOR KV260..
-----------------------------------------
[UNILOG][INFO] Compile mode: dpu
[UNILOG][INFO] Debug mode: null
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA1_B4096
[UNILOG][INFO] Graph name: CNN, with op num: 33
[UNILOG][INFO] Begin to compile...
[UNILOG][INFO] Total device subgraph number 3, DPU subgraph number 1
[UNILOG][INFO] Compile done.
[UNILOG][INFO] The meta json is saved to "./build/compiled_model/meta.json"
[UNILOG][INFO] The compiled xmodel is saved to "./build/compiled_model/CNN_kv260.xmodel"
[UNILOG][INFO] The compiled xmodel's md5sum is ed77..., and has been saved to "./build/compiled_model/md5sum.txt"
**************************************************
* VITIS_AI Compilation - Xilinx Inc.
**************************************************
-----------------------------------------
MODEL COMPILED
-----------------------------------------

Step 4 - Application

vi target.py

## Modify the following line of target.py if using KV260
================================
ap.add_argument('-t', '--target', type=str, default='zcu102', choices=['zcu102','zcu104','u50','vck190','kv260'], help='Target board type')
================================

## Preparing files for the target board
python -u target.py --target kv260 -d ${BUILD} 2>&1 | tee ${LOG}/target_kv260.log

3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
------------------------------------
Command line options:
--build_dir    :  ./build
--target       :  kv260
--num_images   :  10000
--app_dir      :  application
------------------------------------
Copying application code from application ...
Copying compiled model from ./build/compiled_model/CNN_kv260.xmodel ...

Step 5 - Running on Target

## Copy resulting files to the target board
cd build
tar cvfz target_kv260.tar.gz target_kv260
scp target_kv260.tar.gz root@kv260:

## Extract the resulting tar file on the target board
tar xvfz target_kv260.tar.gz
cd target_kv260

## Installing OpenCV (as needed)
sudo pip3 install opencv-python

## Loading DPU (as needed)
sudo xmutil listapps
sudo xmutil unloadapp
sudo xmutil loadapp kv260-benchmark-b4096

## Disabling fingerprint check (as needed)
export XLNX_ENABLE_FINGERPRINT_CHECK=0

## Running the application on the target board
python3 app_mt.py -m CNN_kv260.xmodel

Command line options:
--image_dir :  images
--threads   :  1
--model     :  CNN_kv260.xmodel
-------------------------------
Pre-processing 10000 images...
-------------------------------
Starting 1 threads...
-------------------------------
Throughput=4641.32 fps, total frames = 10000, time=2.1546 seconds
Correct:9886, Wrong:114, Accuracy:0.9886
-------------------------------

Custom Model Development via Model Zoo (PyTorch)

The host machine for training needs at least around 16GB memory, plus many and powerful GPUs
Training with 300 epochs (default) of the whole MS-COCO 2017 dataset on an NVIDIA Tesla T4 x 1 GPU board would take of order around 100 days

Step 0 - Setting-up Workspace

cd ~/Vitis-AI
./docker_run.sh xilinx/vitis-ai-pytorch-gpu:latest

conda activate vitis-ai-pytorch

## Download Model Zoo
wget https://www.xilinx.com/bin/public/openDownload?filename=pt_yolox-nano_coco_416_416_1G_3.0.zip -O pt_yolox-nano_coco_416_416_1G_3.0.zip
unzip pt_yolox-nano_coco_416_416_1G_3.0.zip
cd pt_yolox-nano_coco_416_416_1G_3.0

## Installing Python Modules
pip install --user -r requirements.txt
cd code
pip install --user -v -e .
cd ..

## Preparing Dataset of MS-COCO (as needed)
cd data/COCO
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/zips/test2017.zip
unzip annotations_trainval2017.zip
unzip train2017.zip
unzip val2017.zip
unzip test2017.zip
cd ../../

Step 1 - Evaluation

## Evaluation
bash code/run_eval.sh

Conducting test...
2023-06-27 09:09:19 | INFO | __main__:139 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-27 09:09:32 | INFO | __main__:149 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-27 09:09:32 | INFO | __main__:150 - Model Structure:
YOLOX(
...
)
2023-06-27 09:09:32 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-27 09:09:34 | INFO | yolox.data.datasets.coco:64 - Done (t=1.79s)
2023-06-27 09:09:34 | INFO | pycocotools.coco:86 - creating index...
2023-06-27 09:09:34 | INFO | pycocotools.coco:86 - index created!
2023-06-27 09:09:56 | INFO | __main__:165 - loading checkpoint from float/yolox_nano.pth
2023-06-27 09:09:57 | INFO | __main__:169 - loaded checkpoint done.
100%|##########| 157/157 [04:24<00:00,  1.68s/it]
2023-06-28 08:16:47 | INFO | yolox.evaluators.coco_evaluator:256 - Evaluate in main process...
2023-06-28 08:17:10 | INFO | yolox.evaluators.coco_evaluator:289 - Loading and preparing results...
2023-06-28 08:17:17 | INFO | yolox.evaluators.coco_evaluator:289 - DONE (t=6.73s)
2023-06-28 08:17:17 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 08:17:18 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 18.23 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 2.92 seconds.
2023-06-28 08:17:43 | INFO | __main__:196 -
Average forward time: 5.14 ms, Average NMS time: 0.83 ms, Average inference time: 5.97 ms
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.220
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.365
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.226
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.062
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.225
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.357
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.218
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.351
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.384
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.130
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.428
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.586
per class AP:
| class         | AP     | class        | AP     | class          | AP     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 35.267 | bicycle      | 13.570 | car            | 18.173 |
| motorcycle    | 25.904 | airplane     | 43.217 | bus            | 45.145 |
| train         | 49.695 | truck        | 15.139 | boat           | 10.732 |
| traffic light | 11.231 | fire hydrant | 40.946 | stop sign      | 48.689 |
| parking meter | 24.575 | bench        | 10.999 | bird           | 13.384 |
| cat           | 37.786 | dog          | 34.574 | horse          | 32.929 |
| sheep         | 24.128 | cow          | 27.846 | elephant       | 44.797 |
| bear          | 45.493 | zebra        | 48.477 | giraffe        | 52.491 |
| backpack      | 3.131  | umbrella     | 21.010 | handbag        | 2.292  |
| tie           | 14.239 | suitcase     | 11.637 | frisbee        | 35.369 |
| skis          | 8.520  | snowboard    | 7.550  | sports ball    | 18.980 |
| kite          | 23.806 | baseball bat | 9.997  | baseball glove | 15.666 |
| skateboard    | 22.507 | surfboard    | 14.978 | tennis racket  | 21.614 |
| bottle        | 11.882 | wine glass   | 10.445 | cup            | 15.231 |
| fork          | 9.913  | knife        | 3.421  | spoon          | 1.997  |
| bowl          | 23.385 | banana       | 12.635 | apple          | 7.757  |
| sandwich      | 21.507 | orange       | 18.874 | broccoli       | 13.849 |
| carrot        | 9.995  | hot dog      | 14.199 | pizza          | 35.764 |
| donut         | 23.774 | cake         | 16.139 | chair          | 11.971 |
| couch         | 31.531 | potted plant | 10.725 | bed            | 32.832 |
| dining table  | 23.782 | toilet       | 47.803 | tv             | 42.598 |
| laptop        | 36.937 | mouse        | 31.544 | remote         | 3.964  |
| keyboard      | 30.560 | cell phone   | 15.958 | microwave      | 34.743 |
| oven          | 21.415 | toaster      | 0.446  | sink           | 21.817 |
| refrigerator  | 35.881 | book         | 5.182  | clock          | 29.555 |
| vase          | 13.550 | scissors     | 8.278  | teddy bear     | 23.986 |
| hair drier    | 0.000  | toothbrush   | 4.453  |                |        |
per class AR:
| class         | AR     | class        | AR     | class          | AR     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 46.651 | bicycle      | 27.834 | car            | 32.623 |
| motorcycle    | 41.853 | airplane     | 56.084 | bus            | 53.145 |
| train         | 61.421 | truck        | 41.232 | boat           | 27.783 |
| traffic light | 25.315 | fire hydrant | 50.891 | stop sign      | 54.400 |
| parking meter | 42.333 | bench        | 28.273 | bird           | 25.340 |
| cat           | 57.970 | dog          | 53.853 | horse          | 48.934 |
| sheep         | 42.994 | cow          | 44.247 | elephant       | 61.429 |
| bear          | 55.634 | zebra        | 59.361 | giraffe        | 62.888 |
| backpack      | 19.057 | umbrella     | 38.894 | handbag        | 18.593 |
| tie           | 27.540 | suitcase     | 35.418 | frisbee        | 47.130 |
| skis          | 27.593 | snowboard    | 20.290 | sports ball    | 27.308 |
| kite          | 38.012 | baseball bat | 24.690 | baseball glove | 31.892 |
| skateboard    | 41.061 | surfboard    | 30.637 | tennis racket  | 36.089 |
| bottle        | 29.891 | wine glass   | 20.469 | cup            | 33.017 |
| fork          | 24.558 | knife        | 16.062 | spoon          | 11.107 |
| bowl          | 45.120 | banana       | 34.459 | apple          | 32.076 |
| sandwich      | 47.797 | orange       | 42.807 | broccoli       | 42.276 |
| carrot        | 33.041 | hot dog      | 28.080 | pizza          | 51.514 |
| donut         | 40.427 | cake         | 36.968 | chair          | 35.178 |
| couch         | 57.203 | potted plant | 35.351 | bed            | 54.356 |
| dining table  | 47.094 | toilet       | 62.179 | tv             | 57.951 |
| laptop        | 51.688 | mouse        | 51.509 | remote         | 22.544 |
| keyboard      | 49.477 | cell phone   | 31.565 | microwave      | 58.000 |
| oven          | 46.503 | toaster      | 6.667  | sink           | 42.756 |
| refrigerator  | 56.825 | book         | 20.399 | clock          | 43.258 |
| vase          | 31.861 | scissors     | 21.667 | teddy bear     | 41.263 |
| hair drier    | 0.000  | toothbrush   | 13.860 |                |        |

Troubleshooting - Number of Workers

## UserWarning: This DataLoader will create 4 worker processes in total.
## Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create.
## Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

## Reduce the number of workers
vi ./code/yolox/exp/yolox_base.py

## Modify the following line of yolox_base.py depending on the system (as needed)
================================
self.data_num_workers = 1
================================

Step 2 - Training

## Update the number of output classes
vi ./code/yolox/exp/yolox_base.py

## Modify the following line of yolox_base.py depending on the dataset and desired model (as needed)
================================
self.num_classes = 80
================================

## Training
bash code/run_train.sh

Conducting training...
2023-06-28 08:38:14 | INFO | yolox.core.trainer:130 - args: Namespace(batch_size=4, ...)
2023-06-28 08:38:14 | INFO | yolox.core.trainer:131 - exp value:
╒═══════════════════╤════════════════════════════╕
│ keys              │ values                     │
╞═══════════════════╪════════════════════════════╡
│ seed              │ None                       │
├───────────────────┼────────────────────────────┤
│ output_dir        │ './YOLOX_outputs'          │
├───────────────────┼────────────────────────────┤
│ print_interval    │ 10                         │
├───────────────────┼────────────────────────────┤
│ eval_interval     │ 10                         │
├───────────────────┼────────────────────────────┤
│ num_classes       │ 80                         │
├───────────────────┼────────────────────────────┤
│ depth             │ 0.33                       │
├───────────────────┼────────────────────────────┤
│ width             │ 0.25                       │
├───────────────────┼────────────────────────────┤
│ act               │ 'relu'                     │
├───────────────────┼────────────────────────────┤
│ data_num_workers  │ 2                          │
├───────────────────┼────────────────────────────┤
│ input_size        │ (416, 416)                 │
├───────────────────┼────────────────────────────┤
│ multiscale_range  │ 5                          │
├───────────────────┼────────────────────────────┤
│ data_dir          │ 'data/COCO'                │
├───────────────────┼────────────────────────────┤
│ train_ann         │ 'instances_train2017.json' │
├───────────────────┼────────────────────────────┤
│ val_ann           │ 'instances_val2017.json'   │
├───────────────────┼────────────────────────────┤
│ test_ann          │ 'instances_test2017.json'  │
├───────────────────┼────────────────────────────┤
│ mosaic_prob       │ 0.5                        │
├───────────────────┼────────────────────────────┤
│ mixup_prob        │ 1.0                        │
├───────────────────┼────────────────────────────┤
│ hsv_prob          │ 1.0                        │
├───────────────────┼────────────────────────────┤
│ flip_prob         │ 0.5                        │
├───────────────────┼────────────────────────────┤
│ degrees           │ 10.0                       │
├───────────────────┼────────────────────────────┤
│ translate         │ 0.1                        │
├───────────────────┼────────────────────────────┤
│ mosaic_scale      │ (0.5, 1.5)                 │
├───────────────────┼────────────────────────────┤
│ enable_mixup      │ False                      │
├───────────────────┼────────────────────────────┤
│ mixup_scale       │ (0.5, 1.5)                 │
├───────────────────┼────────────────────────────┤
│ shear             │ 2.0                        │
├───────────────────┼────────────────────────────┤
│ warmup_epochs     │ 5                          │
├───────────────────┼────────────────────────────┤
│ max_epoch         │ 300                        │
├───────────────────┼────────────────────────────┤
│ warmup_lr         │ 0                          │
├───────────────────┼────────────────────────────┤
│ min_lr_ratio      │ 0.05                       │
├───────────────────┼────────────────────────────┤
│ basic_lr_per_img  │ 0.00015625                 │
├───────────────────┼────────────────────────────┤
│ scheduler         │ 'yoloxwarmcos'             │
├───────────────────┼────────────────────────────┤
│ no_aug_epochs     │ 15                         │
├───────────────────┼────────────────────────────┤
│ ema               │ True                       │
├───────────────────┼────────────────────────────┤
│ weight_decay      │ 0.0005                     │
├───────────────────┼────────────────────────────┤
│ momentum          │ 0.9                        │
├───────────────────┼────────────────────────────┤
│ save_history_ckpt │ True                       │
├───────────────────┼────────────────────────────┤
│ exp_name          │ 'yolox_nano_deploy_relu'   │
├───────────────────┼────────────────────────────┤
│ test_size         │ (416, 416)                 │
├───────────────────┼────────────────────────────┤
│ test_conf         │ 0.01                       │
├───────────────────┼────────────────────────────┤
│ nmsthre           │ 0.65                       │
├───────────────────┼────────────────────────────┤
│ random_size       │ (10, 20)                   │
╘═══════════════════╧════════════════════════════╛
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 08:38:17 | INFO | yolox.core.trainer:137 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 08:38:21 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 08:38:44 | INFO | yolox.data.datasets.coco:64 - Done (t=23.58s)
2023-06-28 08:38:44 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 08:38:45 | INFO | pycocotools.coco:86 - index created!
2023-06-28 08:39:22 | INFO | yolox.core.trainer:155 - init prefetcher, this might take one minute or less...
2023-06-28 08:40:16 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 08:40:17 | INFO | yolox.data.datasets.coco:64 - Done (t=1.01s)
2023-06-28 08:40:17 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 08:40:17 | INFO | pycocotools.coco:86 - index created!
2023-06-28 08:40:19 | INFO | yolox.core.trainer:191 - Training start...
2023-06-28 08:40:19 | INFO | yolox.core.trainer:192 -
YOLOX(
...
)
2023-06-28 08:40:19 | INFO | yolox.core.trainer:203 - ---> start train epoch1
2023-06-28 08:40:40 | INFO | yolox.core.trainer:261 - epoch: 1/300, iter: 10/29572, mem: 13006Mb, iter_time: 2.057s, data_time: 0.062s, total_loss: 19.7, ...
2023-06-28 08:40:43 | INFO | yolox.core.trainer:261 - epoch: 1/300, iter: 20/29572, mem: 13006Mb, iter_time: 0.346s, data_time: 0.200s, total_loss: 14.5, ...
...
(Estimated time of arrival with NVIDIA Tesla T4 x 1 would be around 100 days...)
...
2023-06-28 21:43:02 | INFO | yolox.core.trainer:356 - Save weights to ./YOLOX_outputs/yolox_nano_deploy_relu
100%|##########| 1250/1250 [01:54<00:00, 10.92it/s]
2023-06-28 21:44:56 | INFO | yolox.evaluators.coco_evaluator:256 - Evaluate in main process...
2023-06-28 21:44:56 | INFO | yolox.evaluators.coco_evaluator:289 - Loading and preparing results...
2023-06-28 21:44:56 | INFO | yolox.evaluators.coco_evaluator:289 - DONE (t=0.02s)
2023-06-28 21:44:56 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 21:44:57 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 11.71 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.66 seconds.
2023-06-28 21:45:10 | INFO | yolox.core.trainer:346 -
Average forward time: 4.55 ms, Average NMS time: 0.35 ms, Average inference time: 4.90 ms
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
2023-06-28 21:45:10 | INFO | yolox.core.trainer:356 - Save weights to ./YOLOX_outputs/yolox_nano_deploy_relu
2023-06-28 21:45:10 | INFO | yolox.core.trainer:356 - Save weights to ./YOLOX_outputs/yolox_nano_deploy_relu
2023-06-28 21:45:10 | INFO | yolox.core.trainer:196 - Training of experiment is done and the best AP is 0.00

Troubleshooting - Number of GPUs

## RuntimeError: NCCL error in: ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
## ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

## Reduce the number of GPUs
vi ./code/run_train.sh

## Modify the following lines of run_train.sh depending on the system (as needed)
================================
export CUDA_VISIBLE_DEVICES=0
GPU_NUM=1
================================

Troubleshooting - GPU/CUDA Out of Memory

## RuntimeError: CUDA out of memory.
## Tried to allocate 170.00 MiB (GPU 0; 14.76 GiB total capacity; 13.63 GiB already allocated; 72.81 MiB free; 13.63 GiB reserved in total by PyTorch)
## If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
## See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

## Reduce the batch size (or upgrading GPUs would be preferable if possible)
vi ./code/run_train.sh

## Modify the following line of run_train.sh depending on the system (as needed)
================================
BATCH=4
================================

Step 3 - Quantization

## Quantization and xmodel Dumping
bash code/run_quant.sh

[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 19:56:48 | INFO | __main__:148 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
2023-06-28 19:56:48 | INFO | __main__:163 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 19:56:48 | INFO | __main__:164 - Model Structure:
YOLOX(
...
)
2023-06-28 19:56:48 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 19:56:49 | INFO | yolox.data.datasets.coco:64 - Done (t=0.64s)
2023-06-28 19:56:49 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 19:56:49 | INFO | pycocotools.coco:86 - index created!
2023-06-28 19:56:52 | INFO | __main__:181 - loading checkpoint from float/yolox_nano.pth
2023-06-28 19:56:52 | INFO | __main__:188 - loaded checkpoint done.
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization calibration process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing YOLOX...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model YOLOX is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(quantize_result/YOLOX.py)
[VAIQ_NOTE]: =>Get module with quantization.
100%|##########| 157/157 [01:51<00:00,  1.41it/s]
2023-06-28 19:58:51 | INFO | yolox.evaluators.coco_evaluator_q:270 - Evaluate in main process...
2023-06-28 19:59:11 | INFO | yolox.evaluators.coco_evaluator_q:303 - Loading and preparing results...
2023-06-28 19:59:16 | INFO | yolox.evaluators.coco_evaluator_q:303 - DONE (t=5.47s)
2023-06-28 19:59:16 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 19:59:17 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 16.24 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 2.56 seconds.
2023-06-28 19:59:38 | INFO | __main__:236 -
Average forward time: 17.37 ms, Average NMS time: 0.69 ms, Average inference time: 18.06 ms
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.137
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.262
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.132
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.043
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.153
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.228
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.158
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.267
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.300
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.091
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.476
per class AP:
...
[VAIQ_NOTE]: =>Exporting quant config.(quantize_result/quant_info.json)
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 19:59:45 | INFO | __main__:148 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
2023-06-28 19:59:46 | INFO | __main__:163 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 19:59:46 | INFO | __main__:164 - Model Structure:
YOLOX(
...
)
2023-06-28 19:59:46 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 19:59:46 | INFO | yolox.data.datasets.coco:64 - Done (t=0.65s)
2023-06-28 19:59:46 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 19:59:46 | INFO | pycocotools.coco:86 - index created!
2023-06-28 19:59:50 | INFO | __main__:181 - loading checkpoint from float/yolox_nano.pth
2023-06-28 19:59:50 | INFO | __main__:188 - loaded checkpoint done.
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization test process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing YOLOX...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model YOLOX is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(quantize_result/YOLOX.py)
[VAIQ_NOTE]: =>Get module with quantization.
100%|##########| 157/157 [00:54<00:00,  2.89it/s]
2023-06-28 20:00:51 | INFO | yolox.evaluators.coco_evaluator_q:270 - Evaluate in main process...
2023-06-28 20:01:12 | INFO | yolox.evaluators.coco_evaluator_q:303 - Loading and preparing results...
2023-06-28 20:01:17 | INFO | yolox.evaluators.coco_evaluator_q:303 - DONE (t=5.45s)
2023-06-28 20:01:17 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 20:01:18 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 15.82 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 2.59 seconds.
2023-06-28 20:01:38 | INFO | __main__:236 -
Average forward time: 5.94 ms, Average NMS time: 0.78 ms, Average inference time: 6.71 ms
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.136
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.264
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.132
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.041
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.155
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.226
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.156
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.265
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.298
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.093
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.471
per class AP:
...
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 20:01:48 | INFO | __main__:148 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
2023-06-28 20:01:48 | INFO | __main__:163 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 20:01:48 | INFO | __main__:164 - Model Structure:
YOLOX(
...
)
2023-06-28 20:01:48 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 20:01:50 | INFO | yolox.data.datasets.coco:64 - Done (t=1.81s)
2023-06-28 20:01:50 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 20:01:50 | INFO | pycocotools.coco:86 - index created!
2023-06-28 20:01:52 | INFO | __main__:181 - loading checkpoint from float/yolox_nano.pth
2023-06-28 20:01:54 | INFO | __main__:188 - loaded checkpoint done.
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization test process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cpu'.
[VAIQ_NOTE]: =>Parsing YOLOX...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model YOLOX is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(quantize_result/YOLOX.py)
[VAIQ_NOTE]: =>Get module with quantization.
  0%|          | 0/5000 [00:00<?, ?it/s]
2023-06-28 20:02:01 | INFO | __main__:236 -
[VAIQ_NOTE]: =>Converting to xmodel ...
[VAIQ_NOTE]: =>Dumping 'YOLOX_0'' checking data...
[VAIQ_NOTE]: =>Finsh dumping data.(quantize_result/deploy_check_data_int/YOLOX_0)
[VAIQ_NOTE]: =>Successfully convert 'YOLOX_0' to xmodel.(quantize_result/YOLOX_0_int.xmodel)

Step Ex. - Quantization-Aware Training (as needed)

## Quantization-Aware Training, Model Converting, and xmodel Dumping
bash code/run_qat.sh

References

https://misoji-engineer.com/archives/vitis-ai-how-to.html
https://misoji-engineer.com/archives/vitis-ai-3-0.html
https://misoji-engineer.com/archives/build-vitis-ai-gpu.html
https://www.paltek.co.jp/techblog/tag/ai
https://www.pixela.co.jp/products/pickup/dev/ai/vitisai_ai_3_model_zoo.html
https://misoji-engineer.com/archives/vitis-ai-model-zoo.html
https://tomosoft.jp/design/?p=44403
https://www.pixela.co.jp/products/pickup/dev/
https://www.paltek.co.jp/techblog/techinfo/220121_01

Table of Contents

Vitis AI

Getting Started

System Requirements

IP Core / Bitstream

OS Image

Model Zoo

Evaluation Boards

AI Optimizer (Optional)

Host Setup on Ubuntu

PyTorch (CPU Only)

PyTorch (GPU Support)

Troubleshooting - Exit Code 137

Troubleshooting - Docker Image Not Found

Jupyter Notebook in Docker Image

Target Board Setup (Zynq UltraScale+ MPSoC - DPUCZDX8G)

Cross-Compiler on Host Machine

Troubleshooting - LSB Modules

OS Installation

Checking Device

Checking DPU Availability

ResNet50 (Image Classification)

Troubleshooting - File Descriptor

Troubleshooting - OpenCV

Troubleshooting - Fingerprint Failure

YOLOvX (Object Detection)

Custom Model Development (PyTorch)

Step 0 - Setting-up Workspace

Step 1 - Training

Step 2 - Quantization

Step 3 - Compile

Step 4 - Application

Step 5 - Running on Target

Custom Model Development via Model Zoo (PyTorch)

Step 0 - Setting-up Workspace

Step 1 - Evaluation

Troubleshooting - Number of Workers

Step 2 - Training

Troubleshooting - Number of GPUs

Troubleshooting - GPU/CUDA Out of Memory

Step 3 - Quantization

Step Ex. - Quantization-Aware Training (as needed)

References