Vitis AI

https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html

https://docs.xilinx.com/r/en-US/ug1414-vitis-ai
https://xilinx.github.io/Vitis-AI/3.5/html/docs/reference/system_requirements.html (3.0)

正式にサポートしているFPGAボードは限られているため注意
開発側はUbuntu 20.04(Debianは不可)にCUDA 11.3 + Container ToolkitおよびDockerを用意する必要がある (2023年時点)
便宜上CUDA 11.4を使用するケースもあったが問題は無さそう
開発側スペックは少なくともメモリを8GB程度は用意しないと処理中にOoM Killerで停止する > GCPなら最低でもn1-standard-2相当が必要

https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Deep-Learning-Processor-Unit
https://docs.xilinx.com/r/en-US/pg338-dpu (JP)
https://github.com/Xilinx/Vitis-AI/tree/master/dpu
https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-system-integration (3.0)

深層学習用の処理機能が詰まったDPUという名前のFPGA回路情報(IPコア)が提供されている
あらかじめFPGAにDPU回路を書き込んでおいてXRTという名前のドライバで制御する
1. Vivado/Vitisでビットストリームを生成
2. ビットストリームをターゲットボードにダウンロード
3. 関連するドライバ(および依存関係のあるライブラリ)をインストール
例えばZynq UltraScale+ MPSoCの場合はDPUCZDX8GというIPコア(Bitstream)が使用可能
アーキテクチャ(ISA含む)やVitis AIのバージョンも合わせる必要があるので注意: B512/B800/B1024/B1152/B1600/B2304/B3136/B4096
[IMPORTANT] Vitis AIは回路合成までは自動で行わないため適切なDPUを予めFPGAにダウンロードしておく必要がある !!!
- Kira KV260でDPU調整済みのOSイメージ(Option C)を使用すれば適切なDPUが自動でダウンロードされる
- OSイメージにUbuntuなどを選んだ場合は自分で設定するかapt経由(Kira KV260参照)でも可能 > バージョンの整合性に注意

https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Flashing-the-OS-Image-to-the-SD-Card

DPUのために調整されたVitis AI用のPetaLinuxを使用すれば余計な設定は不要 (Kira KV260参照)
ライセンスに記載のあるBenchmarkingとはビジネス用語でありFPGAの性能試験のことではない
公式HPのUbuntuなどを使用する場合はDPUの調整が別途必要 (Kira KV260参照)

https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-model-zoo.html (3.0)
https://xilinx.github.io/Vitis-AI/3.5/html/docs/reference/ModelZoo_Github_web.htm (3.0)
https://github.com/Xilinx/Vitis-AI/tree/master/model_zoo

用意されているモデルは限られており、かつ商用不可ライセンスの物も多いため要注意
ダウンロード先のURLはVitis-AI/model_zooにあるYAMLファイルで参照可能
再トレーニング用のプログラムもダウンロードしたzipファイルに含まれている
対応するDPUがFPGAにダウンロードされていないとFingerprint Failureで実行不可 > 部分的な回避策は下部に記載

https://www.xilinx.com/products/boards-and-kits/see-all-evaluation-boards.html
https://www.xilinx.com/products/som/kria/kv260-vision-starter-kit.html

Kria KV260が最安値

https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-model-development.html#model-optimization (3.0)

Vitis AI Optimizer is an optional tool that can significantly enhance performance in many applications
Vitis AI Optimizer requires the developer to purchase a license

https://xilinx.github.io/Vitis-AI/3.5/html/docs/install/install.html (3.0)

The pre-built cpu container should only be used when a GPU is not available on the host machine
The docker_build process may take several hours to complete
Often simply re-running the build script will result in success

git clone https://github.com/Xilinx/Vitis-AI
#git clone -b 3.0 https://github.com/Xilinx/Vitis-AI

cd ~/Vitis-AI

docker pull xilinx/vitis-ai-pytorch-cpu:latest

./docker_run.sh xilinx/vitis-ai-pytorch-cpu:latest

latest: Pulling from xilinx/vitis-ai-pytorch-cpu
Digest: sha256:f55bd069ffd56c6358cae29df19e6085f2bcf8ea5e045744aa412fd72db521ed
Status: Image is up to date for xilinx/vitis-ai-pytorch-cpu:latest
docker.io/xilinx/vitis-ai-pytorch-cpu:latest
Setting up user 's environment in the Docker container...
Running as vitis-ai-user with ID 0 and group 0
==========================================
__      ___ _   _                   _____
\ \    / (_) | (_)            /\   |_   _|
 \ \  / / _| |_ _ ___ ______ /  \    | |
  \ \/ / | | __| / __|______/ /\ \   | |
   \  /  | | |_| \__ \     / ____ \ _| |_
    \/   |_|\__|_|___/    /_/    \_\_____|
==========================================
Docker Image Version: ubuntu2004-3.0.0.106   (CPU)
Vitis AI Git Hash: d4ec26f
Build Date: 2023-01-08
WorkFlow: pytorch

cd ~/Vitis-AI/docker

## License is necessary to build opt_pytorch
./docker_build.sh -t gpu -f pytorch
#./docker_build.sh -t gpu -f opt_pytorch

Validating Arguments...
Your inputs: Docker-Type:gpu, FrameWork:pytorch
...
=> => writing image sha256:f555d3cf57c1562de00d29987b768f08836018fba6052a189bd1365c292d54b9
=> => naming to docker.io/xilinx/vitis-ai-pytorch-gpu:3.5.0.001-bbd45838d

The list of Nvidia CUDA and cuDNN images of Docker Hub is available below:
https://hub.docker.com/r/nvidia/cuda/tags

#docker run --gpus all nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 nvidia-smi
docker run --gpus all nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04 nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8    10W /  70W |    105MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

cd ~/Vitis-AI

docker images

REPOSITORY                        TAG                                                          IMAGE ID       CREATED        SIZE
xilinx/vitis-ai-pytorch-gpu       3.5.0.001-bbd45838d                                          f555d3cf57c1   2 hours ago    31GB

## The latest tag is not working in vitis-ai version 3.5 (as of July 2023)
#./docker_run.sh xilinx/vitis-ai-pytorch-gpu:latest
./docker_run.sh xilinx/vitis-ai-pytorch-gpu:3.5.0.001-bbd45838d

Setting up user 's environment in the Docker container...
Running as vitis-ai-user with ID 0 and group 0
==========================================
__      ___ _   _                   _____
\ \    / (_) | (_)            /\   |_   _|
 \ \  / / _| |_ _ ___ ______ /  \    | |
  \ \/ / | | __| / __|______/ /\ \   | |
   \  /  | | |_| \__ \     / ____ \ _| |_
    \/   |_|\__|_|___/    /_/    \_\_____|
==========================================
Docker Image Version: 3.5.0.001-bbd45838d   (GPU)
Vitis AI Git Hash: bbd45838d
Build Date: 2023-07-20
WorkFlow: pytorch

Troubleshooting - Exit Code 137

https://stackoverflow.com/questions/31297616/what-is-the-authoritative-list-of-docker-run-exit-codes
https://komodor.com/learn/exit-codes-in-containers-and-kubernetes-the-complete-guide/

## ERROR: failed to solve: process "..." did not complete successfully: exit code: 137

Exit code 137 indicates that container was immediately terminated by the operating system via SIGKILL signal
The host machine needs more system memory; see Docker and OoM Killer for more details
Another workaround is preparing additional swap memory: see Swap for instructions

Troubleshooting - Docker Image Not Found

https://github.com/Xilinx/Vitis-AI/pull/1296

The 'latest' tag is not working in vitis-ai version 3.5 (as of July 2023)

## Unable to find image 'xilinx/vitis-ai-pytorch-gpu:latest' locally
## docker: Error response from daemon: pull access denied for xilinx/vitis-ai-pytorch-gpu,
## repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

docker images

REPOSITORY                        TAG                                                          IMAGE ID       CREATED        SIZE
xilinx/vitis-ai-pytorch-gpu       3.5.0.001-bbd45838d                                          f555d3cf57c1   2 hours ago    31GB

docker tag xilinx/vitis-ai-pytorch-gpu:3.5.0.001-bbd45838d xilinx/vitis-ai-pytorch-gpu:latest

docker images

REPOSITORY                        TAG                                                          IMAGE ID       CREATED        SIZE
xilinx/vitis-ai-pytorch-gpu       3.5.0.001-bbd45838d                                          f555d3cf57c1   2 hours ago    31GB
xilinx/vitis-ai-pytorch-gpu       latest                                                       f555d3cf57c1   2 hours ago    31GB

See Jupyter for details

jupyter notebook --port=8888 --ip=0.0.0.0

http://127.0.0.1:8888/?token=abcdefg

https://xilinx.github.io/Vitis-AI/3.0/html/docs/quickstart/mpsoc.html

## Exit or detach from the vitis-ai docker container before installing the cross compiler

cd ~/Vitis-AI/board_setup/mpsoc
chmod 755 ./host_cross_compiler_setup.sh
./host_cross_compiler_setup.sh

#unset LD_LIBRARY_PATH
source ~/petalinux_sdk_2022.2/environment-setup-cortexa72-cortexa53-xilinx-linux

## Run the vitis-ai docker container after installing the cross compiler
cd ~/Vitis-AI
./docker_run.sh xilinx/vitis-ai-pytorch-cpu:latest

## Activate the conda environment in the docker container
conda activate vitis-ai-pytorch

## Cross compile an example
#cd ~/Vitis-AI/examples/vai_runtime/resnet50
cd /workspace/examples/vai_runtime/resnet50
#bash –x build.sh
bash build.sh

Troubleshooting - LSB Modules

https://postgresweb.com/ubuntu-no-lsb-modules-are-available

## No LSB modules are available.

sudo apt install lsb-core

lsb_release -a

See Kira KV260 for the installation

xbutil examine

System Configuration
  OS Name              : Linux
  Release              : 5.15.36-xilinx-v2022.2
  Version              : #1 SMP Mon Oct 3 07:50:07 UTC 2022
  Machine              : aarch64
  CPU Cores            : 4
  Memory               : 3929 MB
  Distribution         : PetaLinux 2022.2_release_S10071807 (honister)
  GLIBC                : 2.34
  Model                : ZynqMP SMK-K26 Rev1/B/A

XRT
  Version              : 2.14.0
  Branch               : 2022.2
  Hash                 : 43926231f7183688add2dccfd391b36a1f000bea
  Hash Date            : 2022-10-07 05:12:02
  ZOCL                 : 2.14.0, 43926231f7183688add2dccfd391b36a1f000bea

Devices present
BDF             :  Shell  Platform UUID  Device ID     Device Ready*
----------------------------------------------------------------------
[0000:00:00.0]  :  edge   0x0            user(inst=0)  Yes

* Devices that are not ready will have reduced functionality when using XRT tools

export DEBUG_DPU_CONTROLLER=1
show_dpu

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 08:24:27.245379  1928 dpu_controller_dnndk.cpp:279] cancel register the dnndk dpu controller, because /dev/dpu is not opened
I0606 08:24:27.245836  1928 dpu_controller.cpp:42] add factory method 02_xrt
I0606 08:24:27.245878  1928 dpu_control_xrt.cpp:113] register the xrt edge dpu controller
I0606 08:24:27.258949  1928 dpu_control_xrt.cpp:53] xrt dpu cu  is detected, kernel = DPUCZDX8G
I0606 08:24:27.259016  1928 dpu_control_xrt.cpp:82] create DpuControllerXrtEdge for DPUCZDX8G
I0606 08:24:27.259049  1928 dpu_control_xrt_edge.cpp:53] creating dpu controller:  this=0xaaab013e8e10
I0606 08:24:27.259078  1928 dpu_controller.cpp:57] create dpu controller via 02_xrt ret= 0xaaab013e8e10
device_core_id=0 device= 0 core = 0 fingerprint = 0x101000056010407 batch = 1 full_cu_name=DPUCZDX8G:DPUCZDX8G_1

I0606 08:24:27.259140  1928 dpu_control_xrt_edge.cpp:60] destroying dpu controller:  this=0xaaab013e8e10

xdputil query

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 08:26:45.284555  2048 dpu_controller_dnndk.cpp:279] cancel register the dnndk dpu controller, because /dev/dpu is not opened
I0606 08:26:45.285107  2048 dpu_controller.cpp:42] add factory method 02_xrt
I0606 08:26:45.285149  2048 dpu_control_xrt.cpp:113] register the xrt edge dpu controller
{
    "DPU IP Spec":{
        "DPU Core Count":1,
        "IP version":"v4.1.0",
        "generation timestamp":"2022-11-30 19-15-00",
        "git commit id":"ce8dd1",
        "git commit time":2022113019,
        "regmap":"1to1 version"
    },
    "VAI Version":{
        "libvaip-core.so":"Xilinx vaip Version: 1.0.0-a176db67b19f94b0a31f9d24ef80322efe4494ad  2022-12-27-01:24:22 ",
        "libvart-runner.so":"Xilinx vart-runner Version: 3.0.0-2efa5fe1e56c2b2c8a7e71e9fc1636242dd50a9f  2022-12-27-00:47:05 ",
        "libvitis_ai_library-dpu_task.so":"Xilinx vitis_ai_library dpu_task Version: 3.0.0-1cccff04dc341c4a6287226828f90aed56005f4f  2022-12-20 10:29:01 [UTC] ",
        "libxir.so":"Xilinx xir Version: xir-9204ac72103092a7b253a0c23ec7471481656940 2022-12-27-00:46:16",
        "target_factory":"target-factory.3.0.0 860ed0499ab009084e2df3004eeb9ae710c26351"
    },
    "kernels":[
        {
            "DPU Arch":"DPUCZDX8G_ISA1_B4096",
            "DPU Frequency (MHz)":300,
            "IP Type":"DPU",
            "Load Parallel":2,
            "Load augmentation":"enable",
            "Load minus mean":"disable",
            "Save Parallel":2,
            "XRT Frequency (MHz)":300,
            "cu_addr":"0xa0010000",
            "cu_handle":"0xaaaaf9957c70",
            "cu_idx":0,
            "cu_mask":1,
            "cu_name":"DPUCZDX8G:DPUCZDX8G_1",
            "device_id":0,
            "fingerprint":"0x101000056010407",
            "name":"DPU Core 0"
        }
    ]
}

## Root Login (as needed)
sudo su -l root

## Download Model Zoo (as needed)
#wget https://www.xilinx.com/bin/public/openDownload?filename=resnet50-zcu102_zcu104_kv260-r3.0.0.tar.gz -O resnet50-zcu102_zcu104_kv260-r3.0.0.tar.gz
#tar -xzvf resnet50-zcu102_zcu104_kv260-r3.0.0.tar.gz
#cp resnet50 /usr/share/vitis_ai_library/models -r

## Download Sample Images from Xilinx > Extract
cd
tar -xzvf vitis_ai_runtime_r3.0.0_image_video.tar.gz -C Vitis-AI/examples/vai_runtime

## Build Application (as needed)
#cd ~/Vitis-AI/examples/vai_runtime/resnet50
#chmod 755 ./build.sh
#./build.sh

## Run Example
cd ~/Vitis-AI/examples/vai_runtime/resnet50
./resnet50 /usr/share/vitis_ai_library/models/resnet50/resnet50.xmodel

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 09:46:05.968466  6374 main.cc:292] create running for subgraph: subgraph_conv1
I0606 09:46:05.985018  6374 dpu_controller_dnndk.cpp:279] cancel register the dnndk dpu controller, because /dev/dpu is not opened
I0606 09:46:05.985324  6374 dpu_controller.cpp:42] add factory method 02_xrt
I0606 09:46:05.985356  6374 dpu_control_xrt.cpp:113] register the xrt edge dpu controller
I0606 09:46:05.998131  6374 dpu_control_xrt.cpp:53] xrt dpu cu  is detected, kernel = DPUCZDX8G
I0606 09:46:05.998200  6374 dpu_control_xrt.cpp:82] create DpuControllerXrtEdge for DPUCZDX8G
I0606 09:46:05.998237  6374 dpu_control_xrt_edge.cpp:53] creating dpu controller:  this=0xaaaaf72b4960
I0606 09:46:05.998266  6374 dpu_controller.cpp:57] create dpu controller via 02_xrt ret= 0xaaaaf72b4960
I0606 09:46:06.282402  6374 dpu_control_xrt_edge.cpp:115] code 0x19000000 core_idx 0 gen_reg:  0x19100000 0x1aa00000 ...

Image : vitis-ai_gorilla_market.jpg
top[0] prob = 0.xxxxxx  name = Gorilla
I0606 09:46:26.501123  6374 dpu_control_xrt_edge.cpp:60] destroying dpu controller:  this=0xaaaaf72b4960

Troubleshooting - File Descriptor

## F0606 09:12:00.986861  4629 file_lock_lnx.cpp:28] Check failed: fd >= 0 (-1 vs. 0) cannot open file: /tmp/DPU_0

## Reset file descriptor
rm /tmp/DPU_0

Troubleshooting - OpenCV

https://github.com/opencv/opencv/issues/18461

## terminate called after throwing an instance of 'cv::Exception'
##   what():  OpenCV(4.5.2) /usr/src/debug/opencv/4.5.2-r0/git/modules/highgui/src/window_gtk.cpp:624: error: (-2:Unspecified error) Can't initialize GTK backend in function 'cvInitSystem'

## Disable DISPLAY if running the application via SSH or non-GUI environment
export DISPLAY=:0.0

Troubleshooting - Fingerprint Failure

https://support.xilinx.com/s/question/0D54U00006wDmkzSAC/info-post-about-dpu-fingerprint

DPU fingerprint is a unique identifier used in Vitis AI to characterize different DPU targets
DPU fingerprint has a feature code depending on:
1. IP Core: DPUCZDX8G/DPUCVDX8G/…
2. Unique Architecture: B512/B800/B1024/B1152/B1600/B2304/B3136/B4096
3. Instruction Set Architecture (ISA): 0x01
4. Vitis AI Version: 2.5/3.0/…

## W0608 07:27:35.154747 83456 dpu_runner_base_imp.cpp:676] CHECK fingerprint fail ! model_fingerprint 0x101000056010407 dpu_fingerprint 0x101000016010406
## F0608 07:27:35.154840 83456 dpu_runner_base_imp.cpp:648] fingerprint check failure.

## Check if DPU architecture and fingerprint are same as that for the compiled model
xdputil query
cat resnet50/meta.json

## Workaround with disabling fingerprint check
env XLNX_ENABLE_FINGERPRINT_CHECK=0 ./resnet50 resnet50.xmodel

## Root Login (as needed)
sudo su -l root

cd ~/Vitis-AI/examples/vai_library/samples/yolovx
./test_jpeg_yolovx /usr/share/vitis_ai_library/models/yolox_nano_pt/yolox_nano_pt.xmodel vitis-ai_gorilla_market.jpg

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 10:08:46.059096  7557 demo.hpp:1193] batch: 0     image: vitis-ai_gorilla_market.jpg
I0606 10:08:46.059296  7557 process_result.hpp:32] RESULT: 16   78.75   25.94   502.17  505.75  0.469689

https://github.com/Xilinx/Vitis-AI-Tutorials/blob/1.4/Design_Tutorials/09-mnist_pyt/README.md
https://www.paltek.co.jp/techblog/techinfo/211115_01

Basic training flow for Vitis AI:
1. Training: floating-point_model.pth
2. Quantization: floating-point_model.pth > int_model.xmodel
3. Compile: int_model.xmodel > int_model_kv260.xmodel

cd ~/Vitis-AI
./docker_run.sh xilinx/vitis-ai-pytorch-gpu:latest

conda activate vitis-ai-pytorch

git clone -b 1.4 https://github.com/Xilinx/Vitis-AI-Tutorials.git
cd Vitis-AI-Tutorials/Design_Tutorials/09-mnist_pyt/files/

export BUILD=./build
export LOG=${BUILD}/logs
mkdir -p ${LOG}

vi train.py

## Remove following lines from train.py
================================
> torchvision.datasets.MNIST.resources = [
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz', 'f68b3c2dcbeaaa9fbdd348bbdeb94873'),
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz', 'd53e105ee54ea40749a09fcbcd1e9432'),
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz', '9fb629c4189551a2d022fa330f9573f3'),
>     ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz', 'ec29112dd5afa0611ce80d1b7f02629c')
> ]
================================

## Training
python -u train.py -d ${BUILD} 2>&1 | tee ${LOG}/train.log

PyTorch version :  1.12.1
3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
-----------------------------------------
Command line options:
--build_dir    :  ./build
--batchsize    :  100
--learnrate    :  0.001
--epochs       :  3
-----------------------------------------
You have 1 CUDA devices available
Device 0 :  Tesla T4
Selecting device 0..
...
Epoch 1
Test set: Accuracy: 9814/10000 (98.14%)
Epoch 2
Test set: Accuracy: 9866/10000 (98.66%)
Epoch 3
Test set: Accuracy: 9898/10000 (98.98%)
Trained model written to ./build/float_model/f_model.pth

## Quantize
python -u quantize.py -d ${BUILD} --quant_mode calib 2>&1 | tee ${LOG}/quant_calib.log

[VAIQ_NOTE]: Loading NNDCT kernels...
-----------------------------------------
PyTorch version :  1.12.1
3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
-----------------------------------------
Command line options:
--build_dir    :  ./build
--quant_mode   :  calib
--batchsize    :  100
-----------------------------------------
You have 1 CUDA devices available
Device 0 :  Tesla T4
Selecting device 0..
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization calibration process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing CNN...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model CNN is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(./build/quant_model/CNN.py)
[VAIQ_NOTE]: =>Get module with quantization.
Test set: Accuracy: 9904/10000 (99.04%)
[VAIQ_NOTE]: =>Exporting quant config.(./build/quant_model/quant_info.json)

## Evaluate Quantized Model
python -u quantize.py -d ${BUILD} --quant_mode test 2>&1 | tee ${LOG}/quant_test.log

[VAIQ_NOTE]: Loading NNDCT kernels...
-----------------------------------------
PyTorch version :  1.12.1
3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
-----------------------------------------
Command line options:
--build_dir    :  ./build
--quant_mode   :  test
--batchsize    :  100
-----------------------------------------
You have 1 CUDA devices available
Device 0 :  Tesla T4
Selecting device 0..
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization test process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing CNN...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model CNN is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(./build/quant_model/CNN.py)
[VAIQ_NOTE]: =>Get module with quantization.
Test set: Accuracy: 9901/10000 (99.01%)
[VAIQ_NOTE]: =>Converting to xmodel ...
[VAIQ_NOTE]: =>Successfully convert 'CNN' to xmodel.(./build/quant_model/CNN_int.xmodel)

vi compile.sh

## Add following lines to compile.sh if using KV260
================================
elif [ $1 = kv260 ]; then
      ARCH=/opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json
      TARGET=kv260
      echo "-----------------------------------------"
      echo "COMPILING MODEL FOR KV260.."
      echo "-----------------------------------------"
================================

## Compile
source compile.sh kv260 ${BUILD} ${LOG}

COMPILING MODEL FOR KV260..
-----------------------------------------
[UNILOG][INFO] Compile mode: dpu
[UNILOG][INFO] Debug mode: null
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA1_B4096
[UNILOG][INFO] Graph name: CNN, with op num: 33
[UNILOG][INFO] Begin to compile...
[UNILOG][INFO] Total device subgraph number 3, DPU subgraph number 1
[UNILOG][INFO] Compile done.
[UNILOG][INFO] The meta json is saved to "./build/compiled_model/meta.json"
[UNILOG][INFO] The compiled xmodel is saved to "./build/compiled_model/CNN_kv260.xmodel"
[UNILOG][INFO] The compiled xmodel's md5sum is ed77..., and has been saved to "./build/compiled_model/md5sum.txt"
**************************************************
* VITIS_AI Compilation - Xilinx Inc.
**************************************************
-----------------------------------------
MODEL COMPILED
-----------------------------------------

vi target.py

## Modify the following line of target.py if using KV260
================================
ap.add_argument('-t', '--target', type=str, default='zcu102', choices=['zcu102','zcu104','u50','vck190','kv260'], help='Target board type')
================================

## Preparing files for the target board
python -u target.py --target kv260 -d ${BUILD} 2>&1 | tee ${LOG}/target_kv260.log

3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
------------------------------------
Command line options:
--build_dir    :  ./build
--target       :  kv260
--num_images   :  10000
--app_dir      :  application
------------------------------------
Copying application code from application ...
Copying compiled model from ./build/compiled_model/CNN_kv260.xmodel ...

## Copy resulting files to the target board
cd build
tar cvfz target_kv260.tar.gz target_kv260
scp target_kv260.tar.gz root@kv260:

## Extract the resulting tar file on the target board
tar xvfz target_kv260.tar.gz
cd target_kv260

## Installing OpenCV (as needed)
sudo pip3 install opencv-python

## Loading DPU (as needed)
sudo xmutil listapps
sudo xmutil unloadapp
sudo xmutil loadapp kv260-benchmark-b4096

## Disabling fingerprint check (as needed)
export XLNX_ENABLE_FINGERPRINT_CHECK=0

## Running the application on the target board
python3 app_mt.py -m CNN_kv260.xmodel

Command line options:
--image_dir :  images
--threads   :  1
--model     :  CNN_kv260.xmodel
-------------------------------
Pre-processing 10000 images...
-------------------------------
Starting 1 threads...
-------------------------------
Throughput=4641.32 fps, total frames = 10000, time=2.1546 seconds
Correct:9886, Wrong:114, Accuracy:0.9886
-------------------------------

The host machine for training needs at least around 16GB memory, plus many and powerful GPUs
Training with 300 epochs (default) of the whole MS-COCO 2017 dataset on an NVIDIA Tesla T4 x 1 GPU board would take of order around 100 days

cd ~/Vitis-AI
./docker_run.sh xilinx/vitis-ai-pytorch-gpu:latest

conda activate vitis-ai-pytorch

## Download Model Zoo
wget https://www.xilinx.com/bin/public/openDownload?filename=pt_yolox-nano_coco_416_416_1G_3.0.zip -O pt_yolox-nano_coco_416_416_1G_3.0.zip
unzip pt_yolox-nano_coco_416_416_1G_3.0.zip
cd pt_yolox-nano_coco_416_416_1G_3.0

## Installing Python Modules
pip install --user -r requirements.txt
cd code
pip install --user -v -e .
cd ..

## Preparing Dataset of MS-COCO (as needed)
cd data/COCO
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/zips/test2017.zip
unzip annotations_trainval2017.zip
unzip train2017.zip
unzip val2017.zip
unzip test2017.zip
cd ../../

## Evaluation
bash code/run_eval.sh

Conducting test...
2023-06-27 09:09:19 | INFO | __main__:139 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-27 09:09:32 | INFO | __main__:149 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-27 09:09:32 | INFO | __main__:150 - Model Structure:
YOLOX(
...
)
2023-06-27 09:09:32 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-27 09:09:34 | INFO | yolox.data.datasets.coco:64 - Done (t=1.79s)
2023-06-27 09:09:34 | INFO | pycocotools.coco:86 - creating index...
2023-06-27 09:09:34 | INFO | pycocotools.coco:86 - index created!
2023-06-27 09:09:56 | INFO | __main__:165 - loading checkpoint from float/yolox_nano.pth
2023-06-27 09:09:57 | INFO | __main__:169 - loaded checkpoint done.
100%|##########| 157/157 [04:24<00:00,  1.68s/it]
2023-06-28 08:16:47 | INFO | yolox.evaluators.coco_evaluator:256 - Evaluate in main process...
2023-06-28 08:17:10 | INFO | yolox.evaluators.coco_evaluator:289 - Loading and preparing results...
2023-06-28 08:17:17 | INFO | yolox.evaluators.coco_evaluator:289 - DONE (t=6.73s)
2023-06-28 08:17:17 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 08:17:18 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 18.23 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 2.92 seconds.
2023-06-28 08:17:43 | INFO | __main__:196 -
Average forward time: 5.14 ms, Average NMS time: 0.83 ms, Average inference time: 5.97 ms
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.220
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.365
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.226
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.062
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.225
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.357
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.218
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.351
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.384
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.130
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.428
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.586
per class AP:
| class         | AP     | class        | AP     | class          | AP     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 35.267 | bicycle      | 13.570 | car            | 18.173 |
| motorcycle    | 25.904 | airplane     | 43.217 | bus            | 45.145 |
| train         | 49.695 | truck        | 15.139 | boat           | 10.732 |
| traffic light | 11.231 | fire hydrant | 40.946 | stop sign      | 48.689 |
| parking meter | 24.575 | bench        | 10.999 | bird           | 13.384 |
| cat           | 37.786 | dog          | 34.574 | horse          | 32.929 |
| sheep         | 24.128 | cow          | 27.846 | elephant       | 44.797 |
| bear          | 45.493 | zebra        | 48.477 | giraffe        | 52.491 |
| backpack      | 3.131  | umbrella     | 21.010 | handbag        | 2.292  |
| tie           | 14.239 | suitcase     | 11.637 | frisbee        | 35.369 |
| skis          | 8.520  | snowboard    | 7.550  | sports ball    | 18.980 |
| kite          | 23.806 | baseball bat | 9.997  | baseball glove | 15.666 |
| skateboard    | 22.507 | surfboard    | 14.978 | tennis racket  | 21.614 |
| bottle        | 11.882 | wine glass   | 10.445 | cup            | 15.231 |
| fork          | 9.913  | knife        | 3.421  | spoon          | 1.997  |
| bowl          | 23.385 | banana       | 12.635 | apple          | 7.757  |
| sandwich      | 21.507 | orange       | 18.874 | broccoli       | 13.849 |
| carrot        | 9.995  | hot dog      | 14.199 | pizza          | 35.764 |
| donut         | 23.774 | cake         | 16.139 | chair          | 11.971 |
| couch         | 31.531 | potted plant | 10.725 | bed            | 32.832 |
| dining table  | 23.782 | toilet       | 47.803 | tv             | 42.598 |
| laptop        | 36.937 | mouse        | 31.544 | remote         | 3.964  |
| keyboard      | 30.560 | cell phone   | 15.958 | microwave      | 34.743 |
| oven          | 21.415 | toaster      | 0.446  | sink           | 21.817 |
| refrigerator  | 35.881 | book         | 5.182  | clock          | 29.555 |
| vase          | 13.550 | scissors     | 8.278  | teddy bear     | 23.986 |
| hair drier    | 0.000  | toothbrush   | 4.453  |                |        |
per class AR:
| class         | AR     | class        | AR     | class          | AR     |
|:--------------|:-------|:-------------|:-------|:---------------|:-------|
| person        | 46.651 | bicycle      | 27.834 | car            | 32.623 |
| motorcycle    | 41.853 | airplane     | 56.084 | bus            | 53.145 |
| train         | 61.421 | truck        | 41.232 | boat           | 27.783 |
| traffic light | 25.315 | fire hydrant | 50.891 | stop sign      | 54.400 |
| parking meter | 42.333 | bench        | 28.273 | bird           | 25.340 |
| cat           | 57.970 | dog          | 53.853 | horse          | 48.934 |
| sheep         | 42.994 | cow          | 44.247 | elephant       | 61.429 |
| bear          | 55.634 | zebra        | 59.361 | giraffe        | 62.888 |
| backpack      | 19.057 | umbrella     | 38.894 | handbag        | 18.593 |
| tie           | 27.540 | suitcase     | 35.418 | frisbee        | 47.130 |
| skis          | 27.593 | snowboard    | 20.290 | sports ball    | 27.308 |
| kite          | 38.012 | baseball bat | 24.690 | baseball glove | 31.892 |
| skateboard    | 41.061 | surfboard    | 30.637 | tennis racket  | 36.089 |
| bottle        | 29.891 | wine glass   | 20.469 | cup            | 33.017 |
| fork          | 24.558 | knife        | 16.062 | spoon          | 11.107 |
| bowl          | 45.120 | banana       | 34.459 | apple          | 32.076 |
| sandwich      | 47.797 | orange       | 42.807 | broccoli       | 42.276 |
| carrot        | 33.041 | hot dog      | 28.080 | pizza          | 51.514 |
| donut         | 40.427 | cake         | 36.968 | chair          | 35.178 |
| couch         | 57.203 | potted plant | 35.351 | bed            | 54.356 |
| dining table  | 47.094 | toilet       | 62.179 | tv             | 57.951 |
| laptop        | 51.688 | mouse        | 51.509 | remote         | 22.544 |
| keyboard      | 49.477 | cell phone   | 31.565 | microwave      | 58.000 |
| oven          | 46.503 | toaster      | 6.667  | sink           | 42.756 |
| refrigerator  | 56.825 | book         | 20.399 | clock          | 43.258 |
| vase          | 31.861 | scissors     | 21.667 | teddy bear     | 41.263 |
| hair drier    | 0.000  | toothbrush   | 13.860 |                |        |

Troubleshooting - Number of Workers

## UserWarning: This DataLoader will create 4 worker processes in total.
## Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create.
## Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

## Reduce the number of workers
vi ./code/yolox/exp/yolox_base.py

## Modify the following line of yolox_base.py depending on the system (as needed)
================================
self.data_num_workers = 1
================================

## Update the number of output classes
vi ./code/yolox/exp/yolox_base.py

## Modify the following line of yolox_base.py depending on the dataset and desired model (as needed)
================================
self.num_classes = 80
================================

## Training
bash code/run_train.sh

Conducting training...
2023-06-28 08:38:14 | INFO | yolox.core.trainer:130 - args: Namespace(batch_size=4, ...)
2023-06-28 08:38:14 | INFO | yolox.core.trainer:131 - exp value:
╒═══════════════════╤════════════════════════════╕
│ keys              │ values                     │
╞═══════════════════╪════════════════════════════╡
│ seed              │ None                       │
├───────────────────┼────────────────────────────┤
│ output_dir        │ './YOLOX_outputs'          │
├───────────────────┼────────────────────────────┤
│ print_interval    │ 10                         │
├───────────────────┼────────────────────────────┤
│ eval_interval     │ 10                         │
├───────────────────┼────────────────────────────┤
│ num_classes       │ 80                         │
├───────────────────┼────────────────────────────┤
│ depth             │ 0.33                       │
├───────────────────┼────────────────────────────┤
│ width             │ 0.25                       │
├───────────────────┼────────────────────────────┤
│ act               │ 'relu'                     │
├───────────────────┼────────────────────────────┤
│ data_num_workers  │ 2                          │
├───────────────────┼────────────────────────────┤
│ input_size        │ (416, 416)                 │
├───────────────────┼────────────────────────────┤
│ multiscale_range  │ 5                          │
├───────────────────┼────────────────────────────┤
│ data_dir          │ 'data/COCO'                │
├───────────────────┼────────────────────────────┤
│ train_ann         │ 'instances_train2017.json' │
├───────────────────┼────────────────────────────┤
│ val_ann           │ 'instances_val2017.json'   │
├───────────────────┼────────────────────────────┤
│ test_ann          │ 'instances_test2017.json'  │
├───────────────────┼────────────────────────────┤
│ mosaic_prob       │ 0.5                        │
├───────────────────┼────────────────────────────┤
│ mixup_prob        │ 1.0                        │
├───────────────────┼────────────────────────────┤
│ hsv_prob          │ 1.0                        │
├───────────────────┼────────────────────────────┤
│ flip_prob         │ 0.5                        │
├───────────────────┼────────────────────────────┤
│ degrees           │ 10.0                       │
├───────────────────┼────────────────────────────┤
│ translate         │ 0.1                        │
├───────────────────┼────────────────────────────┤
│ mosaic_scale      │ (0.5, 1.5)                 │
├───────────────────┼────────────────────────────┤
│ enable_mixup      │ False                      │
├───────────────────┼────────────────────────────┤
│ mixup_scale       │ (0.5, 1.5)                 │
├───────────────────┼────────────────────────────┤
│ shear             │ 2.0                        │
├───────────────────┼────────────────────────────┤
│ warmup_epochs     │ 5                          │
├───────────────────┼────────────────────────────┤
│ max_epoch         │ 300                        │
├───────────────────┼────────────────────────────┤
│ warmup_lr         │ 0                          │
├───────────────────┼────────────────────────────┤
│ min_lr_ratio      │ 0.05                       │
├───────────────────┼────────────────────────────┤
│ basic_lr_per_img  │ 0.00015625                 │
├───────────────────┼────────────────────────────┤
│ scheduler         │ 'yoloxwarmcos'             │
├───────────────────┼────────────────────────────┤
│ no_aug_epochs     │ 15                         │
├───────────────────┼────────────────────────────┤
│ ema               │ True                       │
├───────────────────┼────────────────────────────┤
│ weight_decay      │ 0.0005                     │
├───────────────────┼────────────────────────────┤
│ momentum          │ 0.9                        │
├───────────────────┼────────────────────────────┤
│ save_history_ckpt │ True                       │
├───────────────────┼────────────────────────────┤
│ exp_name          │ 'yolox_nano_deploy_relu'   │
├───────────────────┼────────────────────────────┤
│ test_size         │ (416, 416)                 │
├───────────────────┼────────────────────────────┤
│ test_conf         │ 0.01                       │
├───────────────────┼────────────────────────────┤
│ nmsthre           │ 0.65                       │
├───────────────────┼────────────────────────────┤
│ random_size       │ (10, 20)                   │
╘═══════════════════╧════════════════════════════╛
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 08:38:17 | INFO | yolox.core.trainer:137 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 08:38:21 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 08:38:44 | INFO | yolox.data.datasets.coco:64 - Done (t=23.58s)
2023-06-28 08:38:44 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 08:38:45 | INFO | pycocotools.coco:86 - index created!
2023-06-28 08:39:22 | INFO | yolox.core.trainer:155 - init prefetcher, this might take one minute or less...
2023-06-28 08:40:16 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 08:40:17 | INFO | yolox.data.datasets.coco:64 - Done (t=1.01s)
2023-06-28 08:40:17 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 08:40:17 | INFO | pycocotools.coco:86 - index created!
2023-06-28 08:40:19 | INFO | yolox.core.trainer:191 - Training start...
2023-06-28 08:40:19 | INFO | yolox.core.trainer:192 -
YOLOX(
...
)
2023-06-28 08:40:19 | INFO | yolox.core.trainer:203 - ---> start train epoch1
2023-06-28 08:40:40 | INFO | yolox.core.trainer:261 - epoch: 1/300, iter: 10/29572, mem: 13006Mb, iter_time: 2.057s, data_time: 0.062s, total_loss: 19.7, ...
2023-06-28 08:40:43 | INFO | yolox.core.trainer:261 - epoch: 1/300, iter: 20/29572, mem: 13006Mb, iter_time: 0.346s, data_time: 0.200s, total_loss: 14.5, ...
...
(Estimated time of arrival with NVIDIA Tesla T4 x 1 would be around 100 days...)
...
2023-06-28 21:43:02 | INFO | yolox.core.trainer:356 - Save weights to ./YOLOX_outputs/yolox_nano_deploy_relu
100%|##########| 1250/1250 [01:54<00:00, 10.92it/s]
2023-06-28 21:44:56 | INFO | yolox.evaluators.coco_evaluator:256 - Evaluate in main process...
2023-06-28 21:44:56 | INFO | yolox.evaluators.coco_evaluator:289 - Loading and preparing results...
2023-06-28 21:44:56 | INFO | yolox.evaluators.coco_evaluator:289 - DONE (t=0.02s)
2023-06-28 21:44:56 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 21:44:57 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 11.71 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.66 seconds.
2023-06-28 21:45:10 | INFO | yolox.core.trainer:346 -
Average forward time: 4.55 ms, Average NMS time: 0.35 ms, Average inference time: 4.90 ms
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
2023-06-28 21:45:10 | INFO | yolox.core.trainer:356 - Save weights to ./YOLOX_outputs/yolox_nano_deploy_relu
2023-06-28 21:45:10 | INFO | yolox.core.trainer:356 - Save weights to ./YOLOX_outputs/yolox_nano_deploy_relu
2023-06-28 21:45:10 | INFO | yolox.core.trainer:196 - Training of experiment is done and the best AP is 0.00

Troubleshooting - Number of GPUs

## RuntimeError: NCCL error in: ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
## ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

## Reduce the number of GPUs
vi ./code/run_train.sh

## Modify the following lines of run_train.sh depending on the system (as needed)
================================
export CUDA_VISIBLE_DEVICES=0
GPU_NUM=1
================================

Troubleshooting - GPU/CUDA Out of Memory

## RuntimeError: CUDA out of memory.
## Tried to allocate 170.00 MiB (GPU 0; 14.76 GiB total capacity; 13.63 GiB already allocated; 72.81 MiB free; 13.63 GiB reserved in total by PyTorch)
## If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
## See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

## Reduce the batch size (or upgrading GPUs would be preferable if possible)
vi ./code/run_train.sh

## Modify the following line of run_train.sh depending on the system (as needed)
================================
BATCH=4
================================

## Quantization and xmodel Dumping
bash code/run_quant.sh

[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 19:56:48 | INFO | __main__:148 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
2023-06-28 19:56:48 | INFO | __main__:163 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 19:56:48 | INFO | __main__:164 - Model Structure:
YOLOX(
...
)
2023-06-28 19:56:48 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 19:56:49 | INFO | yolox.data.datasets.coco:64 - Done (t=0.64s)
2023-06-28 19:56:49 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 19:56:49 | INFO | pycocotools.coco:86 - index created!
2023-06-28 19:56:52 | INFO | __main__:181 - loading checkpoint from float/yolox_nano.pth
2023-06-28 19:56:52 | INFO | __main__:188 - loaded checkpoint done.
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization calibration process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing YOLOX...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model YOLOX is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(quantize_result/YOLOX.py)
[VAIQ_NOTE]: =>Get module with quantization.
100%|##########| 157/157 [01:51<00:00,  1.41it/s]
2023-06-28 19:58:51 | INFO | yolox.evaluators.coco_evaluator_q:270 - Evaluate in main process...
2023-06-28 19:59:11 | INFO | yolox.evaluators.coco_evaluator_q:303 - Loading and preparing results...
2023-06-28 19:59:16 | INFO | yolox.evaluators.coco_evaluator_q:303 - DONE (t=5.47s)
2023-06-28 19:59:16 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 19:59:17 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 16.24 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 2.56 seconds.
2023-06-28 19:59:38 | INFO | __main__:236 -
Average forward time: 17.37 ms, Average NMS time: 0.69 ms, Average inference time: 18.06 ms
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.137
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.262
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.132
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.043
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.153
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.228
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.158
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.267
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.300
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.091
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.476
per class AP:
...
[VAIQ_NOTE]: =>Exporting quant config.(quantize_result/quant_info.json)
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 19:59:45 | INFO | __main__:148 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
2023-06-28 19:59:46 | INFO | __main__:163 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 19:59:46 | INFO | __main__:164 - Model Structure:
YOLOX(
...
)
2023-06-28 19:59:46 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 19:59:46 | INFO | yolox.data.datasets.coco:64 - Done (t=0.65s)
2023-06-28 19:59:46 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 19:59:46 | INFO | pycocotools.coco:86 - index created!
2023-06-28 19:59:50 | INFO | __main__:181 - loading checkpoint from float/yolox_nano.pth
2023-06-28 19:59:50 | INFO | __main__:188 - loaded checkpoint done.
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: GPU information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization test process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cuda'.
[VAIQ_NOTE]: =>Parsing YOLOX...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model YOLOX is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(quantize_result/YOLOX.py)
[VAIQ_NOTE]: =>Get module with quantization.
100%|##########| 157/157 [00:54<00:00,  2.89it/s]
2023-06-28 20:00:51 | INFO | yolox.evaluators.coco_evaluator_q:270 - Evaluate in main process...
2023-06-28 20:01:12 | INFO | yolox.evaluators.coco_evaluator_q:303 - Loading and preparing results...
2023-06-28 20:01:17 | INFO | yolox.evaluators.coco_evaluator_q:303 - DONE (t=5.45s)
2023-06-28 20:01:17 | INFO | pycocotools.coco:366 - creating index...
2023-06-28 20:01:18 | INFO | pycocotools.coco:366 - index created!
Running per image evaluation...
Evaluate annotation type *bbox*
COCOeval_opt.evaluate() finished in 15.82 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 2.59 seconds.
2023-06-28 20:01:38 | INFO | __main__:236 -
Average forward time: 5.94 ms, Average NMS time: 0.78 ms, Average inference time: 6.71 ms
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.136
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.264
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.132
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.041
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.155
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.226
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.156
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.265
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.298
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.093
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.471
per class AP:
...
[VAIQ_NOTE]: Loading NNDCT kernels...
2023-06-28 20:01:48 | INFO | __main__:148 - Args: Namespace(batch_size=32, ckpt='float/yolox_nano.pth', ...)
2023-06-28 20:01:48 | INFO | __main__:163 - Model Summary: Params: 0.91M, Gflops: 1.00
2023-06-28 20:01:48 | INFO | __main__:164 - Model Structure:
YOLOX(
...
)
2023-06-28 20:01:48 | INFO | yolox.data.datasets.coco:64 - loading annotations into memory...
2023-06-28 20:01:50 | INFO | yolox.data.datasets.coco:64 - Done (t=1.81s)
2023-06-28 20:01:50 | INFO | pycocotools.coco:86 - creating index...
2023-06-28 20:01:50 | INFO | pycocotools.coco:86 - index created!
2023-06-28 20:01:52 | INFO | __main__:181 - loading checkpoint from float/yolox_nano.pth
2023-06-28 20:01:54 | INFO | __main__:188 - loaded checkpoint done.
[VAIQ_NOTE]: OS and CPU information:
[VAIQ_NOTE]: Tools version information:
[VAIQ_NOTE]: Quant config file is empty, use default quant configuration
[VAIQ_NOTE]: Quantization test process start up...
[VAIQ_NOTE]: =>Quant Module is in 'cpu'.
[VAIQ_NOTE]: =>Parsing YOLOX...
[VAIQ_NOTE]: Start to trace and freeze model...
[VAIQ_NOTE]: The input model YOLOX is torch.nn.Module.
[VAIQ_NOTE]: Finish tracing.
[VAIQ_NOTE]: Processing ops...
[VAIQ_NOTE]: =>Doing weights equalization...
[VAIQ_NOTE]: =>Quantizable module is generated.(quantize_result/YOLOX.py)
[VAIQ_NOTE]: =>Get module with quantization.
  0%|          | 0/5000 [00:00<?, ?it/s]
2023-06-28 20:02:01 | INFO | __main__:236 -
[VAIQ_NOTE]: =>Converting to xmodel ...
[VAIQ_NOTE]: =>Dumping 'YOLOX_0'' checking data...
[VAIQ_NOTE]: =>Finsh dumping data.(quantize_result/deploy_check_data_int/YOLOX_0)
[VAIQ_NOTE]: =>Successfully convert 'YOLOX_0' to xmodel.(quantize_result/YOLOX_0_int.xmodel)

## Quantization-Aware Training, Model Converting, and xmodel Dumping
bash code/run_qat.sh

https://misoji-engineer.com/archives/vitis-ai-how-to.html
https://misoji-engineer.com/archives/vitis-ai-3-0.html
https://misoji-engineer.com/archives/build-vitis-ai-gpu.html
https://www.paltek.co.jp/techblog/tag/ai
https://www.pixela.co.jp/products/pickup/dev/ai/vitisai_ai_3_model_zoo.html
https://misoji-engineer.com/archives/vitis-ai-model-zoo.html
https://tomosoft.jp/design/?p=44403
https://www.pixela.co.jp/products/pickup/dev/
https://www.paltek.co.jp/techblog/techinfo/220121_01

Daiphys is a professional-service company for research and development of leading-edge technologies in science and engineering.
Get started accelerating your business through our deep expertise in R&D with AI, quantum computing, and space development; please get in touch with Daiphys today!

Daiphys Technologies LLC - https://www.daiphys.com/

Vitis AI

Getting Started

System Requirements

IP Core / Bitstream

OS Image

Model Zoo

Evaluation Boards

AI Optimizer (Optional)

Host Setup on Ubuntu

PyTorch (CPU Only)

PyTorch (GPU Support)

Troubleshooting - Exit Code 137

Troubleshooting - Docker Image Not Found

Jupyter Notebook in Docker Image

Target Board Setup (Zynq UltraScale+ MPSoC - DPUCZDX8G)

Cross-Compiler on Host Machine

Troubleshooting - LSB Modules

OS Installation

Checking Device

Checking DPU Availability

ResNet50 (Image Classification)

Troubleshooting - File Descriptor

Troubleshooting - OpenCV

Troubleshooting - Fingerprint Failure

YOLOvX (Object Detection)

Custom Model Development (PyTorch)

Step 0 - Setting-up Workspace

Step 1 - Training

Step 2 - Quantization

Step 3 - Compile

Step 4 - Application

Step 5 - Running on Target

Custom Model Development via Model Zoo (PyTorch)

Step 0 - Setting-up Workspace

Step 1 - Evaluation

Troubleshooting - Number of Workers

Step 2 - Training

Troubleshooting - Number of GPUs

Troubleshooting - GPU/CUDA Out of Memory

Step 3 - Quantization

Step Ex. - Quantization-Aware Training (as needed)

References

Acknowledgments