DiffusionDet Inference

In this blog post I will provide the procedures that I followed to inference the DiffusionDet Model for object detection. This is based on the

I created a conda environment

conda create --name diffusionDetection python=3.7

Detectron repo needs python >= 3.7

Activate the conda environment

conda activate diffusionDetection

Clone the DiffusionDet repo

git clone https://github.com/ShoufaChen/DiffusionDet.git

DiffusionDet repo does not have the requirements.txt file, I had to individually install the pip packages and set up the environment.

pip install numpy
pip install opencv-python
pip install tqdm

I want to try the inference on a CPU only machine, hence I installed PyTorch without CUDA support. PyTorch is a dependency for the Detectron2

conda install pytorch cpuonly -c pytorch

Install Detectron2

python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

You can find alternative installation options from https://github.com/facebookresearch/detectron2/blob/main/INSTALL.md#installation

Manually download the pretrained model and place it in the same folder as demo.py

Edit the config file Base-DiffusionDet inside the configs folder to add the MODEL.DEVICE as “cpu”, since the inference is perfomed on CPU machine without GPU. I only added the line DEVICE: "cpu" .

MODEL:
  META_ARCHITECTURE: "DiffusionDet"
  DEVICE: "cpu"
  WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
  PIXEL_MEAN: [123.675, 116.280, 103.530]
  PIXEL_STD: [58.395, 57.120, 57.375]
  BACKBONE:
    NAME: "build_resnet_fpn_backbone"
  RESNETS:
    OUT_FEATURES: ["res2", "res3", "res4", "res5"]
  FPN:
    IN_FEATURES: ["res2", "res3", "res4", "res5"]
  ROI_HEADS:
    IN_FEATURES: ["p2", "p3", "p4", "p5"]
  ROI_BOX_HEAD:
    POOLER_TYPE: "ROIAlignV2"
    POOLER_RESOLUTION: 7
    POOLER_SAMPLING_RATIO: 2
SOLVER:
  IMS_PER_BATCH: 16
  BASE_LR: 0.000025
  STEPS: (210000, 250000)
  MAX_ITER: 270000
  WARMUP_FACTOR: 0.01
  WARMUP_ITERS: 1000
  WEIGHT_DECAY: 0.0001
  OPTIMIZER: "ADAMW"
  BACKBONE_MULTIPLIER: 1.0  # keep same with BASE_LR.
  CLIP_GRADIENTS:
    ENABLED: True
    CLIP_TYPE: "full_model"
    CLIP_VALUE: 1.0
    NORM_TYPE: 2.0
SEED: 40244023
INPUT:
  MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800)
  CROP:
    ENABLED: False
    TYPE: "absolute_range"
    SIZE: (384, 600)
  FORMAT: "RGB"
TEST:
  EVAL_PERIOD: 7330
DATALOADER:
  FILTER_EMPTY_ANNOTATIONS: False
  NUM_WORKERS: 4
VERSION: 2

Execution code

python demo.py --config-file configs/diffdet.coco.res50.yaml --input image.jpg --opts MODEL.WEIGHTS .\diffdet_coco_res50_300boxes.pth

Output

(diffusionDetection2) PS C:\Users\karth\OneDrive\Documents\Other_Resources\ProfileProject\DiffusionDet> python demo.py --config-file configs/diffdet.coco.res50.yaml --input image.jpg --opts MODEL.WEIGHTS .\diffdet_coco_res50_300boxes.pth

[11/25 00:30:07 detectron2]: Arguments: Namespace(confidence_threshold=0.5, config_file='configs/diffdet.coco.res50.yaml', input=['image.jpg'], opts=['MODEL.WEIGHTS', '.\\diffdet_coco_res50_300boxes.pth'], output=None, video_input=None, webcam=False)
cpu

[11/25 00:30:08 fvcore.common.checkpoint]: [Checkpointer] Loading from .\diffdet_coco_res50_300boxes.pth ...

[11/25 00:30:25 d2.checkpoint.c2_model_loading]: Following weights matched with model:
| Names in Model                                   | Names in Checkpoint                                                                                  | Shapes                                          |
|:-------------------------------------------------|:-----------------------------------------------------------------------------------------------------|:------------------------------------------------|
| alphas_cumprod                                   | alphas_cumprod                                                                                       | (1000,)                                         |
| alphas_cumprod_prev                              | alphas_cumprod_prev                                                                                  | (1000,)                                         |
| backbone.bottom_up.res2.0.conv1.*                | backbone.bottom_up.res2.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,1,1)             |
| backbone.bottom_up.res2.0.conv2.*                | backbone.bottom_up.res2.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| backbone.bottom_up.res2.0.conv3.*                | backbone.bottom_up.res2.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| backbone.bottom_up.res2.0.shortcut.*             | backbone.bottom_up.res2.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| backbone.bottom_up.res2.1.conv1.*                | backbone.bottom_up.res2.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,256,1,1)            |
| backbone.bottom_up.res2.1.conv2.*                | backbone.bottom_up.res2.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| backbone.bottom_up.res2.1.conv3.*                | backbone.bottom_up.res2.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| backbone.bottom_up.res2.2.conv1.*                | backbone.bottom_up.res2.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,256,1,1)            |
| backbone.bottom_up.res2.2.conv2.*                | backbone.bottom_up.res2.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| backbone.bottom_up.res2.2.conv3.*                | backbone.bottom_up.res2.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| backbone.bottom_up.res3.0.conv1.*                | backbone.bottom_up.res3.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,256,1,1)       |
| backbone.bottom_up.res3.0.conv2.*                | backbone.bottom_up.res3.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| backbone.bottom_up.res3.0.conv3.*                | backbone.bottom_up.res3.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| backbone.bottom_up.res3.0.shortcut.*             | backbone.bottom_up.res3.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,256,1,1)       |
| backbone.bottom_up.res3.1.conv1.*                | backbone.bottom_up.res3.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| backbone.bottom_up.res3.1.conv2.*                | backbone.bottom_up.res3.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| backbone.bottom_up.res3.1.conv3.*                | backbone.bottom_up.res3.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| backbone.bottom_up.res3.2.conv1.*                | backbone.bottom_up.res3.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| backbone.bottom_up.res3.2.conv2.*                | backbone.bottom_up.res3.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| backbone.bottom_up.res3.2.conv3.*                | backbone.bottom_up.res3.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| backbone.bottom_up.res3.3.conv1.*                | backbone.bottom_up.res3.3.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| backbone.bottom_up.res3.3.conv2.*                | backbone.bottom_up.res3.3.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| backbone.bottom_up.res3.3.conv3.*                | backbone.bottom_up.res3.3.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| backbone.bottom_up.res4.0.conv1.*                | backbone.bottom_up.res4.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,512,1,1)       |
| backbone.bottom_up.res4.0.conv2.*                | backbone.bottom_up.res4.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| backbone.bottom_up.res4.0.conv3.*                | backbone.bottom_up.res4.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| backbone.bottom_up.res4.0.shortcut.*             | backbone.bottom_up.res4.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (1024,) (1024,) (1024,) (1024,) (1024,512,1,1)  |
| backbone.bottom_up.res4.1.conv1.*                | backbone.bottom_up.res4.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| backbone.bottom_up.res4.1.conv2.*                | backbone.bottom_up.res4.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| backbone.bottom_up.res4.1.conv3.*                | backbone.bottom_up.res4.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| backbone.bottom_up.res4.2.conv1.*                | backbone.bottom_up.res4.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| backbone.bottom_up.res4.2.conv2.*                | backbone.bottom_up.res4.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| backbone.bottom_up.res4.2.conv3.*                | backbone.bottom_up.res4.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| backbone.bottom_up.res4.3.conv1.*                | backbone.bottom_up.res4.3.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| backbone.bottom_up.res4.3.conv2.*                | backbone.bottom_up.res4.3.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| backbone.bottom_up.res4.3.conv3.*                | backbone.bottom_up.res4.3.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| backbone.bottom_up.res4.4.conv1.*                | backbone.bottom_up.res4.4.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| backbone.bottom_up.res4.4.conv2.*                | backbone.bottom_up.res4.4.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| backbone.bottom_up.res4.4.conv3.*                | backbone.bottom_up.res4.4.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| backbone.bottom_up.res4.5.conv1.*                | backbone.bottom_up.res4.5.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| backbone.bottom_up.res4.5.conv2.*                | backbone.bottom_up.res4.5.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| backbone.bottom_up.res4.5.conv3.*                | backbone.bottom_up.res4.5.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| backbone.bottom_up.res5.0.conv1.*                | backbone.bottom_up.res5.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,1024,1,1)      |
| backbone.bottom_up.res5.0.conv2.*                | backbone.bottom_up.res5.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3)       |
| backbone.bottom_up.res5.0.conv3.*                | backbone.bottom_up.res5.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)  |
| backbone.bottom_up.res5.0.shortcut.*             | backbone.bottom_up.res5.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (2048,) (2048,) (2048,) (2048,) (2048,1024,1,1) |
| backbone.bottom_up.res5.1.conv1.*                | backbone.bottom_up.res5.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,2048,1,1)      |
| backbone.bottom_up.res5.1.conv2.*                | backbone.bottom_up.res5.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3)       |
| backbone.bottom_up.res5.1.conv3.*                | backbone.bottom_up.res5.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)  |
| backbone.bottom_up.res5.2.conv1.*                | backbone.bottom_up.res5.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,2048,1,1)      |
| backbone.bottom_up.res5.2.conv2.*                | backbone.bottom_up.res5.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3)       |
| backbone.bottom_up.res5.2.conv3.*                | backbone.bottom_up.res5.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)  |
| backbone.bottom_up.stem.conv1.*                  | backbone.bottom_up.stem.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}      | (64,) (64,) (64,) (64,) (64,3,7,7)              |
| backbone.fpn_lateral2.*                          | backbone.fpn_lateral2.{bias,weight}                                                                  | (256,) (256,256,1,1)                            |
| backbone.fpn_lateral3.*                          | backbone.fpn_lateral3.{bias,weight}                                                                  | (256,) (256,512,1,1)                            |
| backbone.fpn_lateral4.*                          | backbone.fpn_lateral4.{bias,weight}                                                                  | (256,) (256,1024,1,1)                           |
| backbone.fpn_lateral5.*                          | backbone.fpn_lateral5.{bias,weight}                                                                  | (256,) (256,2048,1,1)                           |
| backbone.fpn_output2.*                           | backbone.fpn_output2.{bias,weight}                                                                   | (256,) (256,256,3,3)                            |
| backbone.fpn_output3.*                           | backbone.fpn_output3.{bias,weight}                                                                   | (256,) (256,256,3,3)                            |
| backbone.fpn_output4.*                           | backbone.fpn_output4.{bias,weight}                                                                   | (256,) (256,256,3,3)                            |
| backbone.fpn_output5.*                           | backbone.fpn_output5.{bias,weight}                                                                   | (256,) (256,256,3,3)                            |
| betas                                            | betas                                                                                                | (1000,)                                         |
| head.head_series.0.bboxes_delta.*                | head.head_series.0.bboxes_delta.{bias,weight}                                                        | (4,) (4,256)                                    |
| head.head_series.0.block_time_mlp.1.*            | head.head_series.0.block_time_mlp.1.{bias,weight}                                                    | (512,) (512,1024)                               |
| head.head_series.0.class_logits.*                | head.head_series.0.class_logits.{bias,weight}                                                        | (80,) (80,256)                                  |
| head.head_series.0.cls_module.0.weight           | head.head_series.0.cls_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.0.cls_module.1.*                | head.head_series.0.cls_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.0.inst_interact.dynamic_layer.* | head.head_series.0.inst_interact.dynamic_layer.{bias,weight}                                         | (32768,) (32768,256)                            |
| head.head_series.0.inst_interact.norm1.*         | head.head_series.0.inst_interact.norm1.{bias,weight}                                                 | (64,) (64,)                                     |
| head.head_series.0.inst_interact.norm2.*         | head.head_series.0.inst_interact.norm2.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.0.inst_interact.norm3.*         | head.head_series.0.inst_interact.norm3.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.0.inst_interact.out_layer.*     | head.head_series.0.inst_interact.out_layer.{bias,weight}                                             | (256,) (256,12544)                              |
| head.head_series.0.linear1.*                     | head.head_series.0.linear1.{bias,weight}                                                             | (2048,) (2048,256)                              |
| head.head_series.0.linear2.*                     | head.head_series.0.linear2.{bias,weight}                                                             | (256,) (256,2048)                               |
| head.head_series.0.norm1.*                       | head.head_series.0.norm1.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.0.norm2.*                       | head.head_series.0.norm2.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.0.norm3.*                       | head.head_series.0.norm3.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.0.reg_module.0.weight           | head.head_series.0.reg_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.0.reg_module.1.*                | head.head_series.0.reg_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.0.reg_module.3.weight           | head.head_series.0.reg_module.3.weight                                                               | (256, 256)                                      |
| head.head_series.0.reg_module.4.*                | head.head_series.0.reg_module.4.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.0.reg_module.6.weight           | head.head_series.0.reg_module.6.weight                                                               | (256, 256)                                      |
| head.head_series.0.reg_module.7.*                | head.head_series.0.reg_module.7.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.0.self_attn.*                   | head.head_series.0.self_attn.{in_proj_bias,in_proj_weight,out_proj.bias,out_proj.weight}             | (768,) (768,256) (256,) (256,256)               |
| head.head_series.1.bboxes_delta.*                | head.head_series.1.bboxes_delta.{bias,weight}                                                        | (4,) (4,256)                                    |
| head.head_series.1.block_time_mlp.1.*            | head.head_series.1.block_time_mlp.1.{bias,weight}                                                    | (512,) (512,1024)                               |
| head.head_series.1.class_logits.*                | head.head_series.1.class_logits.{bias,weight}                                                        | (80,) (80,256)                                  |
| head.head_series.1.cls_module.0.weight           | head.head_series.1.cls_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.1.cls_module.1.*                | head.head_series.1.cls_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.1.inst_interact.dynamic_layer.* | head.head_series.1.inst_interact.dynamic_layer.{bias,weight}                                         | (32768,) (32768,256)                            |
| head.head_series.1.inst_interact.norm1.*         | head.head_series.1.inst_interact.norm1.{bias,weight}                                                 | (64,) (64,)                                     |
| head.head_series.1.inst_interact.norm2.*         | head.head_series.1.inst_interact.norm2.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.1.inst_interact.norm3.*         | head.head_series.1.inst_interact.norm3.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.1.inst_interact.out_layer.*     | head.head_series.1.inst_interact.out_layer.{bias,weight}                                             | (256,) (256,12544)                              |
| head.head_series.1.linear1.*                     | head.head_series.1.linear1.{bias,weight}                                                             | (2048,) (2048,256)                              |
| head.head_series.1.linear2.*                     | head.head_series.1.linear2.{bias,weight}                                                             | (256,) (256,2048)                               |
| head.head_series.1.norm1.*                       | head.head_series.1.norm1.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.1.norm2.*                       | head.head_series.1.norm2.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.1.norm3.*                       | head.head_series.1.norm3.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.1.reg_module.0.weight           | head.head_series.1.reg_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.1.reg_module.1.*                | head.head_series.1.reg_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.1.reg_module.3.weight           | head.head_series.1.reg_module.3.weight                                                               | (256, 256)                                      |
| head.head_series.1.reg_module.4.*                | head.head_series.1.reg_module.4.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.1.reg_module.6.weight           | head.head_series.1.reg_module.6.weight                                                               | (256, 256)                                      |
| head.head_series.1.reg_module.7.*                | head.head_series.1.reg_module.7.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.1.self_attn.*                   | head.head_series.1.self_attn.{in_proj_bias,in_proj_weight,out_proj.bias,out_proj.weight}             | (768,) (768,256) (256,) (256,256)               |
| head.head_series.2.bboxes_delta.*                | head.head_series.2.bboxes_delta.{bias,weight}                                                        | (4,) (4,256)                                    |
| head.head_series.2.block_time_mlp.1.*            | head.head_series.2.block_time_mlp.1.{bias,weight}                                                    | (512,) (512,1024)                               |
| head.head_series.2.class_logits.*                | head.head_series.2.class_logits.{bias,weight}                                                        | (80,) (80,256)                                  |
| head.head_series.2.cls_module.0.weight           | head.head_series.2.cls_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.2.cls_module.1.*                | head.head_series.2.cls_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.2.inst_interact.dynamic_layer.* | head.head_series.2.inst_interact.dynamic_layer.{bias,weight}                                         | (32768,) (32768,256)                            |
| head.head_series.2.inst_interact.norm1.*         | head.head_series.2.inst_interact.norm1.{bias,weight}                                                 | (64,) (64,)                                     |
| head.head_series.2.inst_interact.norm2.*         | head.head_series.2.inst_interact.norm2.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.2.inst_interact.norm3.*         | head.head_series.2.inst_interact.norm3.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.2.inst_interact.out_layer.*     | head.head_series.2.inst_interact.out_layer.{bias,weight}                                             | (256,) (256,12544)                              |
| head.head_series.2.linear1.*                     | head.head_series.2.linear1.{bias,weight}                                                             | (2048,) (2048,256)                              |
| head.head_series.2.linear2.*                     | head.head_series.2.linear2.{bias,weight}                                                             | (256,) (256,2048)                               |
| head.head_series.2.norm1.*                       | head.head_series.2.norm1.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.2.norm2.*                       | head.head_series.2.norm2.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.2.norm3.*                       | head.head_series.2.norm3.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.2.reg_module.0.weight           | head.head_series.2.reg_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.2.reg_module.1.*                | head.head_series.2.reg_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.2.reg_module.3.weight           | head.head_series.2.reg_module.3.weight                                                               | (256, 256)                                      |
| head.head_series.2.reg_module.4.*                | head.head_series.2.reg_module.4.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.2.reg_module.6.weight           | head.head_series.2.reg_module.6.weight                                                               | (256, 256)                                      |
| head.head_series.2.reg_module.7.*                | head.head_series.2.reg_module.7.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.2.self_attn.*                   | head.head_series.2.self_attn.{in_proj_bias,in_proj_weight,out_proj.bias,out_proj.weight}             | (768,) (768,256) (256,) (256,256)               |
| head.head_series.3.bboxes_delta.*                | head.head_series.3.bboxes_delta.{bias,weight}                                                        | (4,) (4,256)                                    |
| head.head_series.3.block_time_mlp.1.*            | head.head_series.3.block_time_mlp.1.{bias,weight}                                                    | (512,) (512,1024)                               |
| head.head_series.3.class_logits.*                | head.head_series.3.class_logits.{bias,weight}                                                        | (80,) (80,256)                                  |
| head.head_series.3.cls_module.0.weight           | head.head_series.3.cls_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.3.cls_module.1.*                | head.head_series.3.cls_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.3.inst_interact.dynamic_layer.* | head.head_series.3.inst_interact.dynamic_layer.{bias,weight}                                         | (32768,) (32768,256)                            |
| head.head_series.3.inst_interact.norm1.*         | head.head_series.3.inst_interact.norm1.{bias,weight}                                                 | (64,) (64,)                                     |
| head.head_series.3.inst_interact.norm2.*         | head.head_series.3.inst_interact.norm2.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.3.inst_interact.norm3.*         | head.head_series.3.inst_interact.norm3.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.3.inst_interact.out_layer.*     | head.head_series.3.inst_interact.out_layer.{bias,weight}                                             | (256,) (256,12544)                              |
| head.head_series.3.linear1.*                     | head.head_series.3.linear1.{bias,weight}                                                             | (2048,) (2048,256)                              |
| head.head_series.3.linear2.*                     | head.head_series.3.linear2.{bias,weight}                                                             | (256,) (256,2048)                               |
| head.head_series.3.norm1.*                       | head.head_series.3.norm1.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.3.norm2.*                       | head.head_series.3.norm2.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.3.norm3.*                       | head.head_series.3.norm3.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.3.reg_module.0.weight           | head.head_series.3.reg_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.3.reg_module.1.*                | head.head_series.3.reg_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.3.reg_module.3.weight           | head.head_series.3.reg_module.3.weight                                                               | (256, 256)                                      |
| head.head_series.3.reg_module.4.*                | head.head_series.3.reg_module.4.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.3.reg_module.6.weight           | head.head_series.3.reg_module.6.weight                                                               | (256, 256)                                      |
| head.head_series.3.reg_module.7.*                | head.head_series.3.reg_module.7.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.3.self_attn.*                   | head.head_series.3.self_attn.{in_proj_bias,in_proj_weight,out_proj.bias,out_proj.weight}             | (768,) (768,256) (256,) (256,256)               |
| head.head_series.4.bboxes_delta.*                | head.head_series.4.bboxes_delta.{bias,weight}                                                        | (4,) (4,256)                                    |
| head.head_series.4.block_time_mlp.1.*            | head.head_series.4.block_time_mlp.1.{bias,weight}                                                    | (512,) (512,1024)                               |
| head.head_series.4.class_logits.*                | head.head_series.4.class_logits.{bias,weight}                                                        | (80,) (80,256)                                  |
| head.head_series.4.cls_module.0.weight           | head.head_series.4.cls_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.4.cls_module.1.*                | head.head_series.4.cls_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.4.inst_interact.dynamic_layer.* | head.head_series.4.inst_interact.dynamic_layer.{bias,weight}                                         | (32768,) (32768,256)                            |
| head.head_series.4.inst_interact.norm1.*         | head.head_series.4.inst_interact.norm1.{bias,weight}                                                 | (64,) (64,)                                     |
| head.head_series.4.inst_interact.norm2.*         | head.head_series.4.inst_interact.norm2.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.4.inst_interact.norm3.*         | head.head_series.4.inst_interact.norm3.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.4.inst_interact.out_layer.*     | head.head_series.4.inst_interact.out_layer.{bias,weight}                                             | (256,) (256,12544)                              |
| head.head_series.4.linear1.*                     | head.head_series.4.linear1.{bias,weight}                                                             | (2048,) (2048,256)                              |
| head.head_series.4.linear2.*                     | head.head_series.4.linear2.{bias,weight}                                                             | (256,) (256,2048)                               |
| head.head_series.4.norm1.*                       | head.head_series.4.norm1.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.4.norm2.*                       | head.head_series.4.norm2.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.4.norm3.*                       | head.head_series.4.norm3.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.4.reg_module.0.weight           | head.head_series.4.reg_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.4.reg_module.1.*                | head.head_series.4.reg_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.4.reg_module.3.weight           | head.head_series.4.reg_module.3.weight                                                               | (256, 256)                                      |
| head.head_series.4.reg_module.4.*                | head.head_series.4.reg_module.4.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.4.reg_module.6.weight           | head.head_series.4.reg_module.6.weight                                                               | (256, 256)                                      |
| head.head_series.4.reg_module.7.*                | head.head_series.4.reg_module.7.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.4.self_attn.*                   | head.head_series.4.self_attn.{in_proj_bias,in_proj_weight,out_proj.bias,out_proj.weight}             | (768,) (768,256) (256,) (256,256)               |
| head.head_series.5.bboxes_delta.*                | head.head_series.5.bboxes_delta.{bias,weight}                                                        | (4,) (4,256)                                    |
| head.head_series.5.block_time_mlp.1.*            | head.head_series.5.block_time_mlp.1.{bias,weight}                                                    | (512,) (512,1024)                               |
| head.head_series.5.class_logits.*                | head.head_series.5.class_logits.{bias,weight}                                                        | (80,) (80,256)                                  |
| head.head_series.5.cls_module.0.weight           | head.head_series.5.cls_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.5.cls_module.1.*                | head.head_series.5.cls_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.5.inst_interact.dynamic_layer.* | head.head_series.5.inst_interact.dynamic_layer.{bias,weight}                                         | (32768,) (32768,256)                            |
| head.head_series.5.inst_interact.norm1.*         | head.head_series.5.inst_interact.norm1.{bias,weight}                                                 | (64,) (64,)                                     |
| head.head_series.5.inst_interact.norm2.*         | head.head_series.5.inst_interact.norm2.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.5.inst_interact.norm3.*         | head.head_series.5.inst_interact.norm3.{bias,weight}                                                 | (256,) (256,)                                   |
| head.head_series.5.inst_interact.out_layer.*     | head.head_series.5.inst_interact.out_layer.{bias,weight}                                             | (256,) (256,12544)                              |
| head.head_series.5.linear1.*                     | head.head_series.5.linear1.{bias,weight}                                                             | (2048,) (2048,256)                              |
| head.head_series.5.linear2.*                     | head.head_series.5.linear2.{bias,weight}                                                             | (256,) (256,2048)                               |
| head.head_series.5.norm1.*                       | head.head_series.5.norm1.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.5.norm2.*                       | head.head_series.5.norm2.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.5.norm3.*                       | head.head_series.5.norm3.{bias,weight}                                                               | (256,) (256,)                                   |
| head.head_series.5.reg_module.0.weight           | head.head_series.5.reg_module.0.weight                                                               | (256, 256)                                      |
| head.head_series.5.reg_module.1.*                | head.head_series.5.reg_module.1.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.5.reg_module.3.weight           | head.head_series.5.reg_module.3.weight                                                               | (256, 256)                                      |
| head.head_series.5.reg_module.4.*                | head.head_series.5.reg_module.4.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.5.reg_module.6.weight           | head.head_series.5.reg_module.6.weight                                                               | (256, 256)                                      |
| head.head_series.5.reg_module.7.*                | head.head_series.5.reg_module.7.{bias,weight}                                                        | (256,) (256,)                                   |
| head.head_series.5.self_attn.*                   | head.head_series.5.self_attn.{in_proj_bias,in_proj_weight,out_proj.bias,out_proj.weight}             | (768,) (768,256) (256,) (256,256)               |
| head.time_mlp.1.*                                | head.time_mlp.1.{bias,weight}                                                                        | (1024,) (1024,256)                              |
| head.time_mlp.3.*                                | head.time_mlp.3.{bias,weight}                                                                        | (1024,) (1024,1024)                             |
| log_one_minus_alphas_cumprod                     | log_one_minus_alphas_cumprod                                                                         | (1000,)                                         |
| posterior_log_variance_clipped                   | posterior_log_variance_clipped                                                                       | (1000,)                                         |
| posterior_mean_coef1                             | posterior_mean_coef1                                                                                 | (1000,)                                         |
| posterior_mean_coef2                             | posterior_mean_coef2                                                                                 | (1000,)                                         |
| posterior_variance                               | posterior_variance                                                                                   | (1000,)                                         |
| sqrt_alphas_cumprod                              | sqrt_alphas_cumprod                                                                                  | (1000,)                                         |
| sqrt_one_minus_alphas_cumprod                    | sqrt_one_minus_alphas_cumprod                                                                        | (1000,)                                         |
| sqrt_recip_alphas_cumprod                        | sqrt_recip_alphas_cumprod                                                                            | (1000,)                                         |
| sqrt_recipm1_alphas_cumprod                      | sqrt_recipm1_alphas_cumprod                                                                          | (1000,)                                         |
[11/25 00:30:34 detectron2]: image.jpg: detected 1 instances in 8.20s

Opencv window pops up with the output image with the bounding box.

Thanks for the interest to learn something new. Please reach out in case if you face any difficulties. Always happy to help.

Written on November 25th, 2022 by Karthik

Feel free to share!

DiffusionDet Inference

You may also enjoy:

Model inference in seldon kubernetes cluster

Synthetic data generation

Dockerize whisper model