Spaces:

bijonguha
/

yolov3-voc-era1

Runtime error

App Files Files Community

Bijon Guha commited on Aug 12, 2023

Commit

97933dd

1 Parent(s): 1e9b4a1

file upload

Browse files

Files changed (22) hide show

README.md +57 -3
app.py +77 -4
config.py +58 -0
examples/1.jpg +0 -0
examples/10.jpg +0 -0
examples/11.jpg +0 -0
examples/12.jpg +0 -0
examples/13.jpg +0 -0
examples/14.jpg +0 -0
examples/15.jpg +0 -0
examples/2.jpg +0 -0
examples/3.jpg +0 -0
examples/4.jpg +0 -0
examples/5.jpg +0 -0
examples/6.jpg +0 -0
examples/7.jpg +0 -0
examples/8.jpg +0 -0
examples/9.jpg +0 -0
requirements.txt +9 -0
utils.py +265 -0
yolov3.pth +3 -0
yolov3.py +171 -0

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-title: Yolov3 Voc Era1
-emoji: 🏃
-colorFrom: purple
 colorTo: yellow
 sdk: gradio
 sdk_version: 3.40.1
@@ -9,5 +9,59 @@ app_file: app.py
 pinned: false
 license: mit
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: YoloV3 GradCam
+emoji: 🦀
+colorFrom: pink
 colorTo: yellow
 sdk: gradio
 sdk_version: 3.40.1
 pinned: false
 license: mit
 ---
+# YOLO V2 & V3 and Object Detection Techniques
+## How to Use the App
+<br>
+1. The app has two tabs:
+   - **YoloV3 Object Detection** : In this tab, you can upload your own image of dimensions 416 x 416 pixels or choose an example image provided already to classify and visualize the Class Activation Maps using GradCAM.
+   You can adjust the number of top predicted classes, select multiple target layers from the model trained, control the transparency of the overlay, and allow/hide GradCam visualizations.
+   - **GradCam Visualization** : In this tab, we are able visualize a gallery of misclassified images from PASCAL_VOC test dataset. You can control the transparency of the overlay, select a target layer, control the number of misclassified examples shown and allow/hide the GradCAM overlay.
+<br>
+2. **YoloV3 Object Detection**
+   - **Input Image**    : Upload your own image of dimensions 416 x 416 pixels or select one of the example images given below.
+   - **Enable GradCAM** : Allows the GradCAM overlay on the input image. Unchecking it allows to view the original image.
+   - **Network Layers** : Select the target layers for GradCAM visualization. The values range from [-4,-1] and the default values are -2 and -1.
+   - **Transparency**   : Control the transparency of the GradCAM overlay. The default value is 0.5.
+   - **Threshold**      : Control the threshold for the boxes plotted on the images.
+<br>
+3. **GradCam Visualization**
+   - **Input Image**    : Upload your own image of dimensions 416 x 416 pixels or select one of the example images given below.
+   - **Network Layer**  : Adjust the target layer for GradCAM visualization in the model's layers. The default value is -2.
+   - **Transparency**   : Control the transparency of the GradCAM overlay. The default value is 0.5.
+   - **Enable GradCAM** : Allows to display the GradCAM overlay on the misclassified images. Unchecking it allows to view the original images.
+   - **Threshold**      : Control the threshold for the boxes plotted on the images.
+<br>
+4. After adjusting the parameters, click the `Submit` button to see the results.
+<br>
+5. To reset the parameters back to default, click on `Clear` button.
+## Training code
+The Pytorch-Lightning code used to train, validate the model can be viewed at - https://github.com/TharunSivamani/ERA-V1/blob/main/Session%2013/S13.ipynb
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## Credits
+- This app is built using the Gradio library ([https://www.gradio.app/](https://www.gradio.app/)) for interactive model interfaces.
+- The PASCAL VOC dataset ([https://www.kaggle.com/datasets/aladdinpersson/pascal-voc-dataset-used-in-yolov3-video](https://www.kaggle.com/datasets/aladdinpersson/pascal-voc-dataset-used-in-yolov3-video)) is used for training and evaluation.
+- The PyTorch library ([https://pytorch.org/](https://pytorch.org/)) is used for the deep learning model and GradCAM visualization.
+- Pytorch Lightning Framework ([https://lightning.ai/](https://lightning.ai/)) is used in training and other steps
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py CHANGED Viewed

@@ -1,7 +1,80 @@
 import gradio as gr
-def greet(name):
-    return "Hello " + name + "!!"
-iface = gr.Interface(fn=greet, inputs="text", outputs="text")
-iface.launch()

 import gradio as gr
+import numpy as np
+import config
+from utils import *
+from pytorch_grad_cam.utils.image import show_cam_on_image
+from yolov3 import YOLOv3LightningModel
+ex1 = [[f'examples/{i}.jpg'] for i in range(1,8)]
+ex2 = [[f'examples/{i}.jpg'] for i in range(8,15)]
+scaled_anchors = config.scaled_anchors
+model = YOLOv3LightningModel()
+model.load_state_dict(torch.load("yolov3.pth", map_location="cpu"), strict=False)
+model.eval()
+@torch.inference_mode()
+def YoloV3_classifier(image,  thresh=0.5,iou_thresh=0.5):
+    transformed_image = config.transforms(image=image)["image"].unsqueeze(0)
+    output = model(transformed_image)
+    bboxes = [[] for _ in range(1)]
+    for i in range(3):
+        batch_size, A, S, _, _ = output[i].shape
+        anchor = scaled_anchors[i]
+        boxes_scale_i = cells_to_bboxes(
+            output[i], anchor, S=S, is_preds=True
+        )
+        for idx, (box) in enumerate(boxes_scale_i):
+            bboxes[idx] += box
+    nms_boxes = non_max_suppression(
+        bboxes[0], iou_threshold=iou_thresh, threshold=thresh, box_format="midpoint",
+    )
+    plot_img = draw_bounding_boxes(image.copy(), nms_boxes, class_labels=config.PASCAL_CLASSES)
+    return plot_img
+window1 = gr.Interface(
+    YoloV3_classifier,
+    inputs=[
+        gr.Image(label="Input Image"),
+        gr.Slider(0, 1, value=0.5, step=0.1, label="Threshold", info="Set Threshold value"),
+        gr.Slider(0, 1, value=0.5, step=0.1, label="IOU Threshold", info="Set IOU Threshold value"),
+    ],
+    outputs=[
+        gr.Image(label="YoloV3 Object Detection"),
+    ],
+    examples=ex1,
+)
+def visualize_gradCam(image, target_layer=-5, show_cam=True, transparency=0.5):
+    if show_cam:
+        cam = YoloCAM(model=model, target_layers=[model.layers[target_layer]], use_cuda=False)
+        transformed_image = config.transforms(image=image)["image"].unsqueeze(0)
+        grayscale_cam = cam(transformed_image, scaled_anchors)[0, :, :]
+        img = cv2.resize(image, (416, 416))
+        img = np.float32(img) / 255
+        cam_image = show_cam_on_image(img, grayscale_cam, use_rgb=True, image_weight=transparency)
+    else:
+        cam_image = image
+    return cam_image
+window2 = gr.Interface(
+    visualize_gradCam,
+    inputs=[
+        gr.Image(label="Input Image"),
+        gr.Slider(-5, -2, value=-3, step=-1, label="Network Layer", info="GRAD-CAM Layer to visualize?"),
+        gr.Checkbox(label="GradCAM", value=True, info="Visualize Class Activation Maps ??"),
+        gr.Slider(0, 1, value=0.5, step=0.1, label="Transparency", info="Set Transparency of GRAD-CAMs"),
+    ],
+    outputs=[
+        gr.Image(label="Grad-CAM Visualization"),
+    ],
+    examples=ex2,
+)
+app = gr.TabbedInterface([window1, window2], ["YOLO V3 Detection", "GradCAM Visualization"])
+app.launch()

config.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import albumentations as A
+import cv2
+import torch
+from albumentations.pytorch import ToTensorV2
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+IMAGE_SIZE = 416
+transforms = A.Compose(
+    [
+        A.LongestMaxSize(max_size=IMAGE_SIZE),
+        A.PadIfNeeded(
+            min_height=IMAGE_SIZE, min_width=IMAGE_SIZE, border_mode=cv2.BORDER_CONSTANT
+        ),
+        A.Normalize(mean=[0, 0, 0], std=[1, 1, 1], max_pixel_value=255,),
+        ToTensorV2(),
+    ],
+)
+ANCHORS = [
+    [(0.28, 0.22), (0.38, 0.48), (0.9, 0.78)],
+    [(0.07, 0.15), (0.15, 0.11), (0.14, 0.29)],
+    [(0.02, 0.03), (0.04, 0.07), (0.08, 0.06)],
+]  # Note these have been rescaled to be between [0, 1]
+S = [IMAGE_SIZE // 32, IMAGE_SIZE // 16, IMAGE_SIZE // 8]
+scaled_anchors = (
+    torch.tensor(ANCHORS)
+    * torch.tensor(S).unsqueeze(1).unsqueeze(1).repeat(1, 3, 2)
+).to(DEVICE)
+PASCAL_CLASSES = [
+    "aeroplane",
+    "bicycle",
+    "bird",
+    "boat",
+    "bottle",
+    "bus",
+    "car",
+    "cat",
+    "chair",
+    "cow",
+    "diningtable",
+    "dog",
+    "horse",
+    "motorbike",
+    "person",
+    "pottedplant",
+    "sheep",
+    "sofa",
+    "train",
+    "tvmonitor"
+]

examples/1.jpg ADDED Viewed

examples/10.jpg ADDED Viewed

examples/11.jpg ADDED Viewed

examples/12.jpg ADDED Viewed

examples/13.jpg ADDED Viewed

examples/14.jpg ADDED Viewed

examples/15.jpg ADDED Viewed

examples/2.jpg ADDED Viewed

examples/3.jpg ADDED Viewed

examples/4.jpg ADDED Viewed

examples/5.jpg ADDED Viewed

examples/6.jpg ADDED Viewed

examples/7.jpg ADDED Viewed

examples/8.jpg ADDED Viewed

examples/9.jpg ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+torch
+torchvision
+torch_lr_finder
+gradio
+grad-cam
+pillow
+opencv-python
+albumentations
+pytorch-lightning

utils.py ADDED Viewed

	@@ -0,0 +1,265 @@

+from typing import List
+import torch
+import numpy as np
+import cv2
+import random
+from pytorch_grad_cam.base_cam import BaseCAM
+from pytorch_grad_cam.utils.svd_on_activations import get_2d_projection
+from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget
+def cells_to_bboxes(predictions, anchors, S, is_preds=True):
+    """
+    Scales the predictions coming from the model to
+    be relative to the entire image such that they for example later
+    can be plotted or.
+    INPUT:
+    predictions: tensor of size (N, 3, S, S, num_classes+5)
+    anchors: the anchors used for the predictions
+    S: the number of cells the image is divided in on the width (and height)
+    is_preds: whether the input is predictions or the true bounding boxes
+    OUTPUT:
+    converted_bboxes: the converted boxes of sizes (N, num_anchors, S, S, 1+5) with class index,
+                      object score, bounding box coordinates
+    """
+    BATCH_SIZE = predictions.shape[0]
+    num_anchors = len(anchors)
+    box_predictions = predictions[..., 1:5]
+    if is_preds:
+        anchors = anchors.reshape(1, len(anchors), 1, 1, 2)
+        box_predictions[..., 0:2] = torch.sigmoid(box_predictions[..., 0:2])
+        box_predictions[..., 2:] = torch.exp(box_predictions[..., 2:]) * anchors
+        scores = torch.sigmoid(predictions[..., 0:1])
+        best_class = torch.argmax(predictions[..., 5:], dim=-1).unsqueeze(-1)
+    else:
+        scores = predictions[..., 0:1]
+        best_class = predictions[..., 5:6]
+    cell_indices = (
+        torch.arange(S)
+        .repeat(predictions.shape[0], 3, S, 1)
+        .unsqueeze(-1)
+        .to(predictions.device)
+    )
+    x = 1 / S * (box_predictions[..., 0:1] + cell_indices)
+    y = 1 / S * (box_predictions[..., 1:2] + cell_indices.permute(0, 1, 3, 2, 4))
+    w_h = 1 / S * box_predictions[..., 2:4]
+    converted_bboxes = torch.cat((best_class, scores, x, y, w_h), dim=-1).reshape(BATCH_SIZE, num_anchors * S * S, 6)
+    return converted_bboxes.tolist()
+def intersection_over_union(boxes_preds, boxes_labels, box_format="midpoint"):
+    """
+    Video explanation of this function:
+    https://youtu.be/XXYG5ZWtjj0
+    This function calculates intersection over union (iou) given pred boxes
+    and target boxes.
+    Parameters:
+        boxes_preds (tensor): Predictions of Bounding Boxes (BATCH_SIZE, 4)
+        boxes_labels (tensor): Correct labels of Bounding Boxes (BATCH_SIZE, 4)
+        box_format (str): midpoint/corners, if boxes (x,y,w,h) or (x1,y1,x2,y2)
+    Returns:
+        tensor: Intersection over union for all examples
+    """
+    if box_format == "midpoint":
+        box1_x1 = boxes_preds[..., 0:1] - boxes_preds[..., 2:3] / 2
+        box1_y1 = boxes_preds[..., 1:2] - boxes_preds[..., 3:4] / 2
+        box1_x2 = boxes_preds[..., 0:1] + boxes_preds[..., 2:3] / 2
+        box1_y2 = boxes_preds[..., 1:2] + boxes_preds[..., 3:4] / 2
+        box2_x1 = boxes_labels[..., 0:1] - boxes_labels[..., 2:3] / 2
+        box2_y1 = boxes_labels[..., 1:2] - boxes_labels[..., 3:4] / 2
+        box2_x2 = boxes_labels[..., 0:1] + boxes_labels[..., 2:3] / 2
+        box2_y2 = boxes_labels[..., 1:2] + boxes_labels[..., 3:4] / 2
+    if box_format == "corners":
+        box1_x1 = boxes_preds[..., 0:1]
+        box1_y1 = boxes_preds[..., 1:2]
+        box1_x2 = boxes_preds[..., 2:3]
+        box1_y2 = boxes_preds[..., 3:4]
+        box2_x1 = boxes_labels[..., 0:1]
+        box2_y1 = boxes_labels[..., 1:2]
+        box2_x2 = boxes_labels[..., 2:3]
+        box2_y2 = boxes_labels[..., 3:4]
+    x1 = torch.max(box1_x1, box2_x1)
+    y1 = torch.max(box1_y1, box2_y1)
+    x2 = torch.min(box1_x2, box2_x2)
+    y2 = torch.min(box1_y2, box2_y2)
+    intersection = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)
+    box1_area = abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))
+    box2_area = abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))
+    return intersection / (box1_area + box2_area - intersection + 1e-6)
+def non_max_suppression(bboxes, iou_threshold, threshold, box_format="corners"):
+    """
+    Video explanation of this function:
+    https://youtu.be/YDkjWEN8jNA
+    Does Non Max Suppression given bboxes
+    Parameters:
+        bboxes (list): list of lists containing all bboxes with each bboxes
+        specified as [class_pred, prob_score, x1, y1, x2, y2]
+        iou_threshold (float): threshold where predicted bboxes is correct
+        threshold (float): threshold to remove predicted bboxes (independent of IoU)
+        box_format (str): "midpoint" or "corners" used to specify bboxes
+    Returns:
+        list: bboxes after performing NMS given a specific IoU threshold
+    """
+    assert type(bboxes) == list
+    bboxes = [box for box in bboxes if box[1] > threshold]
+    bboxes = sorted(bboxes, key=lambda x: x[1], reverse=True)
+    bboxes_after_nms = []
+    while bboxes:
+        chosen_box = bboxes.pop(0)
+        bboxes = [
+            box
+            for box in bboxes
+            if box[0] != chosen_box[0]
+            or intersection_over_union(
+                torch.tensor(chosen_box[2:]),
+                torch.tensor(box[2:]),
+                box_format=box_format,
+            )
+            < iou_threshold
+        ]
+        bboxes_after_nms.append(chosen_box)
+    return bboxes_after_nms
+def draw_bounding_boxes(image, boxes, class_labels):
+    colors = [[random.randint(0, 255) for _ in range(3)] for name in class_labels]
+    im = np.array(image)
+    height, width, _ = im.shape
+    bbox_thick = int(0.6 * (height + width) / 600)
+    # Create a Rectangle patch
+    for box in boxes:
+        assert len(box) == 6, "box should contain class pred, confidence, x, y, width, height"
+        class_pred = box[0]
+        conf = box[1]
+        box = box[2:]
+        upper_left_x = box[0] - box[2] / 2
+        upper_left_y = box[1] - box[3] / 2
+        x1  = int(upper_left_x * width)
+        y1 = int(upper_left_y * height)
+        x2 = x1 + int(box[2] * width)
+        y2 = y1 + int(box[3] * height)
+        cv2.rectangle(
+            image,
+            (x1, y1), (x2, y2),
+            color=colors[int(class_pred)],
+            thickness=bbox_thick
+        )
+        text = f"{class_labels[int(class_pred)]}: {conf:.2f}"
+        t_size = cv2.getTextSize(text, 0, 0.7, thickness=bbox_thick // 2)[0]
+        c3 = (x1 + t_size[0], y1 - t_size[1] - 3)
+        cv2.rectangle(image, (x1, y1), c3, colors[int(class_pred)], -1)
+        cv2.putText(
+            image,
+            text,
+            (x1, y1 - 2),
+            cv2.FONT_HERSHEY_SIMPLEX,
+            0.7,
+            (0, 0, 0),
+            bbox_thick // 2,
+            lineType=cv2.LINE_AA,
+        )
+    return image
+class YoloCAM(BaseCAM):
+    def __init__(self, model, target_layers, use_cuda=False,
+                 reshape_transform=None):
+        super(YoloCAM, self).__init__(model,
+                                       target_layers,
+                                       use_cuda,
+                                       reshape_transform,
+                                       uses_gradients=False)
+    def forward(self,
+                input_tensor: torch.Tensor,
+                scaled_anchors: torch.Tensor,
+                targets: List[torch.nn.Module],
+                eigen_smooth: bool = False) -> np.ndarray:
+        if self.cuda:
+            input_tensor = input_tensor.cuda()
+        if self.compute_input_gradient:
+            input_tensor = torch.autograd.Variable(input_tensor,
+                                                   requires_grad=True)
+        outputs = self.activations_and_grads(input_tensor)
+        if targets is None:
+            bboxes = [[] for _ in range(1)]
+            for i in range(3):
+                batch_size, A, S, _, _ = outputs[i].shape
+                anchor = scaled_anchors[i]
+                boxes_scale_i = cells_to_bboxes(
+                    outputs[i], anchor, S=S, is_preds=True
+                )
+                for idx, (box) in enumerate(boxes_scale_i):
+                    bboxes[idx] += box
+            nms_boxes = non_max_suppression(
+                bboxes[0], iou_threshold=0.5, threshold=0.4, box_format="midpoint",
+            )
+            # target_categories = np.argmax(outputs.cpu().data.numpy(), axis=-1)
+            target_categories = [box[0] for box in nms_boxes]
+            targets = [ClassifierOutputTarget(
+                category) for category in target_categories]
+        if self.uses_gradients:
+            self.model.zero_grad()
+            loss = sum([target(output)
+                       for target, output in zip(targets, outputs)])
+            loss.backward(retain_graph=True)
+        # In most of the saliency attribution papers, the saliency is
+        # computed with a single target layer.
+        # Commonly it is the last convolutional layer.
+        # Here we support passing a list with multiple target layers.
+        # It will compute the saliency image for every image,
+        # and then aggregate them (with a default mean aggregation).
+        # This gives you more flexibility in case you just want to
+        # use all conv layers for example, all Batchnorm layers,
+        # or something else.
+        cam_per_layer = self.compute_cam_per_layer(input_tensor,
+                                                   targets,
+                                                   eigen_smooth)
+        return self.aggregate_multi_layers(cam_per_layer)
+    def get_cam_image(self,
+                      input_tensor,
+                      target_layer,
+                      target_category,
+                      activations,
+                      grads,
+                      eigen_smooth):
+        return get_2d_projection(activations)

yolov3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8816c44660d8b8f77225422081adf109deb727f9a84fe897b7f2726074308252
+size 246877637

yolov3.py ADDED Viewed

	@@ -0,0 +1,171 @@

+import torch
+import torch.nn as nn
+import pytorch_lightning as pl
+import config as cfg
+"""
+Information about architecture config:
+Tuple is structured by (filters, kernel_size, stride)
+Every conv is a same convolution.
+List is structured by "B" indicating a residual block followed by the number of repeats
+"S" is for scale prediction block and computing the yolo loss
+"U" is for upsampling the feature map and concatenating with a previous layer
+"""
+config = [
+    (32, 3, 1),
+    (64, 3, 2),
+    ["B", 1],
+    (128, 3, 2),
+    ["B", 2],
+    (256, 3, 2),
+    ["B", 8],
+    (512, 3, 2),
+    ["B", 8],
+    (1024, 3, 2),
+    ["B", 4],  # To this point is Darknet-53
+    (512, 1, 1),
+    (1024, 3, 1),
+    "S",
+    (256, 1, 1),
+    "U",
+    (256, 1, 1),
+    (512, 3, 1),
+    "S",
+    (128, 1, 1),
+    "U",
+    (128, 1, 1),
+    (256, 3, 1),
+    "S",
+]
+class CNNBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, bn_act=True, **kwargs):
+        super().__init__()
+        self.conv = nn.Conv2d(in_channels, out_channels, bias=not bn_act, **kwargs)
+        self.bn = nn.BatchNorm2d(out_channels)
+        self.leaky = nn.LeakyReLU(0.1)
+        self.use_bn_act = bn_act
+    def forward(self, x):
+        if self.use_bn_act:
+            return self.leaky(self.bn(self.conv(x)))
+        else:
+            return self.conv(x)
+class ResidualBlock(nn.Module):
+    def __init__(self, channels, use_residual=True, num_repeats=1):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        for repeat in range(num_repeats):
+            self.layers += [
+                nn.Sequential(
+                    CNNBlock(channels, channels // 2, kernel_size=1),
+                    CNNBlock(channels // 2, channels, kernel_size=3, padding=1),
+                )
+            ]
+        self.use_residual = use_residual
+        self.num_repeats = num_repeats
+    def forward(self, x):
+        for layer in self.layers:
+            if self.use_residual:
+                x = x + layer(x)
+            else:
+                x = layer(x)
+        return x
+class ScalePrediction(nn.Module):
+    def __init__(self, in_channels, num_classes):
+        super().__init__()
+        self.pred = nn.Sequential(
+            CNNBlock(in_channels, 2 * in_channels, kernel_size=3, padding=1),
+            CNNBlock(
+                2 * in_channels, (num_classes + 5) * 3, bn_act=False, kernel_size=1
+            ),
+        )
+        self.num_classes = num_classes
+    def forward(self, x):
+        return (
+            self.pred(x)
+            .reshape(x.shape[0], 3, self.num_classes + 5, x.shape[2], x.shape[3])
+            .permute(0, 1, 3, 4, 2)
+        )
+class YOLOv3LightningModel(pl.LightningModule):
+    def __init__(self, in_channels=3, num_classes=20):
+        super().__init__()
+        self.num_classes = num_classes
+        self.in_channels = in_channels
+        self.layers = self._create_conv_layers()
+    def forward(self, x):
+        outputs = []  # for each scale
+        route_connections = []
+        for layer in self.layers:
+            if isinstance(layer, ScalePrediction):
+                outputs.append(layer(x))
+                continue
+            x = layer(x)
+            if isinstance(layer, ResidualBlock) and layer.num_repeats == 8:
+                route_connections.append(x)
+            elif isinstance(layer, nn.Upsample):
+                x = torch.cat([x, route_connections[-1]], dim=1)
+                route_connections.pop()
+        return outputs
+    def _create_conv_layers(self):
+        layers = nn.ModuleList()
+        in_channels = self.in_channels
+        for module in config:
+            if isinstance(module, tuple):
+                out_channels, kernel_size, stride = module
+                layers.append(
+                    CNNBlock(
+                        in_channels,
+                        out_channels,
+                        kernel_size=kernel_size,
+                        stride=stride,
+                        padding=1 if kernel_size == 3 else 0,
+                    )
+                )
+                in_channels = out_channels
+            elif isinstance(module, list):
+                num_repeats = module[1]
+                layers.append(ResidualBlock(in_channels, num_repeats=num_repeats,))
+            elif isinstance(module, str):
+                if module == "S":
+                    layers += [
+                        ResidualBlock(in_channels, use_residual=False, num_repeats=1),
+                        CNNBlock(in_channels, in_channels // 2, kernel_size=1),
+                        ScalePrediction(in_channels // 2, num_classes=self.num_classes),
+                    ]
+                    in_channels = in_channels // 2
+                elif module == "U":
+                    layers.append(nn.Upsample(scale_factor=2),)
+                    in_channels = in_channels * 3
+        return layers
+def sanity_check(model):
+    x = torch.randn((2, 3, cfg.IMAGE_SIZE, cfg.IMAGE_SIZE))
+    out = model(x)
+    assert model(x)[0].shape == (2, 3, cfg.IMAGE_SIZE // 32, cfg.IMAGE_SIZE // 32, cfg.NUM_CLASSES + 5)
+    assert model(x)[1].shape == (2, 3, cfg.IMAGE_SIZE // 16, cfg.IMAGE_SIZE // 16, cfg.NUM_CLASSES + 5)
+    assert model(x)[2].shape == (2, 3, cfg.IMAGE_SIZE // 8, cfg.IMAGE_SIZE // 8, cfg.NUM_CLASSES + 5)
+    print("Success!")