Bounding box prediction from scratch using PyTorch


This post was originally published by Aakanksha NS at Towards Data Science

Object detection is a very popular task in Computer Vision, where, given an image, you predict (usually rectangular) boxes around objects present in the image and also recognize the types of objects. There could be multiple objects in your image and there are various state-of-the-art techniques and architectures to tackle this problem like Faster-RCNN and YOLO v3.

This article talks about the case when there is only one object of interest present in an image. The focus here is more on how to read an image and its bounding box, resize and perform augmentations correctly, rather than on the model itself. The goal is to have a good grasp of the fundamental ideas behind object detection, which you can extend to get a better understanding of the more complex techniques.

Here’s a link to the notebook consisting of all the code I’ve used for this article:

If you’re new to Deep Learning or PyTorch, or just need a refresher, this might interest you:

There are four distinct classes these signs could belong to:

  • Traffic Light
  • Stop
  • Speed Limit
  • Crosswalk

This is called a multi-task learning problem as it involves performing two tasks — 1) regression to find the bounding box coordinates, 2) classification to identify the type of road sign

Sample images. Source

Road Sign Detection

877 images belonging to 4 classes.

It consists of 877 images. It’s a pretty imbalanced dataset, with most images belonging to the speed limit class, but since we’re more focused on the bounding box prediction, we can ignore the imbalance.

  • Walk through the training directory to get a list of all the .xml files.
  • Parse the .xml file using xml.etree.ElementTree
  • Create a dictionary consisting of filepath, width , height , the bounding box coordinates ( xmin , xmax , ymin , ymax ) and class for each image and append the dictionary to a list.
  • Create a pandas dataframe using the list of dictionaries of image stats.
  • Label encode the class column

Here’s how resizing a bounding box works:

  • Convert the bounding box into an image (called mask) of the same size as the image it corresponds to. This mask would just have 0 for background and 1 for the area covered by the bounding box.

Original Image

Mask of the bounding box

  • Resize the mask to the required dimensions.
  • Extract bounding box coordinates from the resized mask.

Helper functions to create mask from bounding box, extract bounding box coordinates from a mask

Function to resize an image, write to a new path, and get resized bounding box coordinates

For this problem, I’ve used flip, rotation, center crop and random crop. I’ve talked about various data augmentation techniques in this article:

Image Processing Techniques for Computer Vision

Image Processing is an integral part of Computer vision. We almost always want to resize images, do data augmentation…

The only thing to remember here is ensuring that the bounding box is also transformed the same way as the image. To do this we follow the same approach as resizing — convert bounding box to a mask, apply the same transformations to the mask as the original image, and extract the bounding box coordinates.

  • Helper functions to center crop and random crop an image
  • Transforming image and mask
  • Displaying bounding box
  • train-valid split
  • Creating train and valid datasets
  • Setting the batch size and creating data loaders

It’ll be a fun exercise to take a real photo using your phone and test out the model. Another interesting experiment would be to not perform any data augmentations and train the model and compare the two models.

Spread the word

This post was originally published by Aakanksha NS at Towards Data Science

Related posts