How to detect objects more effectively with Yolo

This article presents the fundamentals of YOLOv3, A object detection algorithm recognized for its precision and his speed. It uses anchors (predetermined bounding boxes) improving detection, automatically calculated via the k-means clustering method with Intersection over Union (IoU).

Aqsone Profile Picture
Aqson

But say Jamy, what is an anchor?

By carrying out a preliminary analysis on the boundaries of the objects that we are trying to find, it turns out that most of these bounding boxes have certain relationships between their height and their width. So, instead of directly predicting a bounding box (bounding box in the rest of the article), YOLOv2 (and v3) predict sets of boxes with particular height-width ratios – these predetermined sets of boxes are the anchors (or anchors – English word).

 

 

Find the anchors

There are several ways to define these anchors, the most naive is to define them by hand. Instead of choosing them by hand, it is better to apply a k-means clustering algorithm to find them automatically.

 

K-means with a particular distance: the IoU

IoU – or Intersection over Union – is a way of measuring the quality of object detection by comparing, in a training dataset, the known position of the object in the image with the prediction made by the 'algorithm. The IoU is the ratio between the surface of the intersection of the bounding boxes considered and the surface of the union of the sets.

 

 

The IoU can be between 0 (for a completely failed detection) and 1 (for a perfect detection). Generally the object is considered to be detected from an IoU greater than 0.5.

 

 

Calculate anchors

If we use a standard k-means, that is to say with the Euclidean distance, large bounding boxes will generate more errors than small ones. Additionally, it is desirable to have anchors that lead to strong IoU scores. So we will replace the Euclidean distance by:

 

dist(box, centroid) = 1 − IoU(box, centroid)

 

IoU calculations are made under the assumption that the upper left corner of the bounding boxes are located in the same place. Thus IoU calculations are simplified and only depend on the height and width of the bounding boxes. Thus, each bounding box can be represented by a point in the plane, so we apply the algorithm to these points.

 

 

 

Detect a face

Let us take for example the case of face detection and the database Wider Face:

 

We take all the faces from this database, and for each face bounding box, we place a point at coordinates (x,y) (x for the abscissa: the width of the bounding box, y for the ordinate: its height ) relative to the total size of the image:

 

 

 

 

We then apply the k-means algorithm with 9 centroids, which will allow us to determine the dimensions of the 9 anchors for our face detection model. Note that the borders between the different clusters are not straight. This is because we are using a distance derived from the IoU.

 

 

 

 

So for an image of size 416×416 pixels, the nine anchors are given by:

 

 

 

 

To illustrate the importance of anchors, let's take the example of license plates. In green, a neural network has been trained with basic anchors and in red, with specific anchors. For each detection, we can see the impact of the anchors in the predictions: with basic anchors, the bounding boxes are too high and too narrow while with specific anchors, the detections are perfectly adjusted.

 

 

 

From image to prediction

YOLOv3 is a so-called fully convolutional neural network, and it produces what we call feature maps as output. What you need to remember for YOLO is that since there is no constraint on the size of the feature maps, we can give it images of different sizes!

 

Example of a 3-layer neural network, with intermediate feature maps: lines and edges, then elements of a face, then structure of the face.

YOLO version 1 directly predicts bounding box coordinates using a dense layer. Instead of directly predicting these boxes, YOLOv3 uses a priori on the shape of the objects to detect, this shape is expressed in number of pixels of width and number of pixels of height, since these boxes are considered rectangular. These priors are the anchors of YOLOv3, the calculation of which was described previously.

 

 

The detailed network (for the most experienced 👩🏽‍💻)

YOLOv3 does not have a dense layer, it is only composed of convolutions. Each convolution is followed by a Batch Normalization layer then the LeakyReLu activation function. Batch normalizations are beneficial: convergence is faster and the effects of disappearance and explosion of the gradient are limited. With these batch normalizations, we can remove Dropouts without overfitting problems.
In addition, the convolutions have a stride of two to subsample the images and reduce the first 2 dimensions of the feature maps. The following figure represents two convolutions, the first with a stride of 1, the second with a stride of 2.

The network has three outputs to be able to detect objects at three different scales.
Its complete architecture is described by the following figure.

 

 

 

The image

YOLOv3 reduces image size by a factor of 32, called the network stride. The first version of YOLO took images of size 448×448, so the output feature map was of size 14×14.
It is common for the objects we want to detect to be in the center of the image. However, a 14×14 grid does not have a single center. In practice, it is therefore preferable for the output to have an odd size. To remove this ambiguity, the size of the images will be 416×416, to provide a feature map of size 13×13 with a single center.

 

The exit from the network

To fully understand YOLO, you need to understand its outputs. YOLOv3 has three final layers, the first has a dimension divided by 31 compared to the initial image, the second by 16 and the third by 8. So starting from an image of size 416×416 pixels, the three features maps output from the network will have respective sizes of 13×13, 26×26 and 52×52 pixels. It is in this sense that YOLOv3 predicts three levels of detail, to detect large, medium and small objects respectively.

Starting from an image of size 416×416 pixels, the same pixel is “tracked” through the network and led to three cells. For each cell three bounding boxes are predicted, this makes a total of 9 which come from the 9 anchors. For each bounding box, an objectness score and class membership scores are predicted. In total, the network offers 52×52×3 + 26×26×3 + 13×13×3 = 10647 bounding boxes.

 

 

Now that we have all these predictions, we need to select the right ones.

 

 

 

Thus for each object detected, the NMS algorithm only retains the best proposal.

 

Conclusion

Once our algorithm is trained, it can be used and even coupled with other algorithms. Face detection can be coupled with verification of the wearing of safety equipment, here a mask:

So, to use YOLOv3, you first need given training. But that's not enough, you also have to prepare to “aim” correctly at the objects you want to detect by using the right anchors!

 

A must see

Most popular articles

Do you have a transformation project? Let's talk about it !