Page 34 - Informatics, July 2021
P. 34

Technology Update



              Object Detection Technologies


              A simplified explanation of YOLO class of algorithms




              Edited by Dr. DIBAKAR RAY               bject detection is the task of localizing as   the number of channels. The number X is chosen
                                                      well as classifying objects of interest from   such that it is divisible by 32. YOLO v3 has 106 lay-
                                                 Oan image or video. It covers a wide range of   ers, with 53 CNN layers (Darknet-53) stacked on
              It is of the highest importance     techniques, including image processing, pattern   top  of  each  other.  The  predictions  are  done  at
              in the art of detection to be       recognition, artificial intelligence, and machine   three different layers corresponding to strides 32,
                                                  learning. Object detection has a variety of uses,
                                                                                     16 and 8. For each cell of the image, we predict
              able to recognise out of a          some of which are surveillance and security,   3 bounding boxes at every scale. The bounding
                                                  traffic monitoring, video communication, image   boxes are predicted as offsets to the prior boxes
              number of facts which are           annotation, activity detection, face recognition,   also known as anchors.
                                                  robot vision and animation.           Common Objects in Context (COCO) is the
              incidental and which are vital                                         dataset containing 80 classes of commonly oc-
                                                  Classes of Algorithms              curring real life objects and is the standard data-
                                                    The three prominently used techniques in Ob-  set to test object detection algorithms. For the
                                                  ject Detection are                 COCO dataset, YOLO v3 produces a tensor of the
                       - Arthur Conan Doyle
                                                  •  R-CNN and its variations like Fast R-CNN, Faster   shape 3* (4+ 1+ 80), where 3 is for the number of
                                                   R-CNN, Mask R-CNN etc.            the bounding boxes, 4 is for the offset location of
                                                  •  Single Shot Detectors           bounding box, 1 is for the objectness score and
                                                                                     80 is for confidence probabilities of the number
                                                  •  YOLO                            of classes. The offsets are given by t , t , t  and
                                                                                                                   w
                                                                                                                 y
                                                                                                               x
                                                                                     t where t and t  are the center co-ordinates and
                                                                                                y
                                                                                            x
                                                                                      n
                                                  R-CNN                              t , t  represents the width and height. The object-
                                                                                      w
                                                                                        n
                                                    Girshik et al. first proposed R-CNN in 2013   ness score represents the IOU between the pre-
                                                  wherein the system would make region proposals   dicted box and any ground truth box.
                                                  and then these regions would be passed to the
                                                  CNN for classification and outputting bounding
                                                  box. The problem with this approach is that it is
                                                  painstakingly slow. Another version by the name
                                                  Fast-RCNN was published by Girshik et al. in 2015
                                                  which used implementation of sliding windows
                                                  convolution to identify all the proposed regions.
                                                  However, it was still slow. It wasn’t until the third
                                                  paper came out by the name Faster R-CNN that
                                                  this technique was used in practical applications.
                                                  It replaced the use of an external algorithm like
                                                  Selective Search with CNN to propose regions.
                                                  YOLO
                                                    YOLO is the acronym for “You Only Look Once”,
                              Dr. A.K. Hota       whose first version appeared in 2016 by Redmon
                              Dy. Director General   et al. Unlike previous approaches,  the image is
                              & SIO               passed only once to the network rather than using
                              ak.hota@nic.in
                                                  a pipeline for region proposals, classification etc.
                                                  and it simultaneously predicts the co-ordinates
                                                  of the bounding box and the class of object. This
                                                  increased the task’s performance. Subsequently,
                                                  here has been many versions of it namely YOLO
                              A.K. Somasekhar     v2, YOLO v3, YOLO v4 and YOLO v5 with the most
                                                  recent one being YOLO v5 published in 2021. The
                              Sr. Technical Director  concepts of YOLO v3 forms the basis for all sub-
                              som@nic.in                                             Image courtesy - http://medium.com
                                                  sequent works.
                                                  YOLO v3                               Each grid cell also predicts 80 condition-
                                                    YOLO v3 uses only convolutional layers as the   al class probabilities, Pr(Class  |Object). These
                                                                                                           i
                              Shom C.             pooling layers are also simulated by convolution-  probabilities are conditioned on the grid cell to
                              Abraham             al layers. The training network’s input is of the   containing an object. At test time we multiply the
                              Scientific Assistant - A  form (n, X,X,3), with n denoting the number of im-  conditional class probabilities and the individual
                              shom.abraham@nic.in  ages , X denoting the width, height and 3 denoting   box confidence predictions.



              34  informatics.nic.in  July 2021
   29   30   31   32   33   34   35   36   37   38   39