Summary of Contributions

This paper presents a new method for Object Detection, YOLO, with the following contributions:

YOLO is able to reason globally before making predictions. A number of previous works use local sliding window approach followed by classification which make them lack global reasoning.
The method runs real time and is fast. Object detection needs to be fast if it is to be deployed in real time use cases and this provides a promising direction.
This method also generalizes well to other domains whereas previous methods lack this behavior as demonstrated empirically by the authors.

Detailed Comments

This paper presents a new method for object detection based on one-shot inference with improvements in average precision and runtime performance. In this method, YOLO, each image can be thought of as partitioned into relatively few grids where each grid is responsible for detecting the object that falls in that particular grid. This allows them to pass the input directly into a CNN and output three quantities per grid: K bounding box parameters, object confidence and class confidence. In my opinion the design choice are simplistic yet effective and the grid assumptions are not too restrictive as grid size can be changed depending on the scale of object detection needed. The bounding box in our set of bounding boxes prediction that result with the maximum IOU with the object is assigned to be the the regression output with the label. They use a squared error to regress to the bounding box and confidence parameters.

The authors present a comparative comparison for their method on the Pascal VOC 2007 dataset showing that YOLO runs faster than previous state of the art methods while having better mAP. They also demonstrate an analysis of what kinds of tasks is YOLO better at- showing that YOLO is much better at reducing false positives in the background compared to Fast R-CNN and attribute this to the global reasoning behavior of YOLO. They observe that YOLO does not the highest performance while detecting small objects in the VOC 2012 dataset, but show that their method can be combined with Fast RCNN to beat state-of-the-art. Finally they conclude with an experiment of generalization by training on VOC 2007 and testing on artwork images and showing that their method generalizes better as YOLO better captures that spatial property of objects.

I appreciate that the authors do a good job of comparing their method to previous related work, explaining the deficiencies in prior work and providing a stark improvement over them. They also discuss deficiencies in their method which are well explained. In brief their method suffers while detecting small objects due to the spatial constraint imposed by the bounding boxes, and the fact that particular areas become specialized to detect particular things based on the dataset and hence lack spatial generalization. They also mention that their loss function is an approximation of the true objective of detection. Overall I think this work presents good results that push forward the domain of Object Detection.