Summary of Contributions

This paper presents a new method for 6DOF Object pose estimation trained purely on synthetic datasets:

The paper presents a new architecture for inferring 3D poses of objects from a single RGB image.
The paper shows that using a combination of photo-realistic and domain randomized (non-photosynthetic) data can achieve performance comparable to SOTA methods trained on real data.
The authors show a detailed analysis of their method - showing necessity of each modules and demonstrating a real-time object manipulation task using their pose detector.

Detailed Comments

In this paper the authors present a new approach to 6DOF object pose estimation using a single RGB camera. Their one-shot pose estimator takes in a RGB image and outputs two things: A belief map and a vector field. A belief map are heatmaps corresponding to the vertices of each individual objects along with their centroids and the vector fields point from each vertex to the corresponding centroid. This allows the network to output a 2D bounding box of the object which corresponds to a 3D projection of the object into the image plane. The 3D coordinates are found by using a method called PnP which uses camera extrinsics, intrinsics and the object dimensions to output the 3D coordinates. Their network architecture consists of multiple stages which the authors claim allows them to resolve ambiguities for object detection. A crucial component of their method is the dataset generation. They rely on two sources for data - 1. Domain randomized non-photorealistic data and 2. Photorealistic data. They do domain randomization by varying the distractors, background, texture, object pose and lightning. The photorealistic images are obtained via a dataset collection procedure where objects are allowed to fall in a photorealistic simulation of environment allowing them to respect physical constraints.

The authors compare their method on two datasets: YCB video dataset and a dataset with extreme lightning conditions to test generalization of their method. They show their method to be competitive with PoseCNN when both are trained with domain randomized and photorealistic data while enjoying a much simpler architecture. Also to note here, PoseCNN was also trained on a dataset that includes similar training statistic as the testing environment which makes their results more significant. They test their method on the extreme lightning dataset qualitatively to demonstrate better results. The authors ablate their design decisions and show that both datasets play an important role and using more than 1 stage in their architecture helps improve accuracy. The authors also demonstrate a working setup for Robotic Manipulation with their pose detector showing a high success rate for 5 objects that were tested. It is unclear how the stages in the architecture help disambiguate belief maps for vertex detection. I encourage the authors to add some more analysis to support this claim. A quantitative comparison on the extreme lightning dataset might also help better validate the claim of generalization. Overall I believe the experiments are comprehensive and the paper is clear to read while demonstrating a real time pose detection for robot manipulation.