Monocular Dynamic Object SLAM in Autonomous Driving

towards-data-science

This post was originally published by Patrick Langechuan Liu at Towards Data Science

Object Proposal

ClusterVO uses YOLOv3 as the 2D object detector to propose semantic 2D bounding boxes for objects in each frame. It does not assume that objects can be described by cuboids such as cubeSLAM or S3DOT. Therefore there is no cuboid generation stage.

Data Association

ClusterVO performs a quite complex scheme of data association. It can be viewed as a two-step process, a Multi-level Probabilistic Association to associate observed keypoints with tracked landmarks and associate bounding boxes with tracked clusters, followed by a heterogeneous CRF which associate the tracked landmarks to tracked clusters.

  • point-point matching: this belongs to the Multi-level Probabilistic Association. Nearest neighbor and descriptor matching may work well for static keypoints but not for dynamic points. Therefore first predict the location of each cluster with its velocity (linear only, no rotational). The probability to associate keypoint observation k to landmark i is proportional to the descriptor similarity if observation k is within the proximity of the projection of the motion predicted landmark i.
  • object-object matching: this also belongs to the Multi-level Probabilistic Association. If a bounding box m contains the most number of projected points from a cluster, then it is associated with that cluster q (object).
  • point-object matching: this is the most complex part and a heterogeneous conditional random field (CRF) is used. It determines if a landmark i is associated with a cluster q. It has multiple energy terms. The unary energy terms include 2D energy (if a point is within the bounding box associated with a cluster, then it has a high probability to be associated with it. It can be assigned to multiple clusters if it is inside multiple bounding boxes), a 3D energy (a point has a higher probability of being associated to a cluster if it is close to the center of the cluster, modulated by the size of the cluster) and a motion energy (the projection of landmark can be explained by the motion of the cluster). The pairwise label smoothness energy term penalizes closeby landmarks if they are associated with different clusters.

Object-aware Bundle Adjustment

After the probabilistic data association, we can formulate the BA for static scenes and dynamic clusters. It uses a sliding window with a specially designed double-track frame management to manage keyframes.

  • Camera-point error: for static scenes, clusterVO jointly optimizes camera pose and location of static key points, in a similar way to ORB-SLAM. It is also augmented by an additional marginalization term as clusterVO chooses a sliding window state estimation method. This marginalization term captures the contribution from observations that would otherwise be removed due to the limited width of the sliding window.
  • Motion error: temporally predicted pose should be consistent with the inferred 3D measurement from a single frame. A motion model with acceleration sampled from a Gaussian process is adopted. ClusterVO only considers the translational part.
  • Dynamic point error: clusterVO also has this dynamic point error, similar to cubeSLAM and S3DOT. If a point is on a dynamic object, then its relative position wrt the dynamic object is fixed over time.

Verdict

ClusterVO is a more general approach to dynamic object SLAM. From the results of KITTI dynamic scenes, the cuboid quality is not as good as this is estimated in the post-process step of the landmark cluster. For autonomous driving, CubeSLAM and S3DOT seem more practical. Note that object constraints are considered when optimizing odometry (camera pose).

The notion of “multibody mono SLAM” seems to come from “multibody SfM”, but it essentially has the same meaning as dynamic object SLAM.

Object Proposal

MoMoSLAM uses a quite heavy but accurate 3D object proposal pipeline. It uses shape prior and keypoint to lift 2D detection to 3D shapes. It first detects k=36 ordered keypoints on distinguishable features of a vehicle, and also deformation coefficients for a list of basis shapes. Then the 2D detections are lifted to 3D by minimizing reprojection error to get 6 DoF pose and shape parameter. This is used by the authors in a previous publication The Earth ain’t Flat (IROS 2018), and very similar to RoI-10D (CVPR 2019).

From 2D keypoints to 3D shape (source: The Earth ain’t Flat, IROS 2018)

Data Association

  • point-point matching: keypoint matching based on descriptor features, similar to ORB-SLAM.
  • object-object matching: this is not explicitly mentioned in the paper, but this is definitely needed. Any 2D object tracking method would work.
  • point-object matching: not used. This is implicitly and partially done through the detection of semantic key points of each object in each frame.

Object-aware Bundle Adjustment

Camera-object pose graph and cycle consistency (source)

MoMoSLAM uses a different formulation of optimization. Instead of specifying each error term and minimize them, MoMoSLAM enforces consistency for each cycle created in the pose graph, as shown above. But in essence, this should be equivalent to the minimization of least square errors.

  • Camera-point error: same as ORB-SLAM. This odometry is up to scale due to the scale ambiguity in monocular images. Then semantic segmentation of the ground area and point matching in this region are used to estimate 3D depth using inverse perspective mapping (IPM). This fixes the scale factor and leads to odometry in the metric scale.

Metric Odometry Estimation of MoMoSLAM (source)

  • Cycle consistency in the multi-object pose graph: Nodes in the pose graph are estimates, and edges in the pose graph are measurements. Camera-camera edges are constrained via metric-scale odometry. Camera-vehicle edges are constrained via single frame 2D to 3D lifting. Vehicle-vehicle edges are constrained via two different 3D depth estimation methods (IPM vs 2D-to-3D lifting). Note that there is no explicit motion model.

I feel the cycle consistency is a bit contrived, especially the vehicle-vehile edge. It would be much more straighforward to add an error item enouraging the consistency between the distance estimation between IPM and 2D-to-3D lifting.

Verdict

The method used by MoMoSLAM to fix the metric scale for the monocular setup is quite useful. Note that object constraints are not considered when calculating odometry (camera pose).

https://www.bilibili.com/video/av90800325/

Spread the word

This post was originally published by Patrick Langechuan Liu at Towards Data Science

Related posts