1 Introduction
^{†}^{†}First two authors contributed equally to this work.Recent work has made substantial progress in fully automatic, 3D featurebased point cloud registration. At first glance, benchmarks like 3DMatch [48] appear to be saturated, with multiple stateoftheart (SOTA) methods [15, 7, 3] reaching nearly 95% feature matching recall and successfully registering 80% of all scan pairs. One may get the impression that the registration problem is solved—but this is actually not the case. We argue that the high success rates are a consequence of lenient evaluation protocols. We have been making our task too easy: existing literature and benchmarks [4, 48, 20] consider only pairs of point clouds with 30% overlap to measure performance. Yet, the lowoverlap regime is very relevant for practical applications. On the one hand, it may be difficult to ensure high overlap, for instance when moving along narrow corridors, or when closing loops in the presence of occlusions (densely builtup areas, forest, etc.). On the other hand, data acquisition is often costly, so practitioners aim for a low number of scans with only the necessary overlap.
Driven by the evaluation protocol, the highoverlap scenario became the focus of research, whereas the more challenging lowoverlap examples were largely neglected (cf. Fig. 1). As a consequence, the registration performance of even the best known methods deteriorates rapidly when the overlap between the two point clouds falls below 30%, see Fig. 2. Human operators, in contrast, can still register such low overlap point clouds without much effort.
This discrepancy is the starting point of the present work. To study its reasons, we have constructed a lowoverlap dataset 3DLoMatch (Sec. 4.1) from scans of the popular 3DMatch benchmark, and have analysed the individual modules/steps of the registration pipeline (Fig. 2). It turns out that the effective receptive field of modern (fully convolutional) feature point descriptors [7, 3] is local enough and the descriptors are hardly corrupted by nonoverlapping parts of the scans. Rather than coming up with yet another way to learn better descriptors, the key to registering low overlap point clouds is learning where to look for feature points (Fig. 2, right). A large performance boost can be achieved if the feature points are predominantly sampled from the overlapping portions of the scans.
We follow this path and introduce Predator, a neural architecture for pairwise 3D point cloud registration that learns to (implicitly) detect the overlap region between two unregistered scans, and to focus on that region when extracting salient feature points. The main contributions of our work are:

[leftmargin=15pt,topsep=4pt]

an analysis why existing registration architectures break down in the lowoverlap regime

a novel overlap attention block that allows for early information exchange between the two point clouds and focuses the subsequent steps on the overlap region

a scheme to refine the feature point descriptors, by conditioning them also on the respective other point cloud

a novel loss function to train
matchability scores, which help to sample better and more repeatable interest points
Moreover, we make available the 3DLoMatch dataset, containing the previously ignored scan pairs of 3DMatch that have low (1030%) overlap. In our experiments, Predator greatly outperforms existing methods in the lowoverlap regime, increasing registration recall by >10 percent points. It also sets a new state of the art for the conventional 3DMatch benchmark, reaching a registration recall of 89%.
2 Related work
Local 3D feature descriptors: Early local descriptors for point clouds [19, 29, 28, 36, 35] aimed to characterise the local geometry by using handcrafted features. While often lacking robustness against clutter and occlusions, they have long been a default choice for downstream tasks because they naturally generalise across datasets [17]. In the last years, learned 3D feature descriptors have taken over and now routinely outperform their handcrafted counterparts.
The pioneering 3DMatch method [48] is based on a Siamese 3D CNN that extracts local feature descriptors from a signed distance function embedding. Others [20, 16]
first extract handcrafted features, then map them to a compact representation using multilayer perceptrons. PPFNet
[10], and its selfsupervised version PPFFoldNet [9], combine point pair features with a PointNet [26] architecture to extract descriptors that are aware of the global context. To alleviate artefacts caused by noise and voxelisation, [15] proposed to use a smoothed density voxel grid as input to a 3D CNN. These early works achieved strong performance, but still operate on individual local patches, which greatly increases the computational cost and limits the receptive field to a predefined size.Fully convolutional architectures [22] that enable dense feature computation over the whole input in a single forward pass [11, 12, 27] have been adopted to design faster 3D feature descriptors. Building on sparse convolutions [6], FCGF [7] achieves a performance similar to the best patchbased descriptors [15], while being orders of magnitude more efficient. D3Feat [3] complements a fully convolutional feature descriptor with an interest point detector trained to detect salient points.
Contextual information
: In the traditional pipeline, feature extraction is done independently per point cloud. Information is only mixed when computing pairwise similarities, although aggregating contextual information at an earlier stage could provide additional cues to robustify the descriptors and guide the matching step.
In 2D feature learning, [43] use an attention mechanism in ththe bottleneck of an encoderdecoder scheme to aggregate the contextual information, which is later used to condition the output of the decoder on the second image. SuperGlue [30] infuses the contextual information into the learned descriptors with a whole series of self and crossattention layers, built upon the messagepassing GNN. Early information mixing was previously also explored in the field of deep point cloud registration, where [40, 41] use a transformer module to extract taskspecific 3D features that are reinforced with contextual information.
Interest point sampling: The classic principle to sample salient rather than random points has also found its way into learned 2D [11, 12, 27, 43] and 3D [46, 3] local feature extraction. All these methods implicitly assume that the saliency of a point fully determines its utility for downstream tasks. Here, we take a step back and argue that, while saliency is desirable for an interest point, it is not sufficient on its own. Indeed, in order to contribute to registration a point should not only be salient, but must also lie in the region where the two point clouds overlap—an essential property that, surprisingly, has largely been neglected thus far.
Deep pointcloud registration
: Instead of combining learned feature descriptors with some offtheshelf robust optimization at inference time, a parallel stream of work aims to embed the (differentiable) estimation of the transformation parameters into the learning pipeline. PointNetLK
[1] combines a PointNetbased global feature descriptor [26] with a Lucas/Kanadelike optimization algorithm [23] and estimates the relative transformation in an iterative fashion. DCP [40] uses a DGCNN network [42] to extract local features and computes soft correspondences before using the Kabsch algorithm to estimate the transformation parameters. To relax the need for strict onetoone correspondence, DCP was later extended to PRNet [41], which includes a keypoint detection step and allows for partial correspondence. Instead of simply using soft correspondences, [47] estimate the similarity matrix with a differentiable Sinkhorn layer [32]. Similar to other methods, the weighted Kabsch algorithm[2] is used in [47] to estimate the transformation parameters. Finally, [14, 5]complement a learned feature descriptor with an outlier filtering network, which infers the points’ influence weights for later use in the weighted Kabsch algorithm.
3 Method
Predator is a twostream encoderdecoder network. Our implementation uses residual blocks with KPConvstyle point convolutions [34], but the architecture is agnostic w.r.t. the backbone and could also be implemented with other formulations of 3D convolutions, such as for instance sparse voxel convolutions [6]. The architecture of Predator can be decomposed into three main modules:

[topsep=2pt,itemsep=0pt,partopsep=1ex,parsep=2pt,leftmargin=15pt]

encoding of the two point clouds into smaller sets of superpoints and associated latent feature encodings, with shared weights (Sec. 3.2);

the overlap attention module (in the bottleneck) that extracts cocontextual information between the feature encodings of the two point clouds, and assigns each superpoint two overlap scores that quantify how likely the superpoint itself and its softcorrespondence are located in the overlap between the two inputs (Sec. 3.3);

decoding of the mutually conditioned bottleneck representations to pointwise descriptors as well as refined perpoint overlap and matchability scores (Sec. 3.4).
Before diving into each component we lay out the basic problem setting and notation in Sec. 3.1.
3.1 Problem setting
Consider two point clouds , and . Our goal is to recover a rigid transformation with parameters and that aligns to . By a slight abuse of notation we use the same symbols for sets of points and for their corresponding matrices and .
Obviously can only ever be determined from the data if and have sufficient overlap, meaning that after applying the ground truth transformation the overlap ratio
(1) 
where denotes the nearestneighbour operator w.r.t. its second argument, is the Euclidean norm, is the set cardinality, and is a tolerance that depends on the point density.^{2}^{2}2For efficiency, is in practice determined after voxelgrid downsampling of the two point clouds. Contrary to previous work [48, 20], where the threshold to even attempt the alignment is typically , we are interested in lowoverlap point clouds with .
3.2 Encoder
We follow [34] and first preprocess raw point clouds with gridbased subsampling, such that and
have reasonably uniform point density. In the shared encoder, a series of ResNetlike blocks and strided convolutions aggregate the raw points into
superpoints and with associated features and . Note that superpoints correspond to a fixed receptive field, so their number depends on the spatial extent of the input point cloud and may be different for the two inputs.3.3 Overlap attention module
So far, the features , in the bottleneck encode the geometry and context of the two point clouds. But has no knowledge of point cloud and vice versa. In order to reason about their respective overlap regions, some crosstalk is necessary. We argue that it makes sense to add that crosstalk at the level of superpoints in the bottleneck, just like a human operator will first get a rough overview of the overall shape to determine likely overlap regions, and only after that identifies precise feature points in those regions.
Graph convolutional neural network
: Before connecting the two feature encodings, we first further aggregate and strengthen their contextual relations individually with a graph neural network (GNN)
[42]. In the following, we describe the GNN for point cloud . The GNN for is the same. First, the superpoints in are linked into a graph in Euclidean space with the NN method. Let denote the feature encoding of superpoint , and the graph edge between superpoints and . The encoder features are then iteratively updated as(2) 
where denotes a linear layer followed by instance normalization [37] and a LeakyReLU activation [24],
denotes element/channelwise maxpooling, and
means concatenation. This update is performed twice with separate (not shared) parameters , and the final GNN features are obtained as(3) 
Crossattention block: Knowledge about potential overlap regions can only be gained by mixing information about both point clouds. To this end we adopt a crossattention block [30] based on the message passing formulation [13]. First, each superpoint in is connected to all superpoints in to form a bipartite graph. Inspired by the Transformer architecture [39]
, vectorvalued keys
and queries are learned for each superpoint and used to retrieve (also learned) values . The messages are then computed as weighted averages of the values,(4) 
with attention weights [30]. I.e., to update a superpoint one combines that point’s query with the keys and values of all superpoints . The queries, keys, and values are linear projections of the corresponding features . In line with the literature, in practice we use a multiattention layer with four parallel attention heads [39]. The cocontextual features are computed as
(5) 
with denoting a threelayer fully connected network with instance normalization [37]
and ReLU
[25] activations after the first two layers. The same crossattention block is also applied in reverse direction, so that information flows in both directions, and .Overlap scores of the bottleneck points: The above update with cocontextual information is done for each superpoint in isolation, without considering the local context within each point cloud. We therefore, explicitly update the local context after the crossattention block using another GNN that has the same architecture and underlying graph (withinpoint cloud links) as above, but separate parameters . This yields the final latent feature encodings and , which are now conditioned on the features of the respective other point cloud. Those features are linearly projected to overlap scores and
, which can be interpreted as probabilities that a certain superpoint lies in the overlap region. Additionally, one can compute
soft correspondences between superpoints and from the correspondence weights predict the crossoverlap score of a superpoint , i.e., the probability that its correspondence in lies in the overlap region:(6) 
where is the inner product, and is the temperature parameter that controls the soft assignment. In the limit , Eq. (6) converges to hard nearestneighbour assignment.
3.4 Decoder
Our decoder starts from conditioned features , concatenates them with the overlap scores , , and outputs perpoint feature descriptors and refined perpoint overlap and matchability scores . The matchability can be seen as a ”conditional saliency” that quantifies how likely a point is to be matched correctly, given the points (resp. features) in the other point cloud .
The decoder architecture combines NNupsampling with 4 PointNetstyle MLP layers [26], and includes skip connections from the corresponding encoder layers. We deliberately keep the overlap score and the matchability separate to disentangle the reasons why a point is a good/bad candidate for matching: in principle a point can be unambiguously matchable but lie outside the overlap region, or it can lie in the overlap but have an ambiguous descriptor. Empirically, we find that the network learns to predict high matchability mostly for points in the overlap; probably reflecting the fact that the ground truth correspondences used for training, naturally, always lie in the overlap. For further details about the architecture, please refer to Sec. A.3 and the source code.
3.5 Loss function and training
Predator is trained endtoend, using three losses w.r.t. ground truth correspondences as supervision.
Circle loss: To supervise the pointwise feature descriptors we follow^{3}^{3}3Added to the repository after publication, not mentioned in the paper. [3] and use the circle loss [33], a variant of the more common triplet loss. Consider again a pair of overlapping point clouds and , this time aligned with the ground truth transformation. We start by extracting the points that have at least one (possibly multiple) correspondence in , where the set of correspondences is defined as points in that lie within a radius around . Similarly, all points of outside a (larger) radius form the set of negatives . The circle loss is then computed from points sampled randomly from :
(7) 
where denotes distance in feature space, and are negative and positive margins, respectively. The weights and are determined individually for each positive and negative example, using the empirical margins and with hyperparameter . The reverse loss is computed in the same way, for a total circle loss .
Overlap loss: The estimation of the overlap probability is cast as binary classification and supervised using the overlap loss , where
(8) 
The ground truth label of point is defined as
(9) 
with overlap threshold . The reverse loss is computed in the same way. The contributions from positive and negative examples are balanced with weights inversely proportional to their relative frequencies.
Matchability loss: Supervising the matchability scores is a bit more difficult, as it is not clear in advance which are the right points to take into account during correspondence search. We follow a simple intuition: good keypoints are those that can be matched successfully at a given point during training, with the current feature descriptors. Hence, we cast the prediction as binary classification and generate the ground truth labels on the fly. Again, we sum the two symmetric losses, , with
(10) 
where ground truth labels are computed on the fly via nearest neighbour search in feature space:
(11) 
Implementation and training: Predator
is implemented in pytorch. For the
3DMatchdataset, we train for 30 epochs, using SGD with initial learning rate
, momentum , and weight decay . The learning rate is exponentially decayed by 0.05 after each epoch. Due to memory constraints we use batch size in all experiments and sample at most positive pairs for the circle loss. Data augmentation includes random rotations around all three axes and Gaussian noise with cm, added independently to each coordinate. At the start of the training we supervise Predator only with the circle and overlap losses, the matchability loss is added only after few epochs, when the pointwise features are already meaningful (i.e., 30% of all points in can be matched correctly). The three loss terms are weighted equally. The hyperparameters are set relative to the voxel size of the initial grid subsampling (respectively, the average pointtopoint distance). For the circle loss, the positive radius is set to , and the safe radius is set to . For the overlap loss, is also set to and is set to be, in accordance with the distance threshold for a valid correspondence in the subsequent RANSAC pose estimation. The training settings for the
ModelNet dataset are given in Sec. A.4.4 Experiments
We evaluate Predator and justify our design choices on real scan data, using 3DMatch and 3DLoMatch. Additionally, we compare Predator to direct registration methods on the synthetic, objectcentric ModelNet40.
4.1 Datasets and preprocessing
3DMatch/3DLoMatch: The official 3DMatch dataset [48] considers only scan pairs with >30% overlap. Here, we add its counterpart that considers only scan pairs with overlaps between 10 and 30% and call this collection 3DLoMatch^{4}^{4}4Due to a bug in the official implementation of the overlap computation for 3DMatch, a few (<7%) scan pairs are included in both datasets.. For both datasets we stick to the accepted split into 54 training and 8 test scenes.
ModelNet40: [44] contains 12,311 CAD models of manmade objects from 40 different categories. We follow [47] and use 5,112 samples for training and 1,202 samples for validation, from the first 20 categories. We then test on 1,266 samples from the other 20 categories. Like [47], we randomly sample planes that cut away 30% of the points, to obtain point clouds with 70% completeness and, on average, 73.5% pairwise overlap. For our purposes we additionally run a version where we cut away 50% of the points, to obtain a second test set with 53.6% average overlap, which we call ModelLoNet (lower overlap is not meaningful, due to the low number of points per model).
4.2 Evaluation metrics
We use the standard metrics of 3DMatch to assess the performance of Predator and to compare it to three stateoftheart methods: 3DSN [15], FCGF [7] and 3DFeat [3]. Our main metric, corresponding to the actual aim of point cloud registration, is Registration Recall (RR), i.e., the fraction of scan pairs for which the correct transformation parameters are found with RANSAC. Following the literature, we also report Feature Match Recall (FMR), defined as the fraction of pairs that have >5% ”inlier” matches with <10 cm residual under the ground truth transformation (without checking if the transformation can be recovered from those matches), and Inlier Ratio (IR), the fraction of correct correspondences among the putative matches.
For ModelNet40 we follow [47] and measure the performance using the Relative Rotation Error (RRE) (geodesic distance between estimated and GT rotation matrices), the
Relative Translation Error (RTE)
(Euclidean distance between the estimated and GT translations), and the Chamfer distance (CD) between the two scans after applying the estimated transformation. For more details please see Sec. A.1.4.3 3DMatch
Relative overlap ratio: We first evaluate if Predator achieves its goal to focus on the overlap. We discard points with a predicted overlap score , compute the overlap ratio, and compare it to the one of the original scans. Fig. 4 shows that more than half of the lowoverlap pairs are pushed over the 30% threshold that prior works considered the lower limit for registration. On average, discarding points with low overlap scores almost doubles the overlap in 3DLoMatch ( increase). Notably, it also increases the overlap in standard 3DMatch by, on average, >35%.
Interest point sampling:
3DMatch  3DLoMatch  

# Samples (k)  5000  2500  1000  500  250  5000  2500  1000  500  250 
Inlier ratio (%)  
rand  43.6  41.5  36.7  31.9  25.9  15.7  14.7  12.8  10.9  8.7 
topk ()  61.0  67.3  71.8  73.1  73.1  26.0  31.4  36.0  37.4  38.0 
filt. () + prob. ()  55.2  55.2  53.7  51.3  46.4  24.7  25.0  24.8  24.0  22.7 
prob. ()  49.9  50.3  49.2  46.3  41.8  20.0  20.8  21.0  20.2  19.0 
Registration Recall (%)  
rand  83.9  82.9  81.5  79.9  69.9  39.3  38.8  36.9  30.3  23.2 
topk ()  81.6  84.3  80.2  72.6  60.3  54.6  52.4  45.7  38.1  28.9 
filt. () + prob. ()  85.7  84.4  86.6  86.3  83.3  51.3  53.3  54.6  54.4  52.0 
prob. ()  88.3  88.3  89.0  88.4  84.7  54.2  55.8  56.7  56.1  50.7 
Predator significantly increases the effective overlap, but does that improve downstream registration performance? To test this we use the overlap scores and matchability scores to bias interest point sampling. We compare three variants: topk (om), where we multiply and pick the top points according to the combined score; prob. (om), where we instead sample points with probability proportional to the combined score; and filt. (o)prob. (om), where we discard points with , then sample from the remaining ones proportional to .
Tab. 1 shows that any of the informed sampling strategies greatly increases the inlier ratio, and as a consequence also the registration recall. The gains are larger when fewer points are sampled. In the lowoverlap regime the inlier ratios more than triple for up to 1000 points. We observe that, as expected, high inlier ratio does not necessarily imply high registration recall: our scores are apparently well calibrated, so that topk (om) indeed finds most inliers, but these are often clustered and too close to each other to reliably estimate the transformation parameters. We thus use the more robust prob. (om) sampling, which yields the best registration recall. It may be possible to achieve even higher registration recall by combining topk (om) sampling with nonmaxima suppression. We leave this for future work.
3DMatch  3DLoMatch  

# Samples  5000  2500  1000  500  250  5000  2500  1000  500  250 
Feature Match Recall (%)  
3DSN [15]  95.0  94.3  92.9  90.1  82.9  63.6  61.7  53.6  45.2  34.2 
FCGF [7]  97.4  97.3  97.0  96.7  96.6  76.6  75.4  74.2  71.7  67.3 
D3Feat [3]  95.6  95.4  94.5  94.1  93.1  67.3  66.7  67.0  66.7  66.5 
Ours  96.6  96.2  96.2  95.8  95.5  71.7  73.8  73.8  72.9  72.3 
Inlier ratio (%)  
3DSN [15]  36.0  32.5  26.4  21.5  16.4  11.4  10.1  8.0  6.4  4.8 
FCGF [7]  56.8  54.1  48.7  42.5  34.1  21.4  20.0  17.2  14.8  11.6 
D3Feat [3]  39.0  38.8  40.4  41.5  41.8  13.2  13.1  14.0  14.6  15.0 
Ours  49.9  50.3  49.2  46.3  41.8  20.0  20.8  21.0  20.2  19.0 
Registration Recall (%)  
3DSN [15]  78.4  76.2  71.4  67.6  50.8  33.0  29.0  23.3  17.0  11.0 
FCGF [7]  85.1  84.7  83.3  81.6  71.4  40.1  41.7  38.2  35.4  26.8 
D3Feat [3]  81.6  84.5  83.4  82.4  77.9  37.2  42.7  46.9  43.8  39.1 
Ours  88.3  88.3  89.0  88.4  84.7  54.2  55.8  56.7  56.1  50.7 
Comparison to featurebased methods: We compare Predator to recent featurebased registration methods (Tab. 2). For a more comprehensive assessment we follow [3] and report performance with different numbers of sampled interest points. Qualitative results are shown in Fig. 6. Predator greatly outperforms existing methods on the lowoverlap 3DLoMatch dataset, improving registration recall by 1020 percent points (pp) over the closest competitor—variously FCGF or 3DFeat. Moreover, it also consistently reaches the highest registration recall on standard 3DMatch, showing that its attention to the overlap pays off even for scans with moderately large overlap. In line with our motivation, what matters is not so much the choice of descriptors, but finding interest points that lie in the overlap region – especially if that region is small.
The results also support our claim that one should evaluate the complete registration pipeline: FCGF slightly beats Predator in terms of FMR, except in the lowoverlap, small sample regime. But Predator mostly compensates that deficit when looking at the inlier ratio, i.e., a higher number of potentially matchable point pairs does not always translate to more usable matches^{5}^{5}5See Tab. 1, where topk (om) sampling has even higher inlier ratio than FCGF, yet lower registration performance.. Even in cases where the inlier ratio remains a bit below that of FCGF, our method achieves higher registration recall.
Comparison to direct registration methods: We tried to compare Predator also to recent methods for direct registration of partial point clouds. Unfortunately, for both PRNet [41] and RPMNet [47], training on 3DMatch failed to converge to reasonable results, as already observed in [5]. It appears that their feature extraction is specifically tuned to synthetic, objectcentric point clouds. Thus, in a further attempt we replaced the feature extractor of RPMNet with FCGF. This brought the registration recall on 3DMatch to 54.9%, still far from the 85.1% that FCGF features achieve with RANSAC. We conclude that direct pairwise registration is at this point only suitable for geometrically simple objects in controlled settings like ModelNet40.
Ablation study:
3DMatch  3DLoMatch  

conditioned  crossoverlap score  FMR  IR  RR  FMR  IR  RR 
96.2  47.2  86.9  71.8  18.0  50.9  
✓  95.5  46.4  87.1  73.0  17.6  54.4  
✓  96.1  47.8  87.3  69.5  15.8  48.4  
✓  ✓  96.6  49.9  88.3  71.7  20.0  54.2 
We ablate our point scoring functions in Tab. 3. By conditioning the decoder input on the respective other point cloud, registration recall increases only by 0.2 pp for 3DMatch, but by 3.3 pp for 3DLoMatch. Adding also the crossoverlap score brings a bigger gain of 1.2 pp for 3DMatch, but no further gain for 3DLoMatch. For more ablation studies, please see Sec. A.6.
4.4 ModelNet40
Relative overlap ratio:
We check if Predator focuses on the overlap region. We extract 8,862 test pairs by varying the completeness of the input point clouds from 70 to 40%. As above, we then discard points with a predicted overlap score , compute the overlap ratio, and compare it to the one of the original scans. Fig. 7 shows that Predator greatly increases the relative overlap and reduces the number of pairs with overlap <70% by more than 40 pp.
ModelNet  ModelLoNet  

Methods  RRE  RTE  RRE  RTE  
DCPv2 [40]  11.975  0.171  0.0117  16.501  0.300  0.0268 
RPMNet [47]  1.712  0.018  0.00085  7.342  0.124  0.0050 
Ours (rand)  2.923  0.034  0.00122  11.585  0.181  0.0104 
Ours (prob. ())  1.856  0.019  0.00088  5.462  0.133  0.0079 
Comparison to direct registration methods: To be able to compare Predator to RPMNet [47] and DCP [40], we resort to the synthetic, objectcentric dataset they were designed for. We failed to train PRNet [41] due to random crashes of the original code (also observed in [5]).
Remarkably, Predator can compete with methods specifically tuned for ModelNet, and in the lowoverlap regime outperforms them in terms of RRE, see Tab. 4. Moreover, we observe a large boost by sampling points with overlap attention (prob. (om)) rather than randomly (rand). Fig. 7 (right) further underlines the importance of sampling in the overlap: Predator is a lot more robust in the low overlap regime (8 lower RRE at completeness 0.4).
5 Conclusion
We have introduced Predator, a deep model designed for pairwise registration of lowoverlap point clouds. The core of the model is an overlap attention module that enables early information exchange between the point clouds’ latent encodings, in order to infer which of their points are likely to lie in their overlap region.
There are a number of directions in which Predator could be extended. At present it is tightly coupled to fully convolutional point cloud encoders, and relies on having a reasonable number of superpoints in the bottleneck. Moreover, it builds on the prevalent definition of the overlap region, which counts the fraction of points with a feasible correspondence. This might be a limitation in scenarios where the point density is very uneven. Finally, in future work it would be interesting to explore how our overlapattention module can be integrated into direct point cloud registration methods, and into other neural architectures that have to handle two or more datasets with low overlap.
References
 [1] (2019) PointnetLK: robust & efficient point cloud registration using Pointnet. In CVPR, Cited by: §2.
 [2] (1987) Leastsquares fitting of two 3d point sets. IEEE TPAMI 9 (5), pp. 698–700. External Links: Document Cited by: §2.
 [3] (2020) D3Feat: joint learning of dense detection and description of 3d local features. In CVPR, Cited by: §A.7, Table 6, Table 9, §1, §1, §2, §2, §3.5, Figure 6, §4.2, §4.3, Table 2.
 [4] (2015) Robust reconstruction of indoor scenes. In CVPR, Cited by: §A.1, §1.
 [5] (2020) Deep global registration. In CVPR, Cited by: §2, §4.3, §4.4.
 [6] (2019) 4D spatiotemporal convnets: minkowski convolutional neural networks. In CVPR, Cited by: §2, §3.
 [7] (2019) Fully convolutional geometric features. In ICCV, Cited by: §A.7, Table 6, Table 9, §1, §1, Figure 2, §2, Figure 6, §4.2, Table 2.
 [8] (1996) A volumetric method for building complex models from range images. In ACM SIGGRAPH, Cited by: §A.2.

[9]
(2018)
PPFFoldNet: unsupervised learning of rotation invariant 3d local descriptors
. In ECCV, Cited by: §2.  [10] (2018) PPFNnet: global context aware local features for robust 3d point matching. In CVPR, Cited by: §A.1, §2.
 [11] (2018) Superpoint: selfsupervised interest point detection and description. In CVPR Workshops, Cited by: §2, §2.
 [12] (2019) D2Net: a trainable CNN for joint detection and description of local features. In CVPR, Cited by: §2, §2.
 [13] (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §3.3.
 [14] (2020) Learning multiview 3d point cloud registration. In CVPR, Cited by: §2.
 [15] (2019) The perfect match: 3d point cloud matching with smoothed densities. In CVPR, Cited by: Table 6, §1, §2, §2, §4.2, Table 2.
 [16] (2018) Learned compact local feature descriptor for TLSbased geodetic monitoring of natural outdoor scenes.. In ISPRS Annals, Cited by: §2.
 [17] (2014) Performance evaluation of 3D local feature descriptors. In ACCV, Cited by: §2.
 [18] (2016) Structured global registration of RGBD scans in indoor environments. arXiv preprint arXiv:1607.08539. Cited by: §A.2.
 [19] (1999) Using spin images for efficient object recognition in cluttered 3d scenes. IEEE TPAMI 21, pp. 433–449. Cited by: §2.
 [20] (2017) Learning compact geometric features. In ICCV, Cited by: §1, §2, §3.1.
 [21] (2014) Unsupervised feature learning for 3d scene labeling. In ICRA, Cited by: §A.2.
 [22] (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.
 [23] (1981) An iterative image registration technique with an application to stereo vision. In IJCAI, Cited by: §2.
 [24] (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML, Cited by: §3.3.
 [25] (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.3.

[26]
(2017)
Pointnet: deep learning on point sets for 3d classification and segmentation
. In CVPR, Cited by: §2, §2, §3.4.  [27] (2019) R2D2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195. Cited by: §2, §2.
 [28] (2009) Fast point feature histograms (FPFH) for 3D registration. In ICRA, Cited by: §2.
 [29] (2008) Aligning point cloud views using persistent feature histograms. In IROS, Cited by: §2.
 [30] (2020) Superglue: learning feature matching with graph neural networks. In CVPR, Cited by: §2, §3.3.
 [31] (2013) Scene coordinate regression forests for camera relocalization in RGBD images. In CVPR, Cited by: §A.2.
 [32] (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35 (2), pp. 876–879. Cited by: §2.
 [33] (2020) Circle loss: a unified perspective of pair similarity optimization. In CVPR, Cited by: §3.5.
 [34] (2019) KPconv: flexible and deformable convolution for point clouds. In CVPR, Cited by: §3.2, §3.
 [35] (2010) Unique shape context for 3D data description. In ACM Workshop on 3D Object Retrieval, Cited by: §2.
 [36] (2010) Unique signatures of histograms for local surface description. In ECCV, Cited by: §2.
 [37] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §3.3, §3.3.
 [38] (2016) Learning to navigate the energy landscape. In 3DV, Cited by: §A.2.
 [39] (2017) Attention is all you need. In NeurIPS, Cited by: §3.3.
 [40] (2019) Deep closest point: learning representations for point cloud registration. In ICCV, Cited by: §2, §2, §4.4, Table 4.

[41]
(2019)
PRNet: selfsupervised learning for partialtopartial registration
. In NeurIPS, Cited by: §2, §2, §4.3, §4.4.  [42] (2019) Dynamic graph CNN for learning on point clouds. ACM TOG 38 (5). Cited by: §2, §3.3.
 [43] (2020) D2D: learning to find good correspondences for image matching and manipulation. arXiv preprint arXiv:2007.08480. Cited by: §2, §2.
 [44] (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §4.1.
 [45] (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In ICCV, Cited by: §A.2.
 [46] (2018) 3dfeatnet: weakly supervised local 3d features for point cloud registration. In ECCV, pp. 630–646. Cited by: §2.
 [47] (2020) RPMNet: robust point matching using learned features. In CVPR, Cited by: §A.1, §A.2, §2, §4.1, §4.2, §4.3, §4.4, Table 4.
 [48] (2017) 3DMatch: learning local geometric descriptors from RGBD reconstructions. In CVPR, Cited by: §1, §2, §3.1, §4.1, footnote 6.
A Supplementary material
In this supplementary material, we first provide rigorous definitions of evaluation metrics (Sec.
A.1), then describe the data preprocessing step (Sec. A.2), network architectures (Sec. A.3) and training on ModelNet40 (Sec. A.4) in more detail. We further provide additional results (Sec. A.5), ablation studies (Sec. A.6) as well as a runtime analysis (Sec. A.7). Finally, we show more visualisations on 3DLoMatch and ModelLoNet benchmarks (Sec. A.8).a.1 Evaluation metrics
The evaluation metrics, which we use to assess model performance in Sec. 4 of the main paper and Sec. A.5 of this supplementary material, are formally defined as follows:
Inlier ratio looks at the set of putative correspondences found by reciprocal matching^{6}^{6}6We follow 3DMatch [48] and apply reciprocal matching as a prefiltering step. in feature space, and measures what fraction of them is ”correct”, in the sense that they lie within a threshold cm after registering the two scans with the ground truth transformation :
(12) 
with the Iverson bracket.
Feature Match recall (FMR) [10] measures the fraction of point cloud pairs for which, based on the number of inlier correspondences, it is likely that accurate transformation parameters can be recovered with a robust estimator such as RANSAC. Note that FMR only checks whether the inlier ratio is above a threshold . It does not test if the transformation can actually be determined from those correspondences, which in practice is not always the case, since their geometric configuration may be (nearly) degenerate, e.g., they might lie very close together or along a straight edge. A single pair of point clouds counts as suitable for registration if
(13) 
Registration recall [4] is the most reliable metric, as it measures endtoend performance on the actual task of point cloud registration. Specifically, it looks at the set of ground truth correspondences after applying the estimated transformation , computes their root mean square error,
(14) 
and checks for what fraction of all point pairs . In keeping with the original evaluation script of 3DMatch, immediately adjacent point clouds are excluded, since they have very high overlap by construction.
Chamfer distance measures the quality of registration on synthetic data. We follow [47] and use the modified Chamfer distance metric:
(15)  
where and are raw source and target point clouds, and are input source and target point clouds.
Relative translation and rotation errors (RTE/RRE) measure the deviations from the ground truth pose as:
RTE  (16)  
RRE 
where and denote the estimated rotation matrix and translation vector, respectively.
a.2 Dataset preprocessing
3DMatch: This is a collection of 62 scenes, combining earlier data from AnalysisbySynthesis [38], 7Scenes [31], SUN3D [45], RGBD Scenes v.2 [21], and Halber et al. [18]. The official specifications split the data into 54 scenes for training and 8 for testing. Individual scenes are not only captured in different indoor spaces (e.g., bedrooms, offices, living rooms, restrooms) but also with different depth sensors (e.g., Microsoft Kinect, Structure Sensor, Asus Xtion Pro Live, and Intel RealSense). 3DMatch provides great diversity and allows our model to generalize across different indoor spaces. Individual scenes of 3DMatch are split into point cloud fragments, which are generated by fusing 50 consecutive depth frames using TSDF volumetric fusion [8]. As a preprocessing step, we apply voxelgrid downsampling to all point clouds, and if multiple points fall into the same voxel, we randomly pick one.
ModelNet40: For each CAD model, 2048 points are first generated by uniform sampling and scaled to fit into a unit sphere. Then we follow [47] to produce partial scans: for source partial point cloud, we uniformly sample a plane through the origin that splits the unit sphere into two halfspaces, shift that plane along its normal until points are on one side, and discard the points on the other side; the target point cloud is generated in the same manner; then the two resulting, partial point clouds are randomly rotated, translated and jittered with Gaussian noise. For the rotation, we sample a random axis and a random angle <45. The translation is sampled in the range . Gaussian noise is applied per coordinate with . Finally, 717 points are randomly sampled from the points.
a.3 Network architecture
The detailed network architecture of Predator is depicted in Fig. 9. Our model is built on the KPConv implementation from the D3Feat repository.^{7}^{7}7https://github.com/XuyangBai/D3Feat.pytorch We complement each KPConv layer with instance normalisation Leaky ReLU activations. The th strided convolution is applied to a point cloud dowsampled with voxel size . Upsampling in the decoder is performed by querying the associated feature of the closest point from the previous layer.
With 20k points after voxelgrid downsampling, the point clouds in 3DMatch are much denser than those of ModelNet40 with only 717 points. Moreover, they also have larger spatial extent with bounding boxes up to , while ModelNet40
point clouds are normalised to fit into a unit sphere. To account for these large differences, we slightly adapt the encoder and decoder per dataset, but keep the same overlap attention model. Differences in network hyperparameters are shown in Tab.
5.# strided  convolution  first conv.  final  

convolutions  radius  feature dim.  feature dim.  
3DMatch  3  2.5  64  32 
ModelNet  2  2.75  256  96 
a.4 Implementation and training
This section complements Sec. 3.5 of the main paper, where implementation and training details are described only for 3DMatch. Here, we provides those details for ModelNet40.
We train Predator on ModelNet40 for 200 epochs, using SGD with initial learning rate 0.01, momentum 0.98, and weight decay . The learning rate is exponentially decayed by a factor of 0.95 after each epoch. We use batch size 1, but accumulate gradients over 4 steps. Similar to 3DMatch, the matchability loss is added when >30% of points in the overlap region can be matched correctly. Due to the sparsity of ModelNet, the input point clouds are not voxelgrid downsampled before the first convolution layer. In the strided convolutions, the voxel size is set to 0.06. For the circle loss, the positive radius is set to 0.018, the safe radius is 0.06. For overlap loss and matchability loss, and are both set to 0.04. RANSAC is run for 50,000 iterations, with distance threshold .
a.5 Additional results
Detailed registration results: We report detailed perscene Registration Recall (RR), Relative Rotation Error (RRE) and Relative Translation Error (RTE) in Tab. 6. RRE and RTE are only averaged over successfully registered pairs for each scene, such that the numbers are mot dominated by gross errors from complete registration failures. We get the highest RR and lowest or second lowest RTE and RRE for almost all scenes, this further shows that our overlap attention module together with probabilistic sampling supports not only robust, but also accurate registration.
3DMatch  3DLoMatch  
Kitchen  Home 1  Home 2  Hotel 1  Hotel 2  Hotel 3  Study  MIT Lab  Avg.  STD  Kitchen  Home 1  Home 2  Hotel 1  Hotel 2  Hotel 3  Study  MIT Lab  Avg.  STD  
# Sample  
449  106  159  182  78  26  234  45  160  128  524  283  222  210  138  42  237  70  191  154  
Registration Recall (%)  
3DSN [15]  90.6  90.6  65.4  89.6  82.1  80.8  68.4  60.0  78.4  11.5  51.4  25.9  44.1  41.1  30.7  36.6  14.0  20.3  33.0  11.8 
FCGF [7]  98.0  94.3  68.6  96.7  91.0  84.6  76.1  71.1  85.1  11.0  60.8  42.2  53.6  53.1  38.0  26.8  16.1  30.4  40.1  14.3 
D3Feat [3]  96.0  86.8  67.3  90.7  88.5  80.8  78.2  64.4  81.6  10.5  49.7  37.2  47.3  47.8  36.5  31.7  15.7  31.9  37.2  10.6 
Ours  97.1  96.2  73.6  96.7  94.9  84.6  85.9  77.8  88.3  8.7  66.3  58.9  55.0  71.8  57.7  46.3  39.8  37.7  54.2  11.4 
Relative Rotation Error (°)  
3DSN [15]  1.926  1.843  2.324  2.041  1.952  2.908  2.296  2.301  2.199  0.321  3.020  3.898  3.427  3.196  3.217  3.328  4.325  3.814  3.528  0.414 
FCGF [7]  1.767  1.849  2.210  1.867  1.667  2.417  2.024  1.792  1.949  0.236  2.904  3.229  3.277  2.768  2.801  2.822  3.372  4.006  3.147  0.394 
D3Feat [3]  2.016  2.029  2.425  1.990  1.967  2.400  2.346  2.115  2.161  0.183  3.226  3.492  3.373  3.330  3.165  2.972  3.708  3.619  3.361  0.227 
Ours  1.859  1.808  2.373  1.816  1.825  2.315  2.047  1.926  1.996  0.214  3.225  3.017  3.183  3.013  3.165  3.421  3.446  2.873  3.168  0.186 
Relative Translation Error (m)  
3DSN [15]  0.059  0.070  0.079  0.065  0.074  0.062  0.093  0.065  0.071  0.010  0.082  0.098  0.096  0.101  0.080  0.089  0.158  0.120  0.103  0.024 
FCGF [7]  0.053  0.056  0.071  0.062  0.061  0.055  0.082  0.090  0.066  0.013  0.084  0.097  0.076  0.101  0.084  0.077  0.144  0.140  0.100  0.025 
D3Feat [3]  0.055  0.065  0.080  0.064  0.078  0.049  0.083  0.064  0.067  0.011  0.088  0.101  0.086  0.099  0.092  0.075  0.146  0.135  0.103  0.023 
Ours  0.051  0.062  0.072  0.059  0.062  0.049  0.078  0.079  0.064  0.011  0.081  0.091  0.075  0.093  0.098  0.091  0.114  0.087  0.091  0.011 
Feature match recall: Finally, Fig. 8 shows that our descriptors are robust and perform well over a wide range of thresholds for the allowable inlier distance and the minimum inlier ratio. Notably, Predator consistently outperforms D3Feat that uses a similar KPConv backbone.
a.6 Additional ablation studies
overlap attention  3DMatch  3DLoMatch  

ov.  ov.  cond.  FMR  IR  RR  FMR  IR  RR 
96.4  39.6  82.6  72.2  14.5  38.9  
✓  96.2  47.2  86.9  71.8  18.0  50.9  
✓  ✓  96.1  47.8  87.3  69.5  15.8  48.4  
✓  ✓  95.5  46.4  87.1  73.0  17.6  54.4  
✓  ✓  ✓  96.6  49.9  88.3  71.7  20.0  54.2 
Ablations of overlap attention module: We compare Predator with a baseline model, which is a plain encoderdecoder architecture based on KPConv, without the proposed overlap attention module. It outputs 32dimensional features without overlap and matchability scores. In the absence of those scores, we randomly sample 5,000 points and pass them to RANSAC for registration. As shown in Tab. 7, this baseline model achieves the second highest FMR on two benchmarks, but only reaches 82.5% and 38.9% RR on 3DMatch and 3DLoMatch respectively; much worse than the four other variants that include (at least) the overlap score. The experiment again confirms that high FMR or IR does not imply high RR, and thus good registration performance.
3DMatch  3DLoMatch  

matchability 
overlap 
FMR  IR  RR  FMR  IR  RR 
96.0  43.6  83.9  69.3  15.7  39.3  
✓  96.3  48.4  87.8  72.2  19.4  50.8  
✓  96.1  46.2  88.0  71.3  16.9  49.3  
✓  ✓  96.6  49.9  88.3  71.7  20.0  54.2 
Ablations of matchability score: We find that probabilistic sampling guided by the product of the overlap and matchability scores attains the highest RR. Here we further analyse the impact of each individual component. We first construct a baseline which applies random sampling (rand) over conditioned features, then we sample points with probability proportional to overlap scores (prob. (o)), to matchability scores (prob. (m)), and to the combination of the two scores (prob. (om)). As shown in Tab. 8, rand fares clearly worse, in all metrics. Compared to prob. (om), either prob. (o) or prob. (m) can achieve comparable results on 3DMatch; the performance gap becomes big on the more challenging 3DLoMatch dataset, where our prob. (om) is around 4 pp better in terms of RR.
a.7 Timings

encoder 

decoder 
overall 


FCGF [7]  206  414  —  25  445  
D3Feat [3]  200  411  —  63  274  
Ours  191  419  70  61  271 
We compare the runtime of Predator with FCGF^{8}^{8}8All experiments were done with MinkowskiEngine v0.4.2. [7] and D3Feat^{9}^{9}9We use its PyTorch implementation. [3] on 3DMatch. For all three methods we set voxel size cm and batch size 1. The test is run on a single GeForce GTX 1080 Ti with Intel(R) Core(TM) i77700K CPU @ 4.20GHz, 32GB RAM. The most timeconsuming step of our model, and also of D3Feat, is the data loader, as we have to precompute the neighborhood indices before the forward pass. With its smaller encoder and decoder, but the additional overlap attention module, Predator is still marginally faster than D3Feat. FCGF has a more efficient data loader that relies on sparse convolution and queries neighbors during the forward pass. See Tab. 9.
Comments
There are no comments yet.