SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image

BMVC 2022

Florian Langer Gwangbin Bae Ignas Budvytis Roberto Cipolla

TL;DR

We sample a set of sparse points and surface normals from a CAD model in an initial pose and provide those in combination with estimated depth and surface normals from an image as input to a network.
The network learns to predict pose updates such as to align the CAD model with the object in the image.

Demo

Motivation

Render-and-Compare allows for precise CAD model alignments. However, traditionally it is very slow and requires a good initialisation. It is slow because

it requires to render full objects.
it then requires to process fully-rendered images.
a similarity function between the image and the render is maximised via gradient descent which requires a large number of iterations (100s to 1000).

It requires a good initialisation because

maximising the similarity function may get stuck in local optima.

We address 1. and 2. by only sampling a set of sparse points and surface normals from the CAD model to be rendered and then processing only those sparse inputs as opposed to a full image. We address 3. and 4. by using a network to directly predict pose updates rather than a similarity function which reduces the number of iterations needed to just 3 and simultaneously makes the network robust to object initialisations.

Method

The input to the pose prediction network comes from three different sources: the 2D image, the 3D CAD model and extra information.

2D Image Information: We sample a set of pixels in the image for which we use their RGB values as well as the estimated depth and surface normals as input to the network. This information is stacked channelwise and we additionally stack the pixel coordinates (u,v) as well as a token tau to inform the network where this information is coming from in the image.

3D CAD Model Information: We sample a set of points (between 100 and 1000) and corresponding surface normals from the CAD model and use the current CAD model pose to reproject them into the image plane. Similar to the 2D image we stack all available information and pixel coordinate as well as a different token tau to inform the network that this information comes from the CAD model.

Extra Information: Additionally we explicitly provide the predicted bounding box, the current CAD model pose and ID in the database as input to the netowrk.

This combined information is the input to the pose update prediction network which predicts refinements Delta Q, Delta T and Delta S which update the rotation, translation and scale of the CAD model (In the figure above c is a classification score indicating how likely the current rotation is to be within 45° of the correct rotation. This is used to choose from what rotation to initialise the pose.). We use the Perceiver architecture whose cross-attention allows it to efficiently process large inputs.

We update the CAD models pose with the predicted refinements and sample the new inputs for the next iteration. This process is repeated 3 times.

Results

We train and evaluate SPARC on the ScanNet dataset. Comparing visually to the previous state-of-the-art ROCA we find that SPARC produces more accurate alignments. These mostly result from better translation and scale predictions.

Quantitatively we find that SPARC outperforms all competitors for both the class average as well as the instance average by a large margin.

Video Presentation

BibTeX

@inproceedings{sparc,
author = {Langer, F. and Bae, G. and Budvytis, I. and Cipolla, R.},
title = {SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image},
booktitle = {Proc. British Machine Vision Conference},
month = {November},
year = {2022},
address={London}
}