Experiments with LightGlue: geometry representation and initialization

Keep it simple, stupid
Published

December 3, 2023

Train your own matcher!

Release of LightGlue (faster, better and open-source SuperGlue) training code made me excited for many reasons, and of them is that I can answer some small research questions about learned feature matching and share results.

If you don’t know what is LightGlue - that is transformer-based architecture for local feature matching, where the original descriptors (SuperPoint, SIFT, etc) are fused with local feature geometry (keypoint coordinates, scale, orientation, etc) and new, context-aware descriptors are produced.

One of such small questions is: “Does the parameterization of the local feature geometry matter for positional encoding input or not?

Geometry encoding in LightGlue

Unlike modern keypoint detectors like SuperPoint or DISK, SIFT local feature has an orientation and scale part.

SIFT keypoint geometry. Image from VLFeat documentation

Should we just concatenate scale and orientation (LightGlue code does it) to (normalized) keypoint center, making a heterogenous vector, or should we try to preprocess the input somehow? Or maybe geometry does not help at all?

To answer this question, I have trained (only the homography pretraining part) 5 LightGlue SIFT models, which differ only in the input feature geometry representation.

Possible geometry representations

There are 4 possible representation of the SIFT keypoints geometry:

  • The simplest is (x,y, angle, radius). Intuitively it makes the least sense, because the vector elements are heterogenious - x,y, radius are in pixels, while angle is in degrees. Moreover, if we compare the scale, it would make more sense to use log(radius). We call it sift_scaori in the graph.

  • So the next version is (x,y, angle, log(radius)). We call it sift_logscaori in the graph.

  • The most homogenious is representation as 2 points - center and the one on the border: (x,y, x2, y2). We call it sift_laf2 in the graph.

  • Finally, we are use LAF representation as in affine transformation matrix, used in, for example, kornia: (x,y, x2-x, y2-y) == (x,y, R * cos(angle), R * sin(angle)). We call it sift_laf in the graph.

SIFT keypoint representations

And the baseline would be not using keypoint geometry, only center as for SuperPoint – (x,y). We call it sift_clean in the graph.

Results

At the beginning of the training it seems, that there is a considerable difference between keypoint geometry representation, (x,y, x2, y2) and (x,y, angle, log(radius)) are clearly better than others.

Match recall and precision for homography pretraining at the beginning. {x,y, x2, y2} and {x,y, angle, logradius} are among leaders

However, in the end of the training, the simplest and the most stupid (x,y, angle, radius) becomes the first.

Match recall and precision for homography pretraining at the end. {x,y, angle, radius} is the best

The difference is even more pronounced on MegaDepth-1500. Even more important thing is that ranking in match recall/precision on validation is different from the MegaDepth-1500 results. Do NOT compare these results to the LightGlue paper SIFT results, as it we haven’t done full training - only a homography pretraining.

Name Encoding Pose mAA (%)
sift_clean (x, y) 47.0
sift_scaori (x, y, angle, radius) 50.0
sift_logscaori (x, y, angle, log(radius)) 47.3
sift_laf1 (x, y, x2-x, y2-y) 48.1
sift_laf2 (x, y, x2-x, y2-y) 48.2

So, the conclusion is, as often the case with deep learning - keep it simple. The second conclusion is that additional geometry surely helps, but not that much.

Can we initialize LightGlue for features X from LightGlue trained for features Y?

Recently I have accidentally run ALIKED-LightGlue with DISK features. To my surprise, result was quite good.

For easy image pairs, LightGlue trained for ALIKED features works for DISK features as well.

Unfortunately, that works only for simple image pairs, not harder ones.

For hard image pairs, LightGlue trained for ALIKED features does not work for DISK features

On the second though, that kind of makes sense - once (in intial layers) we went far from original descriptors, then the attention and positional encoding are kind of similar.

This begs for the question - can we initialize LightGlue for new local features with previous ones, and train faster or better?

I have initialized the LightGlue with SuperPoint LightGlue and started training for DeDoDe features. as_sift here stands for the hyperparameters setup, ignore it.

Initializing DeDoDe LightGlue with SuperPoint weights seemingly significantly helps for the pretraining.

The advantage of initializing from pretrained other features seemingly holds on MegaDepth training

Unfotunately, one cannot rely on the match recall/precision validation metric. When I have evaluated on MegaDepth-1500, the model initialized from SuperPoint was worse.

Name Pose mAA (%)
dedode_homo 60.4
dedode_homo_from_sp 56.5
dedode_homo_ft_megadepth 65.9
dedode_homo_from_sp_ft_megadepth 65.6

Summary

It was fun to train a couple of LightGlue models to check some hypothesis. Kudos to Philipp Lindenberger and Paul-Edouard Sarlin for releasing such amazing package and the paper.

I also think that the story with initializing learned matchers from other features, or even universal, feature-agnostic matchers is not over.

Acknowledgements.

This blogpost is supported by CTU in Prague RCI computing cluster from OP VVV funded project CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for Informatics” grant.

Everything you (didn’t) want to know about image matching