Train your own matcher!
Release of LightGlue (faster, better and open-source SuperGlue) training code made me excited for many reasons, and of them is that I can answer some small research questions about learned feature matching and share results.
If you don’t know what is LightGlue - that is transformer-based architecture for local feature matching, where the original descriptors (SuperPoint, SIFT, etc) are fused with local feature geometry (keypoint coordinates, scale, orientation, etc) and new, context-aware descriptors are produced.
One of such small questions is: “Does the parameterization of the local feature geometry matter for positional encoding input or not?”
Geometry encoding in LightGlue
Unlike modern keypoint detectors like SuperPoint or DISK, SIFT local feature has an orientation and scale part.
Should we just concatenate scale and orientation (LightGlue code does it) to (normalized) keypoint center, making a heterogenous vector, or should we try to preprocess the input somehow? Or maybe geometry does not help at all?
To answer this question, I have trained (only the homography pretraining part) 5 LightGlue SIFT models, which differ only in the input feature geometry representation.
Possible geometry representations
There are 4 possible representation of the SIFT keypoints geometry:
The simplest is
(x,y, angle, radius)
. Intuitively it makes the least sense, because the vector elements are heterogenious - x,y, radius are in pixels, while angle is in degrees. Moreover, if we compare the scale, it would make more sense to uselog(radius)
. We call itsift_scaori
in the graph.So the next version is
(x,y, angle, log(radius))
. We call itsift_logscaori
in the graph.The most homogenious is representation as 2 points - center and the one on the border:
(x,y, x2, y2)
. We call itsift_laf2
in the graph.Finally, we are use LAF representation as in affine transformation matrix, used in, for example, kornia:
(x,y, x2-x, y2-y) == (x,y, R * cos(angle), R * sin(angle))
. We call itsift_laf
in the graph.
And the baseline would be not using keypoint geometry, only center as for SuperPoint – (x,y)
. We call it sift_clean
in the graph.
Results
At the beginning of the training it seems, that there is a considerable difference between keypoint geometry representation, (x,y, x2, y2)
and (x,y, angle, log(radius))
are clearly better than others.
However, in the end of the training, the simplest and the most stupid (x,y, angle, radius)
becomes the first.
The difference is even more pronounced on MegaDepth-1500. Even more important thing is that ranking in match recall/precision on validation is different from the MegaDepth-1500 results. Do NOT compare these results to the LightGlue paper SIFT results, as it we haven’t done full training - only a homography pretraining.
Name | Encoding | Pose mAA (%) |
---|---|---|
sift_clean | (x, y) |
47.0 |
sift_scaori | (x, y, angle, radius) |
50.0 |
sift_logscaori | (x, y, angle, log(radius)) |
47.3 |
sift_laf1 | (x, y, x2-x, y2-y) |
48.1 |
sift_laf2 | (x, y, x2-x, y2-y) |
48.2 |
So, the conclusion is, as often the case with deep learning - keep it simple. The second conclusion is that additional geometry surely helps, but not that much.
Can we initialize LightGlue for features X from LightGlue trained for features Y?
Recently I have accidentally run ALIKED-LightGlue with DISK features. To my surprise, result was quite good.
Unfortunately, that works only for simple image pairs, not harder ones.
On the second though, that kind of makes sense - once (in intial layers) we went far from original descriptors, then the attention and positional encoding are kind of similar.
This begs for the question - can we initialize LightGlue for new local features with previous ones, and train faster or better?
I have initialized the LightGlue with SuperPoint LightGlue and started training for DeDoDe features. as_sift
here stands for the hyperparameters setup, ignore it.
Unfotunately, one cannot rely on the match recall/precision validation metric. When I have evaluated on MegaDepth-1500, the model initialized from SuperPoint was worse.
Name | Pose mAA (%) |
---|---|
dedode_homo | 60.4 |
dedode_homo_from_sp | 56.5 |
— | — |
dedode_homo_ft_megadepth | 65.9 |
dedode_homo_from_sp_ft_megadepth | 65.6 |
Summary
It was fun to train a couple of LightGlue models to check some hypothesis. Kudos to Philipp Lindenberger and Paul-Edouard Sarlin for releasing such amazing package and the paper.
I also think that the story with initializing learned matchers from other features, or even universal, feature-agnostic matchers is not over.
Acknowledgements.
This blogpost is supported by CTU in Prague RCI computing cluster from OP VVV funded project CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for Informatics”
grant.
Everything you (didn’t) want to know about image matching