Robust Visual Place Recognition via modern Hopfield networks and foundation models

Takanori Hashimoto; Teijiro Isokawa; Masaki Kobayashi; Naotake Kamiura

doi:10.1587/nolta.17.932

Abstract

Visual Place Recognition (VPR) under severe environmental changes remains a fundamental challenge for autonomous roboticsin real-world environments. This task can be interpreted as associative memory retrieval from noisy queries, but classical models suffer from limited capacity and sensitivity to pixel-level variations. We address this by integrating Modern Hopfield Networks with DINOv3, a self-supervised Vision Transformer that provides robust semantic representations. The primary aim of this study is not to maximize VPR accuracy itself, but to investigate whether an energy-based associative memory can be realized on the latent space of a foundation model, using VPR as a challenging real-world testbed. Place recognition is formulated as energy minimization in a semantic latent space, where stored scenes act as attractors. Experiments on the Transient Attributes Database across four seasons show that the proposed method significantly outperforms pixel-based baselines, even under extreme domain shifts. We further analyze the retrieval dynamics and the effect of the inverse temperature parameter β on attractor stability.

Content from these authors

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!