2026 Volume 17 Issue 3 Pages 932-944
Visual Place Recognition (VPR) under severe environmental changes remains a fundamental challenge for autonomous roboticsin real-world environments. This task can be interpreted as associative memory retrieval from noisy queries, but classical models suffer from limited capacity and sensitivity to pixel-level variations. We address this by integrating Modern Hopfield Networks with DINOv3, a self-supervised Vision Transformer that provides robust semantic representations. The primary aim of this study is not to maximize VPR accuracy itself, but to investigate whether an energy-based associative memory can be realized on the latent space of a foundation model, using VPR as a challenging real-world testbed. Place recognition is formulated as energy minimization in a semantic latent space, where stored scenes act as attractors. Experiments on the Transient Attributes Database across four seasons show that the proposed method significantly outperforms pixel-based baselines, even under extreme domain shifts. We further analyze the retrieval dynamics and the effect of the inverse temperature parameter β on attractor stability.