Article ID: 2025DVL0006
This letter describes a training-free 3D semantic segmentation method using virtual cameras and a 2D foundation model guided by language prompts. Aggregating multi-view predictions via weighted voting achieves accuracy comparable to supervised methods and supports open-vocabulary recognition without requiring annotated 3D data or paired RGB images.