This study investigates the relative impact of multimodal pre‑training and Transformer depth on the accuracy–cost balance of lightweight vision‑language models trained exclusively with public data. A unified framework toggles pre‑training and varies encoder/decoder layers (encoder 0–8; decoder 12/24) across three benchmarks—HOI recognition on V‑COCO and clothing categorization + localization on DeepFashion2. Results show that costly pre‑training alters recognition/classification by ≤ 1 percentage‑point and yields only marginal localization gains despite requiring more than 200 GPU‑hours. Removing the encoder keeps HOI accuracy unchanged but lowers localization by 13 percent; a single encoder layer restores performance at trivial cost. Doubling decoder depth brings no benefit while adding 70–95 training hours. Consequently, an enc2‑dec12 configuration without pre‑training provides the best accuracy–cost trade‑off unless fine‑grained localization is paramount.
This study describes a new method to achieve the high precision end-milling on rounded-corner. It is known that machining error is generated due to the variation of cutting force in end-milling and tool bending caused by lack of tool rigidity. Therefore, we indicate that it is possible to end-milling under the conditions of keeping Number of Simultaneously Engaged Edges and Length of Simultaneously Engaged Edges constant by controlling axial and radial depth of cut to control variation of cutting force from geometric considerations. Furthermore, we propose a method of machining with the tool of equal radius to the corner radius to maximize tool diameter under the condition of controlling variation of cutting force. Machining with the tool with high rigidity due to large diameter and the condition of controlling variation of cutting force makes the high precision end-milling on corner possible.