This study analyzes architectural explanation videos published on architecture information websites to clarify visual expression techniques based on subject depiction structures. Main subjects were assigned to each shot, and N-gram analysis was used to extract patterns in depiction sequences. Correspondence analysis and multinomial logistic regression were applied to examine the relationship between visuals and speech. The results show that meaning is formed through sequences of different subjects, and depictions of interior spaces, exterior spaces, exteriors, and parts play key roles. A consistent alignment between spoken content and visuals was also observed.