2019 Volume 27 Pages 536-544
Currently, many attacks are targeting legitimate domain names. In homograph attacks, attackers exploit human visual misrecognition, thereby leading users to visit different (fake) sites. These attacks involve the generation of new domain names that appear similar to an existing legitimate domain name by replacing several characters in the legitimate name with others that are visually similar. Specifically, internationalized domain names (IDNs), which may contain non-ASCII characters, can be used to generate/register many similar IDNs (homograph IDNs) for their application as phishing sites. A conventional method of detecting such homograph IDNs uses a predefined mapping between ASCII and similar non-ASCII characters. However, this approach has two major limitations: (1) it cannot detect homograph IDNs comprising characters that are not defined in the mapping and (2) the mapping must be manually updated. Herein, we propose a new method for detecting homograph IDNs using optical character recognition (OCR). By focusing on the idea that homograph IDNs are visually similar to legitimate domain names, we leverage OCR techniques to recognize such similarities automatically. Further, we compare our approach with a conventional method in evaluations employing 3.19 million real (registered) and 10, 000 malicious IDNs. Results reveal that our method can automatically detect homograph IDNs that cannot be detected when using the conventional approach.