人工知能学会論文誌
Online ISSN : 1346-8030
Print ISSN : 1346-0714
ISSN-L : 1346-0714
原著論文
離散フーリエ変換文書特徴を用いた,複素SVMによる日本語Web広告文書の適法性判別
河本 哲秋光 淳生浅井 紀久夫
著者情報
ジャーナル フリー

2023 年 38 巻 3 号 p. D-M51_1-14

詳細
抄録

In Internet advertising, text information is added to increase the appeal of the ad to the viewers. However, some of the advertising documents contain inappropriate expressions. Wording or expressions that exaggerate the efficacy of a product or that recommend a product by a medical professional may violate the Pharmaceutical Affairs Law and the Act against Unjustifiable Premiums and Misleading Representations. Therefore, a system that can effectively and quickly detect problematic advertisements is required. Some advertisements cannot be properly classified based on word statistics alone. Therefore, information other than word statistics must be embedded in the document vector. The advertising documents targeted in this study have characteristics such as “biases in the word positions of specific words” and “periodic occurrence of specific words.” Frequently appearing words in problematic documents (especially in cosmetics advertisements) have strong biases in their word positions, resulting in a complex multimodal distribution of position of occurrence. Therefore, embedding word order information and word period information in document vectors is considered very effective for identifying problematic advertising documents.

In recent years, the effectiveness of the BERT model has been recognized in various natural language processing tasks. However, it is also true that faster models are required for application on the Internet advertising. Therefore, as a means of achieving both inference speed and discrimination performance, we propose a document feature based on the discrete Fourier transform(DFT) of word vectors weighted by an index previously proposed in a study that attempted to categorize Chinese Internet advertisements. In addition, we employed the Complex-valued Support Vector Machines as discriminative models that can handle complex numbers and have high generalization performance even with small amounts of data.

Although the discrimination performance of the proposed model is inferior to that of ALBERT and BERT to some extent, it is higher than that of DistilBERT, XGBoost, and LightGBM. The inference speed of the proposed model is somewhat slower than XGBoost and LightGBM and needs improvement, but is faster than DistilBERT. Those results indicate that the proposed model is promising when applied on the Internet. In addition, we found that when the index proposed in the previous study (which attempted to categorize Chinese advertisements) was applied to Japanese advertisements, that index emphasized the word vectors of specific nouns and verbs.

著者関連情報
© 人工知能学会2023
前の記事 次の記事
feedback
Top