IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Special Section on Intelligent Information Processing to Solve Social Issues
A Methodology on Converting 10-K Filings into a Machine Learning Dataset and Its Applications
Mustafa SAMI KACARSemih YUMUSAKHalife KODAZ
Author information
JOURNAL FREE ACCESS

2023 Volume E106.D Issue 4 Pages 477-487

Details
Abstract

Companies listed on the stock exchange are required to share their annual reports with the U.S. Securities and Exchange Commission (SEC) within the first three months following the fiscal year. These reports, namely 10-K Filings, are presented to public interest by the SEC through an Electronic Data Gathering, Analysis, and Retrieval database. 10-K Filings use standard file formats (xbrl, html, pdf) to publish the financial reports of the companies. Although the file formats propose a standard structure, the content and the meta-data of the financial reports (e.g. tag names) is not strictly bound to a pre-defined schema. This study proposes a data collection and data preprocessing method to semantify the financial reports and use the collected data for further analysis (i.e. machine learning). The analysis of eight different datasets, which were created during the study, are presented using the proposed data transformation methods. As a use case, based on the datasets, five different machine learning algorithms were utilized to predict the existence of the corresponding company in the S&P 500 index. According to the strong machine learning results, the dataset generation methodology is successful and the datasets are ready for further use.

Content from these authors
© 2023 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top