Journal of the Japanese Association for Digital Humanities
Online ISSN : 2188-7276
Articles
The Temporal and Spatial Characteristics of the Administrative Document Catalog of the Government-General of Taiwan
Hajime MuraiToshio Kawashima
著者情報
ジャーナル オープンアクセス HTML

2022 年 6 巻 1 号 p. 3-10

詳細
Abstract

The official documents of the government-general of Taiwan are from the official Taiwan document collection of the Japanese government from before World War II. The documents have been preserved in good condition, and although the official documents of the Japanese government about Japan from that time remain sealed, the Taiwan documents have been published openly. Thus, the documents that are available for historical study are scarce and valuable. Although these documents are being digitized continuously by researchers at Chukyo University, we used the already digitized catalog of the documents from the Meiji era for a quantitative analysis in order to obtain an overview of the enormous number of administrative documents. Through this analysis we were able to detect the chronological shifts in topics of the administrative documents from that time. In addition, we added geographical information based on the place names, and, depending on the regions, extracted characteristics related to the themes of the documents. The goal of this research is to investigate the usefulness of the digitized text of official documents for historical and social studies. The methods used for the analysis in this research can in the future be applied to all of the digitized texts in this collection for more precise and comprehensive quantitative analyses.

Introduction

Currently, there are numerous research projects utilizing digitally archived historical documents in various academic fields, especially in the digital humanities. From these digital archives, a variety of data sets have been combined with geographical information to extract historical characteristics: for example, demographic data about the working population (Parker 2010) and data sets for taxes and financial affairs (Carrion et al. 2016). These geographically combined data sets are not limited to numerical data. Various other resources, such as cultural heritage (Wilson 2010) and historical texts (Fleet 2005), have been combined with geographical data to enable the visualization and discovery of new relationships. As these research projects have demonstrated, digital archiving and the combination of various types of historical resources have become a worldwide trend because they provide additional and useful methods for historical research.

In Japan, digital archiving has also been widely utilized, especially in cultural fields, for example, with digitized classic documents ( National Diet Library, http://dl.ndl.go.jp/) and digitized cultural properties and works of art (Agency for Cultural Affairs 2008). Moreover, Japan Search (Intellectual Property Strategy Headquarters, https://jpsearch.go.jp/) has been developed for cross-searching a variety of digital archives that have been built by several organizations, and this has enabled advanced search methods based on metadata and RDF.

The necessity for the digitization and publication of government documents is claimed to be a part of the progress of e-government (Koga 2005); however, the digitization and publication of Japanese historical government documents are not highly advanced. One of the reasons for this is that natural disasters occur frequently in Japan. Moreover, various precious documents were lost during World War II because of air raids, evacuations, and disposal at the end of the war (National Archives of Japan 2006). However, in those situations, the government documents of the Governor-General of Taiwan and the Governor-General of Korea were preserved in exceptionally good condition (Kato 2002). The documents of the Governor-General of Taiwan are especially comprehensive administrative documents that depict the activities of public organizations in prewar Japan, and these documents are valuable not only in the field of history but also in modern Japanese philology (Hiyama 2003). Moreover, unlike general Japanese official documents, whole documents from this collection have been published, and digitization in image format has been completed (Higashiyama 2017). Therefore, they are very valuable as research materials. However, only a partial catalog of the documents has been digitized in text format, and a research project for automatic character recognition is in progress (Takahashi et al. 2018).

As historical digital archives related to Taiwan, a data set of historical place names (Chen 2014; Hsiang et al. 2012) and a digital archive of historical documents before Japanese rule have already been developed (Chen et al. 2007). Moreover, a current project is underway to develop a historical database of all Chinese people (CBDB 2017). In addition to these digital data sets related to Taiwan and China, if a sufficient number of the administrative documents of the Governor-General of Taiwan are digitized, they can become a foundation for carrying out quantitative historical and geographical research not only in Japan but also in the larger region of East Asia.

In this research, in order to obtain an overview of the enormous number of administrative documents of the Governor-General of Taiwan, we performed a quantitative analysis using information technology on the catalog of the documents that was already digitized in text format of the documents. In addition, by combining geographical information and the text data set, it became possible to extract the chronological and regional characteristics quantitatively.

Although there are various limitations to catalog analyses, this method clarifies the value and potential of all of the digitized full text data of administrative documents of the Governor-General of Taiwan. In other words, though the catalog data cannot be used to prove historical hypotheses, the results of data analysis would help historians to notice interesting trends in the official documents. Subsequently, the goal of this research is to investigate the usefulness of digitally archived full text data of official documents for historical and social studies.

Content to be Analyzed

The data source for this research was the Catalog Database for the Administrative Documents of the Government-General of Taiwan (ISSCU 2015), which is based on the Catalog for the Administrative Documents of the Government-General of Taiwan (台湾総督府文書目録, CCCADGT 1994); these are historical materials owned by Taiwan Historica that have been surveyed and compiled by the Institute for Social Sciences, Chukyo University, for academic research. However, not all of the catalog data were completed at this stage. Permanent preservation documents in the Collection of the Official Documents of the Government-General of Taiwan (台湾総督府公文類纂) between 1895 (Treaty of Shimonoseki) and 1914 have been digitized in text data format (CSV); the data for the last year (1914) are partial. These text data of the catalog are based on the Catalog for the Administrative Documents of the Government-General of Taiwan, volumes 1 to 29. The digitized texts of the catalog include the title of each administrative document and its date of promulgation.

The database was divided into five parts, each corresponding to a time period, as depicted in table 1. Except for the fifth part, each part includes a similar number of titles (of official documents).

Dictionary for Document Categories

Before the quantitative content analysis, a dictionary was developed for extracting the words from the titles of the official documents that are included in the catalog (Murai and Kawashima 2018). It is difficult to extract appropriate results using morphological analysis in the case of historical documents using an ordinary dictionary for natural language processing because historical documents include archaic names of people and places and historical and cultural terms. Subsequently, in this research, the names of people were omitted, and frequently appearing place names and historical or cultural terms were registered in the dictionary for morphological analysis because the goals were to extract an overview of the contents and extract the chronological and geographical characteristics.

We used the Java-based Japanese morphological analysis library Kuromoji (Atilika 2018) for morphological analysis of the catalog. The results of the morphological analysis, based on the existing ordinary dictionary, were manually examined to extract word fragments and mistakenly combined word pairs, and the apparent location of these words in the original documents was verified to register the correct words. In each of the five parts of the catalog text data, we investigated the 500 most frequently appearing words. In addition, we registered the necessary topic words and place names for the characteristic analysis (described below). As a result, 136 words related to place names and 61 historical or cultural words were included in the dictionary. Some examples of these words are depicted in table 2.

Table 1. Summary of the Catalog Database for the Administrative Documents of the Government-General of Taiwan

  Period Titles included Book numbers
The first part 1895–1900 24,985 1 to 579
The second part 1901–1904 24,510 580 to 1,050
The third part 1905–1908 23,752 1,051 to 1,451
The fourth part 1909–1912 24,782 1,453 to 2,087
The fifth part 1913–1914 8,859 2,088 to 2,340

Table 2. Examples of the words registered in the dictionary for morphological analysis

Name of the historical organization Historical Japanese words Historical words related to Taiwan
撫墾署 (Indigenous office) 勅詔 (Command paper) 隘勇 (Defense line)
弁務署 (Police) 聖詔 (Command paper) 土匪 (Hostile indigenous)
警保 (Security, Police) 命免 (Appointments and dismissal) 蕃務 (Indigenous operations)
郵便電信局 (Post office) 紀元節 (Old holiday) 樟脳 (Camphor)

In order to extract an overview of the contents of the catalog of the administrative documents, we developed topic categories for the contents of the administrative documents of the Government-General of Taiwan. As many documents are related to several topic categories, documents were assigned to all of their related categories in the relational database for the catalog of the administrative documents. In the categorization process, we created a word list for each topic category and assigned the documents whose titles include those words to the corresponding topic categories.

In creating the word list for each topic category, first, we randomly selected titles from the catalog and categorized them manually based on the topics of each document. This process is based on the content analysis method (Stemler 2000), which is often utilized in sociology and psychology. In addition, typical words within the categorized title category were listed for each corresponding topic. For example, some titles that are related to the topic of hygiene often include words such as “病院” (hospital), “医師” (doctor), and “患者” (patient). Therefore, these words were added to the word list for the category “Hygiene.” We expanded the word list for the topic categories by repeating this procedure of random selection and word categorization. Then, the categorization of all the titles in the target catalog was automatically completed based on the word list. If a title did not include any word within the list, the title was categorized as “Other.” The expansion of the word list was repeated until the “Other” titles became, on average, less than 5 percent of the total. If one title included several topic words that belonged to different categories, that title was assigned to several corresponding categories. Finally, historians with expertise in the administrative documents of the Government-General of Taiwan were asked to verify the final word list for the topic categories. Table 3 illustrates the final ratio of titles assigned to categories except for “Other,” and table 4 depicts the numbers of words that were categorized for the final word list, with examples of the words.

The number of words for the topic categories is relatively large for “Personnel Shift,” “Hygiene,” “Military,” and “Judgment.” “Personnel Shift” includes various terms related to work, employment, and resignation. “Hygiene” includes many disease names with many different notations, and “Military” includes various names related to military ranks. “Judgment” includes various criminal names and law names. “Residents,” “Permission,” and “Fisheries” include a relatively small number of words. The documents related to “Residents” and “Permission” are mostly composed of standardized phrases, and there are only a few documents related to “Fisheries.”

Table 3. Ratio of the categorized titles

  Categorized Not categorized Total Ratio
The first part (1895–1900) 23,116 1,869 24,985 93%
The second part (1901–1904) 23,655 855 24,510 97%
The third part (1905–1908) 23,054 696 23,750 97%
The fourth part (1909–1912) 24,467 313 24,780 99%
The fifth part (1913–1914) 8,691 168 8,859 98%
Total 102,983 3,901 106,884 96%

Table 4. Topic categories and examples of categorized words

Categories Words Examples of categorized words
Personnel Shift 119 任官 (Commissioner), 採用 (Recruitment), 辞令 (Resignation), 人員 (Personnel), 雇員 (Employer), 職員 (Employee), 技師 (Engineer), 官吏 (Official)
Publicity 29 周知 (Publicity), 掲載 (Publication), 公布 (Promulgation), 告示 (Announcement), 訓示 (Notice), 訓諭 (Reminder)
Rules 29 官制 (Government System), 制定 (Establishment of Law), 法律 (Law), 法令 (Law), 律令 (Statute), 条例 (Ordinance), 勅令 (Decree)
Office Work 20 事務 (Office Work), 書式 (Document Form), 署名 (Signature), 印章 (Seal), 届出 (Notification), 統計 (Statistics)
Permission 8 認可 (Authorization), 許可 (Permission), 免状 (Diploma), 免許 (License), 特許 (Patent)
Hygiene 91 衛生 (Hygiene), 病院 (Hospital), 医師 (Doctor), 製薬所 (Pharmacy), 薬剤師 (Pharmacist), 病人 (Disease), 患者 (Patient)
Livestock 18 家畜 (Livestock), 牧畜 (Pasture), 牛 (Cattle), 豚 (Pigs), 獣疫 (Veterinary Diseases), 疫牛 (Plague Cattle), 疫豚 (Plague Swine)
Economy and Tax 36 会計 (Accounting),予算 (Budget), 税 (Tax), 税法 (Tax Law), 地租 (Land Tax), 納税 (Tax Payment), 関税 (Customs)
Judgment 53 犯罪 (Crime), 非行 (Delinquency), 強盗 (Robbery), 密輸 (Smuggling), 懲罰 (Punishment), 死刑 (Death Penalty)
Infrastructure 21 水道 (Water supply), 電気 (Electricity), 瓦斯 (Gas), 土木 (Civil Engineering), 官有地 (Government-owned Land), 地籍 (Land Ownership)
Police 16 警察 (Police), 警察署 (Police Station), 派出所 (Dispatch Station), 警部 (Police Officer), 警保 (Security)
Correspondence 14 郵便局 (Post Office), 電信 (Telegraph), 郵便及電信局 (Postal and Telegraph Office), 送金 (Remittance), 為替 (Exchange)
Education 26 日本語 (Japanese Language), 中国語 (Chinese Language), 生徒 (Students), 学校 (School), 大学校 (University), 教育 (Education)
Monopoly 16 専売 (Monopoly), 塩田 (Salt Field), 煙草 (Tobacco), 樟脳 (Camphor), 阿片 (Opium)
Establishment and Closure 13 開設 (Opening), 設置 (Installation), 移転 (Relocation), 開庁 (Opening Agency), 閉庁 (Closed Agency)
Traffic 35 交通 (Traffic), 運搬 (Transportation), 輸出入 (Import/Export), 海事 (Maritime), 港湾 (Harbor), 灯台 (Lighthouse), 浮標 (Buoy)
Disaster and Welfare 22 救恤 (Rescue), 扶助 (Assistance), 自殺 (Suicide), 年金 (Pension), 罹災 (Disaster), 台風 (Typhoon), 水害 (Flood)
Mining and Engineering 12 鉱山 (Mine), 鉱業 (Mining), 石炭 (Coal), 硫黄 (Sulfur), 金銅 (Gold Copper), 石油 (Petroleum), 工場 (Factory), 工業 (Industry)
Rebellion and Indigenous 22 土匪 (Indigenous), 賊匪 (Bandit), 凶徒 (Villain), 生蕃 (Rebellious Indigenous), 熟蕃 (Cooperative Indigenous), 蕃務 (Indigenous Operations)
Military 56 大本営 (Headquarters), 海軍 (Navy), 陸軍 (Army), 軍隊 (Military), 軍務 (Military Affairs), 軍艦 (Warship), 陸軍大臣 (Army Minister)
Agriculture 33 糖業 (Sugar Industry), 甘蔗 (Sweet Potato), 果樹 (Fruit Tree), 茶樹 (Tea Tree), 綿花 (Cotton), 護謨 (Rubber), 椰子 (Coconut)
International 25 国際 (International), 公法 (Public Law), 条約 (Treaties), 外国 (Foreign Countries), 欧米 (Europe and the U.S.), 領事 (Consulate)
Residents 7 住人 (Residents), 戸口調査 (Doorway Survey), 戸口簿 (Family Register)
Imperial Family and Rituals 28 天皇 (Emperor), 明治天皇 (Emperor Meiji), 皇太子 (Crown Prince), 皇后 (Empress), 殿下 (His Highness), 祭典 (Celebrations)
Religion 29 宗教 (Religion), 布教 (Mission), 神社 (Shrine), 仏教 (Buddhism), 本願寺 (Honganji), 延暦寺 (Enryakuji), 護国寺 (Gokokuji)
Fisheries 7 水産業 (Fisheries), 水産物 (Fishery Products), 漁船 (Fishing Boats), 漁夫 (Fishermen)

Time Series Change by Topic Categories and Locations

In order to extract the chronological shifts in the contents of the documents of the Government-General of Taiwan, we counted the annual number of titles for each topic category; figure 1 depicts these shifts. Figure 1 illustrates the increase in the total number of documents according to the development of the Government-General of Taiwan, but the differences between all twenty-six topics were difficult to see in this figure. Therefore, we separated the topics into three groups to depict the chronological shift in the ratio of the number of titles for each topic category to the number of all of the documents in figures 2–4.

In the topic categories, the ratios of “Personnel Shift,” “Publicity,” and “Office Work” decrease throughout the period when the documents under study were produced. Inversely, the ratios of “Rules,” “Permission,” “Hygiene,” “Livestock,” “Infrastructure,” “Education,” “Monopoly,” “Agriculture,” and “Disaster and Welfare” increase throughout the period. During the period of the establishment of the Government-General of Taiwan, personnel affairs and office-based work were the largest part of the work. Then, the promotion of industry through the development of laws and expansion of infrastructure gradually became more important. The chronological shift suggests that the Government-General of Taiwan had a goal of economic development, which it pursued by introducing a monopoly system because the main industries were dependent on agricultural and livestock products. These characteristics match the results of previous historical research about agricultural aspects of the Government-General of Taiwan (Huang and Asamoto 2002; Huang and Asamoto 2006; Nakajima 2006).

However, several topics peaked in specific years, including “Rebellion and Indigenous” (1897), “Correspondence” (1898–1901), “Judgment” (1901–2), “Economy and Tax” (1904–5), and “Traffic” (1905–8). Although the absolute value is small, the category “Rebellion and Indigenous” quickly increased by a factor of 4, then quickly decreased by half compared to the peak. These shifts of peaks would suggest that there was a transition between important issues managed by the Government-General of Taiwan, from dealing with local resistance movements to the development of communication methods, legal systems, economic systems, and transportation networks.

Figure 1. Chronological shifts in the numbers of the categorized titles
Figure 2. Chronological shifts in the percentages of the categorized titles, group 1
Figure 3. Chronological shifts in the percentages of the categorized titles, group 2
Figure 4. Chronological shifts in the percentages of the categorized titles, group 3

Relationships of the Topic Categories

In order to extract the relationships between the topic categories, we investigated the co-occurrence of several topic words in one title (where one title is assigned to multiple topic categories). Based on the number of co-occurrences, we calculated the relationships between the topic categories using the Jaccard similarity coefficient, which is used as a general indicator in network analysis. A Jaccard similarity coefficient between A and B is defined as the number of co-occurrences of A and B divided by the total number of appearances of A or B. It is generally regarded that a larger Jaccard similarity coefficient signifies stronger relationships. Figure 5 depicts the relationships between the topic categories as a network based on the frequency of appearance and co-occurrence of the topic categories. Figure 5 visualizes only the edges corresponding to the larger Jaccard similarity coefficients, and the font size of each node corresponds to the frequency of appearance of each topic category.

Figure 5. Visualization of the network relationships between topic categories
Figure 6. Visualization of the network relationships between thematic topic categories

Figure 6 also shows only the edges corresponding to the larger Jaccard similarity coefficients; the nodes related to office work (“Publicity,” “Rules,” “Office Work,” “Permission,” and “Personnel”) were excluded for the sake of readability.

In the center of the network in figure 5, “Personnel Shift” (“Personnel” in fig. 5), “Publicity,” “Rules,” and “Office Work” (“Office” in fig. 5) are tightly coupled. These topic categories correspond to the foundation of the work of the Government-General and are connected to other topic categories. “Permission” became a hub for “Economy and Tax” (“Economy” in fig. 5), “Monopoly,” “Education,” “Infrastructure,” “Agriculture,” and “Mining and Engineering” (“Mining” in fig. 5). These topic categories would be strongly reliant on the permission of the government.

As illustrated in figures 5 and 6, “Livestock” was connected to “Hygiene” and “Traffic.” Outbreaks of infectious diseases in livestock were a serious problem; therefore, it became necessary to block traffic when outbreaks occurred to reduce the spread of disease. Also, the connection between “Rebellion and Indigenous” (“Rebellion” in fig. 5), “Judgment,” and “Police” would correspond to government actions to cope with independence movements. These characteristics of the network structure correspond to historical facts and events, and it would be possible to visualize the chronological shift in the network of the topic categories by selecting documents from a specific period. Moreover, it would be possible to visualize the geographical characteristics by combining the geographical data described below.

Topic Category Changes Related to Location

To analyze the geographical characteristics of these data, it is necessary to define specific regions. However, the administrative divisions changed repeatedly in the period of the Government-General of Taiwan. Therefore, it is difficult to identify the corresponding former place names and their precise geographic locations. In this research, we analyzed the place names of larger regional divisions such as county (縣) and city (市), and the lower categories of rural township, urban township, and district (鄕, 鎭, 區), based on the current administrative divisions in Taiwan. Among these place names, the corresponding earlier place names that appeared more than several times in the titles were registered in the place-name dictionary. In addition, we added to the place-name dictionary the place names that appeared frequently in the dictionary that we developed using the morphological analysis. These place names were grouped at the levels of county (縣) and city (市; in table 5). However, as New Taipei City was separated from Taipei County in 2010, Taipei and New Taipei City were combined and treated as Taipei. In addition, Hsinchu City was integrated into Hsinchu County, and Chiayi City was integrated into Chiayi County. Kinmen and Lienchiang were excluded because they were not under Japanese rule during the period under study. The overlapping place names of Xingang (新港), Sihu (四湖), Dahu (大湖), Neipu (内埔), Zhenke (針珂), and Shuzijiao (樹仔脚) were omitted because it is difficult to identify these only by the title. Table 5 depicts the number of place names that are included in the corresponding county and city regions, and the number of titles that include those place names. Figure 7 depicts the chronological shifts of these place names in the target titles.

Table 5. The regions and place names included

  Place names included Frequency of appearance
台北 (Taipei) 32 34,142
台中 (Taichung) 26 17,894
台南 (Tainan) 29 24,689
桃園 (Taoyuan) 7 6,717
高雄 (Kaohsiung) 30 4,774
基隆 (Keelung) 3 9,947
新竹 (Hsinchu) 8 2,563
苗栗 (Miaoli) 13 6,453
彰化 (Changhua) 25 7,812
南投 (Nantou) 11 6,431
雲林 (Yunlin) 15 7,110
嘉義 (Chiayi) 18 12,096
屏東 (Pingtung) 16 9,332
宜蘭 (Yilan) 9 8,095
花蓮 (Hualien) 10 5,759
台東 (Taitung) 15 5,920
澎湖 (Wuhu) 7 5,966

Figure 7 depicts the characteristics of each division based on whole documents; therefore, it is difficult to interpret the meaning of graph shapes which are composed of various topics. However, it becomes possible to extract chronological and geographical shifts in specific topics when we combine the geographic information based on the place names with the topics. For example, figure 8 depicts the chronological and geographical shifts in the topic “Hygiene.” Unlike most of the other place names (fig. 7), Tainan peaked in 1909. Moreover, shifts in the peaks from Tainan and Yunlin to Taichung were observed, and these would correspond to the outbreak of pig cholera at that time.

Figure 7. The local characteristics of the whole collection of documents
Figure 8. The local characteristics of the topic “Hygiene”

Conclusions and Future Work

In this research, based on the digitized text of the titles and promulgation dates in the Catalog Database for the Administrative Documents of the Government-General of Taiwan, we undertook a quantitative analysis of the characteristics. First, we developed a historical and cultural word dictionary for a morphological analysis. Then, we used random sampling and manual coding to develop the topic categories of the documents and a list of characteristic words for these categories. We extracted the overview of the chronological characteristics of the administrative documents of the Government-General of Taiwan quantitatively according to the list of words in the topics. We then visualized the relationships between these topics as a network. Moreover, by combining geographical information with the other data we had, we clarified that it is possible to extract chronological and geographical characteristics.

Although there are various limitations to using a catalog analysis, the results of this research clarify the value and potential of the digitized full-text data of the administrative documents of the Governor-General of Taiwan. Therefore, we conclude that digital archiving in full text format for the administrative documents of the Governor-General of Taiwan would be useful for historical and social studies.

Moreover, these methods can also be applied to other historical digital archives in order to comprehend the temporal and spatial characteristics of the contents of the documents.

References
 
© Japanese Association for Digital Humanities

この記事はクリエイティブ・コモンズ [表示 4.0 国際]ライセンスの下に提供されています。
https://creativecommons.org/licenses/by/4.0/deed.ja
feedback
Top