Metric-Type Identiﬁcation for Multilevel Header Numerical Tables in Scientiﬁc Papers

Numerical tables are widely used to present experimental results in scientiﬁc papers. For table understanding, a metric-type is essential to discriminate numbers in the tables. Herein, we introduce a new information extraction task, i.e., metric-type identiﬁcation from multilevel header numerical tables, and provide a dataset extracted from scientiﬁc papers comprising header tables, captions, and metric-types. We propose joint-learning neural classiﬁcation and generation schemes featuring pointer-generator-based and pretrained-based models. Our results show that the joint models can manage both in-header and out-of-header metric-type identiﬁcation problems. Furthermore, transfer learning using ﬁne-tuned pretrained-based models successfully improves the performance. The domain-speciﬁc of BERT-based model, SciBERT, achieves the best performance. Results achieved by a ﬁne-tuned T5-based model are comparable to those obtained using our BERT-based model under a multitask setting. the same loss function and decoding procedure. They demonstrated that their approach can be successfully applied to various tasks such as summarization, question-answering, and natural language inference.


Introduction
Tables are an effective tool for presenting data efficiently in rows and columns. In scientific papers, numerical tables are typically used to present experimental results to facilitate data analysis. Examples of numerical tables presented in scientific papers are shown in Figure 1.
Multiple categories can be represented in table headers by incorporating several header sets in a hierarchical view; this is known as multilevel header tables. The presentation of tables in scientific papers must adhere to strict guidelines. For example, similar types of texts are to be written at the same level of header. Figure 1a shows a multilevel header example in the column section, with the task (Task 1 and Task 2 ) presented in the first header-level and the metric (Prec and Rec) in the second. The table contains a row header specifying the model (Model A, Model B, Model C, and Model D). The header information is typically limited owing to the unknown table schema. However, we assume that tables presented in scientific papers adhere to the rule whereby similar types of header names are categorized at the same header level. To understand numbers in the tables, metric-types are important for discriminating the numbers. A comparison between numbers is applied for numbers in the same metric-type with different categories. For the table shown in Figure 1a, we cannot compare the number 60 for Model A in the first column with 60 in the second column because they represent different metric-types: Prec and Rec. Computing numbers of different metric-types will result in inaccurate analyses.
Header names may be written differently in different tables, e.g., using abbreviations such as p, pre, or prec to refer to precision. Due to the lexical diversity of header names, metric-type identification becomes more challenging. Using rule-based metric-type tagging or a limited set of metric-types in a dictionary is insufficient to encompass the diversity of metric-types. As tables presented in scientific papers are typically provided with logical captions and a logical categorization of the header level, we introduce a metric-type identification task that locates the metric-type in the headers using the caption and header names as inputs. For the example shown in Figure 1a, the metric-type is indicated at the second level of the column header.
Furthermore, we consider tables that do not include metric-types in their header (out-ofheaders), as shown in Figure 1b. In these cases, the metric-types are provided in the caption.
To consider tables with metric-types located and not located in the headers, we propose a joint framework of metric-type location prediction and metric-type token generation for the metrictype identification task in multilevel header tables. We adopted a pointer-generator (See et al. 2017) to generate metric-type tokens, combined with a softmax layer to predict the metric-type location.
We collected tables from scientific papers and hired workers to identify the metric-types.
Because our annotated dataset is limited, we proposed transfer learning by fine-tuning available pretrained models trained on a large corpus in our task. We used BERT, a pretrained model with bi-directional transformer encoders, to take advantage of its ability in producing more contexts from two directions. To perform prediction and generation tasks for solving our metric-type identification problem, we fine-tuned a pretrained encoder-decoder T5, which successfully solved multitask NLP problems with its unified framework.
Our contributions are as follows: • We introduce a metric-type identification task for multilevel header tables and propose joint location prediction and generation models to solve the task.
• We provide a dataset comprising multilevel header numerical tables, captions, and metrictypes extracted from scientific papers. Our dataset is publicly available. 1 • We introduce a multilevel header table encoder mechanism to obtain table header representations and propose a pointer-generator-based model to cover out-of-headers in the metric-type identification task.
• We fine-tune a general pretrained encoder (BERT) and a domain-specific encoder (SciBERT) in our task and present the experimental results. We show that models incorporating the pretrained encoders yield significant performance gains, particularly domainspecific encoders.
• We fine-tune a pretrained encoder-decoder, T5, for our task in a multitask setting and present the experimental results. Table information extraction is beneficial for covering unknown table schemes and understanding the table contents. Milosevic et al. (2019) proposed a framework for table information extraction in biomedical domains by defining rules for all possible variables. Specifically, for numerical variables, they retrieved metric-types by searching a set of possible tokens in their dictionary. Focusing on numerical tables, Nourbakhsh et al. (2020) extracted metric-types in earning reports by using similarity scores between stored metric-types and the corresponding non-numeric text for the leftmost cells. They investigated only header texts to identify metrictype from a limited set of tokens in a vocabulary. Dealing with out-of-vocabulary issues, we elaborate the caption information as additional inputs for our proposed framework.

Related Studies
The study that is the most similar to ours is that of Hou et al. (2019), who used tables from the experimental result section, combined with the title and abstract as document representations 1 Dataset is available on https://github.com/titech-nlp/metrictable stream NLP tasks, thereby obviating the necessity to train a new model from scratch. Due to the effectiveness of transfer learning for a limited dataset, we propose fine-tuning pretrained models and utilizing universal language representations trained on a large corpus. A pretrained model that appropriately facilitates our framework for understanding the context of our table representation is BERT. We used the original BERT model (Devlin et al. 2019) trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).
To improve contextualized representations in the scientific domain, Beltagy et al. (2019) introduced a domain-specific BERT model, i.e., SciBERT, which was trained on 1.14M papers from Semantic Scholar. Friedrich et al. (2020) used SciBERT on their models to solve the information extraction task in the same domain and achieved significant performances. Similarly, we fine-tuned SciBERT in our proposed BERT-based model as we used scientific papers as our dataset sources.
To encompass the prediction and generation tasks of our proposed framework, we fine-tuned a pretrained encoder-decoder, T5, which can be easily adapted to multitask settings. Raffel et al. (2020) introduced T5 as a unified framework that converts NLP problems into text-to-text tasks using the same loss function and decoding procedure. They demonstrated that their approach can be successfully applied to various tasks such as summarization, question-answering, and natural language inference.

Dataset
We automatically extracted tables from PDF files of scientific papers in the computational linguistics domain using PDFMiner 2 and Tabula 3 as extraction tools and used only numerical tables associated with experimental results using the keywords evaluation, result, comparison, and performance. We used papers from the ACL and EMNLP conferences (2016 to 2019) on the ACL Anthology 4 website as data sources.
In tables presented in scientific papers, information regarding table semantics is rarely provided. Based on the manner by which information is "read" from a table, Hurst (2000) separated functional table areas into access cells and data cells. Access cells comprise column headers and/or row headers. We define the data structure based on their functional areas: table caption (capt), row headers (rh), column headers (ch), and cells. The headers in the row and column have several levels, and we assume that header names at the same level are of the same type. Figure 2 shows the structure of the table.
We asked several qualified workers in the computer science field to manually verify the extracted table structure to ensure the separation of row headers, column headers, and cells, as shown in Figure 3. Based on the table structure, a header-level was defined as the location order of a group of header names in the same column of row headers, or in the same row of column headers. Subsequently, the workers annotated the metric-types, which are a unit of measurement for numbers in the table, by prioritizing the location of the metric-type in a specific header-level.
Detailed annotation instructions are provided in Figure 7 in the Appendix. The annotators successfully identified the metric-types of approximately 70% of the tables in their headers, and they determined the metric-types of the remainder based on information provided in the table captions. When no metric-type was provided in the headers, we assumed the metric-type was the same for all values in the table. The structure from the example shown in Figure 3 is capt: "model comparison in task 1 and 2"; rh level 1: (models, models, models, models); rh level 2: (model a, model b, model c, model d); ch level 1: (task 1, task 1, task 2, task 2); ch level 2: (prec, rec, prec, rec); and metric-type: (prec, rec, prec, rec) We double annotated 10% of our corpus and obtained near-perfect inter-rater agreement (0.813) using Krippendorf's alpha (Krippendorff and Craggs 2016) in identifying whether metrictype was located in the row header, column header, or neither. A substantial agreement (0.762) was achieved in determining metric-type tokens using the caption information for not-in-header metric-type cases.
The statistics of our dataset are provided in Table 1.

Problem Definition
Let Table = denote an n r × n c table with the u levels of rh and v levels of ch. The task is to identify a tuple of metric-type tokens, denoted bym. When the tuple is in the k-th level of the row header, we extract the rh k asm. When the tuple is in the l-th level of the column header, we extract ch l asm. If both the row and column headers do not contain a tuple of metric-type tokens, we generatem using information from the table caption (capt). The formulation for the metric-type identification is as follows: where W m is the set of metric-types in the vocabulary. If the table headers do not contain any metric-types, then the table is likely to exhibit the values of a single metric-type. In other words, whenm is not located in header rh or ch, we will find a metric-type token w m from W m or capt, and copy w m to all the columns 5 from 1 to n c , resulting in w 1 i is a metric-type token in position i.

Models
We propose neural models to identify the metric-type for multilevel header tables using a joint model that enables metric-type location prediction and metric-type token generation.

Pointer-Generator Supervised Attention Model
We obtained the representations of captions and header-levels using a bidirectional long shortterm memory (BiLSTM) encoder and then captured the header-level weights using supervised attention between the header-level encoder and the metric-type header-location outputs. In the generation scheme, we adopted the pointer-generator network to consider captions as source texts and metric-type vocabulary in the metric-type generation gate. The architecture of the proposed model is shown in Figure 4. initial vector representations of the row header tokens at level k. Similarly, let E ch l denote the average vector of the initial vector representations of the column header tokens at level l. We used BiLSTM as our encoder, which learns bidirectional long-term dependencies between time steps of sequence data. We input a sequence of row header vectors E rh1:u to the BiLSTM encoder to obtain the representation of the k-th level h rh k , while considering both the entire history E rh 1:k and the entire future E rh k:u as follows: Similar to the row headers, for a sequence of column header vectors E ch1:v , we used the BiLSTM encoder to obtain the representation of the l-th level h ch l , while considering both the entire history E ch 1:l and entire future E ch l:v as follows: To obtain the contexts of a sequence of row header-level C rh and column header-level C ch , we incorporated the dot attention mechanism proposed by Luong et al. (2015). We selected the hidden state of the last level and combined it with the weighted hidden states as follows: where attention a rh k and a ch l are derived by comparing the final encoder outputs h rhu and h chv with each source hidden state E rh k and E ch l , respectively, as follows: Note that [x; y] denotes the concatenation of vectors x and y.
Caption encoder Let E capt x denote the initial vector representation of the caption token at position x. Similar to the headers, we used the BiLSTM encoder with attention a captx to compute the context vector of a caption sequence C capt with length t, while considering both the entire history E capt 1:x and entire future E capt x:t as follows: Metric-type header-location gates We input the concatenation of the row and column header contexts to the softmax layer with linear transformation to obtain the metric-type headerlocation probability, as follows: which includes the probabilities of the metric-types located in the row headers (p rh ) and column headers (p ch ), or not located in the headers (p capt ), where p rh + p ch + p capt = 1.

Metric-type header-level gates
Since the attention scores a rh k and a ch l capture the relevant header-level information in rows and columns, these attention scores are used as header-level weights as follows: where i ∈ {1, ..., u, (u + 1), ..., (u + v)} is a header-level index.

Metric-type generation gates
In our pointer-generator network, we used the sigmoid layer with linear transformation to obtain a switch copy probability as follows: which allows us to select between copying a word w capt from a table caption and generating a word w m from the metric-type vocabulary, where p copy ∈ [0, 1]. We used a softmax function with linear transformation to compute the probability distribution over the metric-type vocabulary: Subsequently, we obtained the probability distribution over the extended vocabulary, as follows: where i is the index of metric-type tokens in the vocabulary.
Learning objective For training, we used the negative log-likelihood objective as the loss function. In addition, we adopted supervised attention (Liu et al. 2016) to jointly supervise the row and column header-level attention to obtain the metric-type header-level. We combined all loss functions in the location classification and token generation model, and defined α as the weight, as follows: where c ∈ {capt, rh, ch} is the metric-type header-location class and z hlocc is the binary indicator (0 or 1) of each corresponding class. caption, row header level 1 to u, and column header level 1 to v.

Fine-tuning BERT-based Model
After preprocessing, the input text is denoted as a sequence of tokens X = (x 1 , x 2 , · · ·, x n ).
Three types of embedding are assigned to each x i : token embeddings representing the meaning of each token, segmentation embeddings indicating the segment boundaries of a sequence of tokens, and position embeddings indicating the token position within the sequences. Because BERT only includes two segments in its input, we named the odd segment as segment A and the even segment as segment B. The sum of these three embeddings was input to the bidirectional transformer layer of the BERT.
We used the token representations from the top hidden layers of the pretrained transformer as context embeddings. We assumed that the context vectors of each CLS token can represent the segment sequences more effectively. As shown in Figure 5, we labeled the input embedding as E, the final hidden vector of the [CLS] token for the i th input segment as C i ∈ R H , and the final hidden vector for the j th input token as T j ∈ R H .
We used a metric-type header-location gate and a metric-type header-level gate for metrictype location classification, and a metric-type generation gate to generate metric-type tokens from vocabulary encompassing out-of-header metric-types. The proposed BERT-based model architecture is shown in Figure 5.

Metric-type header-location gates
We input the first segment context C 1 to the softmax layer via linear transformation to obtain the metric-type header-location probability, as follows: Metric-type header-level gates In our task, segments were used to represent the table section that is related the most closely with the metric-type. We incorporated the segment context C i into the sigmoid layer via linear transformation to obtain the probability that the metric-type is located at a specific header-level, as follows: Subsequently, the probabilities were normalized to all segments as the weight score of the header level, as follows: Metric-type generation gates We used a softmax function with linear transformation based on the first segment context C 1 to compute the probability distribution over the metric vocabulary, as follows: Learning objective We combined all loss functions in the metric-type header-location, metrictype header-level, and metric-type generation gates as follows: where α is the weight of the metric-type generation functions.

Fine-tuning T5-based Model
The input text in the fine-tuned T5-based model was preprocessed by inserting several specific tokens to discriminate tokens in different locations. were inserted before the row-name tokens at the i level and column-name tokens at the j level, respectively; each name was separated by the [SEP] token. We fine-tuned the T5 model to perform multiple tasks, metric-type header-location classification, and metric-type token generation. For each task, we appended a prefix string to the input of the model, i.e., "identify location:" for the first task and "identify metric:" for the second task. The architecture of the proposed T5-based model is shown in Figure 6.

Baseline Model
Because our task pertain primarily to location classification and token prediction from metrictype vocabularies, we selected SVM as a baseline owing to its effectiveness in high-dimensional spaces, particularly in text classification. We used two SVM classification models as baselines: a metric-type location prediction model and a metric-type token prediction model from the vocabulary of metric-types. We used "tf.idf" from the concatenation header name tokens for all levels as an input representation in the first model and "tf.idf" of the caption tokens in the second one. We performed a grid search to tune the hyperparameters of the SVM model on the development set for search space c ∈ {0.1, 1, 10, 100, 1000} and gamma ∈ {0.001, 0.0001}; subsequently, we selected the c and gamma parameters of the SVM that indicated the best accuracy.

Evaluation Metrics
The "accuracy" metric was used to evaluate the metric-type location and generated metrictype tokens.

Metric-type location accuracy
The target of the metric-type location prediction model is the metric-type located in the row headers, in the column headers, or neither. The accuracy of the header-location (acc hloc ) is the rate of correct header-location predictions.
Details regarding the metric-type location at the header-level are required to identify metrictype token lists. We computed the accuracy of the metric-type header-level (acc hlvl ) using the ratio of correct header-level predictions to the total number of predictions.
Metric-type token accuracy Letm = (ŵ m1 , ...,ŵ mn ) denote the sequence of predicted metric-type tokens for n r rows or n c columns (depending on the header-location prediction), and m = (w m1 , ..., w mn ) denote the target ones, e.g.,m = (f1, f1, f1) and m = (f-1, f-1, f-1). We calculate the metric-type token accuracy using the string matching of all token lists inm and m, i.e., and string matching of each token pairŵ mi and w mi in the token lists, i.e., To account for token predictions involving an abbreviation, we compute the metric-type token accuracy based on ordered character matching, as follows: where d is the number ofŵ m , whose characters appear in w m in the same order. For example, the predicted token RG1 is regarded as correct when the reference token is ROUGE-1.

Implementation Details
We implemented our models using the AllenNLP library (Gardner et al. 2018). In our pointergenerator-based model, we used pretrained word embeddings for initialization and two-layer BiLSTMs with 256 hidden sizes in both the caption and header-level encoders. We added a dropout (Srivastava and Hovy 2014) with a probability of p = 0.1 to our header and caption encoder. We evaluated our models using k-fold cross-validation. Because we had imbalanced classes of metric-type location with row-header metric-type instances below 2%, we used k = 5 to increase the probability of minority class instances being evaluated.
To perform optimization in the training phase, we used Adam as the optimizer with a batch size of 10 and learning rates of 3 × 10 −3 and 3 × 10 −5 in the pointer-generator-based and BERTbased models, respectively, with a slanted triangular schedule (Howard and Ruder 2018). We trained the model for a maximum of 20 epochs and implemented early stopping on the validation set (patience of 10), and we set α to 0.5. We used the original BERT and the domain-specific SciBERT uncased model to fine-tune our BERT-based model. For our T5-based model, we finetuned the model using Adafactor optimizer with a constant learning rate of 0.001 (Raffel et al. 2020).

Model comparison
The performances of the proposed and baseline models are shown in Table   2. As shown, the pointer-generator supervised attention model initialized by Glove embeddings outperformed the baseline (SVM) in predicting the metric-type location. The accuracy of this model for metric-type generation was better than that of the baseline. A slight difference in the pointer-generator supervised-attention performance of the proposed model as compared with the SVM implies that the deep neural network architecture afforded minimal improvement in our limited dataset. Adding more datasets is suggested when training a deep learning model from scratch. Furthermore, we demonstrated that using only pretrained embeddings of a large pretrained model decreased the accuracy. The performance of our pointer-generator-based model deteriorated significantly when the input was initialized using BERT and SciBERT.
We fine-tuned the large pretrained BERT, SciBERT, and T5 to exploit deep neural models trained on a larger corpus. For our fine-tuned BERT-based models, we used the context representation of their transformer encoders and added gates to solve our tasks. The accuracy of our fine-tuned BERT-based models was significantly better than that of pointer-generator-based models trained on our limited corpus, where header-location and header-level prediction accuracies exceeding 88% and generation accuracy improvement exceeding 3 points (%) were achieved.
The fine-tuned BERT-based model using the domain-specific SciBERT led to significant improvements in all metrics because its corpus is similar to ours.
The performance of our fine-tuned T5 with an encoder-decoder architecture is comparable to that of the fine-tuned encoder-only BERT model. The unified framework of T5 was easily  Table 3 Accuracy scores (%) of ablation test of our pointer-generator-based model obtained using fivefold cross-validation. Scores of "without copy" and "without copy and generation" models differed significantly from proposed pointer-generator model with paired-bootstrap-resampling test (p < 0.05). adapted to our tasks by inserting different prefixes in the input. Unlike fine-tuned BERT-based models, no added layer was used in the fine-tuning of T5.

Effect of copy mechanism
We evaluated our pointer-generator-based model using an ablation test, as presented in Table 3. As shown, the performance of our generation model without a copy mechanism decreased. This shows that incorporating the copy mechanism is beneficial to metric-type token generation. Our model demonstrated the worst accuracy when it was executed without a pointer-generator network because the location prediction model alone failed to manage out-of-header metric-types. Table 4 shows the effect of segment embeddings in our BERTbased model. The accuracies of the fine-tuned BERT and the SciBERT models without segment embeddings both decreased. This implies that segment embeddings successfully discriminated the header-level boundaries in the input representation of the BERT-based models.

Qualitative Analysis
As introduced in the model formulation (Eq. 1), our models include row-header, columnheader, and not-in-header metrics based on metric-type location. We evaluated the performances of our models for each location category based on the precision (P ), recall (R), and F1-score (F )  Table 5 Performance scores (%) of metric-type location prediction for each header-location class using five-fold cross-validation. Bold indicates the best score.
of the metric-type location prediction task as shown in Table 5.
As shown in Table 5, F-SciBERT demonstrated the best performance for all metric-type location cases. Due to the small number of instances in the dataset, the overall performance of the row-header metric-type category was worse than that of the others. Applying a specific procedure to handle the imbalanced class problems is left for future work.
The pointer-generator-based and fine-tuned BERT-based model outperformed the baseline in predicting row-header and column-header metric-type cases. Specifically, a significant margin was observed in predicting the row-header metric-type as the minority class. This indicates that combining the metric-type header-location and the header-level gate improved the model's ability to determine the metric-type location in all cases. The text-to-text mechanism of the fine-tuned T5 failed to perform the metric-type location prediction of the minority cases.
For the not-in-header metric-type cases, the models successfully generated metric-type tokens from the vocabulary. Because we added the copying ability to our pointer-generated-based models, we present both the generating and copying performances in Table 5. As shown, our proposed model performed better than the generating model in terms of copying.
Additionally, we investigated the errors in the predicted metric-type tokens. We discovered that the models generated more generic metric-types; for example, they extracted score as a prediction of the target accuracy. By contrast, our models generated terms similar to the ground truth metric, such as the metric-type pearson's for the target r. Examples of table captions, headers, and metric-type for each metric-type location are shown in Table 6.

Conclusion
In this study, we extracted multilevel header numerical