EARLY ONLINE RELEASE

(cid:2) Context. — The terminology used by pathologists to describe and grade dysplasia and premalignant changes of the cervical epithelium has evolved over time. Unfortunately, coexistence of different classification systems combined with nonstandardized interpretive text has created multiple layers of interpretive ambiguity.

Results.-The NLP algorithms yielded a precision of 0.957, a recall of 0.925, and an F score of 0.94. Additionally, we estimated that the time to evaluate each monthly biopsy file was significantly reduced, from 30 hours to 0.5 hours.
Conclusions.-A set of validated NLP algorithms applied to pathology reports can rapidly and efficiently assign a discrete, actionable diagnosis using CIN classification to assist with clinical management of cervical pathology and disease. Moreover, discrete diagnostic data encoded as CIN terminology can enhance the efficiency of clinical research.
(Arch Pathol Lab Med. doi: 10.5858/arpa.2021-0410-OA) P opulation-based cervical screening programs have led to substantial reduction of cervical cancer incidence and mortality worldwide. Cervical cancer screening aims at detecting cervical precancers that can be treated before invasion occurs. Evaluation of screen-positive women with colposcopy and cervical biopsy is a cornerstone of cervical screening in most settings, since histology results determine clinical management, including excisional treatment, surveillance, or return to screening. Given that cervical screening is recommended for a large proportion of the population (currently from age 21 to 65 years in the United States 1 ) and that about 5% of screened women may undergo colposcopy, cervical biopsies are one of the most commonly performed pathology services. This high volume, combined with the need for electronic medical records to support cervical screening programs, underscores the importance of uniform classification of cervical precancers in pathology reports.
The terminology used by pathologists to describe and grade dysplasia and premalignant changes of the cervical epithelium has evolved over time. [2][3][4][5] Two classification systems are currently used to report histologic cervical precancers: Lower Anogenital Squamous Terminology (LAST) 6 and cervical intraepithelial neoplasia (CIN). Management guidelines rely mostly on the CIN nomenclature; however, the utilization of both squamous intraepithelial lesion and CIN nomenclatures within a single pathology report may create ambiguity. Additional ambiguity may arise from the use of narrative free text including variations in terminology, syntax, or modifiers that increase diagnostic ambiguity and can confound clinical interpretation and challenge the ability to capture discrete diagnoses.
Synoptic reporting has been proposed as a solution to minimize ambiguity in a pathology report. As defined by the College of American Pathologists (CAP), 7 however, this has not been widely implemented in clinical settings.
Kaiser Permanente Northern California (KPNC), an integrated health care delivery system with more than 4 million members, does not use synoptic reporting for cervical biopsies. Since the final diagnosis is unstructured, a manual review of each biopsy contained in the final pathology report is required to determine the most severe diagnosis for patient management. To improve reporting of cervical pathology results, we developed and applied a series these diagnoses and their rank to assign a discrete outcome to each histology result and subsequently the single worst outcome to the pathology report. NLP results were compared to manually abstracted results by using individual categories to determine the exact match. Additionally, biopsy diagnoses were dichotomized as high-risk (CIN2-3 or higher) and low-risk (,CIN2-3) to reflect a clinical treatment threshold and a relevant cutoff used in many research studies.

Development and Validation of NLP Algorithms
Although there are several companies that offer NLP software, we used I2E software, version 5.4.1 (Linguamatics NLP Platform, Linguamatics, an IQVIA company, Cambridge, United Kingdom), which enables text mining of unstructured text with user-created, rule-based algorithms to identify biopsy diagnoses. 13 I2E is used by many top global pharmaceutical companies and health organizations and offers a unique graphic interface (see Supplemental Figure 4) rather than coding syntax when developing algorithms.
More than 20 algorithms were created and were mostly defined on the basis of CIN classifications. Squamous intraepithelial lesion classifications were considered when CIN classifications were not specified. The algorithms were iteratively created, tested, and validated on a smaller representative training sample (N ¼ 2213) of pathology reports before 2019 with biopsy diagnoses determined by a cytotechnologist and adjudicated by a pathologist. Additionally, this iterative process enabled rules to be created within the algorithms to evaluate pathology report phrases that accompany the diagnosis (see Supplemental Table 2) and that could misclassify an outcome.
The outcomes from developing NLP algorithms were compared to the outcomes of this gold standard sample. Every algorithm was modified and reevaluated until, a priori, greater than 90% exact agreement was achieved with the known outcome. Upon the assignment of a diagnostic result for each histology result (see Supplemental Figure 4) within the pathology report, the most severe diagnosis was assigned to the report (see Supplemental Figures 5 and 6). We also defined a patient at high risk if the most severe pathology report diagnosis was assigned CIN2-CIN3 or higher.
The final algorithms that were developed with the representative training sample were applied to the study sample, a practice to measure NLP accuracy. [14][15][16][17][18] To validate the NLP algorithms, a database of the study sample with a user interface was created and populated with the pathology text and the final diagnosis from the algorithms. This information was presented, unblinded with the result from the NLP algorithms, for manual cytotechnologist review and pathologist adjudication in order to assign their most severe final diagnosis to each pathology report. This validation process enabled the comparison of the final diagnostic outcome between the NLP algorithms and an expert review.

Statistical Analysis
The goal was to evaluate the performance of the NLP algorithms in the study sample by measuring the level of concordance among the CIN diagnoses for each pathology report between the algorithms and cytotechnologist review, including expert pathologist adjudication among equivocal diagnoses for each pathology report.
The diagnostic result between the cytotechnologist review and the NLP algorithms was compared for each pathology report and given one of the following match criteria: ''Exact Match'' (same diagnostic grade), ''Same Risk Match'' (same risk level, ie, high risk), ''Risk Mismatch'' (differing risk category), and ''Review'' (unassigned NLP diagnosis). When comparing risk match level, high risk was defined as CIN2-CIN3 or higher. We measured the performance of the NLP algorithms by calculating the precision, recall, and F score for the exact CIN match categories and risk categories. These parameters were calculated as Precision of natural language processing (NLP) algorithms, a process often applied to extract specific outcomes from free text within the electronic medical record (EMR), [8][9][10][11][12] to unstructured cervical pathology text to identify the most severe diagnosis. The accuracy of the NLP results was compared to results from a manual review completed by cytotechnologists and pathologists. Discordant manual interpretations were adjudicated by a cytopathologist.

Study Sample
Unstructured cervical pathology reports (N ¼ 35 847) were identified by selecting reports by their unique accession number from KPNC's laboratory information system, based on cervical, uterine, and endocervical samples among patients of the KPNC health care delivery system between August 2019 and July 2020 (see Supplemental Table 1, see supplemental digital content). This period represents the most current complete year of pathology reports where the NLP-developed algorithms were applied. Each pathology report is composed of 1 or more tissue samples per patient. A narrative result, which may include histologic nomenclature and unstructured text, is given for each tissue sample (see Supplemental Figure 1). A file with a single row of data for each accession number and its corresponding histology results was created as the input for the algorithms by eliminating extraneous spaces and carriage returns (see Supplemental Figure 2). No manual manipulation was made to modify the original text of the unstructured biopsy result.

Biopsy Diagnoses
To evaluate the unstructured histology results within a cervical pathology report as they relate to cervical cancer risk, we defined cervical biopsy diagnoses with discrete categorical labels and assigned a severity rank score. The diagnoses ranged from ''Review'' (see Supplemental Figure 3), an outcome where NLP was unable to assign any diagnosis (lowest rank), to a diagnosis of cervical cancer (highest rank) ( Table 1)

RESULTS
Among all the pathology reports, 32 823 of 35 847 (91.6%) final NLP-assigned diagnoses matched exactly or were classified as the same risk match with the manual validation ( Table 2). A total of 2594 records (7.2%) were not assigned a diagnosis by NLP algorithms and categorized as ''Review.'' Of these, the cytotechnologist review assigned 2578 (99.4%) as low-risk and 16 (0.6%) as high-risk diagnoses. The number of high-risk diagnoses were as follows: adenocarcinoma in situ (1), CIN2-CIN3 (7), CIN3 (2), endocervical adenocarcinoma (2), microinvasive (1), and squamous carcinoma (3). The inability to assign a diagnostic outcome by the algorithms was likely due to the lack of specificity in the interpretive text or the difficulty in identifying specific combinations of CIN and non-CIN terms to assign a definitive outcome.
Among the pathology reports that were assigned a histologic diagnosis (excluding reports assigned ''Review'') by the algorithms, 31 815 of 33 253 (95.7%) matched exactly to the manually assigned diagnoses at manual review by the cytotechnologist. Among the results with a ''Risk Mismatch'' (N ¼ 430), 230 (53.5%) were identified by the algorithms as high risk, whereas manual review determined them low risk, and 192 (44.6%) were identified by the algorithms as low risk as opposed to a high-risk determination by review ( Table 3). The distribution of individual diagnoses coded from NLP algorithms and final review among risk mismatch reports is shown in Table 4.
Using 3 match criteria, the accuracy of the NLP algorithms was assessed by calculating precision, recall, and accuracy (F score). When defining true positives as ''Exact Matches'' (N ¼ 31 815) and false positives as ''Same Risk Match'' and ''Risk Mismatch,'' the precision, recall, and F score was 0.96, 0.93, and 0.94, respectively (Table 5). Accuracy improved slightly if we combined the ''Exact Matches'' and ''Same Risk Match'' as true positives with precision, recall, and F score of 0.99, 0.93, and 0.96, respectively. When records that could not be assigned a diagnosis were excluded, the highest precision, recall, and accuracy were obtained as 0.96, 1.0, and 0.98, respectively. In addition to the high level of concordance between the NLP algorithms and the manual validation, the amount of time to assign a diagnostic outcome between the 2 processes was drastically reduced. We estimate the time to manually evaluate a pathology report to be approximately 2 minutes per report, thereby requiring approximately 30 hours to evaluate 3000 pathology reports per month. The final NLP algorithms are able to evaluate the same number of reports within 30 minutes, a 98% reduction in time.

DISCUSSION
Although cervical pathology reports provide anatomical diagnosis, failure to include CIN grade and/or ambiguity within the narrative text can lead to incorrect interpretation by the clinician and can lead to inappropriate patient management. Furthermore, unstructured text challenges the ability to manually identify and extract the most severe discrete diagnosis to support screening, surveillance, and research efforts. In our study, we created and applied NLP algorithms to unstructured cervical pathology reports in a large health care organization including more than 35 000 cervical pathology reports. The NLP algorithms were able to produce greater than 0.95 precision, recall, and accuracy for both specific CIN and high-risk categories and drastically decreased the evaluation time for pathology reports.
The development of the algorithms and their performance was evaluated individually and collectively. This approach allowed us to focus on specific CIN categories, such as CIN2-CIN3, the most ambiguous in the textual interpretation of reports. Through the process of developing an algorithm for each diagnostic outcome, we gained performance insights by iteratively applying revisions to pathology reports with a previously assigned diagnosis from manual review. Upon finalizing each algorithm and applying them to the study sample, we established a high level of concordance.
Unlike other projects evaluating the accuracy of NLP algorithms where a training or derivation data set is used to initially create algorithms then applied to the validation or study data set, we developed our algorithms iteratively by using a training data set that contained representative samples of diagnostic outcomes. This iterative process was necessary to account for the unstructured text that contained both histologic and unstructured text. During this process, our goal was to increase the level of agreement without increasing the risk of misclassification, ensuring modifications to a single algorithm would not affect the diagnostic outcome from a different algorithm.
Our NLP approach has important implications for both clinical decision-making and research. The new American Society for Colposcopy and Cervical Pathology (ASCCP) Risk-Based Management Consensus Guidelines incorporate current test results as well as previous screening and biopsy results into clinical decision-making. The use of our algorithms can facilitate the incorporation of these results in the EMR to allow for more efficient and accurate risk estimation. This process is meant to supplement and clarify any ambiguous pathology result and assist with any physician-to-physician communication. With respect to clinical research, adoption of the EMR has provided opportunities to generate large amounts of clinical data that can be leveraged to address important research questions with real-world evidence. However, a major  challenge is the accuracy and timeliness to extract actionable diagnosis contained within narrative text. Our NLP algorithms address this challenge with respect to cervical histology outcomes, achieving greater than 95% accuracy compared to a gold standard, manual review. A current limitation to our NLP algorithms was the inability to identify 0.6% of all pathology reports as high risk for cervical cancer along with misclassifications within the low-risk group. Our goal was to minimize or eliminate these misclassifications and to provide the clinician the most severe discrete diagnosis, similar to a discrete clinical laboratory result. Although the NLP misclassifications in Table 4 would not be acceptable for clinical management, particularly benign diagnoses assigned by NLP, the algorithms attempted to minimize such misclassifications ( Table  6). An example of a misclassification of a benign NLP outcome with a manual review outcome of adenocarcinoma in situ is provided in Supplemental Table 3. The NLP algorithms may never be perfect in that misclassifications would occur. Our algorithms are currently not intended to supplant manual review of cervical pathology outcomes but are to be used to expedite the coding of pathology reports with the severest diagnosis. At the moment, clinical management decisions are not based on the NLP algorithms. Our approach demonstrates that NLP can be applied within health care systems that do not use synoptic reporting as an additional tool to review pathology reports and could be extended to additional applications in pathology, as well as other unstructured text in the EMR.

CONCLUSIONS
The lack of clarity of cervical biopsy interpretation potentially impacts patient management, including followup and treatment. Our study demonstrates that NLP algorithms can codify an unstructured cervical biopsy report. Until standardized terminology (eg, synoptic reporting) for cervical biopsies are accepted and implemented, NLP algorithms can assist the clinical management of cervical pathology and disease by rapidly identifying a single severity-based discrete outcome when more than 1 biopsy is present and can enhance the efficiency of clinical research.   Table 1 for definitions.