Conditional Relative Odds Ratio and Comparison of Accuracy of Diagnostic Tests Based on 2×2 Tables

In order to evaluate the accuracy of diagnostic tests based on 2×2 tables, a number of indices were used, some of which are occasionally used inappropriately. This paper demonstrates the characteristics and problems with those indices, and introduces several methods to compare the accuracy of two diagnostic tests. The author summarizes existing indices based on 2×2 tables, agreement rate, kappa (κ), and odds ratio, and reviews their characteristics to find better indices by which to compare two diagnostic tests using hypothetical examples. Because only the odds ratio is not affected by prevalence, the relative odds ratio is the most appropriate index for comparing diagnostic accuracy. In order to decrease selection bias, giving the two tests to the same individuals is preferred. However, no standard method has been established to obtain the standard error of relative odds ratios. In this case, using the newly proposed conditional relative odds ratio (CROR), based on McNemar’s odds ratio, the standard error is available. The CROR is a less biased index when the two tests were given to the same individuals, and it is also preferable in light of its ethical and economic advantages. However, a large base population is required for the two tests to be highly accurate and produce few discordant results.

Diagnostic accuracy is commonly measured by sensitivity and specificity of which trade-off relationship can be presented in the form of a receiver-operating characteristic (ROC) curve. One summary index of diagnostic test accuracy is based on the area under the ROC curve [1][2][3] representing for an integrated discriminative ability of a diagnostic test over cut-off points. Others are based on a single 2 2 table of a specific cut-off point. [4][5][6][7][8][9] In this paper, the author first selects several summary statistics belonging to the latter category, then focuses on the comparison of diagnostic accuracy using the odds ratio. 9,[10][11][12] Lastly, ways to summarize diagnostic accuracy using the meta-analytic method are introduced.
As described in Table 1, diagnostic test accuracy based on a 2 2 table is most commonly presented by two trade-off indices, sensi-tivity= a/D, and specificity= d/ND. When we need to describe diagnostic accuracy by a single index, there are at least three options, i.e., agreement rate (AR), kappa 4 ( ), and odds ratio (OR). The author presents these statistics along with their strengths and weaknesses, and then focuses on the odds ratio to compare the two diagnostic tests administered to the different or same subjects. In the last approach, two meta-analytic methods for a comparison of diagnostic accuracy are reviewed. For each approach, the author provides a hypothetical example to show the actual computational steps. All analyses were re-performed using the SAS release 8.2 (SAS Institute Inc., Cary, NC). 13 The code is presented in the appendix.

Approach to Evaluation of Diagnostic Test Accuracy
One of the widely used indices for diagnostic accuracy is AR, alternatively percent agreement. It is calculated as the number of a valuable feature when comparing accuracy. In this section, the author demonstrates how to compare the diagnostic accuracy of two tests using the OR among both different and the same subjects.

2-1. Comparison of Diagnostic Accuracy of Two Tests Given to Different Subjects Indices Based on 2 2 Tables by Test
When we compare the diagnostic accuracy of two tests, X and Y, applied to different subjects, relative odds ratio (ROR), the ratio of the two ORs, is available. 9 As shown in Table 2, the index is calculated as follows: Because the variance of the logOR is calculated as (1/a)+(1/b)+(1/c)+(1/d) , the variance of the difference between the logORs is var(logORX)+var(logORY) under the assumption of independence. Thus, we obtain the ROR, reflecting the relative diagnostic accuracy of test X to test Y, with a confidence interval (CI). Table 2 could be reconstructed to test results versus diagnostic tests by disease status as shown in Table 3. In this form, we can compare the sensitivities of the two tests as well as their specificities, applying the 2 test for independence or a comparison of two proportions. These tests are mathematically equivalent. correctly categorized subjects over the total number; AR=(a+d)/T in Table 1. AR is computationally simple and intuitively interpretable. It could be manipulated to p + (1-p) , and this is interpreted as the weighted mean of sensitivity and specificity by prevalence.

Indices based on 2 2 tables by disease status
Among several statistics 5 proposed for 2 2 table data to improve AR with regard to removing chance agreement, 4 has frequently received high marks. As shown in Table 1, the index is calculated by subtracting the expected number of correctly diagnosed individuals from both the numerator and denominator of AR. Prevalence remains in the formula as follows: The OR, frequently used in causality studies, is also used to evaluate diagnostic accuracy. [7][8][9] In causality studies, the ORs stand for the strength of the relationship between exposure and disease. This is easily interpreted as the relationship between test results and the presence of the disease. The OR is also interpreted as the ratio of true-positive to false-positive odds. The index is manipulated to 1/{(1/ -1)(1/ -1)}, in which prevalence is cancelled out.

Approach to Comparison of Diagnostic Test Accuracy
Among the above three indices for the evaluation of diagnostic test accuracy, only the OR is not affected by prevalence, which is Diagnostic Accuracy Based on 2 2 Tables   Table 1. Indices of diagnostic accuracy based on 2 2 tables.

2-2. Comparison of Diagnostic Accuracy of Two Tests Given to Same Individuals
When we compare the diagnostic accuracy of two tests given to different subjects, we should take into account the comparability of the subject groups to which each test was administered. Selection bias might invalidate the results on accuracy. 14-15 Thus, we may give two diagnostic tests to the same individual, and try calculating the ROR in the same way. However, an ROR based on 2 2 tables by test requires the independence of both, which is not sufficient for a test with the same subjects. In that case, the ROR based on McNemar's OR by disease status is available. As shown in Table 4, each number of the four cells in the ordinary 2 2 table (Table 3) moves to a marginal number in McNemar's 2 2 table, and the result of test X with that of test Y of each individual is counted and classified into four cells. As McNemar's table has more information than an ordinary 2 2 table, we can reconstruct the latter from the former, but not the other way around.
Another index for a comparison of the sensitivities and specificities of two tests is the OR, which is the ratio of the positive odds of test X to that of test Y in diseased or nondiseased subjects. As a positive result among the diseased subjects denotes a true positive, the OR among the diseased group (ORD= ab'/a'b) is the true positive odds ratio of test X against that of test Y, indicating the relative sensitivity of one to the other. CI of the ORD is calculated using the variance of logORD, which is (1/a)+(1/b)+(1/a') Table 3. If ORD is significantly greater than 1, the sensitivity of test X is higher than that of test Y. Similarly, a positive result in the nondiseased group is a false positive, and the ORND being cd' /c'd, denotes a false positive odds ratio of test X to test Y. This index could also be used for the comparison of specificities. If ORND is smaller than 1, the specificity of test X is higher than that of test Y. The ratio of a true-positive to a false-positive odds ratio, i.e., adb'c'/ bca'd' is identical to the ROR calculated in Table 2. The variance and CI of the ratio are also identical to those in Table 2.  [19][20] can be used. A relative summary OR is calculated by dividing summary ORX by summary ORY. CI is computed using var(log relative summary OR)=var(log summary ORX)+var(log summary ORY). This method is used when test X was given to different subjects than those who took test Y.

Summarizing CROR
We summarize the extracted CROR of test X to test Y using the same method as when summarizing the OR. This method is used when test X was given to the same individuals who took test Y.
To evaluate diagnostic accuracy, AR is commonly used for its simplicity and ease of interpretation. However, a number of papers have reported its pitfalls. [4][5][6]21 As AR, which is p +(1-p) , is the weighted mean of sensitivity and specificity, when prevalence is low, the sensitivity is almost neglected. In that case, AR does not convey the diagnostic accuracy of the test. An Because McNemar's true-positive odds ratio is / , and falsepositive odds ratio is '/ ', the ROR using McNemar's OR, the conditional relative odds ratio 16 (CROR), is '/ ' . The CI of the newly proposed index is calculated from var(logCROR) (1/ )+(1/ )+(1/ ')+(1/ '). The CROR requires the number of individuals having discordant results on the two tests, and no concordant results are needed.

Approach to Meta-Analysis of Comparison of Diagnostic Test
Accuracy There are two ways to compare diagnostic test accuracy using meta-analysis, i.e., a comparison of two summary ORs of tests X and Y by extracting each OR from the original studies, and summarizing the CROR extracted from each. The SAS program for meta-analysis is provided elsewhere. [16][17]

Comparison of two summary ORs
Extracting the OR of each test from the original studies enables us to calculate summary ORs of tests X and Y with their variances. In order to summarize ORs, a proper model such as a fixed Table 5. Comparison of two tests using several indices. (meaning that the OR is McNemar's), there is no methodological problem. However, use of the ordinal ROR may be problematic, in particular a t-test of logROR that ignores intra-study variations, which should be avoided since it leads to incorrectly low p-values.
The CROR is a new index, and at present can be extracted when raw data of discordant individuals are provided in an original study. In future studies of the comparative diagnostic accuracy of tests, the CROR should be presented if raw data of discordant individuals can not be presented for meta-analysis.
The author wishes to express his deepest gratitude to Drs. Takeo Moro-oka and Niteesh K. Choudhry, who were the co-authors of the CROR paper, and to Dr. Nan Laird for her helpful comments on the OR and the ROR from the statistical viewpoint. The author is also grateful to the Japan Epidemiological Association for the Young Investigator Award. extreme example is shown in Table 5, showing a higher AR in test Y despite the fact that test Y has no diagnostic ability. In such a case, care should be taken to avoid accuracy comparison based on ARs of two groups.
In spite of the improvement of AR with regard to removing chance agreement, attention should be paid to evaluating diagnostic accuracy using . As shown in Table 6, diminishes under fixed sensitivity and specificity when the prevalence is closer to one or zero. Even with the same prevalence, would be changed when the sensitivity and specificity are switched. These are examples of some undesirable features of for evaluating diagnostic accuracy.
The OR was originally used as an index representing the strength of a relationship between exposure and disease. It is essentially identical to the relationship between test results and the presence of disease. The remarkable feature of OR is its independence from prevalence and symmetry in terms of sensitivity and specificity. Moreover its variance is given by a simple formula. Therefore, the OR is widely used to evaluate and compare diagnostic accuracy, including use of the meta-analytic technique. [7][8][9][22][23][24][25] However, as this index is based on odds, very small differences may sometimes be exaggerated. For example, a test with =0.9 and =0.9 has the same OR of 81 as another test with =0.99 and =0.45.
The ROR is used if two different tests to be compared are given to different subjects. Although the index is statistically correct, we should be careful of selection bias based on subject differences. To remove the bias, it would be preferable to give two tests to the same individuals. However, ordinal ROR assumes independence of two groups. In that case, the CROR, the ratio of a truepositive to a false-positive McNemar's odds ratio, yields the correct answer to the question. The CROR is identical to the ROR when and only when each cell of McNemar's 2 2 table is identical to the expected number from the margin. The CROR has the following characteristics: less biased index with CI considering the correlation of individual level of two tests, and economically profitable and ethically less problematic, because no diagnosis of disease is needed for subjects with negative results from both diagnostic tests. In addition, the CROR could be used in a comparison between the strength of association of two exposures to a disease. 26 On the other hand, the CROR tends to have a broad CI because of the sparseness of McNemar's tables. This phenomenon is quite serious when diagnostic tests are accurate, and consequently they are highly concordant. Finally, we should generally pay attention to the differences in the characteristics of two tests when we summarize sensitivity and specificity, being especially aware of any loss of information when summarizing indices.