2024 Volume 10 Issue 26 Pages 949-953
There are a number of geotechnical earthquake engineering problems that require predicting the probability of a binary (Yes or No) outcome, typically using logistic regression or similar models. Two relevant examples are liquefaction triggering and surface fault rupture. The datasets used to develop these models often have imbalance in the Yes/No class ratio. The number of yes data points can outweigh the no datapoints by a large fraction. This is because finding true No data points is often very hard and requires careful investigations, whereas Yes data points are obvious and attractive to document and measure. Modelers are often concerned that this class imbalance might lead to biased or skewed results. However, they usually do not explicitly distinguish between class imbalance, the observed Yes/No ratio, and sampling bias due to something that causes potentially observed data to be excluded from observations. This paper examines the problem of sampling bias versus class imbalance and makes recommendations for when it needs to be addressed and when it does not influence the predictive capacity the models.