Class imbalance occurs frequently in drug discovery data sets. In oral absorption data sets, in the literature, there are considerably more highly absorbed compounds compared to poorly absorbed compounds. This produces models that are biased toward highly absorbed compounds which lack generalization to industry settings where more early stage drug candidates are poorly absorbed. This paper presents two strategies to cope with unbalanced class data sets: undersampling the majority high absorption class and misclassification costs using classification decision trees. The published data set by Hou et al. [J. Chem. Inf. Model.2007, 47, 208-218], which contained percentage human intestinal absorption of 645 drug and drug-like compounds, was used for the development and validation of classification trees using classification and regression tree (C&RT) analysis. The results indicate that undersampling the majority class, highly absorbed compounds, leads to a balanced distribution (50:50) training set which can achieve better accuracies for poorly absorbed compounds, whereas the biased training set achieved higher accuracies for highly absorbed compounds. The use of misclassification costs resulted in improved class predictions, when applied to reduce false positives or false negatives. Moreover, it was shown that the classical overall accuracy measure used in many publications is particularly misleading in the case of unbalanced data sets and more appropriate measures presented here may be used for a more realistic assessment of the classification models' performance. Thus, these strategies offer improvements to cope with unbalanced class data sets to obtain classification models applicable in industry.

class imbal occur frequent drug discoveri data set oral absorpt data set literatur consider high absorb compound compar poor absorb compound produc model bias toward high absorb compound lack general industri set earli stage drug candid poor absorb paper present two strategi cope unbalanc class data set undersampl major high absorpt class misclassif cost use classif decis tree publish data set hou et al j chem inf model contain percentag human intestin absorpt drug druglik compound use develop valid classif tree use classif regress tree crt analysi result indic undersampl major class high absorb compound lead balanc distribut train set can achiev better accuraci poor absorb compound wherea bias train set achiev higher accuraci high absorb compound use misclassif cost result improv class predict appli reduc fals posit fals negat moreov shown classic overal accuraci measur use mani public particular mislead case unbalanc data set appropri measur present may use realist assess classif model perform thus strategi offer improv cope unbalanc class data set obtain classif model applic industri

