J Chem Inf Model - Coping with unbalanced class data sets in oral absorption models.


{ model(2656) set(1616) predict(1553) }
{ compound(1573) activ(1297) structur(1058) }
{ featur(3375) classif(2383) classifi(1994) }
{ data(3963) clinic(1234) research(1004) }
{ high(1669) rate(1365) level(1280) }
{ perform(1367) use(1326) method(1137) }
{ cost(1906) reduc(1198) effect(832) }
{ error(1145) method(1030) estim(1020) }
{ model(3480) simul(1196) paramet(876) }
{ drug(1928) target(777) effect(648) }
{ estim(2440) model(1874) function(577) }
{ measur(2081) correl(1212) valu(896) }
{ general(901) number(790) one(736) }
{ search(2224) databas(1162) retriev(909) }
{ cancer(2502) breast(956) screen(824) }
{ implement(1333) system(1263) develop(1122) }
{ detect(2391) sensit(1101) algorithm(908) }
{ take(945) account(800) differ(722) }
{ chang(1828) time(1643) increas(1301) }
{ research(1085) discuss(1038) issu(1018) }
{ visual(1396) interact(850) tool(830) }
{ can(774) often(719) complex(702) }
{ treatment(1704) effect(941) patient(846) }
{ framework(1458) process(801) describ(734) }
{ problem(2511) optim(1539) algorithm(950) }
{ algorithm(1844) comput(1787) effici(935) }
{ control(1307) perform(991) simul(935) }
{ case(1353) use(1143) diagnosi(1136) }
{ howev(809) still(633) remain(590) }
{ perform(999) metric(946) measur(919) }
{ studi(1119) effect(1106) posit(819) }
{ record(1888) medic(1808) patient(1693) }
{ state(1844) use(1261) util(961) }
{ age(1611) year(1155) adult(843) }
{ use(2086) technolog(871) perceiv(783) }
{ decis(3086) make(1611) patient(1517) }
{ model(3404) distribut(989) bayesian(671) }
{ imag(1947) propos(1133) code(1026) }
{ data(1737) use(1416) pattern(1282) }
{ inform(2794) health(2639) internet(1427) }
{ system(1976) rule(880) can(841) }
{ imag(1057) registr(996) error(939) }
{ bind(1733) structur(1185) ligand(1036) }
{ sequenc(1873) structur(1644) protein(1328) }
{ method(1219) similar(1157) match(930) }
{ imag(2830) propos(1344) filter(1198) }
{ network(2748) neural(1063) input(814) }
{ imag(2675) segment(2577) method(1081) }
{ patient(2315) diseas(1263) diabet(1191) }
{ studi(2440) review(1878) systemat(933) }
{ motion(1329) object(1292) video(1091) }
{ assess(1506) score(1403) qualiti(1306) }
{ surgeri(1148) surgic(1085) robot(1054) }
{ learn(2355) train(1041) set(1003) }
{ concept(1167) ontolog(924) domain(897) }
{ clinic(1479) use(1117) guidelin(835) }
{ extract(1171) text(1153) clinic(932) }
{ method(1557) propos(1049) approach(1037) }
{ data(1714) softwar(1251) tool(1186) }
{ design(1359) user(1324) use(1319) }
{ model(2220) cell(1177) simul(1124) }
{ care(1570) inform(1187) nurs(1089) }
{ method(984) reconstruct(947) comput(926) }
{ featur(1941) imag(1645) propos(1176) }
{ studi(1410) differ(1259) use(1210) }
{ risk(3053) factor(974) diseas(938) }
{ system(1050) medic(1026) inform(1018) }
{ import(1318) role(1303) understand(862) }
{ model(2341) predict(2261) use(1141) }
{ blood(1257) pressur(1144) flow(957) }
{ spatial(1525) area(1432) region(1030) }
{ health(3367) inform(1360) care(1135) }
{ monitor(1329) mobil(1314) devic(1160) }
{ ehr(2073) health(1662) electron(1139) }
{ research(1218) medic(880) student(794) }
{ patient(2837) hospit(1953) medic(668) }
{ data(2317) use(1299) case(1017) }
{ medic(1828) order(1363) alert(1069) }
{ signal(2180) analysi(812) frequenc(800) }
{ group(2977) signific(1463) compar(1072) }
{ sampl(1606) size(1419) use(1276) }
{ gene(2352) biolog(1181) express(1162) }
{ data(3008) multipl(1320) sourc(1022) }
{ first(2504) two(1366) second(1323) }
{ intervent(3218) particip(2042) group(1664) }
{ activ(1138) subject(705) human(624) }
{ time(1939) patient(1703) rate(768) }
{ patient(1821) servic(1111) care(1106) }
{ can(981) present(881) function(850) }
{ analysi(2126) use(1163) compon(1037) }
{ health(1844) social(1437) communiti(874) }
{ structur(1116) can(940) graph(676) }
{ use(976) code(926) identifi(902) }
{ use(1733) differ(960) four(931) }
{ result(1111) use(1088) new(759) }
{ survey(1388) particip(1329) question(1065) }
{ process(1125) use(805) approach(778) }
{ activ(1452) weight(1219) physic(1104) }
{ method(1969) cluster(1462) data(1082) }
{ method(2212) result(1239) propos(1039) }


Class imbalance occurs frequently in drug discovery data sets. In oral absorption data sets, in the literature, there are considerably more highly absorbed compounds compared to poorly absorbed compounds. This produces models that are biased toward highly absorbed compounds which lack generalization to industry settings where more early stage drug candidates are poorly absorbed. This paper presents two strategies to cope with unbalanced class data sets: undersampling the majority high absorption class and misclassification costs using classification decision trees. The published data set by Hou et al. [J. Chem. Inf. Model.2007, 47, 208-218], which contained percentage human intestinal absorption of 645 drug and drug-like compounds, was used for the development and validation of classification trees using classification and regression tree (C&RT) analysis. The results indicate that undersampling the majority class, highly absorbed compounds, leads to a balanced distribution (50:50) training set which can achieve better accuracies for poorly absorbed compounds, whereas the biased training set achieved higher accuracies for highly absorbed compounds. The use of misclassification costs resulted in improved class predictions, when applied to reduce false positives or false negatives. Moreover, it was shown that the classical overall accuracy measure used in many publications is particularly misleading in the case of unbalanced data sets and more appropriate measures presented here may be used for a more realistic assessment of the classification models' performance. Thus, these strategies offer improvements to cope with unbalanced class data sets to obtain classification models applicable in industry.

Resumo Limpo

class imbal occur frequent drug discoveri data set oral absorpt data set literatur consider high absorb compound compar poor absorb compound produc model bias toward high absorb compound lack general industri set earli stage drug candid poor absorb paper present two strategi cope unbalanc class data set undersampl major high absorpt class misclassif cost use classif decis tree publish data set hou et al j chem inf model contain percentag human intestin absorpt drug druglik compound use develop valid classif tree use classif regress tree crt analysi result indic undersampl major class high absorb compound lead balanc distribut train set can achiev better accuraci poor absorb compound wherea bias train set achiev higher accuraci high absorb compound use misclassif cost result improv class predict appli reduc fals posit fals negat moreov shown classic overal accuraci measur use mani public particular mislead case unbalanc data set appropri measur present may use realist assess classif model perform thus strategi offer improv cope unbalanc class data set obtain classif model applic industri

Resumos Similares

J Chem Inf Model - Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods. ( 0,872079309533794 )
J Chem Inf Model - Predictions of BuChE inhibitors using support vector machine and naive Bayesian classification techniques in drug discovery. ( 0,849035646401883 )
J Chem Inf Model - Oversampling to overcome overfitting: exploring the relationship between data set composition, molecular descriptors, and predictive modeling methods. ( 0,808588915567336 )
J Chem Inf Model - Structure based model for the prediction of phospholipidosis induction potential of small molecules. ( 0,7989231241081 )
J Chem Inf Model - In silico prediction of chemical Ames mutagenicity. ( 0,792728524316112 )
J Chem Inf Model - Hsp90 inhibitors, part 1: definition of 3-D QSAutogrid/R models as a tool for virtual screening. ( 0,78179208000945 )
J Chem Inf Model - Design and synthesis of new antioxidants predicted by the model developed on a set of pulvinic acid derivatives. ( 0,781407623590819 )
J Chem Inf Model - In silico prediction of aqueous solubility using simple QSPR models: the importance of phenol and phenol-like moieties. ( 0,776111862650717 )
J Chem Inf Model - Jointly handling potency and toxicity of antimicrobial peptidomimetics by simple rules from desirability theory and chemoinformatics. ( 0,773520620105699 )
J Chem Inf Model - In silico prediction of total human plasma clearance. ( 0,773087518328165 )
J Chem Inf Model - Statistical analysis and compound selection of combinatorial libraries for soluble epoxide hydrolase. ( 0,772674110052236 )
J Chem Inf Model - Classification of compounds with distinct or overlapping multi-target activities and diverse molecular mechanisms using emerging chemical patterns. ( 0,76915532506399 )
J Chem Inf Model - Quantitative structure-activity relationship models for ready biodegradability of chemicals. ( 0,751464882219798 )
J Chem Inf Model - Profile-QSAR and Surrogate AutoShim protein-family modeling of proteases. ( 0,748097724360193 )
J Chem Inf Model - Construction and use of fragment-augmented molecular Hasse diagrams. ( 0,740859417835254 )
J Am Med Inform Assoc - Drug repurposing: mining protozoan proteomes for targets of known bioactive compounds. ( 0,740782821847982 )
J Chem Inf Model - Discovery of novel antimalarial compounds enabled by QSAR-based virtual screening. ( 0,740192307518969 )
J Chem Inf Model - Profile-QSAR: a novel meta-QSAR method that combines activities across the kinase family to accurately predict affinity, selectivity, and cellular activity. ( 0,737010943329622 )
J Chem Inf Model - Binary classification of aqueous solubility using support vector machines with reduction and recombination feature selection. ( 0,731464860050562 )
J Chem Inf Model - Discovering new agents active against methicillin-resistant Staphylococcus aureus with ligand-based approaches. ( 0,726239585068251 )
J Chem Inf Model - Automated building of organometallic complexes from 3D fragments. ( 0,721468320675819 )
BMC Med Inform Decis Mak - Regression tree construction by bootstrap: model search for DRG-systems applied to Austrian health-data. ( 0,719658834146089 )
J Chem Inf Model - A comparison of different QSAR approaches to modeling CYP450 1A2 inhibition. ( 0,716501209695357 )
J Chem Inf Model - Kinome-wide activity modeling from diverse public high-quality data sets. ( 0,71516325788988 )
J Chem Inf Model - Experimental and computational prediction of glass transition temperature of drugs. ( 0,7100082616135 )
J Chem Inf Model - Revisiting the general solubility equation: in silico prediction of aqueous solubility incorporating the effect of topographical polar surface area. ( 0,707914286260064 )
J Chem Inf Model - A new approach to radial basis function approximation and its application to QSAR. ( 0,707653090645529 )
J Chem Inf Model - A new protocol for predicting novel GSK-3? ATP competitive inhibitors. ( 0,700633737303918 )
J Chem Inf Model - Prediction of linear cationic antimicrobial peptides based on characteristics responsible for their interaction with the membranes. ( 0,700259265233478 )
J Chem Inf Model - Three useful dimensions for domain applicability in QSAR models using random forest. ( 0,699834769577295 )
J Chem Inf Model - Pharmacophore assessment through 3-D QSAR: evaluation of the predictive ability on new derivatives by the application on a series of antitubercular agents. ( 0,699073807281173 )
J Chem Inf Model - How accurately can we predict the melting points of drug-like compounds? ( 0,698739735024642 )
J Chem Inf Model - Predicting myelosuppression of drugs from in silico models. ( 0,698440756636287 )
J Chem Inf Model - How experimental errors influence drug metabolism and pharmacokinetic QSAR/QSPR models. ( 0,697460640489337 )
J Chem Inf Model - Time-split cross-validation as a method for estimating the goodness of prospective prediction. ( 0,694238624835049 )
J Chem Inf Model - Hsp90 inhibitors, part 2: combining ligand-based and structure-based approaches for virtual screening application. ( 0,693200237322223 )
J Chem Inf Model - Predictive models for cytochrome p450 isozymes based on quantitative high throughput screening data. ( 0,689538876814132 )
J Chem Inf Model - Analysis and study of molecule data sets using snowflake diagrams of weighted maximum common subgraph trees. ( 0,686317741293831 )
J Chem Inf Model - Application of quantitative structure-activity relationship models of 5-HT1A receptor binding to virtual screening identifies novel and potent 5-HT1A ligands. ( 0,685325129467357 )
J Chem Inf Model - Pre-processing feature selection for improved C&RT models for oral absorption. ( 0,683770444677676 )
J Chem Inf Model - Beyond the scope of Free-Wilson analysis: building interpretable QSAR models with machine learning algorithms. ( 0,676976318029155 )
J Chem Inf Model - QSAR classification model for antibacterial compounds and its use in virtual screening. ( 0,672306870161528 )
J Chem Inf Model - Applicability Domain ANalysis (ADAN): a robust method for assessing the reliability of drug property predictions. ( 0,666626300990298 )
J Chem Inf Model - Generalized workflow for generating highly predictive in silico off-target activity models. ( 0,662199219764744 )
J Chem Inf Model - Discovery and design of tricyclic scaffolds as protein kinase CK2 (CK2) inhibitors through a combination of shape-based virtual screening and structure-based molecular modification. ( 0,659790831463857 )
J Chem Inf Model - Fighting high molecular weight in bioactive molecules with sub-pharmacophore-based virtual screening. ( 0,65714025332669 )
J Chem Inf Model - GA(M)E-QSAR: a novel, fully automatic genetic-algorithm-(meta)-ensembles approach for binary classification in ligand-based drug design. ( 0,655671421802934 )
AMIA Annu Symp Proc - Predicting the dengue incidence in Singapore using univariate time series models. ( 0,655135024540345 )
J Chem Inf Model - QSAR modeling of imbalanced high-throughput screening data in PubChem. ( 0,653699987340355 )
J Chem Inf Model - iLOGP: a simple, robust, and efficient description of n-octanol/water partition coefficient for drug design using the GB/SA approach. ( 0,653130052794444 )
J Chem Inf Model - Development of novel 3D-QSAR combination approach for screening and optimizing B-Raf inhibitors in silico. ( 0,651919138857247 )
J Chem Inf Model - Predicting pK(a) values of substituted phenols from atomic charges: comparison of different quantum mechanical methods and charge distribution schemes. ( 0,651192913207384 )
J Chem Inf Model - Maximum-score diversity selection for early drug discovery. ( 0,648358442114812 )
J Chem Inf Model - Compound set enrichment: a novel approach to analysis of primary HTS data. ( 0,647296321285539 )
J Chem Inf Model - Design of novel FLT-3 inhibitors based on dual-layer 3D-QSAR model and fragment-based compounds in silico. ( 0,645258407802702 )
AMIA Annu Symp Proc - Effect of data combination on predictive modeling: a study using gene expression data. ( 0,639258597407997 )
J Chem Inf Model - Quantitative structure-activity relationship models of clinical pharmacokinetics: clearance and volume of distribution. ( 0,633015166918026 )
Comput. Biol. Med. - A prediction model of substrates and non-substrates of breast cancer resistance protein (BCRP) developed by GA-CG-SVM method. ( 0,632401999105299 )
J Chem Inf Model - Molecular modeling of the 3D structure of 5-HT(1A)R: discovery of novel 5-HT(1A)R agonists via dynamic pharmacophore-based virtual screening. ( 0,63199463550218 )
J Chem Inf Model - Synthesis, bioassay, and molecular field topology analysis of diverse vasodilatory heterocycles. ( 0,631322744207098 )
Artif Intell Med - Training artificial neural networks directly on the concordance index for censored data using genetic algorithms. ( 0,631193068867193 )
J Chem Inf Model - Four-dimensional structure-activity relationship model to predict HIV-1 integrase strand transfer inhibition using LQTA-QSAR methodology. ( 0,630448588950108 )
J Integr Bioinform - Database supported candidate search for metabolite identification. ( 0,629830016255994 )
J Chem Inf Model - Modeling drug-induced anorexia by molecular topology. ( 0,624861170611105 )
J Chem Inf Model - Knowledge-based libraries for predicting the geometric preferences of druglike molecules. ( 0,624182008456952 )
J Chem Inf Model - Optimizing predictive performance of CASE Ultra expert system models using the applicability domains of individual toxicity alerts. ( 0,615995487456485 )
J Chem Inf Model - Freely available conformer generation methods: how good are they? ( 0,6149077553287 )
Comput. Biol. Med. - In silico prediction of spleen tyrosine kinase inhibitors using machine learning approaches and an optimized molecular descriptor subset generated by recursive feature elimination method. ( 0,614652366705842 )
J Chem Inf Model - CSAR data set release 2012: ligands, affinities, complexes, and docking decoys. ( 0,61378246072077 )
J Chem Inf Model - Combined 3D-QSAR, molecular docking, and molecular dynamics study on piperazinyl-glutamate-pyridines/pyrimidines as potent P2Y12 antagonists for inhibition of platelet aggregation. ( 0,61258630833036 )
J Chem Inf Model - Kinase-kernel models: accurate in silico screening of 4 million compounds across the entire human kinome. ( 0,610189498477175 )
J Chem Inf Model - Algorithm for reaction classification. ( 0,609989244973792 )
J Chem Inf Model - A critical assessment of combined ligand- and structure-based approaches to HERG channel blocker modeling. ( 0,6093500320616 )
J Chem Inf Model - Rationalization of the pKa values of alcohols and thiols using atomic charge descriptors and its application to the prediction of amino acid pKa's. ( 0,609336653536267 )
J Chem Inf Model - Application of the 4D fingerprint method with a robust scoring function for scaffold-hopping and drug repurposing strategies. ( 0,606905936917685 )
J Chem Inf Model - A multivariate chemical similarity approach to search for drugs of potential environmental concern. ( 0,606105816604736 )
J Chem Inf Model - PLS-optimal: a stepwise D-optimal design based on latent variables. ( 0,605320984592064 )
J Chem Inf Model - Development of a minimal kinase ensemble receptor (MKER) for surrogate AutoShim. ( 0,604704240507667 )
J Chem Inf Model - Combined receptor and ligand-based approach to the universal pharmacophore model development for studies of drug blockade to the hERG1 pore domain. ( 0,603473968387852 )
J Chem Inf Model - On the value of homology models for virtual screening: discovering hCXCR3 antagonists by pharmacophore-based and structure-based approaches. ( 0,602827259514759 )
J Chem Inf Model - Exploring uncharted territories: predicting activity cliffs in structure-activity landscapes. ( 0,602376788134717 )
J Chem Inf Model - Exploring the biologically relevant chemical space for drug discovery. ( 0,60174426199806 )
J Chem Inf Model - Study of chromatographic retention of natural terpenoids by chemoinformatic tools. ( 0,601158157558982 )
J Chem Inf Model - Combining horizontal and vertical substructure relationships in scaffold hierarchies for activity prediction. ( 0,600577539482417 )
J Chem Inf Model - Structural similarity based kriging for quantitative structure activity and property relationship modeling. ( 0,600557662314086 )
J Chem Inf Model - In silico assessment of chemical biodegradability. ( 0,599815774863098 )
J Chem Inf Model - Prediction of compound potency changes in matched molecular pairs using support vector regression. ( 0,598447748428062 )
J Chem Inf Model - Best of both worlds: combining pharma data and state of the art modeling technology to improve in Silico pKa prediction. ( 0,597417594396654 )
J Chem Inf Model - Benchmarking study of parameter variation when using signature fingerprints together with support vector machines. ( 0,59586037647688 )
J. Comput. Biol. - The complexity of the dirichlet model for multiple alignment data. ( 0,593589539839454 )
BMC Med Inform Decis Mak - Concordance and predictive value of two adverse drug event data sets. ( 0,591473631097129 )
J Chem Inf Model - How do 2D fingerprints detect structurally diverse active compounds? Revealing compound subset-specific fingerprint features through systematic selection. ( 0,589852171275622 )
J Chem Inf Model - Comparative studies on some metrics for external validation of QSPR models. ( 0,588734885821078 )
J Chem Inf Model - GRID-based three-dimensional pharmacophores II: PharmBench, a benchmark data set for evaluating pharmacophore elucidation methods. ( 0,588657677729039 )
J Chem Inf Model - Target-independent prediction of drug synergies using only drug lipophilicity. ( 0,588106986473031 )
Med Biol Eng Comput - Application of the RIMARC algorithm to a large data set of action potentials and clinical parameters for risk prediction of atrial fibrillation. ( 0,588101941865391 )
J Chem Inf Model - Chemical data visualization and analysis with incremental generative topographic mapping: big data challenge. ( 0,587269623424573 )
J Chem Inf Model - BioSM: metabolomics tool for identifying endogenous mammalian biochemical structures in chemical structure space. ( 0,587046091909773 )
J Chem Inf Model - Classifier ensemble based on feature selection and diversity measures for predicting the affinity of A(2B) adenosine receptor antagonists. ( 0,585882931949616 )
J Chem Inf Model - Does rational selection of training and test sets improve the outcome of QSAR modeling? ( 0,585469020291503 )