In this study, we extended our AdaSampling method to address positive-unlabeled learning problem and learning with label noise. The utility of the method was demonstrated using two novel bioinformatics applications.

IEEE Transactions on Cybernetics is one of the top ranked journals in Artificial Intelligence with an impact factor of 7.4 (#1 out of 22 in Computer Science, Cybernetics, according to Thomson Reuters).

Title

AdaSampling for Positive-Unlabeled and Label Noise Learning with Bioinformatics Applications

Abstract

Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise. Here we propose AdaSampling, a framework for both positive-unlabeled learning and learning with class label noise. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalisable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for positive-unlabeled learning and/or learning with label noise. We then introduce two novel bioinformatics applications where AdaSampling is used to (1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and (2) predict transcription factor target genes by integrating various next-generation sequencing data.