Interactive Machine Learning for Document Classification
This project is already completed.

Introduction
Each year around 2 - 2.5 million papers are published [12]. Every nine years the number of scientific output doubles [2]. With this amount of papers the task to find relevant ones is considerably difficult costing vital time and energy. While some queries require only one result e.g. finding the most current paper from a specific author, reviews and exploratory searches demand to accumulate a vast selection of studies. For the user, the relevance of search results in an exploratory search is strongly dependent on their preferences. Finding papers that fulfill the users’ needs early in the search process facilitates the search [11] and improves user experience. It is therefore desired to find a model that is geared towards user preferences. To efficiently identify all relevant studies and to achieve personalization, tools are required that help in the screening process of documents. The goal of this thesis is to develop such tools with the means of natural language processing methods.

Figure 1: Active learning process, taken from [17]
Background
Interactive machine learning
Since labeled data is scarce and labeling is costly supervised learning methods are rarely applicable in this domain. Instead, an interactive approach comes into play where a machine interacts with a domain expert and optimizes its learning behavior [4]. The advantage of this semi-supervised approach lies in the fact that the exponential search space can be reduced. The aim is to achieve higher accuracy with less annotated data. Hence, the human is included in a way to enable higher performance than human or machine would achieve independently. The interactive machine learning approach that has been successfully implemented for systematic reviews is pool-based active learning [7, 9, 10, 16]. From a collection of unlabeled papers an active learner selects a small subset. A human reviewer then manually annotates the papers from the subset either as relevant or irrelevant. With this set a classifier is trained which labels all remaining citations and selects the next sample to be annotated by the reviewer based on the query strategy of the algorithms. This is repeated until a certain threshold is reached, e.g. 95% of relevant studies have been identified [3, 7]. In Figure 1 the typical learning process is depicted. The greatest benefit of active learning is the reduction of workload. For example, Wallace et al. applied active learning in the medical domain [16]. Their model reduced the number of studies that need to be screened by 50%.
Active learning variants
The crucial component of active learning is the candidate selection strategy where instances are selected for the user to label. The process revolves around querying the instances that contribute to the learning progress the most. The following list outlines common strategies (see [14]):
- Certainty and uncertainty sampling: Select the citations on which the learner has highest certainty or lowest certainty respectively.
- Query-by-committee: Multiple classifiers vote on each item to decide whether it is likely to be relevant or not [16].
- Decision-theoretic: The aim is to choose candidates which change the underlying model the most (expected model change) or those that reduce the generalization error (expected error reduction and variance reduction).
- Data-driven: A diversity driven approach fosters the selection of dissimilar instances (i.e. informative), while a density driven approach focuses on selecting candidates from dense areas (i.e. representative).
On top of that, weighting can be employed to deal with imbalanced datasets which is often the case in systematic reviews [9]. Other methods are listed in a review by O’Mara-Eves et al. [11] and are taken into account for further analysis. However, the most promising approach is proposed by Huang et al. which provides a systematic way to combine the informativeness and representativeness of an instance [6]. Unlike previous work, their algorithm considers the cluster structure of unlabeled instances and the class assignments of labeled examples for the representativeness measure.
Features and classifier
In the domain of systematic reviews the most commonly applied classifiers are Support Vector Machines (SVMs). It was shown that SVMs work especially well for text classification [15]. Consequently, SVM are also adopted in this thesis. Since the text data that is extracted from scientific literature is rarely labeled, topic modeling has been used to reveal structure in the data. For instance, Miwa et al. found that Latent Dirichlet Allocation (LDA) boosts the active learning performance [9]. Regarding topic modeling, LDA is determined to be the most common approach [11]. However, Hashimoto et al. found that a novel topic modeling algorithm using paragraph vectors [8] combined with k-means clustering performed better than the LDA model [3]. It is promising to investigate whether these results can be replicated and how the approach performs on additional datasets.
Limitations of previous studies
Model performance
A drawback of the paragraph vector model proposed by Hashimoto et al. is a drop in performance in the early stage of active learning iterations [3]. It is argued that this is due to the low number of labeled citations in the beginning. Two studies addressed this problem. The first is a follow-up study that alleviated the cold start problem by applying label propagation [7]. For unlabeled citations that are similar to a manually labeled citation the label of the manually labeled citation is adapted. The second study employs density-based selection of candidates in the beginning of the training [1]. This seems promising for initiating the process. Both are possible strategies to circumvent a cold start.
Interaction
All of the aforementioned studies regarding systematic reviews are retrospective where already labeled datasets are used and human feedback is only simulated. It is necessary to investigate whether active learning performs as hypothesized with human reviewers. One challenge with that involves the high number of needed annotations. In one case over 3000 annotations were necessary to identify 95% of relevant candidates [3]. In a natural setting this poses a serious challenge. Hence, appropriate evaluation methods for human feedback have to be determined. Only one recent study systematically investigated human feedback in text collections [13]. The evaluation is based on 22 reference collections of public health topics. The authors were able to verify that screening effort could be saved and they also included decision time as a measure. The study is a good starting point for examining the effect of human interaction. What the study lacks is qualitative measures and the investigation of collections from other domains. Another challenge regarding human feedback poses the type of feedback. In the typical active learning scenario feedback is obtained only through the labeling process. This a very basic approach of interaction which is geared towards the classifier. Additional feedback from the user could be taken into account to achieve a more sophisticated interaction. For example, a user may indicate that papers that are written by a specific author are especially relevant. With this additional expert input, it is expected to achieve higher accuracy. In particular, Hope and Shahaf demonstrated that proportion constraints on groups lead to high accuracy with only little labels [5] .Therefore, the additional input is expected to be another driver for reducing the number of needed annotations. Bernard et al. build on the idea of additional feedback and identified various user labeling strategies [1]. They distinguish between a model-based approach (active learning) and a user- based approach (visual-interactive labeling). These approaches are not mutually exclusive. By combining both approaches their work can be taken a step further. Thus, elaborate interaction can be achieved.
Objective and planned work
Timetable

The aim of this thesis is to implement and assess an active learning approach that incorporates information from paragraph vectors using a SVM following [3]. Taking the limitations of previous studies into account, the goal is to develop a model that solves cold start problems, tackles the high number of needed annotations and offers visualization and feedback possibilities. In the first month I plan to get familiar with the literature I gathered, focusing on active learning strategies and evaluation methods. At the same time I start to implement the model proposed by Hashimoto et al [3]. The implementation phase is generously dimensioned to ensure all additional components are well thought out. After the first month I want to begin with first visualization trials. Parallel to Bernard et al. [1] the documents will be visualized in feature space by taking the document vectors, applying a dimension reduction technique, i.e. PCA, and plotting the results. Vizualization techniques from a previous approach are shown in Figure 2 to illustrate how the result may look like. The visualization allows the model to provide feedback to the user regarding model decisions and also to take user feedback. Within the second month I plan to evaluate the model with labeled datasets comparing active learning strategies on benchmark datasets. Based on the results of those benchmark tests and the compatibility of the active learning strategy with additional user feedback an active learner is selected. Subsequently, the model will be evaluated with computer science experts through qualitative and quantitative measures. Thereby, metrics like decision time, perceived usefulness and workload can be adduced as measures for improving the design of screening tools. Other challenges arise through the interaction, namely appropriate evaluation metrics and stopping criteria which will be thoroughly addressed during the literature review and implementation phase.

Figure 2: Visualization techniques, taken from [1]
Hypotheses
- The model needs less annotations from the reviewer than previous approaches
- Model performance is higher compared to a LDA-based approach
- Screening time is reduced compared to previous approaches
- Users rate the feedback-based model as more useful than a model without feedback
Methodology
The model will be implemented using Python with libraries Gensim, spaCy and scikit-learn. Benchmark testing will be done on datasets that are available online and that have been previously used, see [9, 16]. Possible datasets are: Drug Effectiveness Review Project (DERP), Chronic Obstructive Pulmonary Disease (COPD), Proton beam, Micro nutrients. In addition, two datasets exist that provide means to investigate recommendations: Dataset 2 from ACM Digital Library and the Related-Article Recommendation Dataset (RARD). Further testing with human feedback will include computer science papers from arXiv and a subset of the Semantic Scholar corpus. The model will be evaluated with the following methods: Utility and Coverage [7], Work saved over sampling at 95 % recall [3] and human evaluation.
References
- Jürgen Bernard, Marco Hutter, Matthias Zeppelzauer, Dieter Fellner, and Michael Sedlmair. Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study. IEEE Transactions on Visualization and Computer Graphics, 24(1):298–308, 2018.
- Lutz Bornmann and Ruüdiger Mutz. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11):2215–2222, 2015.
- Kazuma Hashimoto, Georgios Kontonatsios, Makoto Miwa, and Sophia Ananiadou.Topic detection using paragraph vectors to support active learning in systematic reviews. Journal of Biomedical Informatics, 62:59–65, 2016.
- Andreas Holzinger. Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Informatics, 3(2):119–131, 2016.
- Tom Hope and Dafna Shahaf. Ballpark learning: Estimating labels from rough group comparisons. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9852 LNAI:299–314, 2016.
- Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active Learning by Querying Informative and Representative Examples. Advances in neural information processing systems, pages 892—-900, 2010.
- Georgios Kontonatsios, Austin J. Brockmeier, Piotr Przybyla, John Mc-Naught, Tingting Mu, John Y. Goulermas, and Sophia Ananiadou. A semi-supervised approach using label propagation to support citation screening. Journal of Biomedical Informatics, 72:67–76, 2017.
- Quoc V. Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. International Conference on Machine Learning, 32:1188– 1196, 2014.
- Makoto Miwa, James Thomas, Alison O’Mara-Eves, and Sophia Ananiadou. Reducing systematic review workload through certainty-based screening. Journal of Biomedical Informatics, 51:242–253, 2014.
- Fredrik Olsson. A literature survey of active machine learning in the context of natural language processing. Swedish Institute of Computer Science, pages 134–138, 2009.
- Alison O’Mara-Eves, James Thomas, John McNaught, Makoto Miwa, and Sophia Ananiadou. Using text mining for study identification in systematic reviews: A systematic review of current approaches. Systematic Reviews, 4(1):1–22, 2015.
- Andrew Plume and Daphne Van Weijen. Publish or perish? The rise of the fractional author. Research Trends, 38(3):16–18, 2014.
- Piotr Przybyla, Austin J. Brockmeier, Georgios Kontonatsios, Marie-Annick Le Pogam, John McNaught, Erik von Elm, Kay Nolan, and Sophia Ananiadou. Prioritising references for systematic reviews with RobotAnalyst: A user study. Research Synthesis Methods, (June):1–19, 2018.
- Burr Settles. Active Learning Literature Survey. Machine Learning, 15(2):201–221, 2010.
- Simon Tong and Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research, pages 45–66, 2001.
- Byron C. Wallace, Thomas A. Trikalinos, Joseph Lau, Carla Brodley, and Christopher H. Schmid. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics, 11, 2010.
- Liping Yang, Alan M. MacEachren, Prasenjit Mitra, and Teresa Onorati. Visually-Enabled Active Deep Learning for (Geo) Text and Image Classification: A Review. ISPRS International Journal of Geo-Information, 7(2):65, 2018.
Contact Persons at the University Würzburg
Dr. Martin Fischbach (Primary Contact Person)Mensch-Computer-Interaktion, Universität Würzburg
martin.fischbach@uni-wuerzburg.de