Users are often confronted with large, unannotated digital document datasets (e.g., court proceedings, presidential libraries, WikiLeaks), and they must make sense of them quickly.
Start: August 1, 2014
End: July 31, 2018
Contract Terms: Funded by National Science Foundation
- Principal Investigator: Eric Ringger Co-Investigator: Kevin Seppi
- Website: https://facwiki.cs.byu.edu/nlp/index.php/Main_Page
- Project Description:
Existing search tools are not particularly useful for datasets when the user does not know what to look for. The computer-aided discovery of important trends and patterns in such collections can make these collections more valuable and informative. Topic models
such as latent Dirichlet allocation (LDA) are a common tool for addressing such information needs, but these tools are often presented to users as a “take it or leave it” proposition.
Effective methods should use the insights and deep knowledge of the users seeking to understand these collections. Research into effective environments for information exploration has traditionally been represented by two distinct research areas: models and interfaces. Machine learning researchers build ever more complicated models, and human-computer interaction researchers build shiny interfaces assuming a static model. The lack of communication between these groups comes at a price: for many years the objective function of the machine learning researchers making topic models—likelihood on held-out data—negatively correlated with human judgments of topic model quality. Similarly, existing interfaces do not allow the user to correct or guide topic models.
This proposal corrects this dichotomy by bringing together machine learning researchers with human-computer interaction researchers to build topic model interfaces that allow users to interactively refine the entirety of the topic modeling process: the number and granularity of topics, vocabulary selection—the words considered (or ignored) by the model —and finally interactive constraints on what words appear together in topics. Moreover, documents do not appear in isolation; effective analysis must also include document metadata, which allow users to explore and interact with the data through a metadata map interface. While these options have been usable by and accessible to machine learning experts for years, they have not been incorporated into new, more broadly
This research provides advances in all three areas: new machine-learning models focused on user needs along with efficient scalable inference for such Bayesian models, new visualizations and interfaces for interactive machine learning, and evaluations. Compared to static topic modeling approaches, our new topic models will better incorporate the effect of metadata to discover coherent topics, use active learning to better discover useful annotations, and more efficiently represent topic distributions to allow interactive vocabulary modification. This work will also create new interfaces for interactive vocabulary discovery, which while important for unsupervised learning (such as topic modeling), will likely also help those investigating supervised learning.
The work will be of significance to anyone in possession of large collections of text, including libraries, courts, publishers, scholars, and forensics experts. Topic models have increased in importance outside of computer science and have been adopted by digital humanists and social scientists. More flexible tools to dig through large text datasets will allow them to more efficiently conduct and communicate their research.