2019 Volume 48 Issue 1-2 Pages 1-16
The development of next-generation sequencing-based genome analysis methods has enabled microbiome 16S rRNA sequence data to be easily accessible. In addition, various microbiome data analysis methods have been proposed; among them, the latent Dirichlet allocation (LDA) model, which is frequently used to extract latent topics of words from documents, was proposed to extract microbial clusters. LDA models that utilize supervisory information, such as document characteristics (e.g., supervised LDA or LDA with Dirichlet multinomial regression (DMR)), have already been developed; however, these methods have not been applied to microbial data. In addition, the percentage of each extracted topic is relatively small, and topics that comprise a large percentage are often not extracted. Thus, we established a Bayesian nonparametric topic model with DMR and compared this model with two real microbiome 16S rRNA datasets. A Bayesian nonparametric model postulates an infinite number of latent variables and can distinguish between proportions of latent variables. By using a method that generates a stick-breaking process on the basis of covariate values, we proposed a Bayesian nonparametric topic model that extracts topics associated with covariate values. Upon evaluation, the proposed method's predictive ability was deemed the highest. In addition, we could extract microbial topics that are associated with a clinical outcome and whose proportions are relatively large.