IPSJ Transactions on Bioinformatics
Online ISSN : 1882-6679
ISSN-L : 1882-6679
A Novel Metagenomic Binning Framework Using NLP Techniques in Feature Extraction
Viet Toan TranHoang D. QuachPhuong V. D. VanVan Hoai Tran
Author information
JOURNAL FREE ACCESS

2022 Volume 15 Pages 1-8

Details
Abstract

Without traditional cultures, metagenomics studies the microorganisms sampled from the environment. In those studies, the binning step results serve as an input for the next step of metagenomic projects such as assembly and annotation. The main challenging issue of this process is due to the lack of explicit features of metagenomic reads, especially in the case of short-read datasets. There are two approaches, namely, supervised and unsupervised learning. Unfortunately, only about 1% of microorganisms in nature is annotated. That can cause problems for supervised approaches when an under-study dataset contains unknown species. It is well-known that the main challenging issue of this process is due to the lack of explicit features of metagenomic reads, especially in the case of short-read datasets. Previous studies usually assumed that reads in a taxonomic label have similar k-mer distributions. Our new method is to use Natural Language Processing (NLP) techniques in generating feature vectors. Additionally, the paper presents a comprehensive unsupervised framework in order to apply different embeddings categorized as notable NLP techniques in topic modeling and sentence embedding. The experimental results present our proposed approach's comparative performance with other previous studies on simulated datasets, showing the feasibility of applying NLP for metagenomic binning. The program can be found at https://github.com/vandinhvyphuong/NLPBimeta.

Content from these authors
© 2022 by the Information Processing Society of Japan
Next article
feedback
Top