2022 Volume 15 Pages 1-8
Without traditional cultures, metagenomics studies the microorganisms sampled from the environment. In those studies, the binning step results serve as an input for the next step of metagenomic projects such as assembly and annotation. The main challenging issue of this process is due to the lack of explicit features of metagenomic reads, especially in the case of short-read datasets. There are two approaches, namely, supervised and unsupervised learning. Unfortunately, only about 1% of microorganisms in nature is annotated. That can cause problems for supervised approaches when an under-study dataset contains unknown species. It is well-known that the main challenging issue of this process is due to the lack of explicit features of metagenomic reads, especially in the case of short-read datasets. Previous studies usually assumed that reads in a taxonomic label have similar k-mer distributions. Our new method is to use Natural Language Processing (NLP) techniques in generating feature vectors. Additionally, the paper presents a comprehensive unsupervised framework in order to apply different embeddings categorized as notable NLP techniques in topic modeling and sentence embedding. The experimental results present our proposed approach's comparative performance with other previous studies on simulated datasets, showing the feasibility of applying NLP for metagenomic binning. The program can be found at https://github.com/vandinhvyphuong/NLPBimeta.