The Lancelet (Branchiostoma belcheri) Genome Sequencing and Annotation Project Database
Lancelet (amphioxus) represents the most basal living cephalochordate that diverged from urochordates and vertebrates 550 million years ago, and retains a body plan and morphology most similar to fossil Cambrian chordates. As it occupies an evolutionary key position and has been widely used in research on cephalochordate biology and chordate evolution, especially the origin and evolution of vertebrate adaptive immunity. Recently, an active RAG transposon containing ProtoRAG (a prototypic recombination-activating gene or RAG), was discovered in the lower chordate lancelet. This sequence encodes RAG1-like (L) and RAG2L proteins and meets the structural criteria for the long-sought RAG transposon, illuminating the origins of V(D)J recombination and providing strong evidence in favour of the RAG transposon hypothesis for the origins of jawed vertebrate adaptive immunity. Since lancelet has become one of the best proxies for understanding the chordate ancestral state, the construction of a database with multiple lancelet genomes and annotation data, including domain types, is urgently needed to investigate the loss and gain of domains in orthologues among species. In particular, we especially wanted to enable searching the ancient domain types (non-vertebrate-specific domains) and novel domain combinations during evolution from invertebrates to vertebrates, which may provide new insights into the chordate ancestral state.
Main points highlighted in our LanceletDB project
Here, we construct and present a lancelet (Branchiostoma belcheri) genome sequencing and annotation project database, named LanceletDB and addressed at http://genome.bucm.edu.cn/lancelet. LanceletDB is a web-accessible integrated genome database for two popular lancelet species (B. belcheri and B. floridae). It provides convenient URL-based retrieval, browsing and presentation of several types of information online, including genome sequences, gene models, gene function and domain types in orthologues among type species, gene expression pattern in lancelet embryogenesis, various expression sequence tag (EST) sets, and the alternative polyadenylation (APA) sites profiled by the high throughput sequencing APA sites (SAPAS) method. In addition, we integrate the released diploid lancelet genome annotation data (B. floridae) to expand our LanceletDB and extend its usefulness. These data are available through the search page, BLAST page and genome browser to provide an integrated display of annotation data. The advances and new biological information are outlined below.
(1) The B. belcheri diploid assembly is fragmented and highly polymorphic, so we developed a novel algorithms (HaploMerger), an automated pipeline, to create the reference haploid assembly from the original diploid assembly. The haploid assembly adopted in LanceletDB may represent a better reference assembly for lancelet B. belcheri, because it maintains better sequence contiguity and continuity. HaploMerger is available as an open-source package in our website http://genome.bucm.edu.cn/lancelet/download_softwares.php.
(2) As previously described, gene model sets were obtained by integrating the results of de novo gene prediction, homology-based and transcriptome-based prediction. Proteins were annotated by searching against the InterPro database, the Pfam domain database, the Gene Ontology database and the KEGG database. In particular, LanceletDB details the domain types in orthologues among lancelet and other species, which enables the investigation of loss and gain of domains in orthologues, especially ancient domain types (non-vertebrate-specific domains) and novel domain combination during evolution from invertebrates to vertebrates.
(3) Different EST mapping alignment and RNA-seq read mapping coverage data are used and visualized to support the gene models in LanceletDB, especially the RNA-seq reads generated from eight lancelet samples corresponding to the major stages of early embryonic development such as oosperm (0hpf), 4-8 cells (0.5hpf), blastula (4hpf), cap-gastrula (5hpf), cup-gastrula (6hpf), late neurula (20hpf), 1 gill (30hpf) and larve (6dpf). These data help profile the dynamic usage of lancelet genes in lancelet embryogenesis, and in a sense, they provide additional experimental support for the actual gene models in LanceletDB.
(4) “B.belcheri_v7h2_polyA_V.anguillarum-infects-intestine”, the generated lancelet alternative polyadenylation (APA) sites dataset, it helps us to annotate the APA sites for the predicted transcript models in LanceletDB, but is also well documented as a new searching dataset in our web-accessible APASdb, detailing APA site switching in the intestine of lancelets with and without challenge by Vibrio anguillarum.
(5) As a user-friendly web database, keywords such as gene id, gene name, symbol and fuzzy description can be used for direct querying the interesting transcript sets. LanceletDB supports URL-based retrieval, browsing and display of several types of information, including exon-intron structure, poly(A) signal type, poly(A) sites, 3’-UTR regions, domain types in orthologues among lancelet and other species.
Search by using a ID, gene name,
symbols or description variable in publications and databases.Examples: ssr4, hox, TLR, caspase,
also Bb_073760F are available for query.
Leiming You, et al. LanceletDB: an integrated genome database for lancelet, comparing domain types and combination in orthologues among lancelet and other species. The Journal of Biological Databases and curation. 2019, doi:10.1093/database/baz056.
LanceletDB current version: 1.0 change log.