The Lancelet (Branchiostoma belcheri) Genome Sequence and Annotation Project Database
Lancelet (amphioxus) represents the most basal extant chordate (cephalochordates) that diverged from the other two chordate lineages (urochordates and vertebrates) half a billion years ago. It is widely used as a model system for research in evolutionary developmental biology to understand the basic patterning mechanisms for the chordate body plan and the origin of vertebrates, especially the origin of vertebrate adaptive immunity. Recently, ProtoRAG, an active DNA transposon family was discovered in the lancelet (Branchiostoma belcheri), which encodes RAG1-like and RAG2-like proteins and meets the structural criteria for the long-sought RAG transposon, providing powerful evidence in favor of the RAG transposon hypothesis for the origins of jawed vertebrate adaptive immunity. Therefore, the construction of multiple lancelet genomes and annotation database is urgent needed for providing new insights into the chordate ancestral state and the vertebrate evolution.
Main points highlighted in our LanceletDB project
Here we construct and present the lancelet genome annotation project database, named LanceletDB and addressed at http://mosas.sysu.edu.cn/genome. LanceletDB is a national science foundation project, and provides the reference haploid genome sequence and annotation data for cephalochordate lancelet (Branchiostoma belcheri), including the gene models and function annotation, gene expression pattern in lancelet embryogenesis, alternative polyadenylation (APA) sites as well as different expression sequence tags (ESTs) sets. Also, we integrate the publicly available diploid genome sequence and annotation data (Branchiostoma floridae), to expand our LanceletDB and extend its usefulness. These data are available through the searching page, BLAST page and genome browser (GBrowser) to provide an integrated display of annotation data. The advances and new biological information are outlined below.
(a). The diploid assembly is fragmented and highly polymorphic. We developed novel algorithms (HaploMerger) to reconstruct reference haploid assembly (v2h7, v15h11 and v18h27) from the original diploid assembly. The HaploMerger package is available here (http://mosas.sysu.edu.cn/genome/download_softwares.php).
The new reference assembly (bbv18h27) were created by using a pipeline that combines hierachical scaffolding, a hybrid assembly method (Illumina and 454 reads were used for de novo assembly), the HaploMerger2 (an unpublished updated version) algorithms for error correction and haploid assembly reconstruction. The new reference assembly has a size of ~426Mb, with scaffold and contig N50 sizes at 2.3Mb and 46 Kb, respectively.
(b). As described in our previous publication, gene model sets were obtained by integrating the results of de novo gene prediction, homology-based and transcriptome-based prediction. Proteins were annotated by searching against the InterPro database, the Pfam domain database, the gene ontology database and the KEGG database.
(c). The RNA sequencing (RNA-Seq) approach was adopted to analyze the gene expression pattern in the major stages of lancelet embryogenesis (Branchiostoma belcheri), including the oosperm (0hpf), blastula (4hpf), cap-gastrula (5hpf), cup-gastrula (6hpf), anterior somite visible (10.5hpf), late neurula (20hpf), 1 gill (30hpf) and larve (6dpf). These data help profile the dynamic usage of lancelet genes, in a sense, they also provide additional experimental support for gene models in LanceletDB.
(d). The alternative polyadenylation (APA) sites in the normal and Vibrio anguillarum-infected intestine were identified by the Sequencing Alternative Polyadenylation Sites (SAPAS) method capable of high-throughput sequencing and quantifying the 3'-ends of polyadenylated transcripts. And the APA data were visualized to provide additional experimental support for lancelet gene models.
(e). As a user-friendly web database, keywords such as gene id, gene name, symbol and fuzzy description can be used for direct querying the interesting transcript sets. LanceletDB supports URL-based retrieval, browsing and display of several types of information, including exon-intron structure, poly(A) signal type, poly(A) sites and 3’-UTR regions.
Search by using a ID, gene name,
symbols or description variable in publications and databases. Examples: IFT27, IFT, TLR, caspase,
also Bb_063970F are available for query.
LanceletDB current version: 1.0 change log.