Unlike well established protein coding genes, publically available lncRNAs are often sparsely annotated, partial in scope and scattered in collection. For example, large proportions of reported “lncRNAs” are assembled from short sequencing reads and tend to be incomplete at 5’ or 3’ends. Very often, cDNA libraries are truncated at 5’-ends due to RNA degradation or reverse transcriptase not copying to the 5’ end. Also, RNA-seq reads are not uniform in covering the 5’ and 3’ ends. These inaccurate and truncated lncRNA annotations can have a profound impact on downstream uses of the data, such as misinterpreting mRNA fragments as lncRNAs, unreliable transcript abundance estimate by FPKM, and misidentification of lncRNA promoter sites.
Arraystar maintains high quality proprietary transcriptome and lncRNA databases that extensively collect lncRNAs through all major external data sources, knowledge-based mining of scientific publications, and our lncRNA discovery pipelines. Especially, we place premium attention on full-length lncRNAs collection. Full length lncRNAs as annotated or experimentally supported in the public databases are compiled with higher priority. The lncRNAs in Arraystar proprietary transcriptome databases and newly published studies are carefully assessed by supporting data for their sequence completeness: 5’ends by host gene histone marks, CAGE cluster, and DNA hypersensitivity (DHS) data; 3’-ends by poly(A)-position profiling (3P-Seq). Additionally, lncRNA candidates are evaluated for protein coding potentials by a combination of prediction methods. Only the lncRNAs that pass these assessments are curated into the full length lncRNA collections (Fig. 1).
Figure 1. Comprehensive and robust collection of full-length lncRNAs from all major sources.
Arraystar Human LncRNA Array V5.0 has a total of 39,317 lncRNAs in two lncRNA collection tiers: 8,393 Gold Standard LncRNAs and 30,924 Reliable LncRNAs.
The Gold Standard lncRNAs are well annotated and experimentally validated genuine lncRNAs, compared with very large numbers of partial fragments, incomplete UTRs, and less reliable sequences deposited as “lncRNAs” in the public databases. The Gold Standard LncRNAs are complete with annotations of lncRNA transcription units, transcript isoforms, functional molecular mechanisms and subcellular localizations. They are selected from:
- lncRNAdb v2.0 compilation as the reference database for functional lncRNAs ;
- Experimentally validated lncRNAs (Featured lncRNAs) from LncRNAWiki;
- Level 1 GENCODE v21 LncRNAs with experimental support by RT-PCR-Seq and manual curation ;
- Refseq high confidence full length lncRNAs under stringent selection;
- Arraystar lncRNA complete transcripts with 5'TSS, 3'ends and expression data defined by ENCODE CAGE Clusters, PolyA-seq, deep RNA-Seq and capture seq [3, 4].
- Arraystar continuous lncRNA curation from scientific publications, assessed with the same stringency of Arraystar lncRNA complete transcripts.
The Reliable lncRNAs are the comprehensive yet highly reliable lncRNA collection tier in the lncRNA transcriptome, which are lncRNAs remained from the Gold Standard LncRNA collection. The sequences are consolidated by the transcription unit models (TU). One best transcript is selected as the representative lncRNA from each TU based on the transcript source, length, and other helpful information. 32,667 Reliable LncRNAs were constructed from 308,525 putative lncRNA sequences.
LncRNA Array Service