Javascript required
Skip to content Skip to sidebar Skip to footer

A Survey of the Sorghum Transcriptome Using Single-molecule Long Reads

Introduction

A complete transcript resources is primal to gene discovery and studies of genetic variation, especially in species that lack a reference genome. Alternative splicing (AS) is an evolutionarily disquisitional graphic symbol of genes in eukaryotes that increases the proteome diversity of cells. AS can also be an important layer of factor regulation in response to environmental changes [one,two]. High-throughput RNA sequencing (RNAseq) of brusque fragments (with read length unremarkably less than 300 bp) tin can provide in-depth coverage of low-abundance transcripts, simply the assembly of full transcripts based on the bioinformatic algorithms remains challenging, especially for genes that accept undergone extensive AS [3]. In contempo years, unmarried-molecule sequencing platforms, such as those of Pacific Biosciences (PacBio) and Oxford Nanopore Technologies, take been characterized past long read lengths, high throughput, high accurateness, and the absence of distension [4,v]. These platforms let straight sequencing of full-length transcripts, which can back up the identification of factor isoforms compared to short-read RNA sequencing, equally in this style, there is no demand to reconstruct the transcripts variants [half dozen,7].

Recent studies using the single-molecule transcriptome sequencing in several constitute species have proven to be an effective means of identifying gene isoforms. For instance, using the Isoform Sequencing (Iso-Seq) developed past PacBio, a written report in maize has revealed 111,151 isoforms in six tissues, and tissue-specific isoforms have been identified and investigated [8]. Likewise, comparative analyses of long-read transcriptomes from maize and sorghum have uncovered evolutionarily conserved isoforms and species-specific AS patterns [9]. Other studies of many plant species, such as rice, clover, sugarcane, bamboo, coffee, and others [1015], have been reported, and these provide a comprehensive supplement for understanding the multifariousness of transcriptomes.

For species without a loftier-quality reference genome, however, accurately identification of AS isoforms remains hard. Due to the dissimilar modes of Equally patterns, the sequence alignment analysis of isoforms may not be efficient for revealing As sites, specially in regard to sequences with minor changes. In a study of Amborella trichopoda, Liu and colleagues have described a pipeline to identify Every bit isoforms without using the reference genome; this all-vs.-all approach has resulted in 428 pairs of Every bit isoforms with a validation rate between 66-76% [16]. Combined analysis using short-read and long-read sequencing of transcriptome can be a robust means of characterizing the structures and expression profiles of AS isoforms in non-model species. 1 approach is to gather reads and generate a reference to decide AS sites by re-alignment. For instance, the IDP-denovo tool takes both short and long reads to construct a 'pseudo-genome' for Equally identification. This method has shown a substantial increase in efficiency for non-model species [17].

Single-molecule transcriptome sequencing produces a novel set of total-length isoforms that is apt to the downstream analyses, including the identification of long not-coding RNAs (lncRNAs), alternative polyadenylation (APA), and fusion transcripts [7,eighteen]. Recently, the lncRNAs have been found to be abundant in plant genomes, and they can play of import regulatory functions in diverse biological processes [1921]. Although the number of genomic databases of lncRNAs is increasing [22,23], the understanding of the office of lncRNAs in plants is all the same limited. Piffling conservation of lncRNAs at the sequences level is observed in distant-related species, which poses an obstacle to the computational prediction of lncRNA function. Still, a key aspect of the function of lncRNAs is their association with small regulatory RNAs (e.thousand. microRNA [miRNA], small interfering RNA [siRNA)]. Mature miRNAs are mainly 20–21 bp RNAs that tin can direct the expression of their target transcripts at the post-transcriptional level [24]; also miRNA can trigger the synthesis of secondary siRNA including trans-acting pocket-sized interfering RNAs (tasiRNAs) and phased small-interfering RNAs (phasiRNAs), which further modify the gene expression [2426].

With the development of more than competent bioinformatic methods, single-molecule sequencing in not-model species will yield new insights into the regulation of cistron expression. The genus Camellia is well known for its cultivars, which tin can be utilized for making tea, ornaments, and edible oil [27]. Recently, high-quality reference genomes have been released for ii genotypes of Camellia sinenisis [28,29]. Here, we take performed Iso-Seq analyses in five Camellia japonica tissues; and designed a novel pipeline for identifying AS isoforms without using a reference genome; finally, nosotros have identified a new locus of phasiRNA that is responsive to temperature stresses.

Materials and methods

Establish materials and treatments

Camellia japonica plants were grown in the greenhouse of Research Found of Subtropical Forestry (Fuyang, Zhejiang, China). For sample preparation, the plant tissues were collected immediately frozen in liquid nitrogen and stored at −fourscore°C until further employ. For temperature handling, minor cuttings of C. japonica (10–15 cm) were kept in a growth chamber nether long-day weather (16-h light/8-h dark) at 24°C and 40% humidity. To perform the low-temperature handling, a glass freezer was controlled past a temperature sensor (PURUI G6000, Ningbo, Red china). To perform the high-temperature treatment, an incubator was set to an appropriate temperature prior to the experiment to stabilize the internal temperature. To collect the pericarp and immature seed, a young fruit was sliced in one-half, and the tissues were removed and collected using a precipitous scalpel.

RNA training, library structure, sequencing, and IsoSeq data processing

Total RNA was extracted from the tissues of C. japonica using an RNAprep Pure Plant Plus Kit (Transgen, True cat No. DP441, Beijing, Mainland china). The concentration and integrity of the total RNA were checked earlier library construction. A Nanodrop 2000 spectrophotometer (Thermo Fisher, CA, United states) was used to calculate the RNA concentration, and samples with more than 200 ng/uL and optical density (OD) 260/280 in a higher place 2.0 were used. Well-nigh equal amount of RNAs from different tissue types were mixed co-ordinate to the concentrations. To construct libraries for IsoSeq analysis, the high-quality mRNA was purified by the Oligo dT chaplet (Invitrogen, Cat No. 61002) and then reverse-transcribed into cDNA using a SMARTer PCR cDNA Synthesis Kit (Clontech, True cat No. 634926). The cDNA fragments were selected by a BluePippin device (Sage Science, Beverly, MA, Usa).

The Iso-Seq protocol was performed on a PacBio sequencer using the RSII platform, as previously described [5,18]. The raw reads data were initially filtered with a read-accuracy less than 0.75 and a read-length less than l bp. The reads-of-insert (ROIs) were further divided into full-length and non-full-length based on the presence of 5ʹ and 3ʹ adapters. The described full-length non-chimeric reads were clustered by using Iterative Clustering for Error Correction software to generate the cluster consensus with high-quality isoforms (over 99% accurateness). The not-redundant isoforms were further retrieved using CD-hit [30]. All sequencing data were deposited with National Center for Biotechnology Information (NCBI) under the BioProject ID: PRJNA564707.

Transcriptome sequencing and data processing

The Illumina HiSeq platform was used to perform RNA sequencing for each sample to generate 2 × 150 bp short reads. The methods of library construction, sequencing, and data processing were as described elsewhere [31]. All make clean reads were deposited in the NCBI Brusk Read Archive mentioned above. The clean reads were mapped to the non-redundant isoforms by bowtie2 v2.i.0 [32]; and the expression level of the transcript was calculated to FPKM by RSEM1.ii.15 [33]. The same clean reads of each sample were likewise used to validate the AS sites predicted by IsoSplitter. To perform a differential analysis, the transcripts with no less than two-fold change and FDR less than 0.001 were identified by software edgeR, as previously described [34].

AS identification and the IsoSplitter pipeline

To place AS isoforms without the reference genome, we designed a pipeline of sequence alignment, using the isoforms to predict and validate the AS sites (IsoSplitter bachelor at https://github.com/Hengfu-Yin/IsoSplitter). Briefly, IsoSplitter invokes the modified SIM4 program to find divide-sites of transcripts [35]; and each split-site is validated and quantified using the high-depth short reads. We take modified the SIM4 '–word size' of core region to 15 (default value is 12) which gives more than stringent alignment results for farther analyses. The detailed manual is available on the webpage. For this study, the non-redundant isoforms from the Iso-Seq sequencing were used for Equally identification as following: 'IsoSplittingAnchor -i 95 -L 30bp longReadsFile'; to validate the AS sites, the make clean reads of Illumina sequencing were used: 'ShortReadsAligner -q longReadsFile ShortReadsFile Breakpoint_out'. For the quantification of AS isoforms, we farther identified the isoform-specific reads through mapping the brusk-reads across the 'split-sites', and the value of 'average read counts per split per meg reads' (ACM) was obtained to reveal the isoform expression. The script of short-read mapping and quantification for this study is bachelor at https://github.com/Hengfu-Yin/IsoSplitter/scripts/ACM_quantification.py.

Small RNA expression analysis

The 21 bp pocket-sized RNA sequences were used to design primers for quantitative expression analysis (Supplementary Table S1). The total RNA was prepared and normalized before the reverse-transcription by the Mir-Ten miRNA Beginning-Strand Synthesis Kit (Clontech, Cat No. 638315, Dalian, China). To perform PCR analysis, a Mir-X miRNA qRT-PCR TB Greenish Kit (Clontech, Cat No. 638314, Dalian, China) was used according to the user's manual. The U6 sequences was used as the internal reference, and the miR167 [mature sequence: tgaagctgccagcatgatctg; 35] was as well used as a command for gene expression analysis. Three biological replicates were obtained for expression assay. To predict the targets of the secondary siRNAs, nosotros chose a transcriptome associates of C. japonica to reduce the complexity [31]; and the psRNATarget server was used with default settings [37]. The gene-specific primers (Supplementary Table S1) of Quantitative Real-Time -PCR (qRT-PCR) assay for potential target genes were designed by Primer Express 3.0.1 (Applied Biosystems), and carried out using an SYBR Premix Ex Taq (Takara, Dalian, Cathay) kit as described [31].

Bioinformatic and statistical analysis

To predict coding sequences of transcript, TransDecoder (http://transdecoder.sourceforge.net/) was used to place the open reading frames (ORFs), and the ORFs with more than 100 codons were kept for notation analysis. The NCBI nucleotide sequences, NCBI non-redundant protein sequences, and Swiss-Prot were used to annotate the derived protein sequences by Boom 2.2.31 +. To identify lncRNAs, the pipeline of lncRNA prediction (PLEK software version one.2, available at https://sourceforge.net/projects/plek/files/) was initially used [38] with the maize model (-model maize_ens_linli.model – range maize_ens_linli.range – -minlength 300), and CPC software (cpc-0.9-r2 with default settings) was also used to find lncRNAs [39]. To obtain the final set of lncRNAs, we take combined the predicted results to think the overlapped sequences as lncRNAs. For homologous analysis of lncRNAs, the genome notation of Populus trichocarpa version 3.0 [twoscore], Vitis vinifera (http://world wide web.genoscope.cns.fr/externe/GenomeBrowser/Vitis/) [41]; and Camellia sinensis [28] were downloaded. And we have used the sequence similarity alignment by BLASTN (version 2.2.ii.31+, cutting-off Eastward-value: 1e-10) to identify homologous lncRNAs in unlike species. To determine the polyA signature, the 3ʹ UTR sequences were retrieved and search by SIGNITRUTH to reveal the enriched bespeak [42].

For Gene Ontology (Become) enrichment assay of differentially expressed genes, the hypergeometric test was used to summate the enrichment probability for each GO term in a differentially expressed transcript (DET) prepare and further corrected past the Benjamini-Hochberg method [43]. To visualize the relationships of the enriched GO terms, the top xxx GO terms from biological processes were grouped using reviGO with the default setting and plotted using the Cytoscape 3.1.1 [44]. To exam the probability of polyA signal (PAS) sites between the AS and not-AS groups, i-sided Fisher's exact test was used to calculate the significance of enrichment of PAS sites. For the prediction of phasiRNA loci, previous modest RNA sequencing information were used with UEA SRNA-Workbench version 3.ii [44,45].

Results

Extensive transcript isoforms from the Iso-Seq-based transcriptome in C. japonica

To construct a complete resource for gene discovery in C. japonica, we performed the full-length transcriptome sequencing by the PacBio Iso-Seq technology. A mixed RNA sample from 5 dissimilar tissue types (Fig. 1A) was used for library construction with a preferential size of i–two kb, 2–3 kb, 3–half-dozen kb, and v–10 kb. All libraries were subjected to a PacBio SMRT sequencing platform. In total, 901,752 raw reads (around 10.ii billion bases) were generated; and after filtering, 537,587 subreads representing 9.six billion bases were obtained, including full-length and non-full-length transcripts (Fig. 1B, C). We found that the size distribution of ROIs was expected with the pick of cDNA size used for library construction (Supplementary Fig. 1).

Effigy 1. An overview of Iso-Seq transcriptome sequencing in Camellia japonica. (A) the plant tissues used for the library construction. FB, floral bud; YL, young leafage; SK, seed kernel; PE, pericarp; IS, young seed. The inset figure is a close-up prototype of fruit tissues. (B) the distribution of sequencing reads with 5ʹ and 3ʹ primers in libraries of different size. (C) the distribution of full-length and non-full-length sequences in libraries of dissimilar size.

To think high-quality cistron isoforms, the subreads were polished with Illumina sequencing reads from different tissues to correct sequencing errors, and then the redundant sequences with loftier similarity were filtered [xxx]. The obtained dataset (111,277 transcripts, in total) was established for further analysis, which included multiple AS isoforms of transcripts. To annotate the transcripts, multiple public databases of cistron resources were searched for sequencing similarities (Supplementary Table S2). In total, 108,083 transcripts were annotated in total, and bulk of the transcripts were found in the Non-Redundant Protein database (Supplementary Table S2; Supplementary Dataset 1). To accurately quantify the expression levels of transcripts, the brusque reads from Illumina sequencing platform were generated in the five tissue types with iii biological replicates. The average reads per library were nearly 46 M, and the average mapping charge per unit of total reads was 75.57% (Supplementary Table S3). The expression levels were calculated by alignment short reads to the transcripts, and the distribution of fragments per kilobase one thousand thousand (FPKM) values of each library were calculated (Supplementary Fig. S2; Supplementary Dataset 2).

Identification of gene As isoforms based of long and short reads without a reference genome

Due to the lack of a high-quality genome in C. japonica, the determination of Equally isoforms was not trivial. Nosotros implemented a novel pipeline, called IsoSplitter, to identify Every bit sites based on the sequence alignment of transcripts. We adopted the alignment algorithms of SIM4 to determine the high-similarity regions for initial Equally identification. Every bit a tool designed to marshal cDNA to genomic Dna sequences, SIM4 determines the high-similarity regions (HSPs) with a 12 mers screening followed past the dynamic programming algorithm. This has been shown to have high accuracy and efficiency [35]. To identify potential AS sites of transcripts, nosotros designed a contrary-tracing method through the modified SIM4 program: the HSPs regions were screened for "split up-sites" (sites that were adjacent and supported past another transcript) based on a cadre region of 15-mers; we then grouped potential gene isoforms, and counted the occurrences of split up-sites to reveal the transcript variety (the details are presented in the Materials and Methods).

Nosotros aligned the brusk-reads to validate the Every bit sites through screening the junction reads (these are reads partially mapped side by side to the predicted Equally sites and exclusively split up at the same location). We showed that the IsoSplitter pipeline was remarkably efficient in identifying AS sites. In total, we determined 61,838 transcripts with at least 1 As site from the to a higher place-mentioned 111,277 transcripts (bookkeeping for 55.vi%; Supplementary Dataset iii); and 257,692 As sites were identified based on the SIM4 alignments (Fig. 2A, B; Supplementary Dataset 3). To further evaluate the AS sites, we mapped the brusque-reads from unlike tissue types to validate the Equally sites; and nosotros found that xiii,068 transcripts with at least one Equally site were validated, with the majority of these transcripts (6,373 transcripts) were commonly found in all tissues (Fig. 2C, D; Supplementary Dataset 4); in that location were 51,527 Equally sites that were supported by the junction reads from all tissues, and 28,889 sites were uncovered in every tissue type (Fig. 2B). These results indicated that the IsoSplitter pipeline is effective to place factor As sites without the reference genome information.

Figure ii. Identification and analysis of cistron alternative splicing isoforms based of long and short reads in Camellia japonica. (A) the number of AS sites that were discovered and validated with Illumina sequencing reads using the IsoSplitter pipeline. (B, A) venn diagram of Every bit sites validated using Illumina sequencing reads from different tissue types. C, the number of Equally transcripts that were discovered and validated with Illumina sequencing reads using the IsoSplitter pipeline. (D, A) venn diagram of Every bit transcripts with at least one Equally site validated using Illumina sequencing reads from different tissue types.

DETs and tissue-specific expression of Every bit isoforms in distinctive tissue types

We performed the statistical analysis to identify DETs betwixt tissue types (Fig. 3A); in total, we identified 48,487 DETs in the 5 tissue types (FDR < 0.005; Fold-change > 2; Supplementary Dataset 5). We institute that the IS versus PE had the smallest number of DETs comparing to other comparisons of tissue types (Fig. 3A). The expression level of all DETs was used to perform correlation analyses among the samples. Nosotros found that all replicates had a high degree of correlations indicating the reproducibility of the gene expression analysis; and IS and PE displayed a particularly loftier correlation, which was in agreement with the outcome of smallest DETs between the ii (Supplementary Fig. 3). These results betoken that the analysis of long reads transcriptome coupled with brusk reads is competent for gene expression study.

Figure 3. The distribution of DETs and functional enrichment analyses of tissue-specific isoforms in five distinctive tissue types. (A) the distribution of upwards- and downward-regulated DETs between tissue types. (B) the GO enrichment assay seed kernel specific AS isoforms. The Go terms were summarized using REVIGO (http://revigo.irb.60 minutes/), and the degree of red colour indicates the significance (P-value) of the enrichment as listed in Supplementary Dataset six. (C) the normalized expression (Z-score) of seed kernel specific AS genes (ACM) that were annotated as lipid biosynthesis genes. (LUP, beta-amyrin synthase; PLD, Phospholipase (D) SQE, Squalene Epoxidase; EH, Epoxide hydrolase; fadD, Long chain acyl-CoA synthetase).

To investigate the gene isoforms that are specifically found in tissues, nosotros obtained the tissue-specific isoform based on IsoSplitter assay, and performed functional enrichment assay. The isoforms that were supported by brusque 'junction reads' in a tissue type were used to reveal the tissue-specific Equally events. We performed Gene Ontology (Get) enrichment to identify pathways that were related to tissue-specific cistron isoforms (Fig. 3B; Supplementary Dataset 6). The most enriched 30 Get terms were analysed to reveal the biological processes with tissue-specific isoforms; nosotros found that many enriched GO terms were consistent with the function of the tissues. For example, the 'photosynthetic electron transport' was enriched in leaves; the 'regulation of embryonic development' was enriched in seed kernel (Fig. 3B; Supplementary Fig. 4; Supplementary Dataset 6). Nosotros further investigated the tissue-specific expression levels of enriched isoforms of seed kernel through the ACM quantification method (Encounter Materials and Methods for details; Supplementary Dataset vii). We establish that genes involved in lipid biosynthesis, including beta-amyrin Synthase (LUP) and Squalene Epoxidase (SQE), were highly expressed in seed kernels (Fig. 3C).

Identification and characterization of lncRNAs in C. japonica

The obtained transcriptome was searched for lncRNAs; in total, xx,734 transcripts were identified as lncRNA (Supplementary dataset 8). The bulk of lncRNAs were betwixt 1 and 2 kb in length (63%, Fig. 4A). Nosotros then compared the lncRNAs from C. japonica to various plant species. We showed that there were a pocket-size number of lncRNAs displaying sequence homology to distant-related species: only 318 and 513 lncRNAs were revealed to be homologous to Populus and Vitis species, respectively (Fig. 4B), while a large amount of lncRNAs (17,842, 86.i%) was found to be homologous to a closely related species, Camellia sinensis (Fig. 4B). To identify potential miRNA-harbouring lncRNAs, we aligned the mature sequences of miRNAs from previous studies in Camellia species [36], and showed that 720 lncRNAs were matched to mature miRNAs, indicating that those are potential miRNA-harbouring lncRNAs. Nosotros examined the expression of lncRNA in tissue types and found that the average expression levels displayed minor variations (Fig. 4C).

Figure iv. Characterization of lncRNA and their associated miRNAs in Camellia. (A) the distribution of length of lncRNAs. (B) the homologous lncRNA in different plant species and potential miRNA-harbouring lncRNAs in Camellia japonica. (C) the expression of lncRNA in different plant tissue types.

Polyadenylation patterns of the C. japonica transcriptome

To investigate the polyadenylation (polyA) site of transcripts, we first translated the transcripts to think the coding sequences, and the 3ʹ UTR sequences were clipped for further analysis. The polyA tails were identified based on a sliding window scanning with x nucleotides in length containing at least nine As. The 50 bp sequences upstream of a polyA tail were retrieved, and sequences with the motif 'AATAAA' were kept for the identification of PAS. We used an exhaustive counting program to identify potential signatures [42]. Nosotros institute that 'AATAAA' was most frequently discovered (Fig. 5A); and 'ATAAAA', 'AAATAA', and 'ATAAAT' were the abundant ones associated with the 'AATAAA' motif (Fig. 5A).

Effigy 5. The analysis of PAS in Camellia japonica. (A) the distribution of PAS identified in Camellia japonica using transcripts. (B) the PAS sites enriched in AS isoforms. The red colour indicates the significantly enriched PAS sites comparing to the non-AS transcripts.

Furthermore, nosotros analysed the frequency top fifteen hexamers in the AS transcripts. Nosotros tested the probabilities of occurrence of each hexamer in the grouping of Equally genes and found that the frequency of AATAAA motif was not significantly dissimilar from that of the whole gene set (Fig. 5B). We also showed that eight hexamers had college appearance probabilities in AS isoforms, suggesting that AS transcripts might have different preferences for PAS sites (Fig. 5B).

Identification of a new phasiRNA loci potentially involved in common cold and heat stresses of Camellia

The phased secondary small RNA loci were predicted using miRNA and the long-read transcriptome. We found that 182 transcripts were potential phasiRNA loci (Supplementary Tabular array S4). And amongst these, certain loci including auxin responsive factor, auxin signalling F-box, and zinc-finger domain containing protein had been reported in other plant lineages, suggesting conserved evolutionary origins. We too noticed that 41 transcripts encoding lipoxygenase were besides predicted as potential phasiRNAs (Supplementary Table S4); and all of these transcripts contained a region of 252 bp which could potentially generate 12 21-bp siRNAs (Fig. 6A). To farther evaluate the loci as a phasiRNA locus, we combined the small RNA datasets from C. japonica and C. azalea to place the secondary siRNA. We establish that but five 21-bp siRNAs (exact matches) were obtained (Fig. 6A).

Figure 6. Identification and expression analyses of a phasiRNA locus from lipoxygenases. (A) the 252 bp region was identified with supports from modest RNA sequencing data from Camellia japonica and Camellia azalea [31,36]. The numbers were counts of 21-bp siRNA fragments identified of deep sequencing. (B) the expression of each potential siRNA using probes for existent-time quantitative PCR assay in common cold and oestrus treatments. The y-axis indicated the relative expression values. The miR167 was used equally a control. (C) a heatmap plot of correlations of expression of potential siRNAs.

Information technology has been shown that lipoxygenases tin be induced by thermal stresses. We reasoned that the production of secondary siRNA of this locus might be responsive to stresses. We designed 12 short siRNA probes and performed an expression analysis nether low- and high-temperature treatments. Nosotros plant that the short probes from M1-M4 displayed a consistent consecration of expression level upon both cold and rut stresses (Fig. 6B); and probes M7, M10, and M12 were induced under −five and 42°C treatments (Fig. 6B). We performed a correlation analysis of the expression of the probes, and a loftier correlation amongst M1-M5 was observed (Fig. 6C), suggesting a 'ane-striking' model at the 5ʹ region of the phasiRNA locus. But the loftier correlations between M6-M7 and M11-M12 were also observed (Fig. 6C), suggesting a circuitous origin of the secondary siRNA biosynthesis. To further investigate the functions of the secondary siRNA, nosotros predicted the potential targets using transcriptome assembly of C. japonica [31]. We found that, in improver to the lipoxygenase genes, the secondary siRNAs were predicted to target many other genes including protein phosphatase, glycosyl hydrolase, protein kinase, and more (Supplementary Dataset 9). We performed gene expression analysis of some potential targets and showed that RAN GTPase, Xyloglucan endotransglucosylase, and ATPase were differentially expressed in response to oestrus and cold stresses (Supplementary Fig. v). These results suggested that the secondary siRNAs might regulate the downstream gene expression in a trans-acting manner.

Discussion

Transcriptome sequencing constitute a great complication of Equally in institute cells. The roles of AS in gene regulation have been constitute to be closely related to plant development, growth, and stress resistance [1,47]. The recent evolution of single-molecule sequencing technologies has provided an efficient way to obtain complete transcripts that can be used for Every bit, PAS, and lncRNA analyses [7]. Additionally, the use of Iso-Seq in various plant species has the potential to exist an important tool for studying the genomic footing of adaptations.

We combined long-read and brusque-read transcriptome sequencing approaches in Camellia japonica, which lacks a reference genome. With the evolution of novel bioinformatic pipelines, we characterized genome-wide As patterns, APA, and non-coding RNAs; the integrative analysis of Iso-Seq transcripts and modest RNAs adult a new phasiRNA locus that may exist involved in the regulation of temperature stresses.

IsoSplitter is a novel pipeline for AS identification using long-reads transcriptome for not-model species

Single-molecule transcriptome sequencing is a useful technology for unravelling As isoforms, peculiarly for species without reference genomes. Our design for IsoSplitter tin efficiently identify AS sites by aligning isoform sequences. In this report, the screening of transcriptome yielded 61,838 transcripts out of 111,277, with at least one Equally site (Fig. 2C); the discovery charge per unit of the SIM4-based alignment algorithm was significantly improved. To compare with previous assay pipeline, nosotros accept tried to utilize the method (based on BLAST 2.2.2.31+ and the cut-off E-value is 1e-15 as described in Liu et al. [15]) to search for homologues sequences, and obtained 906 pairs of AS isoforms (not shown) in C. japonica. Another key feature of IsoSplitter is that if curt-read RNAseq information are available, IsoSplitter can map the brusk-reads to identify the junction reads to validate the predicted AS sites. Using Illumina reads from five tissue types, nosotros validated 13,068 transcripts with at least 1 AS site (Fig. 2B); based on the short-read analysis, tissue-specific AS isoforms are revealed for further analyses (Fig. 3B, C). The Iso-Seq pipeline is normally used in combined long-read and short-read sequencing in plants (Wang et al., 2019b; Xu et al., 2015), so this pipeline can be a powerful manner to decide As sites and tissue-specific AS isoforms. This study of Camellia japonica provides a comprehensive example of an integrative analysis of both long-read and short-read transcriptome to uncover AS sites when no reference genome is available.

Differences of APA between AS and non-AS transcripts

Polyadenylation is a primal step in mRNA maturation, and information technology also plays an of import function in the regulation of translation. The Iso-Seq pipeline has been shown to be an efficient means of identifying APA isoforms [seven,18]. To investigate the PAS sites of transcripts in C. japonica, the 3ʹ-UTR sequences were retrieved based on coding sequences. Nosotros showed that 'AATAAA' was the well-nigh frequent PAS signal, which is consistent with studies in corn and sorghum [9]. Some hexamers, including 'ATATAT' and 'TATATA,' which are abundant in corn and sorghum, were not amid the top xv hexamers in C. japonica, suggesting a different preference of PAS (Fig. 5A). However, due to the lack of a reference genome, the identification of 3ʹ-UTR and polyA tail might cause bias during the option of sequences, which could lead to the omission of some high-frequency PAS signals. Our enrichment assay showed that eight hexamers were significantly selected in the Equally isoform group, which indicates that the Equally transcripts might have a distinct mechanism of polyA tail processing. This processing has been found to exist highly correlated with splicing [48], and this effect suggests that genes with AS might produce dissimilar 3ʹ-UTR sequences.

A newly evolved phasiRNA locus in Camellia

A diverse range of found lineages feature phasiRNAs, suggesting that they accept a deep evolutionary origin [49,fifty]. Contempo evidence has indicated that not only the 21 bp but also the 24 bp secondary siRNA-producing loci are widely distributed in plants [51]. The phasiRNA-generating loci are plant in both protein-coding and non-coding transcripts [25]. Using the isoform sequences and small RNA sequencing information, we predicted 183 transcripts that were potentially phasiRNA loci, including some conserved loci encoding myeloblastosis transcription gene, nucleotide-binding leucine-rich repeat, pentatricopeptide repeats, auxin-related F-box (AFB), and others [Supplementary Table 4; 24]. Nosotros too predicted a locus in the transcripts of lipoxygenases, including a region of 252 bp in length, and potentially 12 consecutive 21 bp secondary siRNAs tin can be produced (Fig. 6A). This locus appears to exist a newly evolved phasiRNA transcript in Camellia, as it has not been discovered in other plant lineages. We showed that the secondary siRNAs were expressed at depression levels using pocket-sized RNA sequencing information from various establish tissues [36], and the heat and common cold stresses induced the levels of the secondary siRNAs (Fig. 6). Previous studies have shown that the lipoxygenases belong to a large cistron family unit that is involved in several biotic and abiotic responses in C. sinensis [52,53]. The members of lipoxygenase in C. sinensis underwent extensive AS, which led to the truncation of some proteins that might accept regulatory functions that are responsive to biotic and abiotic stresses [53]. A potential siRNA-producing region has been found in at least 42 transcripts in the isoform dataset (Supplementary Table 4), partly due to the Every bit isoforms, which suggests that information technology may accept important regulatory functions for downstream genes.

A Survey of the Sorghum Transcriptome Using Single-molecule Long Reads

Source: https://www.tandfonline.com/doi/full/10.1080/15476286.2020.1738703