Rna-seq Differential Expression Number of Reads Required
hamidghaedi / RNA-seq-differential-expression Goto Github PK
Differential gene expression analysis on RNA-seq data
RNA-seq-differential-expression'south Introduction
RNA-seq-differential-expression
Differential gene expression analysis on RNA-seq data
RNA-seq
RNA-seq is emerging as a powerful genome-broad gene expression analysis tools in genomic studies. Technically an RNA-seq experiment get-go from library preparation, sequencing, and information analysis. Data analysis is ordinarily kickoff from quality control, read mapping to a reference genome. From RNA reads aligned to a reference genome, a count matrix will generate that is the master file for differential expression assay. For differential gene expression assay from this count RNA matrix, there are two main R package which are widely used in this field: DeSeq2 and edgeR and limma.
In this commodity I volition share my workflow for performing differential gene expression analysis on RNA-seq count matrix. The data used here are coming from float cancer data from TCGA (TCGA-BLCA ). Here is a cursory overview on assay steps:
- Obtaining data
- Data transformation, quality control and data normalization
- Differential gene expression analysis
- Visualization
1. Obtaining data
#________Packgaes______________# library( "TCGAbiolinks" ) # bioconductor package library( "SummarizedExperiment" ) # bioconductor parcel library( "DESeq2" ) # bioconductor package library( "IHW" ) # bioconductor bundle library( "biomaRt" ) # bioconductor packet library( "apeglm" ) # bioconductor package library( "pheatmap" ) # CRAN package library( "RColorBrewer" ) # CRAN parcel library( "PCAtools" ) # bioconductor package library(reshape2) # CRAN package #_______ Downloading_Data_______# query_TCGA = GDCquery( projection = "TCGA-BLCA" , data.category = "Transcriptome Profiling" , # parameter enforced by GDCquery experimental.strategy = "RNA-Seq" , workflow.type = "HTSeq - Counts" ) GDCdownload(query = query_TCGA) dat <- GDCprepare(query = query_TCGA, save = TRUE, save.filename = "exp.rda" ) # exp matrix rna <- equally.data.frame(SummarizedExperiment ::analysis(dat)) # clinical information clinical <- data.frame(dat @ colData) # Check how many tumor and control sample are in that location: ## information are stored under "definition" column in the clinical dataset. table(clinical $ definition) # Likewise from sample id it is possible to count normal and tumor samples: table(substr(colnames(rna),fourteen,14)) #The count matrix (rna dataset) and the rows of the column data (clinical dataset) MUST be in the same society. # to see whether all rows of clinical are present in rna datset all(rownames(clinical) %in% colnames(rna)) # whether they are in the same society: all(rownames(clinical) == colnames(rna)) # if non reorder them by: #rna <- rna[, rownames(clinical)] #all(rownames(clinical) == colnames(rna))
2. Data transformation, quality control and normalization
For differential expression analysis, using the raw counts is just fine. Still before performing whatsoever analysis it is good thought to go a full general picture of data shape and distribution. To this aim it is necessary to transform data. For data transformation we will employ variance stabilizing transformations (VST) method. This methods is described in the DESeq2 paper. Transformed information is useful to do some quality control checking on expression data. Quality control metrics would be used to bank check sample level QC and gene level QC. The DESeq2 package workflow performs gene level QC automatically. This includes removing genes that are very unlikely to be differentially expressed in the study design. Such genes are peradventure those with zero counts in all samples, those with a hugely skewed read count and depression mean normalized counts.
By sample level QC, we tin go how well the biological replicates(samples in unlike groups) cluster together. Actually sample level QC isapplied to respond the questions "How are samples like to each other in termsof expression values?" , " Which samples are unlike from each other?", "Exercise thesedifferences marshal with our expectation from written report design?" There are two statistical tools useful to gain insight into sample level QC: PCA and Hierarchical clustering.
#_______Making_Expression_Object__________# #We volition apply the column "definition", as the grouping variable for gene expression assay. # supplant spaces with "_" in levels of definition cavalcade clinical $ definition <- gsub( " " , "_" , clinical $ definition) # making the definition column as cistron clinical $ definition <- every bit.factor(clinical $ definition) # relevling factor to ensure tumors would exist compared to normal tissue. levels(clinical $ definition) # clinical $ definition <- relevel(clinical $ definition, ref = "Solid_Tissue_Normal" ) # Making DESeqDataSet object which stores all experiment data dds <- DESeqDataSetFromMatrix(countData = rna, colData = clinical, design = ~ definition) #dds # prefilteration: it is not necessary but recommended to filter out low expressed genes go on <- rowSums(counts(dds)) > = 10 dds <- dds[go on,] # data tranfromation vsd <- vst(dds, blind = FALSE) # making PC object p <- pca(analysis(vsd), metadata = colData(vsd), removeVar = 0.i) # create PCA plot for PCA1 and PCA2 biplot(p, colby = "definition" , lab = NULL, legendPosition = 'right' )
As depicted by the above plot, samples are somehow clustered together based on the "definition" variable levels: "Solid_Tissue_Normal", and "Primary_solid_Tumor". Its helpful to see how other PCs -other than PC1 and PC2 would clusters samples together. At that place are no PC combinations that can perfectly cluster samples based on the "definition" variable and this is may exist considering of how RNA-seq data are.
# Fol all pinnacle 10 possible combination pairsplot(p, components = getComponents(p, c(one : 10)), triangle = TRUE, trianglelabSize = 12, hline = 0, vline = 0, pointSize = 0.4, gridlines.major = Simulated, gridlines.small-scale = FALSE, colby = 'definition' , title = 'Pairs plot' , plotaxes = False, margingaps = unit(c(- 0.01, - 0.01, - 0.01, - 0.01), 'cm' ))
Hierarchical clustering is suitable to visualize similarities between samples with regard to factor expression profile. At the same time it will provide information about how gene expression profiles are different between samples. Here we take a total of 433 samples , and so visualizing all these sample in a plot would not be very informative. Here I will compare nineteen normal samples for making a sample-sample heatmap for visualizing hierarchical clustering
normal_idx <- substr(colnames(assay(vsd)),14,fourteen) == "1" n_sample <- analysis(vsd)[, c(normal_idx) ] colnames(n_sample) <- paste( "NT_" , substr(colnames(n_sample),1,12)) # Dissimilarity matrix calculation sampleDists <- dist(t(n_sample)) sampleDistMatrix <- every bit.matrix(sampleDists) rownames(sampleDistMatrix) <- c(colnames(n_sample), colnames(t_sample)) colnames(sampleDistMatrix) <- NULL colors <- colorRampPalette( rev(brewer.pal(nine, "Blues" )) )(255) # heatmap visualization pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, col = colors)
3. Differential cistron expression analysis
#_______DE_analysis________# dds <- DESeq(dds) #This would take some time #default method res <- results(dds, alpha = 0.05, altHypothesis = "greaterAbs" , lfcThreshold = one.5) # blastoff controls FDR charge per unit #issue Log fold alter shrinkage method (suitable for logfc based visualization) resLFC <- lfcShrink(dds, coef =resultsNames(dds)[2], type = "apeglm" ) resLFC.Ordered <- resLFC[with(resLFC, club(abs(log2FoldChange), padj, decreasing = Truthful)), ] # converting Ensebl id to Gene symbole using biomart ens2symbol <- role(ids){ mart <- useDataset( "hsapiens_gene_ensembl" , useMart( "ensembl" )) genes <- getBM(filters = "ensembl_gene_id" , attributes = c( "ensembl_gene_id" , "hgnc_symbol" ), values = ids, mart = mart) return(genes) } df <- ens2symbol(row.names(res)) res_df <- as.information.frame(res) res_df $ ensembl_gene_id <- row.names(res_df) res_df <- merge(df,res_df, by = "ensembl_gene_id" ) resOrdered <- res_df[with(res_df, society(abs(log2FoldChange), padj, decreasing = TRUE)), ] #saving the results write.csv(res_df, file = paste0(resultsNames(dds)[2], ".csv" ) #outcome with Independent hypothesis weighting resIHW <- results(dds, filterFun = ihw, alpha = 0.05, altHypothesis = "greaterAbs" , lfcThreshold = 1.5) resIHW_df <- every bit.information.frame(resIHW) resIHW_df $ ensembl_gene_id <- row.names(resIHW_df) resIHW_df <- merge(df,resIHW_df, past = "ensembl_gene_id" ) resIHWOrdered <- resIHW_df[with(resIHW_df, order(abs(log2FoldChange), padj, decreasing = TRUE)), ] write.csv(resIHW_df, file = paste0( "IHW" ,resultsNames(dds)[ii], ".csv" ))
four. Visualization
There are several mode for gene expression analysis results visualization.
MA-plot visualize gene significantly (bluish dots with p < 0.05) upwards-regulated (log2FC > ane.v) or down-regulated (log2FC < -1.5).
#ploting genes differentially expressed ylim <- c(- six.5,six.5) drawLines <- office() abline(h =c(- 1.five,1.v),col = "dodgerblue" ,lwd = ii) plotMA(resLFC, ylim = ylim); drawLines()
Height 25 dysregulated gene
# obtaining normalized count read dys_reg <- assay(vsd)[which(row.names(assay(vsd)) %in% resOrdered $ ensembl_gene_id [one : 25]), ] #melting dataset for visualization melted_norm_counts <- information.frame(cook(dys_reg)) colnames(melted_norm_counts) <- c( "gene" , "samplename" , "normalized_counts" ) melted_norm_counts $ grouping <- ifelse(melted_norm_counts $ samplename %in% colnames(assay(vsd))[normal_idx], "Normal" , "Tumor" ) # write.tabular array to import in python write.table(melted_norm_counts, file = "data.csv" , sep = "," , row.names = F)
It is very interesting to visualize gene count data past violin plot. Information technology is very informative! At the fourth dimension of writing , there is no piece of cake way to plot values in R "dodge" fashion. So I exported information into a file to plot this file in python
. The jupyter notebook code could exist notice in the repostory.
import pandas as pd import seaborn as sns from matplotlib.pyplot import figure #reading data data = pd.read_csv('data.csv') data.caput() ## plotting information figure(num = None, figsize =(xiv, 6), dpi = 300, facecolor = 'w', edgecolor = 'k') graph = sns.violinplot(ten = "gene", y = "normalized_counts", hue = "group", data = data, palette = "Set3", split = True) graph.set_xticklabels(graph.get_xticklabels(), rotation = 45, horizontalalignment = 'right')
RNA-seq-differential-expression's People
Contributors
Watchers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for edifice user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for edifice UI on the spider web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open up Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvass and HTML. 📊📈🎉
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some affair interesting about web. New door for the world.
-
server
A server is a program made to process requests and evangelize information to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a slice of software to reply intelligently.
-
Visualization
Some thing interesting about visualization, utilize data art
-
Game
Some thing interesting about game, brand everyone happy.
Recommend Org
-
Facebook
We are working to build community through open up source engineering. NB: members must accept two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for anybody.
-
Alibaba
Alibaba Open up Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
Cathay tencent open source team.
Jobs
JoobleSource: https://githubhelp.com/hamidghaedi/RNA-seq-differential-expression
0 Response to "Rna-seq Differential Expression Number of Reads Required"
Post a Comment