Rna-seq Differential Expression Number of Reads Required

RNA-seq-differential-expression

hamidghaedi / RNA-seq-differential-expression Goto Github PK

0.0 ane.0 2.0 940 KB

Differential gene expression analysis on RNA-seq data

Jupyter Notebook 98.49% R 1.51%

RNA-seq-differential-expression'south Introduction

RNA-seq-differential-expression

Differential gene expression analysis on RNA-seq data

RNA-seq

RNA-seq is emerging as a powerful genome-broad gene expression analysis tools in genomic studies. Technically an RNA-seq experiment get-go from library preparation, sequencing, and information analysis. Data analysis is ordinarily kickoff from quality control, read mapping to a reference genome. From RNA reads aligned to a reference genome, a count matrix will generate that is the master file for differential expression assay. For differential gene expression assay from this count RNA matrix, there are two main R package which are widely used in this field: DeSeq2 and edgeR and limma.

In this commodity I volition share my workflow for performing differential gene expression analysis on RNA-seq count matrix. The data used here are coming from float cancer data from TCGA (TCGA-BLCA ). Here is a cursory overview on assay steps:

  1. Obtaining data
  2. Data transformation, quality control and data normalization
  3. Differential gene expression analysis
  4. Visualization

1. Obtaining data

                                          #________Packgaes______________#                    library(                      "TCGAbiolinks"                    )                                          #                      bioconductor package                    library(                      "SummarizedExperiment"                    )                                          #                      bioconductor parcel                    library(                      "DESeq2"                    )                                          #                      bioconductor package                    library(                      "IHW"                    )                                          #                      bioconductor bundle                    library(                      "biomaRt"                    )                                          #                      bioconductor packet                    library(                      "apeglm"                    )                                          #                      bioconductor package                    library(                      "pheatmap"                    )                                          #                      CRAN package                    library(                      "RColorBrewer"                    )                                          #                      CRAN parcel                    library(                      "PCAtools"                    )                                          #                      bioconductor package                    library(reshape2)                                          #                      CRAN package                                          #_______ Downloading_Data_______#                    query_TCGA                    =                    GDCquery(                    projection                    =                                          "TCGA-BLCA"                    ,                    data.category                    =                                          "Transcriptome Profiling"                    ,                                          #                      parameter enforced by GDCquery                    experimental.strategy                    =                                          "RNA-Seq"                    ,                    workflow.type                    =                                          "HTSeq - Counts"                    )  GDCdownload(query                    =                    query_TCGA)                    dat                    <-                    GDCprepare(query                    =                    query_TCGA,                    save                    =                    TRUE,                    save.filename                    =                                          "exp.rda"                    )                                          #                      exp matrix                    rna                    <-                    equally.data.frame(SummarizedExperiment                    ::analysis(dat))                                          #                      clinical information                    clinical                    <-                    data.frame(dat                    @                    colData)                                          #                      Check how many tumor and control sample are in that location:                                          ## information are stored under "definition" column in the clinical dataset.                    table(clinical                    $                    definition)                                          #                      Likewise from sample id it is possible to count normal and tumor samples:                    table(substr(colnames(rna),fourteen,14))                                          #The count matrix (rna dataset) and the rows of the column data (clinical dataset) MUST be in the same society.                                          #                      to see whether all rows of clinical are present in rna datset                    all(rownames(clinical)                    %in%                    colnames(rna))                                          #                      whether they are in the same society:                    all(rownames(clinical)                    ==                    colnames(rna))                                          #                      if non reorder them by:                                          #rna <- rna[, rownames(clinical)]                                          #all(rownames(clinical) == colnames(rna))                  

2. Data transformation, quality control and normalization

For differential expression analysis, using the raw counts is just fine. Still before performing whatsoever analysis it is good thought to go a full general picture of data shape and distribution. To this aim it is necessary to transform data. For data transformation we will employ variance stabilizing transformations (VST) method. This methods is described in the DESeq2 paper. Transformed information is useful to do some quality control checking on expression data. Quality control metrics would be used to bank check sample level QC and gene level QC. The DESeq2 package workflow performs gene level QC automatically. This includes removing genes that are very unlikely to be differentially expressed in the study design. Such genes are peradventure those with zero counts in all samples, those with a hugely skewed read count and depression mean normalized counts.

By sample level QC, we tin go how well the biological replicates(samples in unlike groups) cluster together. Actually sample level QC isapplied to respond the questions "How are samples like to each other in termsof expression values?" , " Which samples are unlike from each other?", "Exercise thesedifferences marshal with our expectation from written report design?" There are two statistical tools useful to gain insight into sample level QC: PCA and Hierarchical clustering.

= ten dds <- dds[go on,] # data tranfromation vsd <- vst(dds, blind=False) # making PC object p <- pca(assay(vsd), metadata = colData(vsd), removeVar = 0.1) # create PCA plot for PCA1 and PCA2 biplot(p, colby = "definition", lab = NULL, legendPosition = 'right')">
                                          #_______Making_Expression_Object__________#                                          #We volition apply the column "definition", as the grouping variable for gene expression assay.                                                              #                      supplant spaces with "_" in levels of definition cavalcade                    clinical                    $                    definition                    <-                    gsub(                      "                      "                    ,                                          "_"                    ,                    clinical                    $                    definition)                                          #                      making the definition column as cistron                    clinical                    $                    definition                    <-                    every bit.factor(clinical                    $                    definition)                                          #                      relevling factor to ensure tumors would exist compared to normal tissue.                    levels(clinical                    $                    definition)                                          #                                        clinical                    $                    definition                    <-                    relevel(clinical                    $                    definition,                    ref                    =                                          "Solid_Tissue_Normal"                    )                                          #                      Making DESeqDataSet object which stores all experiment data                    dds                    <-                    DESeqDataSetFromMatrix(countData                    =                    rna,                    colData                    =                    clinical,                    design                    =                    ~                    definition)                                          #dds                                          #                      prefilteration: it is not necessary but recommended to filter out low expressed genes                    go on                    <-                    rowSums(counts(dds))                    >                    =                    10                    dds                    <-                    dds[go on,]                                          #                      data tranfromation                    vsd                    <-                    vst(dds,                    blind                    =                    FALSE)                                          #                      making PC object                    p                    <-                    pca(analysis(vsd),                    metadata                    =                    colData(vsd),                    removeVar                    =                    0.i)                                          #                      create PCA plot for PCA1 and PCA2                    biplot(p,                    colby                    =                                          "definition"                    ,                    lab                    =                    NULL,                    legendPosition                    =                                          'right'                    )

alt text

As depicted by the above plot, samples are somehow clustered together based on the "definition" variable levels: "Solid_Tissue_Normal", and "Primary_solid_Tumor". Its helpful to see how other PCs -other than PC1 and PC2 would clusters samples together. At that place are no PC combinations that can perfectly cluster samples based on the "definition" variable and this is may exist considering of how RNA-seq data are.

                                          #                      Fol all pinnacle 10 possible combination                                        pairsplot(p,                    components                    =                    getComponents(p, c(one                    :                    10)),                    triangle                    =                    TRUE,                    trianglelabSize                    =                    12,                    hline                    =                    0,                    vline                    =                    0,                    pointSize                    =                    0.4,                    gridlines.major                    =                    Simulated,                    gridlines.small-scale                    =                    FALSE,                    colby                    =                                          'definition'                    ,                    title                    =                                          'Pairs plot'                    ,                    plotaxes                    =                    False,                    margingaps                    =                    unit(c(-                    0.01,                    -                    0.01,                    -                    0.01,                    -                    0.01),                                          'cm'                    ))

alt text

Hierarchical clustering is suitable to visualize similarities between samples with regard to factor expression profile. At the same time it will provide information about how gene expression profiles are different between samples. Here we take a total of 433 samples , and so visualizing all these sample in a plot would not be very informative. Here I will compare nineteen normal samples for making a sample-sample heatmap for visualizing hierarchical clustering

                    normal_idx                    <-                    substr(colnames(assay(vsd)),14,fourteen)                    ==                                          "1"                                        n_sample                    <-                    analysis(vsd)[, c(normal_idx) ] colnames(n_sample)                    <-                    paste(                      "NT_"                    , substr(colnames(n_sample),1,12))                                          #                      Dissimilarity matrix calculation                    sampleDists                    <-                    dist(t(n_sample))                    sampleDistMatrix                    <-                    every bit.matrix(sampleDists) rownames(sampleDistMatrix)                    <-                    c(colnames(n_sample), colnames(t_sample)) colnames(sampleDistMatrix)                    <-                    NULL                    colors                    <-                    colorRampPalette( rev(brewer.pal(nine,                                          "Blues"                    )) )(255)                                          #                      heatmap visualization                    pheatmap(sampleDistMatrix,                    clustering_distance_rows                    =                    sampleDists,                    clustering_distance_cols                    =                    sampleDists,                    col                    =                    colors)

alt text

3. Differential cistron expression analysis

                                          #_______DE_analysis________#                    dds                    <-                    DESeq(dds)                                          #This would take some time                                          #default method                    res                    <-                    results(dds,                    alpha                    =                    0.05,                    altHypothesis                    =                                          "greaterAbs"                    ,                    lfcThreshold                    =                    one.5)                                          #                      blastoff controls FDR charge per unit                                          #issue Log fold alter shrinkage method (suitable for logfc based visualization)                    resLFC                    <-                    lfcShrink(dds,                    coef                    =resultsNames(dds)[2],                    type                    =                                          "apeglm"                    )                    resLFC.Ordered                    <-                    resLFC[with(resLFC, club(abs(log2FoldChange),                    padj,                    decreasing                    =                    Truthful)), ]                                          #                      converting Ensebl id to Gene symbole using biomart                    ens2symbol                    <-                    role(ids){                    mart                    <-                    useDataset(                      "hsapiens_gene_ensembl"                    , useMart(                      "ensembl"                    ))                    genes                    <-                    getBM(filters                    =                                          "ensembl_gene_id"                    ,                    attributes                    =                    c(                      "ensembl_gene_id"                    ,                      "hgnc_symbol"                    ),                    values                    =                    ids,                    mart                    =                    mart)                    return(genes) }                    df                    <-                    ens2symbol(row.names(res))                    res_df                    <-                    as.information.frame(res)                    res_df                    $                    ensembl_gene_id                    <-                    row.names(res_df)                    res_df                    <-                    merge(df,res_df,                    by                    =                                          "ensembl_gene_id"                    )                    resOrdered                    <-                    res_df[with(res_df, society(abs(log2FoldChange),                    padj,                    decreasing                    =                    TRUE)), ]                                          #saving the results                    write.csv(res_df,                    file                    =                    paste0(resultsNames(dds)[2],                                          ".csv"                    )                                          #outcome with Independent hypothesis weighting                    resIHW                    <-                    results(dds,                    filterFun                    =                    ihw,                    alpha                    =                    0.05,                    altHypothesis                    =                                          "greaterAbs"                    ,                    lfcThreshold                    =                    1.5)                    resIHW_df                    <-                    every bit.information.frame(resIHW)                    resIHW_df                    $                    ensembl_gene_id                    <-                    row.names(resIHW_df)                    resIHW_df                    <-                    merge(df,resIHW_df,                    past                    =                                          "ensembl_gene_id"                    )                    resIHWOrdered                    <-                    resIHW_df[with(resIHW_df, order(abs(log2FoldChange),                    padj,                    decreasing                    =                    TRUE)), ]  write.csv(resIHW_df,                    file                    =                    paste0(                      "IHW"                    ,resultsNames(dds)[ii],                                          ".csv"                    ))

four. Visualization

There are several mode for gene expression analysis results visualization.

MA-plot visualize gene significantly (bluish dots with p < 0.05) upwards-regulated (log2FC > ane.v) or down-regulated (log2FC < -1.5).

                                          #ploting genes differentially expressed                                        ylim                    <-                    c(-                    six.5,six.5)                    drawLines                    <-                    office() abline(h                    =c(-                    1.five,1.v),col                    =                                          "dodgerblue"                    ,lwd                    =                    ii) plotMA(resLFC,                    ylim                    =                    ylim); drawLines()

alt text

Height 25 dysregulated gene

                                          #                      obtaining normalized count read                    dys_reg                    <-                    assay(vsd)[which(row.names(assay(vsd))                    %in%                    resOrdered                    $                    ensembl_gene_id                    [one                    :                    25]), ]                                          #melting dataset for visualization                    melted_norm_counts                    <-                    information.frame(cook(dys_reg)) colnames(melted_norm_counts)                    <-                    c(                      "gene"                    ,                                          "samplename"                    ,                                          "normalized_counts"                    )                    melted_norm_counts                    $                    grouping                    <-                    ifelse(melted_norm_counts                    $                    samplename                    %in%                    colnames(assay(vsd))[normal_idx],                                          "Normal"                    ,                                          "Tumor"                    )                                          #                      write.tabular array to import in python                    write.table(melted_norm_counts,                    file                    =                                          "data.csv"                    ,                    sep                    =                                          ","                    ,                    row.names                    =                    F)

It is very interesting to visualize gene count data past violin plot. Information technology is very informative! At the fourth dimension of writing , there is no piece of cake way to plot values in R "dodge" fashion. So I exported information into a file to plot this file in python. The jupyter notebook code could exist notice in the repostory.

                    import                    pandas                    as                    pd                    import                    seaborn                    as                    sns                    from                    matplotlib.pyplot                    import                    figure                    #reading data                    data                    =                    pd.read_csv('data.csv')                    data.caput()                    ## plotting information                    figure(num                    =                    None,                    figsize                    =(xiv,                    6),                    dpi                    =                    300,                    facecolor                    =                    'w',                    edgecolor                    =                    'k')                    graph                    =                    sns.violinplot(ten                    =                    "gene",                    y                    =                    "normalized_counts",                    hue                    =                    "group",                    data                    =                    data,                    palette                    =                    "Set3",                    split                    =                    True)                    graph.set_xticklabels(graph.get_xticklabels(),                    rotation                    =                    45,                    horizontalalignment                    =                    'right')

alt text

RNA-seq-differential-expression's People

Contributors

hamidghaedi avatar

Watchers

Hamid Ghaedi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for edifice user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for edifice UI on the spider web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open up Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • Laravel photo Laravel

    A PHP framework for web artisans

  • D3 photo D3

    Bring data to life with SVG, Canvass and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some affair interesting about web. New door for the world.

  • server

    A server is a program made to process requests and evangelize information to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a slice of software to reply intelligently.

  • Visualization

    Some thing interesting about visualization, utilize data art

  • Game

    Some thing interesting about game, brand everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open up source engineering. NB: members must accept two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for anybody.

  • Alibaba photo Alibaba

    Alibaba Open up Source for everyone

  • D3 photo D3

    Data-Driven Documents codes.

  • Tencent photo Tencent

    Cathay tencent open source team.

Jobs

Jooble

corleystideass.blogspot.com

Source: https://githubhelp.com/hamidghaedi/RNA-seq-differential-expression

0 Response to "Rna-seq Differential Expression Number of Reads Required"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel