Introduction to Bioinformatics Differential Expression Analysis (DEG) using GEO GeneChip data - Limma Algorithm Data Code Results Interpretation

Updated to 2 days ago

Differential expression analysis is often used as the first step in bioinformatics analysis based on gene expression matrices, which helps us to look at the differences in gene expression across samples to determine the link between the gene to be studied and the phenotype. Commonly used gene expression data comes from gene chips or high-throughput sequencing. Although the matrices look similar, they obey different distributions and therefore require different methods when performing differential expression. For the average researcher in the life sciences, understanding obscure algorithms is not of much value. In this paper, we try to be concise and give the simplest demonstration from three aspects: data - algorithm - result. Note: The code in this article only applies to the counts data of gene chips! The limma algorithm is used!

An algorithm for differential representation of TCGA-based FPKM data can be found at: (not written yet, will add in a few days)

1. Data preparation

Data preparation includes an expression matrix and a grouping matrix.

Expression Matrix:

clustering matrix

The first column is the name of the sample and the second column is the name of the group, noting that each column should have a column name

2. Variance analysis using the Limma package

First install the limma package and the gplots package

source("/")
biocLite("Limma")
biocLite("gplots")

retrieve data

#DGE for microarray by limma
library('gplots')
library('gplots')
setwd("C:/Users/lenovo/DEG")
foldChange=0.5 #fold change=1 means the difference is twice as large
padj=0.01 #padj=0.05 means corrected p-value is less than 0.05
rawexprSet=("",header=TRUE,=1, = FALSE)
# Read the matrix file, this is the input data path, change it to your own file name#
dim(rawexprSet)
exprSet=log2(rawexprSet)
par(mfrow=c(1,2))
boxplot((exprSet),col="blue") ## Draw a box plot to compare data distribution
exprSet[1:5,1:5]
group <- ("",header=TRUE,=1, = FALSE)
group <- group[,1] #Define the comparison group, modified by the number of cancerous and normal samples#
design <- (~0+factor(group))# set group as a model matrix#
colnames(design)=levels(factor(group))
rownames(design)=colnames(exprSet)

Note that not all of the data in the expression matrix downloaded from GEO is logged, and the data that is not logged needs to be logged itself.

The cause and determination of log handling is described in:

Reasons why log2 processing is required for differential expression analysis of GEO chip data

/tuanzide5233/article/details/88542805

The need for log2 and normalization in differential expression analysis of GEO chip data

/tuanzide5233/article/details/88542558

If the data does not need to be logged, simply comment out the code shown in the figure by prefixing it with a #

After the notes:

The box plot in the lower right corner shows that the data is still relatively neat and can be analyzed in the next step

calculation step

fit <- lmFit(exprSet,design)
<-makeContrasts(paste0(unique(group),collapse = "-"),levels = design)
fit2=(fit,)
fit2 <- eBayes(fit2)  ## default no trend !!!
##eBayes() with trend=TRUE
tempOutput = topTable(fit2,coef=1,n=Inf,adjust="BH") 
nrDEG = (tempOutput)

Output results:

allDiff <- nrDEG
diff=allDiff
(diff, "")
diffSig = diff[(diff$ < padj & (diff$logFC>foldChange | diff$logFC<(-foldChange))),]# Filter for significantly different #s.
#(diffSig, file="",sep="\t",quote=F)#Outputs significantly different expressions to the file diffSig.
(diffSig, "")
diffUp = diff[(diff$ < padj & (diff$logFC>foldChange)),]#foldchange>0 is an upward adjustment, foldchange<0 is a downward adjustment#
#(diffUp, file="",sep="\t",quote=F)##39-42 Enter up and down adjustments into up and down files respectively##
(diffUp, "")
diffDown = diff[(diff$ < padj & (diff$logFC<(-foldChange))),]
#(diffDown, file="",sep="\t",quote=F)
(diffDown, "")

Here you can see that all results, all results that satisfy the filtering condition (i.e., multiples of difference expression), upwardly adjusted results, and downwardly adjusted results are output separately according to padj.

The filter criteria for this step are set at the very beginning of the code.

Reasons why log2 processing is required for differential expression analysis of GEO chip data

/tuanzide5233/article/details/88542805

The need for log2 and normalization in differential expression analysis of GEO chip data

/tuanzide5233/article/details/88542558

Tutorial on making a difference expression matrix

/tuanzide5233/article/details/83659768

Heat mapping of differential expression is detailed in

/tuanzide5233/article/details/83659501

Tutorial on differential expression analysis of RNAseq data using edgeR

/tuanzide5233/article/details/88785486

Differential Expression Analysis (DEG) Solution for 'No Duplicate Names'

/tuanzide5233/article/details/86568155

Survival Analysis Tutorial Series (I) Survival Analysis Using the Bio Trustee Toolbox

/tuanzide5233/article/details/83685403

Enrichment Analysis and Visualization of Protein Interaction Networks (PPIs) Cystocape Getting Started Guide

/tuanzide5233/article/details/88048439

Advanced version of Venn plot: Upset plot introductory practical code details - UpSetR package introduction

/tuanzide5233/article/details/83109527

Plotting pathway enrichment analysis bubble plots using the ggplot2 package for the R language: data structure and code

/tuanzide5233/article/details/82141817