Power analysis for RNAseq

Conducting a power analysis for RNASeq data

My sediment stress paper is in review at PeerJ and they need info about power analysis calculation and info on biological & technical replicates. They recommended using the R package RNASeqPower for power analysis/sample size calculation.

From PeerJ:

Please edit your manuscript to include BOTH:

a power analysis calculation (software tools are available, e.g. https://doi.org/doi:10.18129/B9.bioc.RNASeqPower (see usage instructions)) AND
information on biological and technical replicates used to achieve the claimed statistical power

“A suitable wording for the power analysis would be something like: ‘The statistical power of this experimental design, calculated in RNASeqPower is 0.84’ (replace the program name if you used a different tool and provide the actual calculated power).”

Power analysis

What is an RNASeq power analysis? Why do a power analysis?

An RNASeq power analysis is a statistical method used to estimate the number of samples required to detect differential gene expression with a certain degree of statistical pwoer
- Selecting optimal number of biological replicates to achieve desired statistical power (sample size estimation)
- Estimating the likelihood of successfully finding statistical significance in dataset (power estimation)

From the RNASeqPower vignette:

A formal sample size calculation for comparison of two groups will involve five factors, each of which is an argument to the rnapower function.

The depth of sequencing and consequent expected count μ for a given transcript, argument depth.
- Depth of sequencing refers to number of reads generated for each sample
- Expected count is the average number of reads expected to align to a given transcript in a single sample
The coefficient of variation of counts within each of the two groups, argument cv.
- cv is a measure of the dispersion of the data
- Reflects the degree of biological and technical variability within each group
The relative expression that we wish to detect ∆, argument effect.
- Refers to the fold change in expression level between two groups
The target false positive rate α and false negative rate β desired (or power = 1 − β), arguments alpha and power.
- False positive rate (α) is the probability of incorrectly rejecting the null hypothesis when it is actually true
- False negative rate (β) is the probability of incorrectly failing to reject the null hypothesis when its actually false
- Power is the probability of correctly rejecting the null when its actually false
The number of samples n in each group, argument n

Note: there are options for n and n2, as well as cv and cv2 if there are two groups. If only one group is specified, the program will assume that the n and cv values are equal for both groups.

Examples

Let’s do some examples first:

rnapower(depth = 20, cv = 0.4, effect = c(1.25, 1.5, 1.75, 2), alpha = 0.05, power = c(0.8, 0.9))

           0.8       0.9
1.25 66.204618 88.629200
1.5  20.051644 26.843463
1.75 10.526332 14.091771
2     6.861294  9.185326

We see here to detect a 25% increase with high power (0.9), we need ~89 samples. But if we want to detect a 100% change with high power, we only need ~10 samples. As the true biological impact grows, the sample size decreases. Let’s increase the depth:

rnapower(depth = 100, cv = 0.4, effect = c(1.25, 1.5, 1.75, 2), alpha = 0.05, power = c(0.8, 0.9))

8      0.9
25 53.594215 71.74745
5  16.232283 21.73042
75  8.521316 11.40762
   5.554381  7.43574

Increasing the depth decreases the sample sizes needed. What if we changed the cv?

rnapower(depth = 20, cv = 0.3, effect = c(1.25, 1.5, 1.75, 2), alpha = 0.05, power = c(0.8, 0.9))

8       0.9
25 44.136412 59.086133
5  13.367763 17.895642
75  7.017554  9.394514
   4.574196  6.123551

Decreasing the cv also decreases the sample sizes needed.

My data

In my study, I have 6 species from 2 locations. Does this mean I should do a power analysis for each species? I’ll start w/ Acerv from FL.

First, I’m going to calculate the coefficient of variation of each gene in the filtered gene count matrix.

##filtered gene count matrix 
acerv_filt <- read.csv("Output/DESeq2/acerv/acerv_counts_sub_filt_20210327.csv")
colnames(acerv_filt)[1] <- "gene_id"
for ( col in 1:ncol(acerv_filt)){
  colnames(acerv_filt)[col] <-  gsub("X", "", colnames(acerv_filt)[col]) # remvoing X from col names
}

# Separate into Control, T1, T2, T3 & T4 dfs
acerv_filt_control <- acerv_filt[,c("gene_id","25_ctl1_Ac_GF_1", "27_ctl2_Ac_YG_1", "41_ctl3_Ac_RN_1")]
acerv_filt_T1 <- acerv_filt[,c("gene_id", "37_T13_Ac_ML", "52_T11_Ac_II")] # removed sample 24 from dataset during filtering 
acerv_filt_T2 <- acerv_filt[,c("gene_id","31_T22_Ac_UV", "38_T23_Ac_IN", "53_T21_Ac_NH")]
acerv_filt_T3 <- acerv_filt[,c("gene_id","19_T33_Ac_WK", "47_T31_Ac_JB", "57_T32_Ac_NM")]
acerv_filt_T4 <- acerv_filt[,c("gene_id","35_T43_Ac_MT", "54_T42_Ac_JQ")] # removed sample 45 from dataset during filtering 

# Calculate row mean, std dev, and coefficient of variation
acerv_filt_control$mean <- rowMeans(acerv_filt_control[, c(2:4)], na.rm = T)
acerv_filt_control$sd <- rowSds(as.matrix(acerv_filt_control[, c(2:4)], na.rm = T))
acerv_filt_control$cv <- rowCVs(acerv_filt_control[,2:4])
mean(acerv_filt_control$cv, na.rm = T) # 0.2569478
median(acerv_filt_control$cv, na.rm = T) # 0.2067829

acerv_filt_T1$mean <- rowMeans(acerv_filt_T1[, c(2:3)], na.rm = T)
acerv_filt_T1$sd <- rowSds(as.matrix(acerv_filt_T1[, c(2:3)], na.rm = T))
acerv_filt_T1$cv <- rowCVs(acerv_filt_T1[,2:3])
mean(acerv_filt_T1$cv, na.rm = T) # 0.2148519
median(acerv_filt_T1$cv, na.rm = T) # 0.1741535

acerv_filt_T2$mean <- rowMeans(acerv_filt_T2[, c(2:4)], na.rm = T)
acerv_filt_T2$sd <- rowSds(as.matrix(acerv_filt_T2[, c(2:4)], na.rm = T))
acerv_filt_T2$cv <- rowCVs(acerv_filt_T2[,2:4])
mean(acerv_filt_T2$cv, na.rm = T) # 0.3396199
median(acerv_filt_T2$cv, na.rm = T) # 0.2994043

acerv_filt_T3$mean <- rowMeans(acerv_filt_T3[, c(2:4)], na.rm = T)
acerv_filt_T3$sd <- rowSds(as.matrix(acerv_filt_T3[, c(2:4)], na.rm = T))
acerv_filt_T3$cv <- rowCVs(acerv_filt_T3[,2:4])
mean(acerv_filt_T3$cv, na.rm = T) # 0.5463428
median(acerv_filt_T3$cv, na.rm = T) # 0.5210751

acerv_filt_T4$mean <- rowMeans(acerv_filt_T4[, c(2:3)], na.rm = T)
acerv_filt_T4$sd <- rowSds(as.matrix(acerv_filt_T4[, c(2:3)], na.rm = T))
acerv_filt_T4$cv <- rowCVs(acerv_filt_T4[,2:3])
mean(acerv_filt_T4$cv, na.rm = T) # 0.1414061
median(acerv_filt_T4$cv, na.rm = T) # 0.1008634

# Average cv for T1, T2, T3, T4
df <- c(0.2148519, 0.3396199, 0.5463428, 0.1414061)
mean(df) # 0.3105552 

Now I have the CVs for each group for Acerv. It looks like w/ RNASeqPower, there can only be two groups. I’m going to pool the treatment samples.

Parameters:

Depth = 5 (?)
n = 3 (control)
n2 = 10 (T1, T2, T3, T4 - not including outliers)
cv = 0.2569478
cv2 = 0.3105552
effect = 1.25, 1.5, 1.75, 2
alpha = 0.05
power = 0.8, 0.9

Now lets try the code! First, I’m including all the things:

rnapower(depth = 5, n = 3, n2 = 10, cv = 0.2569478, cv2 = 0.3105552, effect = c(1.25, 1.5, 1.75, 2), alpha = 0.05, power = c(0.8, 0.9))

When I try to run the code above with all the info, I get this error: Exactly one of n, cv, effect, alpha, or power should be unspecified. So it looks like I need to not include one of the arguments so the function can calculate it for me. But what am I trying to calculate if I have all the information already?

Try running w/o specifying effect size:

# w/o effect size 
rnapower(depth = 5, n = 3, n2 = 10, cv = 0.2569478, cv2 = 0.3105552, alpha = 0.05, power = c(0.8, 0.9))

     0.8      0.9 
2.621310 3.049565 

This means that, with the given parameters, I will have a 2.6 effect size at 0.8 power. But is this for each gene? Or the mean of the genes for each group?

Let’s try running w/o specifying n sizes.

# w/o n
rnapower(depth = 5, cv = 0.2569478, cv2 = 0.3105552, effect = c(1.25, 1.5, 1.75, 2), alpha = 0.05, power = c(0.8, 0.9))

           0.8       0.9
1.25 88.661651 118.69279
1.5  26.853291  35.94894
1.75 14.096931  18.87179
2     9.188689  12.30105

This is saying that, for instance, I need a sample size of 9 to achieve an effect size of 2 and a power of 0.8? Does this mean a sample size of 10 for all groups?

Now let’s try running w/o specifying power

rnapower(depth = 5, n = 3, n2 = 10, cv = 0.2569478, cv2 = 0.3105552, effect = c(1.25, 1.5, 1.75, 2), alpha = 0.05)

1.25        1.5       1.75          2 
0.09488785 0.21734266 0.36954688 0.52198959 

To detect 100% effect (effect of 2), the power will be 0.52. That seems pretty low, given that power is usually set to 0.8 or 0.9.

How is depth of coverage calculated?

From Sims et al. 2014: “The average depth of sequencing coverage can be defined theoretically as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length.”
- For Acerv: (50 bp read length * 16,499,909 avg reads) / 430,000,000 bp genome length = 1.92 coverage. What does this mean? Each gene has 1.92 average coverage? That doesn’t make any sense, but need to check actually genome bp length
From Hart et al. 2013: “We refer to depth and coverage interchangeably—both meaning how many reads are assigned to a particular gene”
- This article corresponds to the RNASeqPower R package
- “Overall, 85%–96% of all annotated genes were sequenced at a rate of 0.1 reads per million or greater, regardless of the biospecimen or depth. In other words, a sequencing depth of 10 million reads will ensure that approximately 90% of all genes will be covered by at least 10 reads.”
This is the advice that chatGPT gave me (based on Conesa et al. 2016):
- 1) Determine total number of reads generated by sequencing platform for a single sample
  - Avg number of reads for Acerv = 16,499,909
- 2) Estimate size of transcriptome being analyzed
  - Number of protein coding genes in Acerv = 33,322
- 3) Divide the total number of reads by the estimated size of the transcriptome to obtain depth of sequencing
  - 16,499,909 avg reads / 33,322 genes = 495.17 reads per gene?

Given this calculation, lets calculate n size using depth=495.

rnapower(depth = 495, cv = 0.2569478, cv2 = 0.3105552, effect = c(1.25, 1.5, 1.75, 2), alpha = 0.05, power = c(0.8, 0.9))

8       0.9
25 26.246523 35.136647
5   7.949384 10.641970
75  4.173117  5.586619
   2.720129  3.641481

To achieve an effect size of 2 and a power of 0.8, I would need ~2-3 samples per group?

This tutorial says that “Replicates are almost always preferred to greater sequencing depth for bulk RNA-Seq” and provides these recommendations for general gene-level differential expression:

ENCODE guidelines suggest 30 million SE reads per sample (stranded).
15 million reads per sample is often sufficient, if there are a good number of replicates (>3).
Spend money on more biological replicates, if possible.
Generally recommended to have read length >= 50 bp

Issues / questions / notes

Samples in sediment stress study have different read lengths - FL samples are all 50 bp long, but HI samples are either 50 or 125 bp long
I have more than 2 groups (in HI, 3 groups; in FL, 5 groups), but RNASeqPower can only handle 2 groups. Do I pool the samples from the treatments?
What does PeerJ mean by the following statement?
- “information on biological and technical replicates used to achieve the claimed statistical power” - what information do they mean here?
Should all genes be held to the same effect size? What subset of the genes (i.e., all genes, filtered genes, differentially expressed genes) should we be considering here?
If the analysis says I need more replicates, what can I do??
- From my reading, it seems like biological replicates are more important than sequencing depth

Other papers to look at

Ching et al. 2014 - Power analysis and sample size estimation for RNA-Seq differential expression
Zhao et al. 2018 - RnaSeqSampleSize: real data based sample size estimation for RNA sequencing
- This is another R package, but uses different parameters (see vignette)

Written on March 27, 2023