Mcap Ribofree expression test
I am deciding what kind of depth/coverage I will need for sequencing if I proceed with the Zymo Ribofree library prep kit. I am using the gene count matrices generated through the Express Compare project. In this post, I looked at all of the Mcap Ribofree gene count matrices that I could find to assess how many genes were retained before and after filtering. I did a two-step filtering process - first, I removed all rows whose sum was zero (ie those genes were not expressed at all). Then I applied the pOverA(0.75,5).
All of the code was run on my personal computer under R version 4.3.1
```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
library(genefilter) library(tidyverse)
getwd()
“/Users/jillashey/Desktop/PutnamLab/Hawaii/2023/DevelopmentalTimeseries”
There were many different gene count matrices on the express compare github.
### CSV #1
`MCap_gene_matrix.csv` from [here](https://github.com/hputnam/Express_Compare/tree/main/RAnalysis/Data/Mcap)
```{r}
# Read in data
matrix1 <- read.csv("MCap_gene_matrix.csv")
matrix1 <- as.data.frame(matrix1)
rownames(matrix1) <- matrix1[,1] #set first column that contains gene names as rownames
matrix1 <- matrix1[,-1] # remove column w/ gene names
dim(matrix1)
# 55086 genes present
# Keep ribofree samples
matrix1 <- matrix1 %>%
select(Sample4, Sample5, Sample6)
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix1 <- matrix1[rowSums(matrix1) != 0, ]
dim(matrix1)
# 53498 genes retained
ffun<-filterfun(pOverA(0.75,5)) #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa1 <- genefilter((matrix1), ffun) #apply filter
sum(filt_outrm_poa1)
# 26476 genes retained
filt_outrm_poa1 <- matrix1[filt_outrm_poa1,] #keep only rows that passed filter
CSV #2
hisat2_MCap_gene_matrix.csv
from here. On github, it is named MCap_gene_matrix.csv
but I added the hisat2 to differentiate it from the csv above.
# Read in data
matrix2 <- read.csv("hisat2_MCap_gene_matrix.csv")
matrix2 <- as.data.frame(matrix2)
rownames(matrix2) <- matrix2[,1] #set first column that contains gene names as rownames
matrix2 <- matrix2[,-1] # remove column w/ gene names
dim(matrix2)
# 55086 genes present
# Keep ribofree samples
matrix2 <- matrix2 %>%
select(Sample4, Sample5, Sample6)
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix2 <- matrix2[rowSums(matrix2) != 0, ]
dim(matrix2)
# 53498 genes retained
ffun<-filterfun(pOverA(0.75,5)) #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa2 <- genefilter((matrix2), ffun) #apply filter
sum(filt_outrm_poa2)
# 25708 genes retained
filt_outrm_poa2 <- matrix2[filt_outrm_poa2,] #keep only rows that passed filter
CSV #3
fastp_STAR_gene_count_matrix.csv
from here. On github, it is named gene_matrix.csv
but I added the fastp_STAR to differentiate it from the other csvs.
# Read in data
matrix3 <- read.csv("fastp_STAR_gene_count_matrix.csv")
matrix3 <- as.data.frame(matrix3)
rownames(matrix3) <- matrix3[,1] #set first column that contains gene names as rownames
matrix3 <- matrix3[,-1] # remove column w/ gene names
dim(matrix3)
# 63227 genes present
# Keep ribofree samples
#matrix3 <- matrix3 %>%
# select(Sample4, Sample5, Sample6)
These sample names (X54, X57, X65) do not correspond to the others in matrix 1 and matrix 2 above (Sample4, Sample5, Sample6). That is confusing but this gene count matrix is under the Mcap Ribofree folder on github.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix3 <- matrix3[rowSums(matrix3) != 0, ]
dim(matrix3)
# 62673 genes retained
ffun<-filterfun(pOverA(0.75,5)) #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa3 <- genefilter((matrix3), ffun) #apply filter
sum(filt_outrm_poa3)
# 46806 genes retained
filt_outrm_poa3 <- matrix3[filt_outrm_poa3,] #keep only rows that passed filter
CSV #4
trimmomatic_STAR_gene_count_matrix.csv
from here. On github, it is named gene_matrix.csv
but I added the trimmomatic_STAR to differentiate it from the other csvs.
# Read in data
matrix4 <- read.csv("trimmomatic_STAR_gene_count_matrix.csv")
matrix4 <- as.data.frame(matrix4)
rownames(matrix4) <- matrix4[,1] #set first column that contains gene names as rownames
matrix4 <- matrix4[,-1] # remove column w/ gene names
dim(matrix4)
# 63227 genes present
# Keep ribofree samples
#matrix3 <- matrix3 %>%
# select(Sample4, Sample5, Sample6)
These sample names (X54, X57, X65) correspond to matrix 3, but not to the others in matrix 1 and matrix 2 above (Sample4, Sample5, Sample6). That is confusing but this gene count matrix is under the Mcap Ribofree folder on github.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix4 <- matrix4[rowSums(matrix4) != 0, ]
dim(matrix4)
# 62707 genes retained
ffun<-filterfun(pOverA(0.75,5)) #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa4 <- genefilter((matrix4), ffun) #apply filter
sum(filt_outrm_poa4)
# 48366 genes retained
filt_outrm_poa4 <- matrix4[filt_outrm_poa4,] #keep only rows that passed filter
CSV #5
stringtie_assembly_gene_count_matrix.csv
from here. On github, it is named gene_matrix.csv
but I added the stringtie_assembly to differentiate it from the other csvs.
# Read in data
matrix5 <- read.csv("stringtie_assembly_gene_count_matrix.csv")
matrix5 <- as.data.frame(matrix5)
rownames(matrix5) <- matrix5[,1] #set first column that contains gene names as rownames
matrix5 <- matrix5[,-1] # remove column w/ gene names
dim(matrix5)
# 55086 genes present
# Keep ribofree samples
matrix5 <- matrix5 %>%
select(Sample4.gtf, Sample5.gtf, Sample6.gtf)
These sample names correspond to matrix 1 and matrix 2.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix5 <- matrix5[rowSums(matrix5) != 0, ]
dim(matrix5)
# 53723 genes retained
ffun<-filterfun(pOverA(0.75,5)) #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa5 <- genefilter((matrix5), ffun) #apply filter
sum(filt_outrm_poa5)
# 28835 genes retained
filt_outrm_poa5 <- matrix5[filt_outrm_poa5,] #keep only rows that passed filter
CSV #6
stringtie_assembly2_gene_count_matrix.csv
from here. On github, it is named gene_matrix.csv
but I added the stringtie_assembly2 to differentiate it from the other csvs.
# Read in data
matrix6 <- read.csv("stringtie_assembly2_gene_count_matrix.csv")
matrix6 <- as.data.frame(matrix6)
rownames(matrix6) <- matrix6[,1] #set first column that contains gene names as rownames
matrix6 <- matrix6[,-1] # remove column w/ gene names
dim(matrix6)
# 54993 genes present
# Keep ribofree samples
matrix6 <- matrix6 %>%
select(Sample4.2.gtf, Sample5.2.gtf, Sample6.2.gtf)
These sample names correspond to matrix 1 and matrix 2.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix6 <- matrix6[rowSums(matrix6) != 0, ]
dim(matrix6)
# 53638 genes retained
ffun<-filterfun(pOverA(0.75,5)) #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa6 <- genefilter((matrix6), ffun) #apply filter
sum(filt_outrm_poa6)
# 28825 genes retained
filt_outrm_poa6 <- matrix6[filt_outrm_poa6,] #keep only rows that passed filter
Summary
The count matrices had different sample names in them. Here’s the summary info for matrices that included Sample4, Sample5, and Sample6.
Matrix | Library Prep | Genes present | Genes retained after removing rows == 0 | Genes retained after pOverA | Bioinformatics processing | Link to csv |
---|---|---|---|---|---|---|
Matrix 1 | RiboFree | 55086 | 53498 | 26476 | Unknown | https://github.com/hputnam/Express_Compare/tree/main/RAnalysis/Data/Mcap |
Matrix 2 | RiboFree | 55086 | 53498 | 25708 | Hisat2 | https://github.com/hputnam/Express_Compare/blob/main/hisat2_workflow/counts_matrices/MCap_gene_matrix.csv |
Matrix 5 | RiboFree | 55086 | 53723 | 28835 | Stringtie | https://github.com/hputnam/Express_Compare/blob/main/stringtie_assembly/counts_matrices/Mcap/redo/gene_count_matrix.csv |
Matrix 6 | RiboFree | 54993 | 53638 | 28825 | Stringtie | https://github.com/hputnam/Express_Compare/blob/main/stringtie_assembly/counts_matrices/Mcap/redo/gene_count_matrix2.csv |
Here’s the summary info for matrices that included X54, X57, and X65.
Matrix | Library Prep | Genes present | Genes retained after removing rows == 0 | Genes retained after pOverA | Bioinformatics processing | Link to csv |
---|---|---|---|---|---|---|
Matrix 3 | RiboFree | 63227 | 62673 | 46806 | Fastp, STAR, stringtie | https://github.com/hputnam/Express_Compare/blob/main/star_kallisto_workflow/Output/Fastp_STAR_Stringtie_Matrices/Mcap_RiboFree/gene_count_matrix.csv |
Matrix 4 | RiboFree | 63227 | 62707 | 48366 | Trimmomatic, STAR, stringtie | https://github.com/hputnam/Express_Compare/blob/main/star_kallisto_workflow/Output/Trimmomatic_STAR_Stringtie_Matrices/Mcap_RiboFree/gene_count_matrix.csv |