Mcap Ribofree expression test
I am deciding what kind of depth/coverage I will need for sequencing if I proceed with the Zymo Ribofree library prep kit. I am using the gene count matrices generated through the Express Compare project. In this post, I looked at all of the Mcap Ribofree gene count matrices that I could find to assess how many genes were retained before and after filtering. I did a two-step filtering process - first, I removed all rows whose sum was zero (ie those genes were not expressed at all). Then I applied the pOverA(0.75,5).
All of the code was run on my personal computer under R version 4.3.1
```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
library(genefilter) library(tidyverse)
getwd()
“/Users/jillashey/Desktop/PutnamLab/Hawaii/2023/DevelopmentalTimeseries”
There were many different gene count matrices on the express compare github. 
### CSV #1
`MCap_gene_matrix.csv` from [here](https://github.com/hputnam/Express_Compare/tree/main/RAnalysis/Data/Mcap)
```{r}
# Read in data
matrix1 <- read.csv("MCap_gene_matrix.csv")
matrix1 <- as.data.frame(matrix1)
rownames(matrix1) <- matrix1[,1] #set first column that contains gene names as rownames
matrix1 <- matrix1[,-1] # remove column w/ gene names 
dim(matrix1)
# 55086 genes present 
# Keep ribofree samples
matrix1 <- matrix1 %>%
  select(Sample4, Sample5, Sample6)
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix1 <- matrix1[rowSums(matrix1) != 0, ]
dim(matrix1)
# 53498 genes retained 
ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa1 <- genefilter((matrix1), ffun) #apply filter
sum(filt_outrm_poa1) 
# 26476 genes retained  
filt_outrm_poa1 <- matrix1[filt_outrm_poa1,] #keep only rows that passed filter
CSV #2
hisat2_MCap_gene_matrix.csv from here. On github, it is named MCap_gene_matrix.csv but I added the hisat2 to differentiate it from the csv above.
# Read in data
matrix2 <- read.csv("hisat2_MCap_gene_matrix.csv")
matrix2 <- as.data.frame(matrix2)
rownames(matrix2) <- matrix2[,1] #set first column that contains gene names as rownames
matrix2 <- matrix2[,-1] # remove column w/ gene names 
dim(matrix2)
# 55086 genes present 
# Keep ribofree samples
matrix2 <- matrix2 %>%
  select(Sample4, Sample5, Sample6)
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix2 <- matrix2[rowSums(matrix2) != 0, ]
dim(matrix2)
# 53498 genes retained 
ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa2 <- genefilter((matrix2), ffun) #apply filter
sum(filt_outrm_poa2) 
# 25708 genes retained  
filt_outrm_poa2 <- matrix2[filt_outrm_poa2,] #keep only rows that passed filter
CSV #3
fastp_STAR_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the fastp_STAR to differentiate it from the other csvs.
# Read in data
matrix3 <- read.csv("fastp_STAR_gene_count_matrix.csv")
matrix3 <- as.data.frame(matrix3)
rownames(matrix3) <- matrix3[,1] #set first column that contains gene names as rownames
matrix3 <- matrix3[,-1] # remove column w/ gene names 
dim(matrix3)
# 63227 genes present 
# Keep ribofree samples
#matrix3 <- matrix3 %>%
#  select(Sample4, Sample5, Sample6)
These sample names (X54, X57, X65) do not correspond to the others in matrix 1 and matrix 2 above (Sample4, Sample5, Sample6). That is confusing but this gene count matrix is under the Mcap Ribofree folder on github.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix3 <- matrix3[rowSums(matrix3) != 0, ]
dim(matrix3)
# 62673 genes retained 
ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa3 <- genefilter((matrix3), ffun) #apply filter
sum(filt_outrm_poa3) 
# 46806 genes retained  
filt_outrm_poa3 <- matrix3[filt_outrm_poa3,] #keep only rows that passed filter
CSV #4
trimmomatic_STAR_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the trimmomatic_STAR to differentiate it from the other csvs.
# Read in data
matrix4 <- read.csv("trimmomatic_STAR_gene_count_matrix.csv")
matrix4 <- as.data.frame(matrix4)
rownames(matrix4) <- matrix4[,1] #set first column that contains gene names as rownames
matrix4 <- matrix4[,-1] # remove column w/ gene names 
dim(matrix4)
# 63227 genes present 
# Keep ribofree samples
#matrix3 <- matrix3 %>%
#  select(Sample4, Sample5, Sample6)
These sample names (X54, X57, X65) correspond to matrix 3, but not to the others in matrix 1 and matrix 2 above (Sample4, Sample5, Sample6). That is confusing but this gene count matrix is under the Mcap Ribofree folder on github.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix4 <- matrix4[rowSums(matrix4) != 0, ]
dim(matrix4)
# 62707 genes retained 
ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa4 <- genefilter((matrix4), ffun) #apply filter
sum(filt_outrm_poa4) 
# 48366 genes retained  
filt_outrm_poa4 <- matrix4[filt_outrm_poa4,] #keep only rows that passed filter
CSV #5
stringtie_assembly_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the stringtie_assembly to differentiate it from the other csvs.
# Read in data
matrix5 <- read.csv("stringtie_assembly_gene_count_matrix.csv")
matrix5 <- as.data.frame(matrix5)
rownames(matrix5) <- matrix5[,1] #set first column that contains gene names as rownames
matrix5 <- matrix5[,-1] # remove column w/ gene names 
dim(matrix5)
# 55086 genes present 
# Keep ribofree samples
matrix5 <- matrix5 %>%
  select(Sample4.gtf, Sample5.gtf, Sample6.gtf)
These sample names correspond to matrix 1 and matrix 2.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix5 <- matrix5[rowSums(matrix5) != 0, ]
dim(matrix5)
# 53723 genes retained 
ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa5 <- genefilter((matrix5), ffun) #apply filter
sum(filt_outrm_poa5) 
# 28835 genes retained  
filt_outrm_poa5 <- matrix5[filt_outrm_poa5,] #keep only rows that passed filter
CSV #6
stringtie_assembly2_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the stringtie_assembly2 to differentiate it from the other csvs.
# Read in data
matrix6 <- read.csv("stringtie_assembly2_gene_count_matrix.csv")
matrix6 <- as.data.frame(matrix6)
rownames(matrix6) <- matrix6[,1] #set first column that contains gene names as rownames
matrix6 <- matrix6[,-1] # remove column w/ gene names 
dim(matrix6)
# 54993 genes present 
# Keep ribofree samples
matrix6 <- matrix6 %>%
  select(Sample4.2.gtf, Sample5.2.gtf, Sample6.2.gtf)
These sample names correspond to matrix 1 and matrix 2.
Filter reads
# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix6 <- matrix6[rowSums(matrix6) != 0, ]
dim(matrix6)
# 53638 genes retained 
ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa6 <- genefilter((matrix6), ffun) #apply filter
sum(filt_outrm_poa6) 
# 28825 genes retained  
filt_outrm_poa6 <- matrix6[filt_outrm_poa6,] #keep only rows that passed filter
Summary
The count matrices had different sample names in them. Here’s the summary info for matrices that included Sample4, Sample5, and Sample6.
| Matrix | Library Prep | Genes present | Genes retained after removing rows == 0 | Genes retained after pOverA | Bioinformatics processing | Link to csv | 
|---|---|---|---|---|---|---|
| Matrix 1 | RiboFree | 55086 | 53498 | 26476 | Unknown | https://github.com/hputnam/Express_Compare/tree/main/RAnalysis/Data/Mcap | 
| Matrix 2 | RiboFree | 55086 | 53498 | 25708 | Hisat2 | https://github.com/hputnam/Express_Compare/blob/main/hisat2_workflow/counts_matrices/MCap_gene_matrix.csv | 
| Matrix 5 | RiboFree | 55086 | 53723 | 28835 | Stringtie | https://github.com/hputnam/Express_Compare/blob/main/stringtie_assembly/counts_matrices/Mcap/redo/gene_count_matrix.csv | 
| Matrix 6 | RiboFree | 54993 | 53638 | 28825 | Stringtie | https://github.com/hputnam/Express_Compare/blob/main/stringtie_assembly/counts_matrices/Mcap/redo/gene_count_matrix2.csv | 
Here’s the summary info for matrices that included X54, X57, and X65.
| Matrix | Library Prep | Genes present | Genes retained after removing rows == 0 | Genes retained after pOverA | Bioinformatics processing | Link to csv | 
|---|---|---|---|---|---|---|
| Matrix 3 | RiboFree | 63227 | 62673 | 46806 | Fastp, STAR, stringtie | https://github.com/hputnam/Express_Compare/blob/main/star_kallisto_workflow/Output/Fastp_STAR_Stringtie_Matrices/Mcap_RiboFree/gene_count_matrix.csv | 
| Matrix 4 | RiboFree | 63227 | 62707 | 48366 | Trimmomatic, STAR, stringtie | https://github.com/hputnam/Express_Compare/blob/main/star_kallisto_workflow/Output/Trimmomatic_STAR_Stringtie_Matrices/Mcap_RiboFree/gene_count_matrix.csv |