Mcap Ribofree expression test

I am deciding what kind of depth/coverage I will need for sequencing if I proceed with the Zymo Ribofree library prep kit. I am using the gene count matrices generated through the Express Compare project. In this post, I looked at all of the Mcap Ribofree gene count matrices that I could find to assess how many genes were retained before and after filtering. I did a two-step filtering process - first, I removed all rows whose sum was zero (ie those genes were not expressed at all). Then I applied the pOverA(0.75,5).

All of the code was run on my personal computer under R version 4.3.1

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

library(genefilter) library(tidyverse)

getwd()

“/Users/jillashey/Desktop/PutnamLab/Hawaii/2023/DevelopmentalTimeseries”

There were many different gene count matrices on the express compare github. 

### CSV #1

`MCap_gene_matrix.csv` from [here](https://github.com/hputnam/Express_Compare/tree/main/RAnalysis/Data/Mcap)

```{r}
# Read in data
matrix1 <- read.csv("MCap_gene_matrix.csv")
matrix1 <- as.data.frame(matrix1)
rownames(matrix1) <- matrix1[,1] #set first column that contains gene names as rownames
matrix1 <- matrix1[,-1] # remove column w/ gene names 
dim(matrix1)
# 55086 genes present 

# Keep ribofree samples
matrix1 <- matrix1 %>%
  select(Sample4, Sample5, Sample6)

Filter reads

# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix1 <- matrix1[rowSums(matrix1) != 0, ]
dim(matrix1)
# 53498 genes retained 

ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa1 <- genefilter((matrix1), ffun) #apply filter
sum(filt_outrm_poa1) 
# 26476 genes retained  

filt_outrm_poa1 <- matrix1[filt_outrm_poa1,] #keep only rows that passed filter

CSV #2

hisat2_MCap_gene_matrix.csv from here. On github, it is named MCap_gene_matrix.csv but I added the hisat2 to differentiate it from the csv above.

# Read in data
matrix2 <- read.csv("hisat2_MCap_gene_matrix.csv")
matrix2 <- as.data.frame(matrix2)
rownames(matrix2) <- matrix2[,1] #set first column that contains gene names as rownames
matrix2 <- matrix2[,-1] # remove column w/ gene names 
dim(matrix2)
# 55086 genes present 

# Keep ribofree samples
matrix2 <- matrix2 %>%
  select(Sample4, Sample5, Sample6)

Filter reads

# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix2 <- matrix2[rowSums(matrix2) != 0, ]
dim(matrix2)
# 53498 genes retained 

ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa2 <- genefilter((matrix2), ffun) #apply filter
sum(filt_outrm_poa2) 
# 25708 genes retained  

filt_outrm_poa2 <- matrix2[filt_outrm_poa2,] #keep only rows that passed filter

CSV #3

fastp_STAR_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the fastp_STAR to differentiate it from the other csvs.

# Read in data
matrix3 <- read.csv("fastp_STAR_gene_count_matrix.csv")
matrix3 <- as.data.frame(matrix3)
rownames(matrix3) <- matrix3[,1] #set first column that contains gene names as rownames
matrix3 <- matrix3[,-1] # remove column w/ gene names 
dim(matrix3)
# 63227 genes present 

# Keep ribofree samples
#matrix3 <- matrix3 %>%
#  select(Sample4, Sample5, Sample6)

These sample names (X54, X57, X65) do not correspond to the others in matrix 1 and matrix 2 above (Sample4, Sample5, Sample6). That is confusing but this gene count matrix is under the Mcap Ribofree folder on github.

Filter reads

# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix3 <- matrix3[rowSums(matrix3) != 0, ]
dim(matrix3)
# 62673 genes retained 

ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa3 <- genefilter((matrix3), ffun) #apply filter
sum(filt_outrm_poa3) 
# 46806 genes retained  

filt_outrm_poa3 <- matrix3[filt_outrm_poa3,] #keep only rows that passed filter

CSV #4

trimmomatic_STAR_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the trimmomatic_STAR to differentiate it from the other csvs.

# Read in data
matrix4 <- read.csv("trimmomatic_STAR_gene_count_matrix.csv")
matrix4 <- as.data.frame(matrix4)
rownames(matrix4) <- matrix4[,1] #set first column that contains gene names as rownames
matrix4 <- matrix4[,-1] # remove column w/ gene names 
dim(matrix4)
# 63227 genes present 

# Keep ribofree samples
#matrix3 <- matrix3 %>%
#  select(Sample4, Sample5, Sample6)

These sample names (X54, X57, X65) correspond to matrix 3, but not to the others in matrix 1 and matrix 2 above (Sample4, Sample5, Sample6). That is confusing but this gene count matrix is under the Mcap Ribofree folder on github.

Filter reads

# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix4 <- matrix4[rowSums(matrix4) != 0, ]
dim(matrix4)
# 62707 genes retained 

ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa4 <- genefilter((matrix4), ffun) #apply filter
sum(filt_outrm_poa4) 
# 48366 genes retained  

filt_outrm_poa4 <- matrix4[filt_outrm_poa4,] #keep only rows that passed filter

CSV #5

stringtie_assembly_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the stringtie_assembly to differentiate it from the other csvs.

# Read in data
matrix5 <- read.csv("stringtie_assembly_gene_count_matrix.csv")
matrix5 <- as.data.frame(matrix5)
rownames(matrix5) <- matrix5[,1] #set first column that contains gene names as rownames
matrix5 <- matrix5[,-1] # remove column w/ gene names 
dim(matrix5)
# 55086 genes present 

# Keep ribofree samples
matrix5 <- matrix5 %>%
  select(Sample4.gtf, Sample5.gtf, Sample6.gtf)

These sample names correspond to matrix 1 and matrix 2.

Filter reads

# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix5 <- matrix5[rowSums(matrix5) != 0, ]
dim(matrix5)
# 53723 genes retained 

ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa5 <- genefilter((matrix5), ffun) #apply filter
sum(filt_outrm_poa5) 
# 28835 genes retained  

filt_outrm_poa5 <- matrix5[filt_outrm_poa5,] #keep only rows that passed filter

CSV #6

stringtie_assembly2_gene_count_matrix.csv from here. On github, it is named gene_matrix.csv but I added the stringtie_assembly2 to differentiate it from the other csvs.

# Read in data
matrix6 <- read.csv("stringtie_assembly2_gene_count_matrix.csv")
matrix6 <- as.data.frame(matrix6)
rownames(matrix6) <- matrix6[,1] #set first column that contains gene names as rownames
matrix6 <- matrix6[,-1] # remove column w/ gene names 
dim(matrix6)
# 54993 genes present 

# Keep ribofree samples
matrix6 <- matrix6 %>%
  select(Sample4.2.gtf, Sample5.2.gtf, Sample6.2.gtf)

These sample names correspond to matrix 1 and matrix 2.

Filter reads

# Remove rows whose sum s 0 (ie gene was not expressed at all)
matrix6 <- matrix6[rowSums(matrix6) != 0, ]
dim(matrix6)
# 53638 genes retained 

ffun<-filterfun(pOverA(0.75,5))  #set up filtering parameters--LOOK INTO FILTERING PARAMETERS
filt_outrm_poa6 <- genefilter((matrix6), ffun) #apply filter
sum(filt_outrm_poa6) 
# 28825 genes retained  

filt_outrm_poa6 <- matrix6[filt_outrm_poa6,] #keep only rows that passed filter

Summary

The count matrices had different sample names in them. Here’s the summary info for matrices that included Sample4, Sample5, and Sample6.

Matrix	Library Prep	Genes present	Genes retained after removing rows == 0	Genes retained after pOverA	Bioinformatics processing	Link to csv
Matrix 1	RiboFree	55086	53498	26476	Unknown	https://github.com/hputnam/Express_Compare/tree/main/RAnalysis/Data/Mcap
Matrix 2	RiboFree	55086	53498	25708	Hisat2	https://github.com/hputnam/Express_Compare/blob/main/hisat2_workflow/counts_matrices/MCap_gene_matrix.csv
Matrix 5	RiboFree	55086	53723	28835	Stringtie	https://github.com/hputnam/Express_Compare/blob/main/stringtie_assembly/counts_matrices/Mcap/redo/gene_count_matrix.csv
Matrix 6	RiboFree	54993	53638	28825	Stringtie	https://github.com/hputnam/Express_Compare/blob/main/stringtie_assembly/counts_matrices/Mcap/redo/gene_count_matrix2.csv

Here’s the summary info for matrices that included X54, X57, and X65.

Matrix	Library Prep	Genes present	Genes retained after removing rows == 0	Genes retained after pOverA	Bioinformatics processing	Link to csv
Matrix 3	RiboFree	63227	62673	46806	Fastp, STAR, stringtie	https://github.com/hputnam/Express_Compare/blob/main/star_kallisto_workflow/Output/Fastp_STAR_Stringtie_Matrices/Mcap_RiboFree/gene_count_matrix.csv
Matrix 4	RiboFree	63227	62707	48366	Trimmomatic, STAR, stringtie	https://github.com/hputnam/Express_Compare/blob/main/star_kallisto_workflow/Output/Trimmomatic_STAR_Stringtie_Matrices/Mcap_RiboFree/gene_count_matrix.csv

Written on January 31, 2024