RNA-Seq

Goal: Obtain gene counts from O. faveolata RNA-Seq data

1) Check data integrity

a) Count all files to make sure all downloaded

ls -1 | wc -l

b) Verify data integrity with md5sum

md5sum *.fastq.gz > checkmd5.md5
md5sum -c checkmd5.md5

Should output ‘OK’ next to each file name

c) Count number of reads per file

Fastq files have a specifc format (see here) with 4 lines per read. To count the number of reads, only 1 in every 4 rows needs to be counted.

# Getting one line per read and counting
zgrep -c "^>" *fastq.gz

2) Evaluate quality of reads

a) Run FastQC to check read quality

FastQC is a tool to help identify any discrepancies or problems in your data.

module load FastQC/0.11.8-Java-1.8 

for file in /path/to/data/raw/*fastq.gz
do
fastqc $file --outdir /path/to/results/fastqc_results/raw
done

b) Run MultiQC to visualize quality checks

MultiQC works with FastQC output to aggregate and present results in user-friendly way.

module load MultiQC/1.7-foss-2018b-Python-2.7.15

multiqc /path/to/results/fastqc_results/raw/*fastqc.zip -o /path/to/results/multiqc_results/raw

To fully visualize multiQC output, I move the output of MultiQC to my local computer and look at the html file. Because the remote server doesn’t recognize my local file paths, I need to open another terminal window and do it from there.

# Moving html file
scp jillashey@xxx:/path/to/results/multiqc/raw/multiqc_report.html /my/local/computer/ 

# Moving folder 
scp -r jillashey@xx:/path/to/results/multiqc/raw/multiqc_data /my/local/computer/

3) Trim reads

a) Unzip files

Trimmomatic can’t run if files are zipped

gunzip *fastq.gz

b) Run Trimmomatic to trim reads and clean adaptors.

Trimmomatic trims fastq data and removes adapters.

module load Trimmomatic/0.38-Java-1.8

for file in /path/to/data/raw/*fastq
do
        java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.38.jar SE -phred33 \
        $file $file.trim.fq ILLUMINACLIP:/path/to/clip/Illumina_adapter_reads_PE_SE.fa:2:30:10 \
        LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:20 >> TrimmedAmount.txt
done

4) Evaluate quality of trimmed reads

a) Check file integrity

Count number of files

ls -1 | wc -l

Count number of lines per read

zgrep -c "^>" *fastq.trim.fq

b) Run FastQC to check read quality of trimmed reads

module load FastQC/0.11.8-Java-1.8 

for file in /path/to/data/trimmed/*trim.fq
do
fastqc $file --outdir /path/to/results/fastqc_results/trimmed
done

c) Run MultiQC to visualize quality checks

module load MultiQC/1.7-foss-2018b-Python-2.7.15

multiqc /path/to/results/fastqc_results/trimmed/*trim.fq.zip -o /path/to/results/multiqc_results/trimmed

# Moving html file
scp jillashey@xxx:/path/to/results/multiqc_results/trimmed/multiqc_report.html /my/local/computer/ 

# Moving folder 
scp -r jillashey@xx:/path/to/results/multiqc_results/trimmed/multiqc_data /my/local/computer/

5) Align reads to reference genome

STAR (Spliced Transcripts Alignment to a Reference) aligns reads to reference genome. STAR uses sequential maximum mappable seed search followed by seed clustering and stitching to align reads. This method is very fast and accurate, but it requires a lot of memory.

STAR also requires a genome fasta file and genome annotation file.

a) Generate genome index

module load STAR/2.5.3a-foss-2016b

STAR --runThreadN 10 \
--runMode genomeGenerate \
--genomeDir /path/to/GenomeIndex_Ofav \
--genomeFastaFiles /path/to/genome/fasta/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.fna \
--sjdbGTFfile /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff

b) Align reads to genome

# symbolically link to existing directory 
ln -s /path/to/data/trimmed/*trim.fq .

#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=jillashey@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="Align_Ofav_out_error"
#SBATCH --output="Align_Ofav_out"

module load STAR/2.5.3a-foss-2016b

F=/path/to/AlignReads_Ofav

array1=($(ls $F/*trim.fq))
for i in ${array1[@]}
do
STAR --runMode alignReads --quantMode TranscriptomeSAM --outTmpDir ${i}_TMP --readFilesIn ${i} --genomeDir /path/to/GenomeIndex_Ofav --twopassMode Basic --twopass1readsN -1 --outStd Log BAM_Unsorted BAM_Quant --outSAMtype BAM Unsorted SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix ${i}.
done 

Obtain gene count with stringTie

a) Move BAM files (aligned and sorted by coordinates) to stringTie folder

mv *Aligned.sortedByCoord.out.bam /path/to/stringTie/Ofav/BAM

b) Assemble and estimate reads

module load StringTie/2.1.1-GCCcore-7.3.0

F=/path/to/stringTie/Ofav/BAM

array1=($(ls $F/*bam))
for i in ${array1[@]}; do
	stringtie -G /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -e -o ${i}.gtf ${i}
	echo "${i}"
done

c) Merge stringTie gtf results

# move GTF files to their own folder 
mv *gtf ../GTFfiles/

ls *gtf > ofav_mergelist.txt
cat ofav_mergelist.txt

module load StringTie/2.1.1-GCCcore-7.3.0

stringtie --merge -p 8 -G /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -o stringtie_ofav_merged.gtf ofav_mergelist.txt

d) Assess assembly quality

module load gffcompare/0.11.5-foss-2018b

gffcompare -r /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -o Ofav.merged stringtie_ofav_merged.gtf

e) Re-estimate assembly

module load StringTie/2.1.1-GCCcore-7.3.0

F=/path/to/stringTie/Ofav/BAM

array1=($(ls $F/*bam))
for i in ${array1[@]}
do
stringtie -e -G /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -o ${i}.merge.gtf ${i}
echo "${i}"
done

# move merged GTF files to their own folder 
mv *merge.gtf ../GTF_merge

f) Create gene matrix

module load StringTie/2.1.1-GCCcore-7.3.0
module load Python/2.7.15-foss-2018b

F=/path/to/stringTie/Ofav/GTF_merge/

array2=($(ls *merge.gtf))

for i in ${array2[@]}
do
echo "${i} $F${i}" >> sample_list_ofav.txt
done

# get prepDE.py function here (xxxx)
python prepDE.py -g gene_count_ofav_matrix.csv -i sample_list_ofav.txt

g) Secure-copy gene counts onto local computer

scp jillashey@xxx:/path/to/stringTie/Ofav/GTF_merge/gene_count_ofav_matrix.csv /my/local/computer/ 

Yay, you got O. fav gene counts! Now head over to R for some differential gene expression analysis!

Written on September 24, 2020