RNA-Seq
Goal: Obtain gene counts from O. faveolata RNA-Seq data
1) Check data integrity
a) Count all files to make sure all downloaded
ls -1 | wc -l
b) Verify data integrity with md5sum
md5sum *.fastq.gz > checkmd5.md5
md5sum -c checkmd5.md5
Should output ‘OK’ next to each file name
c) Count number of reads per file
Fastq files have a specifc format (see here) with 4 lines per read. To count the number of reads, only 1 in every 4 rows needs to be counted.
# Getting one line per read and counting
zgrep -c "^>" *fastq.gz
2) Evaluate quality of reads
a) Run FastQC to check read quality
FastQC is a tool to help identify any discrepancies or problems in your data.
module load FastQC/0.11.8-Java-1.8
for file in /path/to/data/raw/*fastq.gz
do
fastqc $file --outdir /path/to/results/fastqc_results/raw
done
b) Run MultiQC to visualize quality checks
MultiQC works with FastQC output to aggregate and present results in user-friendly way.
module load MultiQC/1.7-foss-2018b-Python-2.7.15
multiqc /path/to/results/fastqc_results/raw/*fastqc.zip -o /path/to/results/multiqc_results/raw
To fully visualize multiQC output, I move the output of MultiQC to my local computer and look at the html file. Because the remote server doesn’t recognize my local file paths, I need to open another terminal window and do it from there.
# Moving html file
scp jillashey@xxx:/path/to/results/multiqc/raw/multiqc_report.html /my/local/computer/
# Moving folder
scp -r jillashey@xx:/path/to/results/multiqc/raw/multiqc_data /my/local/computer/
3) Trim reads
a) Unzip files
Trimmomatic can’t run if files are zipped
gunzip *fastq.gz
b) Run Trimmomatic to trim reads and clean adaptors.
Trimmomatic trims fastq data and removes adapters.
module load Trimmomatic/0.38-Java-1.8
for file in /path/to/data/raw/*fastq
do
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.38.jar SE -phred33 \
$file $file.trim.fq ILLUMINACLIP:/path/to/clip/Illumina_adapter_reads_PE_SE.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:20 >> TrimmedAmount.txt
done
4) Evaluate quality of trimmed reads
a) Check file integrity
Count number of files
ls -1 | wc -l
Count number of lines per read
zgrep -c "^>" *fastq.trim.fq
b) Run FastQC to check read quality of trimmed reads
module load FastQC/0.11.8-Java-1.8
for file in /path/to/data/trimmed/*trim.fq
do
fastqc $file --outdir /path/to/results/fastqc_results/trimmed
done
c) Run MultiQC to visualize quality checks
module load MultiQC/1.7-foss-2018b-Python-2.7.15
multiqc /path/to/results/fastqc_results/trimmed/*trim.fq.zip -o /path/to/results/multiqc_results/trimmed
# Moving html file
scp jillashey@xxx:/path/to/results/multiqc_results/trimmed/multiqc_report.html /my/local/computer/
# Moving folder
scp -r jillashey@xx:/path/to/results/multiqc_results/trimmed/multiqc_data /my/local/computer/
5) Align reads to reference genome
STAR (Spliced Transcripts Alignment to a Reference) aligns reads to reference genome. STAR uses sequential maximum mappable seed search followed by seed clustering and stitching to align reads. This method is very fast and accurate, but it requires a lot of memory.
STAR also requires a genome fasta file and genome annotation file.
a) Generate genome index
module load STAR/2.5.3a-foss-2016b
STAR --runThreadN 10 \
--runMode genomeGenerate \
--genomeDir /path/to/GenomeIndex_Ofav \
--genomeFastaFiles /path/to/genome/fasta/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.fna \
--sjdbGTFfile /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff
b) Align reads to genome
# symbolically link to existing directory
ln -s /path/to/data/trimmed/*trim.fq .
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=jillashey@uri.edu
#SBATCH --account=putnamlab
#SBATCH --error="Align_Ofav_out_error"
#SBATCH --output="Align_Ofav_out"
module load STAR/2.5.3a-foss-2016b
F=/path/to/AlignReads_Ofav
array1=($(ls $F/*trim.fq))
for i in ${array1[@]}
do
STAR --runMode alignReads --quantMode TranscriptomeSAM --outTmpDir ${i}_TMP --readFilesIn ${i} --genomeDir /path/to/GenomeIndex_Ofav --twopassMode Basic --twopass1readsN -1 --outStd Log BAM_Unsorted BAM_Quant --outSAMtype BAM Unsorted SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix ${i}.
done
Obtain gene count with stringTie
a) Move BAM files (aligned and sorted by coordinates) to stringTie folder
mv *Aligned.sortedByCoord.out.bam /path/to/stringTie/Ofav/BAM
b) Assemble and estimate reads
module load StringTie/2.1.1-GCCcore-7.3.0
F=/path/to/stringTie/Ofav/BAM
array1=($(ls $F/*bam))
for i in ${array1[@]}; do
stringtie -G /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -e -o ${i}.gtf ${i}
echo "${i}"
done
c) Merge stringTie gtf results
# move GTF files to their own folder
mv *gtf ../GTFfiles/
ls *gtf > ofav_mergelist.txt
cat ofav_mergelist.txt
module load StringTie/2.1.1-GCCcore-7.3.0
stringtie --merge -p 8 -G /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -o stringtie_ofav_merged.gtf ofav_mergelist.txt
d) Assess assembly quality
module load gffcompare/0.11.5-foss-2018b
gffcompare -r /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -o Ofav.merged stringtie_ofav_merged.gtf
e) Re-estimate assembly
module load StringTie/2.1.1-GCCcore-7.3.0
F=/path/to/stringTie/Ofav/BAM
array1=($(ls $F/*bam))
for i in ${array1[@]}
do
stringtie -e -G /path/to/genome/annotation/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.gff -o ${i}.merge.gtf ${i}
echo "${i}"
done
# move merged GTF files to their own folder
mv *merge.gtf ../GTF_merge
f) Create gene matrix
module load StringTie/2.1.1-GCCcore-7.3.0
module load Python/2.7.15-foss-2018b
F=/path/to/stringTie/Ofav/GTF_merge/
array2=($(ls *merge.gtf))
for i in ${array2[@]}
do
echo "${i} $F${i}" >> sample_list_ofav.txt
done
# get prepDE.py function here (xxxx)
python prepDE.py -g gene_count_ofav_matrix.csv -i sample_list_ofav.txt
g) Secure-copy gene counts onto local computer
scp jillashey@xxx:/path/to/stringTie/Ofav/GTF_merge/gene_count_ofav_matrix.csv /my/local/computer/