Astrangia 2021 mRNA analysis
Astrangia 2021 mRNA analysis
These data came from my Astrangia 2021 experiment, during which adult Astrangia colonies were exposed to ambient and high temperatures for ~9 months.
Files were downloaded to this location: /data/putnamlab/KITT/hputnam/20230614_Astrangia_mRNA
Set-up
Make new folders for this project in my directory
cd /data/putnamlab/jillashey
mkdir Astrangia2021
cd Astrangia2021
mkdir mRNA
cd mRNA
mkdir data scripts fastqc
cd data
mkdir raw
cd raw
Copy the new files into raw data folder
cp /data/putnamlab/KITT/hputnam/20230614_Astrangia_mRNA/* .
Check to make sure all files were transferred successfully
md5sum *.fastq.gz > checkmd5.md5
md5sum -c checkmd5.md5
AST-1065_R1_001.fastq.gz: OK
AST-1065_R2_001.fastq.gz: OK
AST-1105_R1_001.fastq.gz: OK
AST-1105_R2_001.fastq.gz: OK
AST-1147_R1_001.fastq.gz: OK
AST-1147_R2_001.fastq.gz: OK
AST-1412_R1_001.fastq.gz: OK
AST-1412_R2_001.fastq.gz: OK
AST-1560_R1_001.fastq.gz: OK
AST-1560_R2_001.fastq.gz: OK
AST-1567_R1_001.fastq.gz: OK
AST-1567_R2_001.fastq.gz: OK
AST-1617_R1_001.fastq.gz: OK
AST-1617_R2_001.fastq.gz: OK
AST-1722_R1_001.fastq.gz: OK
AST-1722_R2_001.fastq.gz: OK
AST-2000_R1_001.fastq.gz: OK
AST-2000_R2_001.fastq.gz: OK
AST-2007_R1_001.fastq.gz: OK
AST-2007_R2_001.fastq.gz: OK
AST-2302_R1_001.fastq.gz: OK
AST-2302_R2_001.fastq.gz: OK
AST-2360_R1_001.fastq.gz: OK
AST-2360_R2_001.fastq.gz: OK
AST-2398_R1_001.fastq.gz: OK
AST-2398_R2_001.fastq.gz: OK
AST-2404_R1_001.fastq.gz: OK
AST-2404_R2_001.fastq.gz: OK
AST-2412_R1_001.fastq.gz: OK
AST-2412_R2_001.fastq.gz: OK
AST-2512_R1_001.fastq.gz: OK
AST-2512_R2_001.fastq.gz: OK
AST-2523_R1_001.fastq.gz: OK
AST-2523_R2_001.fastq.gz: OK
AST-2563_R1_001.fastq.gz: OK
AST-2563_R2_001.fastq.gz: OK
AST-2729_R1_001.fastq.gz: OK
AST-2729_R2_001.fastq.gz: OK
AST-2755_R1_001.fastq.gz: OK
AST-2755_R2_001.fastq.gz: OK
Count number of reads per file
zgrep -c "@NGSNJ" *fastq.gz
AST-1065_R1_001.fastq.gz:42852873
AST-1065_R2_001.fastq.gz:42852873
AST-1105_R1_001.fastq.gz:36082209
gzip: AST-1105_R2_001.fastq.gz: unexpected end of file
AST-1105_R2_001.fastq.gz:409
AST-1147_R1_001.fastq.gz:41008372
AST-1147_R2_001.fastq.gz:41008372
AST-1412_R1_001.fastq.gz:33078120
AST-1412_R2_001.fastq.gz:33078120
AST-1560_R1_001.fastq.gz:35909631
AST-1560_R2_001.fastq.gz:35909631
AST-1567_R1_001.fastq.gz:36789468
AST-1567_R2_001.fastq.gz:36789468
AST-1617_R1_001.fastq.gz:31764172
AST-1617_R2_001.fastq.gz:31764172
AST-1722_R1_001.fastq.gz:34941521
AST-1722_R2_001.fastq.gz:34941521
AST-2000_R1_001.fastq.gz:30315074
AST-2000_R2_001.fastq.gz:30315074
AST-2007_R1_001.fastq.gz:39369837
AST-2007_R2_001.fastq.gz:39369837
AST-2302_R1_001.fastq.gz:33326148
AST-2302_R2_001.fastq.gz:33326148
AST-2360_R1_001.fastq.gz:32840507
AST-2360_R2_001.fastq.gz:32840507
AST-2398_R1_001.fastq.gz:33309047
AST-2398_R2_001.fastq.gz:33309047
AST-2404_R1_001.fastq.gz:38349371
AST-2404_R2_001.fastq.gz:38349371
AST-2412_R1_001.fastq.gz:35436774
AST-2412_R2_001.fastq.gz:35436774
AST-2512_R1_001.fastq.gz:35323599
AST-2512_R2_001.fastq.gz:35323599
AST-2523_R1_001.fastq.gz:34086320
AST-2523_R2_001.fastq.gz:34086320
AST-2563_R1_001.fastq.gz:35799766
AST-2563_R2_001.fastq.gz:35799766
AST-2729_R1_001.fastq.gz:22735638
AST-2729_R2_001.fastq.gz:22735638
AST-2755_R1_001.fastq.gz:33057601
AST-2755_R2_001.fastq.gz:33057601
Raw QC
Run fastqc to quality check raw reads
In scripts folder: nano fastqc_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="fastqc_raw_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastqc_raw_output" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
for file in /data/putnamlab/jillashey/Astrangia2021/mRNA/data/raw/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/Astrangia2021/mRNA/fastqc/raw
done
multiqc --interactive fastqc_results
Submitted batch job 265922
QC results:
The quality scores all look really good, happy about that. There is quite a bit of adapter content though.
Trim data
Using fastp to trim data.
In scripts folder: nano fastp_QC.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="fastp_QC_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastp_QC_output" #once your job is completed, any final job report comments will be put in this file
# Load modules needed
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
# Make an array of sequences to trim in raw data directory
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/data/raw
array1=($(ls *R1_001.fastq.gz))
echo "Read trimming of adapters started." $(date)
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--in2 $(echo ${i}|sed s/_R1/_R2/)\
--out1 /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.${i} \
--out2 /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.$(echo ${i}|sed s/_R1/_R2/) \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
--length_required 100 \
--cut_right cut_right_window_size 5 cut_right_mean_quality 20
fastqc /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.${i}
fastqc /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.$(echo ${i}|sed s/_R1/_R2/)
done
echo "Read trimming of adapters complete." $(date)
# Quality Assessment of Trimmed Reads
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/ #go to output directory
# Compile MultiQC report from FastQC files
multiqc --interactive ./
echo "Cleaned MultiQC report generated." $(date)
Submitted batch job 267173
ADD QC PLOTS
Let’s look at the counts from the trimmed data
zgrep -c "@NGSNJ" *fastq.gz
trimmed.AST-1065_R1_001.fastq.gz:0
trimmed.AST-1065_R2_001.fastq.gz:28514704
trimmed.AST-1105_R1_001.fastq.gz:0
trimmed.AST-1105_R2_001.fastq.gz:283
trimmed.AST-1147_R1_001.fastq.gz:0
trimmed.AST-1147_R2_001.fastq.gz:27401584
trimmed.AST-1412_R1_001.fastq.gz:0
trimmed.AST-1412_R2_001.fastq.gz:23324255
trimmed.AST-1560_R1_001.fastq.gz:0
trimmed.AST-1560_R2_001.fastq.gz:25261438
trimmed.AST-1567_R1_001.fastq.gz:0
trimmed.AST-1567_R2_001.fastq.gz:23711472
trimmed.AST-1617_R1_001.fastq.gz:0
trimmed.AST-1617_R2_001.fastq.gz:21247535
trimmed.AST-1722_R1_001.fastq.gz:0
trimmed.AST-1722_R2_001.fastq.gz:24754153
trimmed.AST-2000_R1_001.fastq.gz:0
trimmed.AST-2000_R2_001.fastq.gz:20483353
trimmed.AST-2007_R1_001.fastq.gz:0
trimmed.AST-2007_R2_001.fastq.gz:26160891
trimmed.AST-2302_R1_001.fastq.gz:0
trimmed.AST-2302_R2_001.fastq.gz:22054499
trimmed.AST-2360_R1_001.fastq.gz:0
trimmed.AST-2360_R2_001.fastq.gz:21377658
trimmed.AST-2398_R1_001.fastq.gz:0
trimmed.AST-2398_R2_001.fastq.gz:21796937
trimmed.AST-2404_R1_001.fastq.gz:0
trimmed.AST-2404_R2_001.fastq.gz:26894798
trimmed.AST-2412_R1_001.fastq.gz:0
trimmed.AST-2412_R2_001.fastq.gz:25372505
trimmed.AST-2512_R1_001.fastq.gz:0
trimmed.AST-2512_R2_001.fastq.gz:24221309
trimmed.AST-2523_R1_001.fastq.gz:0
trimmed.AST-2523_R2_001.fastq.gz:23751953
trimmed.AST-2563_R1_001.fastq.gz:0
trimmed.AST-2563_R2_001.fastq.gz:25242216
trimmed.AST-2729_R1_001.fastq.gz:0
trimmed.AST-2729_R2_001.fastq.gz:15551660
trimmed.AST-2755_R1_001.fastq.gz:0
trimmed.AST-2755_R2_001.fastq.gz:24720522
Why do the R1 files have 0 reads??
Align trimmed data to reference genome
Should I use hisat2 or bowtie? Only considering using bowtie because that’s used in a lot of ncRNA analyses. Let’s do both!
Make new folders in output directory
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/
mkdir hisat2 bowtie
cd hisat2
mkdir refs align
cd ../bowtie
mkdir refs align
HISAT2 alignment
First, index the reference genome.
In scripts folder: nano hisat2_build.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="hisat2_build_error" #if your job fails, the error report will be put in this file
#SBATCH --output="hisat2_build_output" #once your job is completed, any final job report comments will be put in this file
# load modules needed
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
echo "Indexing reference genome" $(date)
# Index the reference genome for A. poculata
hisat2-build -f /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta ./Apoc_ref
echo "Referece genome indexed!" $(date)
Submitted batch job 267749. After this is done running, move reference files from scripts folder to hisat2 refs folder: mv Apoc_ref.* ../output/hisat2/refs/
Use the indexed reference genome to align sequences
In scripts folder: nano hisat2_align.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="hisat2_align_error" #if your job fails, the error report will be put in this file
#SBATCH --output="hisat2_align_output" #once your job is completed, any final job report comments will be put in this file
# load modules needed
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
echo "Aligning reads to reference genome" $(date)
# This script exports alignments as bam files, sorts the bam file because Stringtie takes a sorted file for input (--dta), and removes the sam file because it is no longer needed
array=($(ls /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/*fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
hisat2 -p 8 --rna-strandness RF --dta -q -x ../output/hisat2/refs/Apoc_ref -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -S ${i}.sam
samtools sort -@ 8 -o ${i}.bam ${i}.sam
echo "${i} bam-ified!"
rm ${i}.sam
done
echo "Alignment complete!" $(date)
Submitted batch job 267750 - this job was completed, but I think it didn’t work properly. In the hisat2 error file, there was some kind of GCC conflict error. Purged modules from the environment, then ran job again. Submitted batch job 269176 on 6/30/23. Checked the error file 30 seconds after running and got the same errors:
foss/2018b(23):ERROR:150: Module 'foss/2018b' conflicts with the currently loaded module(s) 'foss/2019b'
foss/2018b(23):ERROR:102: Tcl command execution failed: conflict foss
GCCcore/7.3.0(23):ERROR:150: Module 'GCCcore/7.3.0' conflicts with the currently loaded module(s) 'GCCcore/8.3.0'
GCCcore/7.3.0(23):ERROR:102: Tcl command execution failed: conflict GCCcore
GCCcore/7.3.0(23):ERROR:150: Module 'GCCcore/7.3.0' conflicts with the currently loaded module(s) 'GCCcore/8.3.0'
GCCcore/7.3.0(23):ERROR:102: Tcl command execution failed: conflict GCCcore
GCCcore/7.3.0(23):ERROR:150: Module 'GCCcore/7.3.0' conflicts with the currently loaded module(s) 'GCCcore/8.3.0'
GCCcore/7.3.0(23):ERROR:102: Tcl command execution failed: conflict GCCcore
GCCcore/7.3.0(23):ERROR:150: Module 'GCCcore/7.3.0' conflicts with the currently loaded module(s) 'GCCcore/8.3.0'
GCCcore/7.3.0(23):ERROR:102: Tcl command execution failed: conflict GCCcore
Stopping the job now. I don’t think it likes the modules that I’m trying to load. Use these modules instead: HISAT2/2.2.1-gompi-2021b
and SAMtools/1.16.1-GCC-11.3.0
. Submitted batch job 269177. still getting a conflict error. Adding this line before loading modules: source /usr/share/Modules/init/sh # load the module function (specific need to my computer)
from Emma’s script. Not sure what it does but maybe it’ll help??? Submitted batch job 269178. nope same error. Trying HISAT2/2.2.1-gompi-2021b
and SAMtools/1.15.1-GCC-11.2.0
(from Emma’s script linked above). Submitted batch job 269179. This worked!
View mapping percentages for the hisat2 alignment
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
for i in *.bam; do
echo "${i}" >> mapped_reads_counts_Apoc
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_Apoc
done
Move BAM files in trimmed folder to hisat align folder
mv *bam ../../output/hisat2/align/
Should the output should be R1 and R2 in one sam file…why is this not the case with the hisat2 samples? if not now, when do they get combined into a single file?
Bowtie2 alignment
First, index the reference genome.
In scripts folder: nano bowtie2_build.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="bowtie2_build_error" #if your job fails, the error report will be put in this file
#SBATCH --output="bowtie2_build_output" #once your job is completed, any final job report comments will be put in this file
# load modules needed
module load Bowtie2/2.4.4-GCC-11.2.0
echo "Indexing reference genome" $(date)
# Index the reference genome for A. poculata
bowtie2-build /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta Apoc_ref.btindex
echo "Referece genome indexed!" $(date)
Submitted batch job 269279
After this is done running, move reference files to bowtie folder: mv *.bt2 ../output/bowtie/refs/
Use the indexed reference genome to align sequences
In scripts folder: nano bowtie_align.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="bowtie_align_error" #if your job fails, the error report will be put in this file
#SBATCH --output="bowtie_align_output" #once your job is completed, any final job report comments will be put in this file
module load Bowtie2/2.4.4-GCC-11.2.0
echo "Aligning reads" $(date)
array=($(ls /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/*fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
bowtie2 -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex -b --align-paired-reads
done
echo "Reads aligned!" $(date)
Submitted batch job 275228. Didn’t work. Getting this error:
0 reads
0.00% overall alignment rate
Warning: Same mate file "/data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1065_R2_001.fastq.gz" appears as argument to both -1 and -2
0 reads
Maybe it didn’t like the --align-paired-reads
argument? Going to remove this and rerun the code. Submitted batch job 275230. Got the same error… Editing the code so that the array is array=($(ls /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/*_R1_001.fastq.gz))
based on Emma’s code. Submitted batch job 275237. Failed again, still saying 0% alignment rate. In the -x
argument, changed Apoc_ref.btindex
to Apoc_ref
. Submitted batch job 275239. It didn’t like that! Changing it back.
Still getting 0% alignment. The error file says: “Error while reading BAM extra subfields
(ERR): bowtie2-align exited with value 1”. Maybe it’s an issue with the path? Changing -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex
to -x ../output/bowtie/refs/Apoc_ref.btindex
. Submitted batch job 275248. Still getting the same error???
It’s 20231004. Let’s get back to aligning with bowtie2. This is the code I have now:
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="bowtie_align_error" #if your job fails, the error report will be put in this file
#SBATCH --output="bowtie_align_output" #once your job is completed, any final job report comments will be put in this file
module load Bowtie2/2.4.4-GCC-11.2.0
echo "Aligning reads" $(date)
array=($(ls /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/*_R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
bowtie2 -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -x ../output/bowtie/refs/Apoc_ref.btindex -b
done
echo "Reads aligned!" $(date)
Let’s try running it again and seeing what happens. Submitted batch job 283318. Nope same error as above. I may need to add a bam file name. Added -b ${i}
. Submitted batch job 283321. Didn’t work, this is the error that I’m getting: Warning: Output file '{i}' was specified without -S. This will not work in future Bowtie 2 versions. Please use -S instead.
-S
means to write the output files as sam files, but I don’t want them as sam files. I’m going to add the -S
anyway and see what happens. I can always convert the sam to bam files. Submitted batch job 283322
Now I’m getting this error:
Extra parameter(s) specified: "/data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1412_R1_001.fastq.gz"
Note that if <mates> files are specified using -1/-2, a <singles> file cannot
also be specified. Please run bowtie separately for mates and singles.
Error: Encountered internal Bowtie 2 exception (#1)
Command: /opt/software/Bowtie2/2.4.4-GCC-11.2.0/bin/bowtie2-align-s --wrapper basic-0 -x ../output/bowtie/refs/Apoc_ref.btindex -S /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1412_R1_001.fastq.gz -1 /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1412_R1_001.fastq.gz -2 /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1412_R2_001.fastq.gz -b /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1412_R1_001.fastq.gz
(ERR): bowtie2-align exited with value 1
It looks like it thinks I’m trying to run a singles file but its not working…Going back! Changing the syntax of the array a little bit and removing the bam and sam commands
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH --error="bowtie_align_error" #if your job fails, the error report will be put in this file
#SBATCH --output="bowtie_align_output" #once your job is completed, any final job report comments will be put in this file
module load Bowtie2/2.4.4-GCC-11.2.0
echo "Aligning reads" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/
array=($(ls /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/*_R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
bowtie2 -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex
done
echo "Reads aligned!" $(date)
Submitted batch job 283326. Now I got this error:
Error, fewer reads in file specified with -1 than in file specified with -2
terminate called after throwing an instance of 'int'
(ERR): bowtie2-align died with signal 6 (ABRT)
Unable to read file magic number
Unsure why bowtie is so angry at me. boo. Still need to troubleshoot as of 10/4/23.
20240103
I may have found the source of the problem…when I trim my data, for some reason, the R1 files now have no data in them. Need to go back to trimming step.
Assemble and quantify transcripts
Hisat2
Going to be using stringtie to assemble and quantify transcripts
cd /data/putnamlab/jillashey/Astrangia2021/mRNA
mkdir stringtie
cd scripts
In the scripts folder: nano assemble.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=128GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/output/hisat2/align
#SBATCH --error="stringtie-hisat-assemble_error" #if your job fails, the error report will be put in this file
#SBATCH --output="stringtie-hisat-assemble_error" #once your job is completed, any final job report comments will be put in this file
module load StringTie/2.2.1-GCC-11.2.0
echo "Starting stringtie assembly" $(date)
# Transcript assembly: StringTie
array1=($(ls *.bam)) #Make an array of sequences to assemble
for i in ${array1[@]}; do
sample_name=`echo $i| awk -F [_] '{print $1"_"$2"_"$3}'`
stringtie -p 8 -e -B -G /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -A ${sample_name}.gene_abund.tab -o ${sample_name}.gtf ${i}
echo "StringTie assembly for seq file ${i}" $(date)
done
echo "Stringtie assembly complete" $(date)
Submitted batch job 275232. Job was pending for about two days, but then ran in about an hour.
Move GTF and gene abundance files into stringtie folder, as they are currently in the hisat2 align folder
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/hisat2/align
mv *gtf ../../../stringtie/
mv *gene_abund.tab ../../../stringtie/
mv *.ctab ../../../stringtie/
mv stringtie-hisat-assemble_error ../../../scripts/
Make the gene count matrix! This step uses a script called prepDE.py that can be downloaded/copied from the StringTie github repo. Copy the script into the scripts folder.
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
nano prepDE.py # copy script from github into here and save
In the scripts folder: nano prepDE.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --error="prepDE_error" #if your job fails, the error report will be put in this file
#SBATCH --output="prepDE_output" #once your job is completed, any final job report comments will be put in this file
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/stringtie
echo "Starting assembly and assembly QC" $(date)
# Load packages
module load Python/2.7.15-foss-2018b #Python
module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0 #Transcript assembly QC: GFFCompare
# Make gtf_list.txt file
ls *.gtf > gtf_list.txt
stringtie --merge -e -p 8 -G /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -o Apoc_merged.gtf gtf_list.txt #Merge GTFs
echo "Stringtie merge complete" $(date)
gffcompare -r /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -G -o merged Cvir_merged.gtf #Compute the accuracy
echo "GFFcompare complete, Starting gene count matrix assembly..." $(date)
#Note: the merged part is actually redundant and unnecessary unless we perform the original stringtie step without the -e function and perform
#re-estimation with -e after stringtie --merge, but will redo the pipeline later and confirm that I get equal results.
#make gtf list text file
for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt
python /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts/prepDE.py -g Apoc_mRNA_count_matrix.csv -i listGTF.txt #Compile the gene count matrix
echo "Gene count matrix compiled." $(date)
The above code was written by Zoe in this post. Submitted batch job 275703
20240103
I’m going to go in date order and then rewrite and make it pretty. Today, I figured out that the trimming step resulted in the R1 files having 0 reads. Let’s try to fix that. When I look at the trimmed R1 files, it says:
@HD VN:1.0 SO:unsorted
@SQ SN:chromosome_1 LN:21106950
@SQ SN:chromosome_2 LN:21532011
@SQ SN:chromosome_3 LN:22725890
@SQ SN:chromosome_4 LN:27282911
@SQ SN:chromosome_5 LN:29602567
@SQ SN:chromosome_6 LN:30212294
@SQ SN:chromosome_7 LN:30683508
@SQ SN:chromosome_8 LN:29863146
@SQ SN:chromosome_9 LN:31044737
@SQ SN:chromosome_10 LN:33861578
@SQ SN:chromosome_11 LN:33534404
@SQ SN:chromosome_12 LN:42634786
@SQ SN:chromosome_13 LN:42928715
@SQ SN:chromosome_14 LN:58409227
@PG ID:bowtie2 PN:bowtie2 VN:2.4.4 CL:"/opt/software/Bowtie2/2.4.4-GCC-11.2.0/bin/bowtie2-align-s --wrapper basic-0 -x ../output/bowtie/refs/Apoc_ref.btindex -1 /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1065_R1_001.fastq.gz -2 /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1065_R2_001.fastq.gz -b /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/trimmed.AST-1065_R1_001.fastq.gz"
Did I accidently replace my data with bowtie info? I’m going to check to make sure the raw reads look okay still and then rerun fastp. I’m going to try trimming just one sample first in interactive mode in the trim folder.
module load fastp/0.19.7-foss-2018b
fastp --in1 ../raw/AST-1065_R1_001.fastq.gz \
--in2 ../raw/AST-1065_R2_001.fastq.gz \
--out1 test.AST-1065_R1_001.fastq.gz \
--out2 test.AST-1065_R2_001.fastq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
--length_required 100 \
--cut_right cut_right_window_size 5 cut_right_mean_quality 20
Didn’t finish running, but it seems like this is the way to go. Rerun fastp_QC.sh
. Submitted batch job 291968
Now I’m going to attempt to run bowtie2 again. I already made the bowtie2 genome index above so lets align the data. I need to be careful not to overwrite my trimmed files with bowtie info. First, I’m going to try with just one sample:
interactive
module load GCCcore/11.2.0 #I needed to add this to resolve conflicts between loaded GCCcore/9.3.0 and GCCcore/11.3.0
module load Bowtie2/2.4.4-GCC-11.2.0
bowtie2 -1 trimmed.AST-1065_R1_001.fastq.gz -2 trimmed.AST-1065_R2_001.fastq.gz -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex -S test
Batch job 291993. This appears to have been successful. Let’s try to convert the test sam file into a bam file.
module load SAMtools/1.9-foss-2018b
samtools sort -@ 8 -o test.bam test.sam
Gives me this error: Illegal instruction (core dumped)
.
20240104
Trimming finished last night and plots look good. Now I’m going to run bowtie2 to align all samples. Here’s the bowtie2 code for all of the samples. In scripts: nano bowtie2_align.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.out
module load Bowtie2/2.4.4-GCC-11.2.0
echo "Aligning reads" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/
array=($(ls *R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
bowtie2 -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex -S -b --align-paired-reads
done
echo "Reads aligned!" $(date)
Submitted batch job 292070. failed immediately. Error was: Warning: Output file '/data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/test.AST-1065_R1_001.fastq.gz.bam' was specified without -S. This will not work in future Bowtie 2 versions. Please use -S instead.
. Going to add -S
even though I want a bam file. Submitted batch job 292080. Now this error:
Extra parameter(s) specified: "/data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/test.AST-1065_R1_001.fastq.gz.bam"
Note that if <mates> files are specified using -1/-2, a <singles> file cannot
also be specified. Please run bowtie separately for mates and singles.
Error: Encountered internal Bowtie 2 exception (#1)
I think it’s angry that I specified a single output file for bam so I’m going to remove the file name. Submitted batch job 292081
Now it says this:
Error: 0 mate files/sequences were specified with -1, but 1
mate files/sequences were specified with -2. The same number of mate files/
sequences must be specified with -1 and -2.
Error: Encountered internal Bowtie 2 exception (#1)
Command: /opt/software/Bowtie2/2.4.4-GCC-11.2.0/bin/bowtie2-align-s --wrapper basic-0 -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex -S -1 -2 /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/test.AST-1065_R2_001.fastq.gz -b --align-paired-reads /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/test.AST-1065_R1_001.fastq.gz
(ERR): bowtie2-align exited with value 1
but there is a file in the -1 position? Maybe the array doesn’t like having the full path name in the array itself? Going to remove that and rerun. Submitted batch job 292082. Same error as above… This is what it thinks my command is: Command: /opt/software/Bowtie2/2.4.4-GCC-11.2.0/bin/bowtie2-align-s --wrapper basic-0 -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex -S -1 -2 trimmed.AST-1065_R2_001.fastq.gz -b --align-paired-reads trimmed.AST-1065_R1_001.fastq.gz
. For some reason, it is moving the R1 to the end of the script. I’m going to space out using \
linebreaks the script to see if that fixes it
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Bowtie2/2.4.4-GCC-11.2.0
echo "Aligning reads" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/
array=($(ls *R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
bowtie2 -x /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/refs/Apoc_ref.btindex \
-1 ${i} \
-2 $(echo ${i}|sed s/_R1/_R2/) \
-S \
-b \
--align-paired-reads
done
echo "Reads aligned!" $(date)
Submitted batch job 292083. SAME ERROR. Going to comment everything except for the -x, -1, and -2 out. Submitted batch job 292084. Appears to be running but the output seems like its going into the slurm output file…Canceled the job. I’m going to add the -b back in a specify a file name (-b align.${i}
). This should output bam files with the specific file names. Submitted batch job 292085. Still seems to be outputting to the slurm output file…I’m going to stop bowtie and comment out the b and include the -S back in. Submitted batch job 292098
The aligned files are getting put in sam files but labeled as .gz. Will need to convert to sam files, then to bam files once alignment is finished. The output files are also labeled R1 but contain info for both R1 and R2 of that sample.
20240105
Convert .gz files to proper .sam file name and convert to bam file. Test on one sample first in a test script. In the scripts folder: nano test_sam_to_bam.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim
#mv align.trimmed.AST-1065_R1_001.fastq.gz align.trimmed.AST-1065.sam
samtools sort -o align.trimmed.AST-1065.bam align.trimmed.AST-1065.sam
Submitted batch job 292161. Worked! Now I need to make do it for all the samples
20240107
First, I’m going to move the aligned reads to the bowtie output alignment folder: mv /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/align* /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
In the scripts folder: nano sam_to_bam.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
echo "Converting sam to bam" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
array=($(ls *R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
mv ${i} ${i}.sam
samtools sort -o ${i}.bam ${i}.sam
rm ${i}.sam
done
echo "Files converted to bam, sam files removed" $(date)
Submitted batch job 292223. Job has been pending for a while…cancelled the job. Adding -@ 8
to the script to indicate to use 8 threads while running. Submitted batch job 292231
20240107
Cancelled job 292231, as it was still pending. Editing the sbatch info for the script so that the time is 24 hrs, n tasks per node is 1 and mem is 100 GB. Submitted batch job 292244. Got an error saying it didn’t recognize the -@
argument. Removing that and resubmitting. Submitted batch job 292245. Now getting the error: ls: cannot access *R1_001.fastq.gz: No such file or directory
. The aligned files are gone from the bowtie align folder…fml. Somehow, they must have gotten deleted from the sam to bam script? That’s super annoying.
I guess I need to rerun bowtie. Submitted batch job 292246
20240114
Going to run a test sam to bam so that I can confirm that the files are not being deleted.
In the scripts folder, modify the test_sam_to_bam.sh
script:
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
echo "Converting sam to bam for test sample" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim
mv align.test.AST-1065_R1_001.fastq.gz align.test.AST-1065.sam
samtools sort -o align.test.AST-1065.bam align.test.AST-1065.sam
echo "Sam to bam conversion complete for test sample" $(date)
Submitted batch job 292494. Took ~45 mins. Check % reads aligned
interactive
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
samtools flagstat align.test.AST-1065.bam
57029408 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
21403407 + 0 mapped (37.53% : N/A)
57029408 + 0 paired in sequencing
28514704 + 0 read1
28514704 + 0 read2
15442422 + 0 properly paired (27.08% : N/A)
17601996 + 0 with itself and mate mapped
3801411 + 0 singletons (6.67% : N/A)
135006 + 0 with mate mapped to a different chr
64974 + 0 with mate mapped to a different chr (mapQ>=5)
37.53% mapped using bowtie2, whereas alignment with hisat2 for this sample was 47.17%. Even though the alignment with hisat2 was higher, I am more inclined to go with the bowtie2 alignment, as I want to be consistent in mapping algorithms for the mRNA and smRNA analysis. Other papers that have examined miRNAs in corals have also used bowtie for the mRNA analysis (i.e., Gajigan & Conaco 2017; Baumgarten et al. 2013, etc).
Now let’s do the sam to bam for all the samples. First, I’m going to move the aligned reads to the bowtie output alignment folder: mv /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/align* /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
In the scripts folder: nano sam_to_bam.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
echo "Converting sam to bam" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
array=($(ls *R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
mv ${i} ${i}.sam
samtools sort -o ${i}.bam ${i}.sam
#rm ${i}.sam
done
echo "Files converted to bam" $(date)
Submitted batch job 292496. Took about 9.5 hours. Calculate % mapped from the bam files. In scripts, nano map_percent.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
echo "Calculating the percentage of reads aligned" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
for i in *.bam; do
echo "${i}" >> mapped_reads_counts_Apoc
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_Apoc
done
echo "Calculation complete" $(date)
Submitted batch job 292513. Took about 20 mins, mapping %s definitely lower with bowtie2 than with hisat2.
align.test.AST-1065.bam
21403407 + 0 mapped (37.53% : N/A)
align.trimmed.AST-1065_R1_001.fastq.gz.bam
21403407 + 0 mapped (37.53% : N/A)
align.trimmed.AST-1105_R1_001.fastq.gz.bam
184 + 0 mapped (32.51% : N/A)
align.trimmed.AST-1147_R1_001.fastq.gz.bam
29489221 + 0 mapped (53.81% : N/A)
align.trimmed.AST-1412_R1_001.fastq.gz.bam
16202316 + 0 mapped (34.73% : N/A)
align.trimmed.AST-1560_R1_001.fastq.gz.bam
3312164 + 0 mapped (6.56% : N/A)
align.trimmed.AST-1567_R1_001.fastq.gz.bam
11081976 + 0 mapped (23.37% : N/A)
align.trimmed.AST-1617_R1_001.fastq.gz.bam
20896972 + 0 mapped (49.18% : N/A)
align.trimmed.AST-1722_R1_001.fastq.gz.bam
24918472 + 0 mapped (50.33% : N/A)
align.trimmed.AST-2000_R1_001.fastq.gz.bam
14086921 + 0 mapped (34.39% : N/A)
align.trimmed.AST-2007_R1_001.fastq.gz.bam
22480611 + 0 mapped (42.97% : N/A)
align.trimmed.AST-2302_R1_001.fastq.gz.bam
22989178 + 0 mapped (52.12% : N/A)
align.trimmed.AST-2360_R1_001.fastq.gz.bam
14679046 + 0 mapped (34.33% : N/A)
align.trimmed.AST-2398_R1_001.fastq.gz.bam
22409210 + 0 mapped (51.40% : N/A)
align.trimmed.AST-2404_R1_001.fastq.gz.bam
21960526 + 0 mapped (40.83% : N/A)
align.trimmed.AST-2412_R1_001.fastq.gz.bam
19389724 + 0 mapped (38.21% : N/A)
align.trimmed.AST-2512_R1_001.fastq.gz.bam
4958601 + 0 mapped (10.24% : N/A)
align.trimmed.AST-2523_R1_001.fastq.gz.bam
17472526 + 0 mapped (36.78% : N/A)
align.trimmed.AST-2563_R1_001.fastq.gz.bam
19339538 + 0 mapped (38.31% : N/A)
align.trimmed.AST-2729_R1_001.fastq.gz.bam
4728959 + 0 mapped (15.20% : N/A)
align.trimmed.AST-2755_R1_001.fastq.gz.bam
20252460 + 0 mapped (40.96% : N/A)
As stated above, I am more inclined to go with bowtie2 but will discuss with Hollie.
Assemble using stringtie. In scripts, nano assemble.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=128GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load StringTie/2.2.1-GCC-11.2.0
echo "Assembling transcripts using stringtie" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
array1=($(ls *.bam)) #Make an array of sequences to assemble
for i in ${array1[@]}; do
stringtie -p 8 -e -B -G /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -A ${i}.gene_abund.tab -o ${i}.gtf ${i}
done
echo "Assembly for each sample complete " $(date)
Submitted batch job 292514. Took about 30 mins
Move the stringtie related files to the stringtie folder.
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
mv *gtf ../../../stringtie/bowtie/
mv *tab ../../../stringtie/bowtie/
Now I will merge the gtf files together with stringtie, assess the assembly with gffcompare, and make a gene count matrix with the prepDE.py
script, which is in the scripts folder. In the scripts folder, nano prepDE.sh
:
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=128GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load StringTie/2.2.1-GCC-11.2.0
module load Python/3.9.6-GCCcore-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0
####MAYBE LOAD GCC IF ERRORS
echo "Merging assembled files" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/stringtie/bowtie
#make gtf_list.txt file
ls *.gtf > gtf_list.txt
stringtie --merge -e -p 8 -G /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -o Apoc_merged.gtf gtf_list.txt #Merge GTFs
echo "Stringtie merge complete, starting gffcompare" $(date)
gffcompare -r /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -G -o merged Apoc_merged.gtf #Compute the accuracy
echo "GFFcompare complete, Starting gene count matrix assembly" $(date)
#make gtf list text file
for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt
python /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts/prepDE.py -g Apoc_gene_count_matrix.csv -i listGTF.txt #Compile the gene count matrix
echo "Gene count matrix compiled." $(date)
Submitted batch job 292515. Failed, some GCC error. Changed Python module. Submitted batch job 292516. Failed agian. Gave this error:
Error: no transcripts were found in input file Apoc_merged.gtf
48184 reference transcripts loaded.
0 query transfrags loaded.
File "/data/putnamlab/jillashey/Astrangia2021/mRNA/scripts/prepDE.py", line 34
print "Error: line should have a sample ID and a file path:\n%s" % (line.strip())
^
SyntaxError: invalid syntax
Need to troubleshoot this.
20240115
Remove sam files to save space
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
rm *sam
The prepDE.sh
script didn’t work, so I may just try running the components separately. Make a list of all gtf files and merge together with stringtie.
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/stringtie/bowtie
module load StringTie/2.2.1-GCC-11.2.0
ls *.gtf > gtf_list.txt
stringtie --merge -e -p 8 -G /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -o Apoc_merged.gtf gtf_list.txt #Merge GTFs
Saying Illegal instruction (core dumped)
. Why? Do I need a different version of stringtie? Let’s try this version: StringTie/2.1.4-GCC-9.3.0
. Success! Let’s look at what the merged file looks like:
head Apoc_merged.gtf
# stringtie --merge -e -p 8 -G /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -o Apoc_merged.gtf gtf_list.txt
# StringTie version 2.1.4
chromosome_1 StringTie transcript 20664 21393 1000 - . gene_id "MSTRG.1"; transcript_id "evm.model.chromosome_1.1"; ref_gene_id "evm.TU.chromosome_1.1"; cov "0.00413735";
chromosome_1 StringTie exon 20664 21106 1000 - . gene_id "MSTRG.1"; transcript_id "evm.model.chromosome_1.1"; exon_number "1"; ref_gene_id "evm.TU.chromosome_1.1";
chromosome_1 StringTie exon 21189 21393 1000 - . gene_id "MSTRG.1"; transcript_id "evm.model.chromosome_1.1"; exon_number "2"; ref_gene_id "evm.TU.chromosome_1.1";
chromosome_1 StringTie transcript 34636 40489 1000 + . gene_id "MSTRG.2"; transcript_id "evm.model.chromosome_1.2"; ref_gene_id "evm.TU.chromosome_1.2"; cov "0.280786";
chromosome_1 StringTie exon 34636 34734 1000 + . gene_id "MSTRG.2"; transcript_id "evm.model.chromosome_1.2"; exon_number "1"; ref_gene_id "evm.TU.chromosome_1.2";
chromosome_1 StringTie exon 34890 35046 1000 + . gene_id "MSTRG.2"; transcript_id "evm.model.chromosome_1.2"; exon_number "2"; ref_gene_id "evm.TU.chromosome_1.2";
chromosome_1 StringTie exon 37554 37710 1000 + . gene_id "MSTRG.2"; transcript_id "evm.model.chromosome_1.2"; exon_number "3"; ref_gene_id "evm.TU.chromosome_1.2";
chromosome_1 StringTie exon 39812 39877 1000 + . gene_id "MSTRG.2"; transcript_id "evm.model.chromosome_1.2"; exon_number "4"; ref_gene_id "evm.TU.chromosome_1.2";
The gene ids were named MSTRG from stringtie but the transcript ids still have the chromosome info. Assess the accuracy of the merged assembly with gffcompare.
module purge
module load GffCompare/0.12.6-GCC-11.2.0
gffcompare -r /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -G -o merged Apoc_merged.gtf #Compute the accuracy
echo "GFFcompare complete, Starting gene count matrix assembly" $(date)
Similar to above, saying Illegal instruction (core dumped)
. Trying a different version of gffcompare: GffCompare/0.12.1-GCCcore-8.3.0
. Worked!
# gffcompare v0.12.1 | Command line was:
#gffcompare -r /data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3 -G -o merged Apoc_merged.gtf
#
#= Summary for dataset: Apoc_merged.gtf
# Query mRNAs : 48184 in 47159 loci (28499 multi-exon transcripts)
# (854 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 48184 in 47159 loci (28499 multi-exon)
# Super-loci w/ reference transcripts: 47159
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 100.0 | 100.0 |
Intron level: 100.0 | 100.0 |
Intron chain level: 100.0 | 100.0 |
Transcript level: 100.0 | 100.0 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 28499
Matching transcripts: 48184
Matching loci: 47159
Missed exons: 0/228071 ( 0.0%)
Novel exons: 0/228064 ( 0.0%)
Missed introns: 0/180537 ( 0.0%)
Novel introns: 0/180537 ( 0.0%)
Missed loci: 0/47159 ( 0.0%)
Novel loci: 0/47159 ( 0.0%)
Total union super-loci across all input datasets: 47159
48184 out of 48184 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)
The stats look good. The merged files are providing info on where the transcripts/exons are in the genome I believe. Make gft list text file for gene count matrix creation
for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt
Load Python and compile the gene count matrix.
module purge
module load Python/3.8.2-GCCcore-9.3.0
python prepDE.py -g Apoc_gene_count_matrix.csv -i listGTF.txt #Compile the gene count matrix
Got this error:
File "/data/putnamlab/jillashey/Astrangia2021/mRNA/scripts/prepDE.py", line 34
print "Error: line should have a sample ID and a file path:\n%s" % (line.strip())
^
SyntaxError: invalid syntax
Confused because the file does have the sample id and file path…Maybe move the prepDE.py
script into this folder?
mv /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts/prepDE.py .
Nope still getting the same error. Might have to use python 2 version, as that is what the script is written in.
module purge
module load Python/2.7.18-GCCcore-9.3.0
python prepDE.py -g Apoc_gene_count_matrix.csv -i listGTF.txt
Success!!!! Copy the gene count matrix onto my computer and rename so that the file has bowtie included in the name (to distinguish it from the hisat2 gene count matrix).
For lncRNA
- CPC: https://github.com/biocoder/CPC2/blob/master/bin/CPC2.py
- lncRNA discovery overview: https://github.com/urol-e5/deep-dive/blob/main/D-Apul/code/05.32-lncRNA-discovery-overview.Rmd
NEED TO ASK HP: WAS THE SEQUENCING DONE RF, FF OR FR?