Astrangia 2021 small RNA analysis
Astrangia 2021 small RNA analysis
These data came from my Astrangia 2021 experiment, during which adult Astrangia colonies were exposed to ambient and high temperatures for ~9 months.
Files were downloaded to this location: /data/putnamlab/KITT/hputnam/20230605_Astrangia_smallRNA
Set-up
Make new folders for this project in my directory
cd /data/putnamlab/jillashey
mkdir Astrangia2021
cd Astrangia2021
mkdir smRNA
cd smRNA
mkdir data scripts fastqc
cd data
mkdir raw
cd raw
Copy the new files into raw data folder
cp /data/putnamlab/KITT/hputnam/20230605_Astrangia_smallRNA/* .
Check to make sure all files were transferred successfully
md5sum *.fastq.gz > checkmd5.md5
md5sum -c checkmd5.md5
AST-1065_R1_001.fastq.gz: OK
AST-1065_R2_001.fastq.gz: OK
AST-1105_R1_001.fastq.gz: OK
AST-1105_R2_001.fastq.gz: OK
AST-1147_R1_001.fastq.gz: OK
AST-1147_R2_001.fastq.gz: OK
AST-1412_R1_001.fastq.gz: OK
AST-1412_R2_001.fastq.gz: OK
AST-1560_R1_001.fastq.gz: OK
AST-1560_R2_001.fastq.gz: OK
AST-1567_R1_001.fastq.gz: OK
AST-1567_R2_001.fastq.gz: OK
AST-1617_R1_001.fastq.gz: OK
AST-1617_R2_001.fastq.gz: OK
AST-1722_R1_001.fastq.gz: OK
AST-1722_R2_001.fastq.gz: OK
AST-2000_R1_001.fastq.gz: OK
AST-2000_R2_001.fastq.gz: OK
AST-2007_R1_001.fastq.gz: OK
AST-2007_R2_001.fastq.gz: OK
AST-2302_R1_001.fastq.gz: OK
AST-2302_R2_001.fastq.gz: OK
AST-2360_R1_001.fastq.gz: OK
AST-2360_R2_001.fastq.gz: OK
AST-2398_R1_001.fastq.gz: OK
AST-2398_R2_001.fastq.gz: OK
AST-2404_R1_001.fastq.gz: OK
AST-2404_R2_001.fastq.gz: OK
AST-2412_R1_001.fastq.gz: OK
AST-2412_R2_001.fastq.gz: OK
AST-2512_R1_001.fastq.gz: OK
AST-2512_R2_001.fastq.gz: OK
AST-2523_R1_001.fastq.gz: OK
AST-2523_R2_001.fastq.gz: OK
AST-2563_R1_001.fastq.gz: OK
AST-2563_R2_001.fastq.gz: OK
AST-2729_R1_001.fastq.gz: OK
AST-2729_R2_001.fastq.gz: OK
AST-2755_R1_001.fastq.gz: OK
AST-2755_R2_001.fastq.gz: OK
Count number of reads per file
zgrep -c "@GWNJ" *fastq.gz
AST-1065_R1_001.fastq.gz:18782226
AST-1065_R2_001.fastq.gz:18782226
AST-1105_R1_001.fastq.gz:18535712
AST-1105_R2_001.fastq.gz:18535712
AST-1147_R1_001.fastq.gz:43815757
AST-1147_R2_001.fastq.gz:43815757
AST-1412_R1_001.fastq.gz:17729353
AST-1412_R2_001.fastq.gz:17729353
AST-1560_R1_001.fastq.gz:19958419
AST-1560_R2_001.fastq.gz:19958419
AST-1567_R1_001.fastq.gz:18414936
AST-1567_R2_001.fastq.gz:18414936
AST-1617_R1_001.fastq.gz:17164109
AST-1617_R2_001.fastq.gz:17164109
AST-1722_R1_001.fastq.gz:17993435
AST-1722_R2_001.fastq.gz:17993435
AST-2000_R1_001.fastq.gz:18885883
AST-2000_R2_001.fastq.gz:18885883
AST-2007_R1_001.fastq.gz:17958643
AST-2007_R2_001.fastq.gz:17958643
AST-2302_R1_001.fastq.gz:17901570
AST-2302_R2_001.fastq.gz:17901570
AST-2360_R1_001.fastq.gz:17996561
AST-2360_R2_001.fastq.gz:17996561
AST-2398_R1_001.fastq.gz:18231685
AST-2398_R2_001.fastq.gz:18231685
AST-2404_R1_001.fastq.gz:17661430
AST-2404_R2_001.fastq.gz:17661430
AST-2412_R1_001.fastq.gz:18215455
AST-2412_R2_001.fastq.gz:18215455
AST-2512_R1_001.fastq.gz:17643371
AST-2512_R2_001.fastq.gz:17643371
AST-2523_R1_001.fastq.gz:17901421
AST-2523_R2_001.fastq.gz:17901421
AST-2563_R1_001.fastq.gz:18067665
AST-2563_R2_001.fastq.gz:18067665
AST-2729_R1_001.fastq.gz:18840062
AST-2729_R2_001.fastq.gz:18840062
AST-2755_R1_001.fastq.gz:18122482
AST-2755_R2_001.fastq.gz:18122482
Raw QC
Run fastqc to quality check raw reads
In scripts folder: nano fastqc_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH --error="fastqc_raw_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastqc_raw_output" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
for file in /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/Astrangia2021/smRNA/fastqc/raw
done
multiqc --interactive fastqc_results
Submitted batch job 261246
Trim data
I have not done trimming specific to small RNAs, but this paper gave a nice workflow for miRNA analysis. They suggested using cutadapt. I’m going to follow their code, which applies a minimum length of 18 and a max length of 30. It doesn’t do any trimming of adapters, but we will see how the reads look after they go through this cutting.
In scripts folder: nano cutadapt.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH --error="cutadapt_error" #if your job fails, the error report will be put in this file
#SBATCH --output="cutadapt_output" #once your job is completed, any final job report comments will be put in this file
module load cutadapt/3.5-GCCcore-11.2.0
# Make array of sequences to cut
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
array1=($(ls *R1_001.fastq.gz))
echo "Trimming reads so that min length is 18 bp and max length is 30 bp" $(date)
# cutadapt loop
for i in ${array1[@]}; do
cutadapt --minimum-length=18 --maximum-length=30 -o /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/trimmed.${i} -p /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/trimmed.$(echo ${i}|sed s/_R1/_R2/) ${i} $(echo ${i}|sed s/_R1/_R2/)
done
echo "Trimming done!" $(date)
Submitted batch job 269180. Okay cutadapt not working. It is just cutting all of the reads because they are all long and the resulting trimmed file is just empty. Cancelling this job.
20230630 I should also ask Sam White/Javi about their trimming of miRNA data…
Questions for Javi/Sam
- details on seq for E5
- what trimming software did they use?
- did they do something like setting the min to 18 and max to 30
From Hao et al. 2021: “Raw reads obtained from the sequencing machine were filtered to get clean tags according to the following rules: removing low quality reads containing more than one low quality (Q-value≤ 20) base or containing unknown nucleotides(N) to get the high-quality reads. Then, high-quality reads were filtered by removing reads without 3′ adapters, containing 5′ adapters, containing 3′ and 5′ adapters but no small RNA fragment between them, containing polyA in small RNA fragment and shorter than 18 nt to get clean tags. The clean tags were aligned with small RNAs in the GenBank database”. This sounds like something i should try, but I’m not sure how to ID the 3’ and 5’ adapters in my sequences.
trimming - try cutadapt, trimmomatic or trimgalore
Looking at the adapter content MultiQC plot, it looks like the reads were processed using the illumina universal adapter and the illumina small rna 5’ adapter. The R1 reads have the universal adapter and the R2 reads have the small rna 5’ adapter. Not sure why that is. I looked the adapters sequences on Illumina and found this post that says the Illumina Universal Adapter—AGATCGGAAGAG and Illumina Small RNA 5’ Adapter—GATCGTCGGACT. I am unsure when this page wzs written, but I’m going to test them out.
nano cudadapt.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH --error="cutadapt_error" #if your job fails, the error report will be put in this file
#SBATCH --output="cutadapt_output" #once your job is completed, any final job report comments will be put in this file
module load cutadapt/3.5-GCCcore-11.2.0
# Make array of sequences to cut
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
array1=($(ls *R1_001.fastq.gz))
echo "Trimming reads so that min length is 18 bp and max length is 30 bp" $(date)
echo "Starting to trim using the Illumina universal adapter" $(date)
for i in ${array1[@]}; do
cutadapt -a AGATCGGAAGAG --minimum-length=18 --maximum-length=30 -o /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/trimmed.${i} -p /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/trimmed.$(echo ${i}|sed s/_R1/_R2/) ${i} $(echo ${i}|sed s/_R1/_R2/)
done
echo "Starting to trim using the Illumina small rna 5' adapter" $(date)
for i in ${array1[@]}; do
cutadapt -a GATCGTCGGACT --minimum-length=18 --maximum-length=30 -o /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/trimmed.again.${i} -p /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/trimmed.again.$(echo ${i}|sed s/_R1/_R2/) ${i} $(echo ${i}|sed s/_R1/_R2/)
done
echo "Trimming done!" $(date)
Submitted batch job 275252. The trimming parameters seem to be too stringent. It is saying that the reads are either too long or too short. Not sure what this means. I’m going to try Sam’s code where he trimmed some smRNAs using a software called flexbar. Need to ask Kevin Bryan to install flexbar.
Flexbar installed!
Let’s try Flexbar trimming code that Sam White wrote for the e5 small RNA analysis
First, I’m only going to try to trim one sample (2 reads) to see if flexbar works.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
First, make the NEB adapters fasta file.
nano NEB-adapters.fasta
>first
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>second
GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT
In scripts folder: nano test_flexbar.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --error="flexbar_raw_error" #if your job fails, the error report will be put in this file
#SBATCH --output="flexbar_raw_output" #once your job is completed, any final job report comments will be put in this file
module load Flexbar/3.5.0-foss-2018b
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
R1_fastq=/data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw/AST-1065_R1_001.fastq.gz
R2_fastq=/data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw/AST-1065_R2_001.fastq.gz
flexbar \
-r ${R1_fastq} \
-p ${R2_fastq} \
-a NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 25 \
--post-trim-length 35 \
--target TEST_AST-1065 \
--zip-output GZ
Submitted batch job 284050. Finished in about 25 mins.
Now I’m going to run fastqc on the test sample and see how it looks:
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
fastqc TEST_AST-1065_1.fastq.gz TEST_AST-1065_2.fastq.gz
multiqc *fastqc*
The plots look good! I’m going to move forward w/ flexbar trimming.
In scripts folder: nano flexbar.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --error="flexbar_error" #if your job fails, the error report will be put in this file
#SBATCH --output="flexbar_output" #once your job is completed, any final job report comments will be put in this file
module load Flexbar/3.5.0-foss-2018b
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
echo "Trimming reads using flexbar" $(date)
array1=($(ls *R1_001.fastq.gz))
for i in ${array1[@]}; do
flexbar \
-r ${i} \
-p $(echo ${i}|sed s/_R1/_R2/) \
-a NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 25 \
--post-trim-length 35 \
--target $(echo ${i}|sed s/_R1/_R2/) \
--zip-output GZ
done
Submitted batch job 284064
Move trimmed reads to trim folder
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
mv trim.AST-* ../trim/
Trim QC
Run fastqc to quality check trim reads
In scripts folder: nano fastqc_trim.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH --error="fastqc_trim_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastqc_trim_output" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
for file in /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/Astrangia2021/smRNA/fastqc/trim
done
multiqc --interactive fastqc_results/trim
Submitted batch job 284426
Count number of reads per file
zgrep -c "@GWNJ" *fastq.gz
trim.AST-1065_R2_001.fastq.gz_1.fastq.gz:17829111
trim.AST-1065_R2_001.fastq.gz_2.fastq.gz:17829111
trim.AST-1105_R2_001.fastq.gz_1.fastq.gz:17238126
trim.AST-1105_R2_001.fastq.gz_2.fastq.gz:17238126
trim.AST-1147_R2_001.fastq.gz_1.fastq.gz:40415224
trim.AST-1147_R2_001.fastq.gz_2.fastq.gz:40415224
trim.AST-1412_R2_001.fastq.gz_1.fastq.gz:16279555
trim.AST-1412_R2_001.fastq.gz_2.fastq.gz:16279555
trim.AST-1560_R2_001.fastq.gz_1.fastq.gz:17827024
trim.AST-1560_R2_001.fastq.gz_2.fastq.gz:17827024
trim.AST-1567_R2_001.fastq.gz_1.fastq.gz:16611397
trim.AST-1567_R2_001.fastq.gz_2.fastq.gz:16611397
trim.AST-1617_R2_001.fastq.gz_1.fastq.gz:16077717
trim.AST-1617_R2_001.fastq.gz_2.fastq.gz:16077717
trim.AST-1722_R2_001.fastq.gz_1.fastq.gz:16430221
trim.AST-1722_R2_001.fastq.gz_2.fastq.gz:16430221
trim.AST-2000_R2_001.fastq.gz_1.fastq.gz:17428854
trim.AST-2000_R2_001.fastq.gz_2.fastq.gz:17428854
trim.AST-2007_R2_001.fastq.gz_1.fastq.gz:16559551
trim.AST-2007_R2_001.fastq.gz_2.fastq.gz:16559551
trim.AST-2302_R2_001.fastq.gz_1.fastq.gz:16665370
trim.AST-2302_R2_001.fastq.gz_2.fastq.gz:16665370
trim.AST-2360_R2_001.fastq.gz_1.fastq.gz:16648356
trim.AST-2360_R2_001.fastq.gz_2.fastq.gz:16648356
trim.AST-2398_R2_001.fastq.gz_1.fastq.gz:16788208
trim.AST-2398_R2_001.fastq.gz_2.fastq.gz:16788208
trim.AST-2404_R2_001.fastq.gz_1.fastq.gz:16712903
trim.AST-2404_R2_001.fastq.gz_2.fastq.gz:16712903
trim.AST-2412_R2_001.fastq.gz_1.fastq.gz:17488508
trim.AST-2412_R2_001.fastq.gz_2.fastq.gz:17488508
trim.AST-2512_R2_001.fastq.gz_1.fastq.gz:16265716
trim.AST-2512_R2_001.fastq.gz_2.fastq.gz:16265716
trim.AST-2523_R2_001.fastq.gz_1.fastq.gz:16995265
trim.AST-2523_R2_001.fastq.gz_2.fastq.gz:16995265
trim.AST-2563_R2_001.fastq.gz_1.fastq.gz:17023002
trim.AST-2563_R2_001.fastq.gz_2.fastq.gz:17023002
trim.AST-2729_R2_001.fastq.gz_1.fastq.gz:17121869
trim.AST-2729_R2_001.fastq.gz_2.fastq.gz:17121869
trim.AST-2755_R2_001.fastq.gz_1.fastq.gz:16269823
trim.AST-2755_R2_001.fastq.gz_2.fastq.gz:16269823
20231130
Locations of cnidarian miRNA data
- Stylophora pistillata
- Hydra
- Nematostella
- Acropora muricata, Montipora capricornis, Montipora foliosa, Pocillopora verrucosa
20240103 Should I retrim with fastp to keep it consistent with mRNA? Going to try it out and compare results from flexbar. In trimmed smRNA folder, make new directory to put fastp trimmed reads
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim
mkdir fastp
In scripts folder: nano fastp_QC.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.out
# Load modules needed
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
# Make an array of sequences to trim in raw data directory
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
array1=($(ls *R1_001.fastq.gz))
echo "Read trimming of adapters started." $(date)
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--in2 $(echo ${i}|sed s/_R1/_R2/)\
--out1 /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/trimmed.${i} \
--out2 /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/trimmed.$(echo ${i}|sed s/_R1/_R2/) \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
#--length_required 100 \
--cut_right cut_right_window_size 5 cut_right_mean_quality 20
fastqc /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/trimmed.${i}
fastqc /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/trimmed.$(echo ${i}|sed s/_R1/_R2/)
done
echo "Read trimming of adapters complete." $(date)
# Quality Assessment of Trimmed Reads
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp #go to output directory
# Compile MultiQC report from FastQC files
multiqc --interactive ./
echo "Cleaned MultiQC report generated." $(date)
Submitted batch job 291969. Took about 5 hours. The QC plots don’t look amazing and the length is still at 100 bp. I’m going to rerun but adding the argument --length_limit 30
for fastp. This means that reads longer than 30 bp will be discarded. Submitted batch job 291997. Downloaded the QC report and it looks okay. Low amount of reads but high duplication. Let’s see how many counts are in each file:
zgrep -c "@GWNJ" *fastq.gz
trimmed.AST-1065_R1_001.fastq.gz:7834561
trimmed.AST-1065_R2_001.fastq.gz:7834561
trimmed.AST-1105_R1_001.fastq.gz:8754651
trimmed.AST-1105_R2_001.fastq.gz:8754651
trimmed.AST-1147_R1_001.fastq.gz:22326820
trimmed.AST-1147_R2_001.fastq.gz:22326820
trimmed.AST-1412_R1_001.fastq.gz:8158050
trimmed.AST-1412_R2_001.fastq.gz:8158050
trimmed.AST-1560_R1_001.fastq.gz:8733402
trimmed.AST-1560_R2_001.fastq.gz:8733402
trimmed.AST-1567_R1_001.fastq.gz:9830273
trimmed.AST-1567_R2_001.fastq.gz:9830273
trimmed.AST-1617_R1_001.fastq.gz:8146294
trimmed.AST-1617_R2_001.fastq.gz:8146294
trimmed.AST-1722_R1_001.fastq.gz:9014021
trimmed.AST-1722_R2_001.fastq.gz:9014021
trimmed.AST-2000_R1_001.fastq.gz:10252309
trimmed.AST-2000_R2_001.fastq.gz:10252309
trimmed.AST-2007_R1_001.fastq.gz:9622779
trimmed.AST-2007_R2_001.fastq.gz:9622779
trimmed.AST-2302_R1_001.fastq.gz:8921101
trimmed.AST-2302_R2_001.fastq.gz:8921101
trimmed.AST-2360_R1_001.fastq.gz:8635502
trimmed.AST-2360_R2_001.fastq.gz:8635502
trimmed.AST-2398_R1_001.fastq.gz:9565008
trimmed.AST-2398_R2_001.fastq.gz:9565008
trimmed.AST-2404_R1_001.fastq.gz:8798584
trimmed.AST-2404_R2_001.fastq.gz:8798584
trimmed.AST-2412_R1_001.fastq.gz:7032530
trimmed.AST-2412_R2_001.fastq.gz:7032530
trimmed.AST-2512_R1_001.fastq.gz:7463762
trimmed.AST-2512_R2_001.fastq.gz:7463762
trimmed.AST-2523_R1_001.fastq.gz:8272150
trimmed.AST-2523_R2_001.fastq.gz:8272150
trimmed.AST-2563_R1_001.fastq.gz:8537237
trimmed.AST-2563_R2_001.fastq.gz:8537237
trimmed.AST-2729_R1_001.fastq.gz:7929477
trimmed.AST-2729_R2_001.fastq.gz:7929477
trimmed.AST-2755_R1_001.fastq.gz:9775688
trimmed.AST-2755_R2_001.fastq.gz:9775688
Flexbar keeps more reads but it seems like it is combining the two reads into one for each sample (which I don’t want to do yet). Let’s edit the flexbar code and see if I can fix that. For now, I just commented out the line --target trim.$(echo ${i}|sed s/_R1/_R2/) \
, which names the files. Also changing max read length to 30. Submitted batch job 292029
20240104
Flexbar trimming finished last night but it just rewrote the files over one another so only the last sample has the files. Looking at the flexbar documentation, it seems like --target
is the prefix for output file names or paths, where as --output-reads
and --output-reads2
is used for the output file for reads 1 and 2 instead of the target prefix usage. So I am just dumb and should’ve specified the output reads argument instead of target.
Let’s look at the flexbar script again:
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Flexbar/3.5.0-foss-2018b
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
echo "Trimming reads using flexbar" $(date)
array1=($(ls *R1_001.fastq.gz))
for i in ${array1[@]}; do
flexbar \
-r ${i} \
-p $(echo ${i}|sed s/_R1/_R2/) \
-a NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 30 \
-t /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
--post-trim-length 30 \
--output-reads trim.${i} \
--output-reads2 trim.$(echo ${i}|sed s/_R1/_R2/) \
--zip-output GZ
done
Made those changes for output reads to the script. Also changed script so that -qt
was 30 (ie phred score of 30) and post trim length was 30 bp. Submitted batch job 292075. Kept failing, for some reason all my files were empty. I’m now re-copying them from the KITT directory (thank god for backups) and then will rerun flexbar. Added trim. prefix to the beginning of the output file names so that the original files don’t get overwritten. Submitted batch job 292079.
Checked back and it is still overwriting the files with the flexbar.log and flexbar fastq files…Also in the slurm error report, it tells me that the post trim length argument was not found. Looking at the code again, I didn’t add a \
after the -t line. Going to add it and rerun. Also going to try to take out the --
and just do the -
. Submitted batch job 292092
Job finished but the files are empty. Why does Flexbar hate me? The error file said it couldn’t open the files that flexbar created but why did those need to be opened anyway?
I’m going to go back to the original script that I ran because it seems to have worked, even though it named both files R2.
In the scripts folder: nano flexbar_og.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Flexbar/3.5.0-foss-2018b
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
echo "Trimming reads using flexbar" $(date)
array1=($(ls *R1_001.fastq.gz))
for i in ${array1[@]}; do
flexbar \
-r ${i} \
-p $(echo ${i}|sed s/_R1/_R2/) \
-a NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 30 \
--post-trim-length 30 \
--target ${i} \
--zip-output GZ
done
Submitted batch job 292095
While that is running, I’m going to figure out how to set up a conda environment using the conda unity documentation. For whatever reason, Kevin Bryan recommended I use a conda environment.
I’m going to make it in the putnam lab folder.
First I need to load miniconda module: module load Miniconda3/4.9.2
Now I need to create a conda environment.
conda create --prefix /data/putnamlab/miranda
If I need to update:
==> WARNING: A newer version of conda exists. <==
current version: 4.9.2
latest version: 23.11.0
Please update conda by running
$ conda update -n base -c defaults conda
Now that the conda environment is created, I need to activate it.
conda activate /data/putnamlab/miranda
It told me my shell had not been properly configured. To do that:
conda init
no change /opt/software/Miniconda3/4.9.2/condabin/conda
no change /opt/software/Miniconda3/4.9.2/bin/conda
no change /opt/software/Miniconda3/4.9.2/bin/conda-env
no change /opt/software/Miniconda3/4.9.2/bin/activate
no change /opt/software/Miniconda3/4.9.2/bin/deactivate
no change /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.sh
no change /opt/software/Miniconda3/4.9.2/etc/fish/conf.d/conda.fish
no change /opt/software/Miniconda3/4.9.2/shell/condabin/Conda.psm1
no change /opt/software/Miniconda3/4.9.2/shell/condabin/conda-hook.ps1
no change /opt/software/Miniconda3/4.9.2/lib/python3.8/site-packages/xontrib/conda.xsh
no change /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.csh
modified /home/jillashey/.bashrc
==> For changes to take effect, close and re-open your current shell. <==
I closed the terminal window and logged back in on a new window. Now let’s activate!
conda activate /data/putnamlab/miranda
Now the conda environment is activated! My shell thing now looks like this: (/data/putnamlab/miranda) [jillashey@ssh3 putnamlab]$
. To deactivate, do conda deactivate
.
Create a conda environment for mirdeep2 using the same steps above.
I’m going to first work in the mirdeep2 environment. Activate the environment: conda activate /data/putnamlab/mirdeep2
Install mirdeep2 within the conda env: conda install bioconda::mirdeep2
. This will take a few minutes to install and load the required packages.
I’m going to try to run mirDeep2 using code from the mirdeep2 github tutorial and Sam White’s code from the E5 deep dive project.
Before running any mirdeep2 modules, I need to upload some databases to the HPC and configure some of my files. Let’s first configure my files. There are 2 files (R1 and R2) per sample, so I need to concatenate and collapse the reads.
cat trimmed.AST-1065_R1_001.fastq.gz trimmed.AST-1065_R2_001.fastq.gz > cat.trimmed.AST-1065.fastq
module load FASTX-Toolkit/0.0.14-GCC-9.3.0
# gunzip cat.trimmed.AST-1065.fastq.gz # files must be unzipped for collapsing; unzip if needed
fastx_collapser -v -i cat.trimmed.AST-1065.fastq -o collapse.cat.trimmed.AST-1065.fastq
head collapse.cat.trimmed.AST-1065.fastq
>1-116635
TGGTCTATGGTGTAACTGGCAACACGTCTGT
>2-115039
ACAGACGTGTTGCCAGTTACACCATAGACCA
>3-104350
TGGTCTATGGTGTAACTGGCAACACGTCTGTT
>4-103158
AACAGACGTGTTGCCAGTTACACCATAGACCA
>5-71882
TGAAAATCTTTTCTCTGAAGTGGAA
As per the mirdeep2 documentation, The readID must end with _xNumber and is not allowed to contain whitespaces. So it has to have the format name_uniqueNumber_xnumber.
sed '/^>/ s/-/_x/g' collapse.cat.trimmed.AST-1065.fastq \
| sed '/^>/ s/>/>seq_/' \
> collapse.cat.trimmed.AST-1065.fastq
>seq_1_x116635
TGGTCTATGGTGTAACTGGCAACACGTCTGT
>seq_2_x115039
ACAGACGTGTTGCCAGTTACACCATAGACCA
>seq_3_x104350
TGGTCTATGGTGTAACTGGCAACACGTCTGTT
>seq_4_x103158
AACAGACGTGTTGCCAGTTACACCATAGACCA
>seq_5_x71882
TGAAAATCTTTTCTCTGAAGTGGAA
Next I need to reformat the genome fasta description lines. miRDeep2 can’t process genome FastAs with spaces in the description lines. I don’t think the Apoc genome has any spaces but I’m going to double check.
cd /data/putnamlab/jillashey/Astrangia_Genome/
grep "^>" apoculata.assembly.scaffolds_chromosome_level.fasta
>chromosome_1
>chromosome_2
>chromosome_3
>chromosome_4
>chromosome_5
>chromosome_6
>chromosome_7
>chromosome_8
>chromosome_9
>chromosome_10
>chromosome_11
>chromosome_12
>chromosome_13
>chromosome_14
Nice, there are no spaces so I don’t need to reformat. If I did, I would sub the spaces with underscores.
Index the genome with bowtie (NOT bowtie2). In the scripts folder: nano bowtie_build.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.out
module load GCCcore/11.3.0 #I needed to add this to resolve conflicts between loaded GCCcore/9.3.0 and GCCcore/11.3.0
module load Bowtie/1.3.1-GCC-11.3.0
# Index the reference genome for A. poculata
bowtie-build /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta Apoc_ref.btindex
echo "Referece genome indexed!" $(date)
The indexed genome lives in the scripts folder for now.
Load the miRbase mature miRNA fasta database onto the server. I downloaded it onto my computer (its not very large) and will now copy it to the server. I downloaded it on 1/3/24. It will live in the refs folder in the smRNA directory.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA
mkdir refs
cd refs
ls refs
20240103_mature.fa
head 20240103_mature.fa
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p
CUAUGCAAUUUUCUACCUUACC
>cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p
UCCCUGAGACCUCAAGUGUGA
>cel-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p
ACACCUGGGCUCUCCGGGUACC
>cel-miR-1-5p MIMAT0020301 Caenorhabditis elegans miR-1-5p
CAUACUUCCUUACAUGCCCAUA
Check how many mature miRNA sequences there are in the file
zgrep -c ">" 20240103_mature.fa
48885
I’m going to reformat the fasta header names here so there are no spaces
sed '/^>/ s/ /_/g' 20240103_mature.fa \
| sed '/^>/ s/,//g' \
> 20240103_mature.fa
head 20240103_mature.fa
>cel-let-7-5p_MIMAT0000001_Caenorhabditis_elegans_let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cel-let-7-3p_MIMAT0015091_Caenorhabditis_elegans_let-7-3p
CUAUGCAAUUUUCUACCUUACC
>cel-lin-4-5p_MIMAT0000002_Caenorhabditis_elegans_lin-4-5p
UCCCUGAGACCUCAAGUGUGA
>cel-lin-4-3p_MIMAT0015092_Caenorhabditis_elegans_lin-4-3p
ACACCUGGGCUCUCCGGGUACC
>cel-miR-1-5p_MIMAT0020301_Caenorhabditis_elegans_miR-1-5p
CAUACUUCCUUACAUGCCCAUA
Do I need to change the U to T in 20240103_mature.fa
? Let’s try mapping first and seeing.
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/collapse.cat.trimmed.AST-1065.fastq -e -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s reads_collapsed.fa -t reads_collapsed_vs_genome.arf -v
Didn’t take very long! Only a few seconds. It output this:
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_10292
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 4289826 379347 3910479 8.843 91.157
seq: 4289826 379347 3910479 8.843 91.157
Not a very high mapping rate but I wonder if that’s normal. I may need to change the U to T in the miRbase file. It also produced a bowtie.log
file:
less bowtie.log
# reads processed: 1768
# reads with at least one reported alignment: 215 (12.16%)
# reads that failed to align: 1496 (84.62%)
# reads with alignments suppressed due to -m: 57 (3.22%)
Reported 613 alignments to 1 output stream(s)
The reads_collapsed_vs_genome.arf
provides info about the sequences that did align:
head reads_collapsed_vs_genome.arf
seq_19_x21706 32 1 32 aacttttgacggtggatctcttggctcacgca chromosome_2 32 42321 42352 aacttttgacggtggatctcttggctcacgca - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_19_x21706 32 1 32 aacttttgacggtggatctcttggctcacgca chromosome_2 32 53070 53101 aacttttgacggtggatctcttggctcacgca - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_19_x21706 32 1 32 aacttttgacggtggatctcttggctcacgca chromosome_2 32 20734 20765 aacttttgacggtggatctcttggctcacgca - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_19_x21706 32 1 32 aacttttgacggtggatctcttggctcacgca chromosome_2 32 31478 31509 aacttttgacggtggatctcttggctcacgca - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x21513 32 1 32 tgcgtgagccaagagatccaccgtcaaaagtt chromosome_2 32 31478 31509 tgcgtgagccaagagatccaccgtcaaaagtt + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x21513 32 1 32 tgcgtgagccaagagatccaccgtcaaaagtt chromosome_2 32 20734 20765 tgcgtgagccaagagatccaccgtcaaaagtt + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x21513 32 1 32 tgcgtgagccaagagatccaccgtcaaaagtt chromosome_2 32 42321 42352 tgcgtgagccaagagatccaccgtcaaaagtt + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x21513 32 1 32 tgcgtgagccaagagatccaccgtcaaaagtt chromosome_2 32 53070 53101 tgcgtgagccaagagatccaccgtcaaaagtt + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_33_x17087 43 1 43 ttgctacgatcttctgagattaagcctttgttctaagatttgt chromosome_2 43 879093 879135 ttgctacgatcttctgagattaagcctttgttctaagatttgt +mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_33_x17087 43 1 43 ttgctacgatcttctgagattaagcctttgttctaagatttgt chromosome_2 43 38360 38402 ttgctacgatcttctgagattaagcctttgttctaagatttgt -mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Not sure what all of this info means, but I will look into it. I also need to look into why/how the reads get collapsed because the collapse set left me with only 1796 sequences (compared to the 26561348 sequences I had in the cat file) and I want to make sure that’s normal.
Now lets run mirdeep2!!!!!!!!
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/collapse.cat.trimmed.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature.fa none none -t N.vectensis -P -v -g -1 2>report.log
I need to specify none 2x because I do not have the files for known miRNAs or known precursor miRNAs in this species. I got an error:
#Starting miRDeep2
/data/putnamlab/mirdeep2/bin/miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/collapse.cat.trimmed.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature.fa none none -t N.vectensis -P -v -g -1
miRDeep2 started at 16:15:57
mkdir mirdeep_runs/run_04_01_2024_t_16_15_57
#testing input files
started: 16:16:06
sanity_check_mature_ref.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature.fa
ended: 16:16:06
total:0h:0m:0s
sanity_check_reads_ready_file.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/collapse.cat.trimmed.AST-1065.fastq
started: 16:16:06
ESC[1;31mError: ESC[0mproblem with /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/collapse.cat.trimmed.AST-1065.fastq
Error in line 860: Either the sequence
GATGGAATTGTAGCAT
contains less than 17 characters or contains characters others than [acgtunACGTUN]
Please make sure that your file only comprises sequences that have at least 17 characters
containing letters [acgtunACGTUN]
My collapsed read file has some sequences that are <17 bp which mirdeep2 doesn’t like. I need to remove sequences with <17 nts (or do it during the trimming step). Used chatgpt for the code below :)
#!/bin/bash
# Define the input and output files
input_file="collapse.cat.trimmed.AST-1065.fastq"
output_file="17_collapse.cat.trimmed.AST-1065.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
Rerun mirdeep2 with the new file
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/17_collapse.cat.trimmed.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature.fa none none -t N.vectensis -P -v -g -1 2>report.log
Took about 2 mins and there were no alignments. I’m going to change the U to T in the 20240103_mature.fa
file, then rerun the mapping and mirdeep2 steps.
#!/bin/bash
# Define the input and output files
input_file="20240103_mature.fa" # Replace with your actual input file name
output_file="20240103_mature_T.fa"
# Initialize the output file
> "$output_file"
# Use awk to process the file
awk '{
if (substr($0, 1, 1) == ">") {
print $0 >> "'$output_file'" # Print the identifier as is
} else {
gsub(/U/, "T", $0) # Replace U with T in sequences
print $0 >> "'$output_file'"
}
}' "$input_file"
I guess I don’t need to redo the mapping step because the mapping did not use the 20240103_mature.fa file. Therefore, I will proceed to the mirdeep2 step.
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/fastp/17_collapse.cat.trimmed.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
Still no alignment :’( I need to talk to Sam and ask him the following:
- How many reads were left after he concatnated and collapsed his fastq files?
- What was his alignment after the mapping step?
Also what if I didn’t collapse the reads? What if I just removed the heading, + sign, and quality scores and formatted it like mirdeep2 wants it?
I just briefly reran the collapse step and now there are hundreds of thousands of sequences…may need to rerun mapping idk
20240105
Flexbar finished running overnight, took about 9 hours. Now I need to QC it. First going to move it from the raw data folder to the trim data folder.
In /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim
, I moved the old flexbar trimmed seqs to the foler flexbar_old
. I moved the newly trimmed seqs from the raw data folder into /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
.
Now let’s run fastqc on the newly trimmed samples using the fastqc_trim.sh
from above, but changing the directories
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.out
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "QC for trimmed reads using flexbar with max length of 30 bp" $(date)
for file in /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/Astrangia2021/smRNA/fastqc/trim/flexbar
done
echo "QC complete, run multiqc" $(date)
multiqc --interactive fastqc_results/trim/flexbar
Submitted batch job 292159
If the data looks good, I will do another test run of mirdeep2. It finished in ~30 mins but it did not complete the multiQC. When I went into the folder to run the multiqc step, it appears to only have run it on the R2 files. I’m going to move the R1 and R2 fastqc info into separate folders and run the QC on them separately. There is probably a better way to do this idk.
mkdir R1 R2
mv *_1_fastqc* R1
mv *_2_fastqc* R2
Now go into each folder and run multiqc separately. QC looks good for both reads.
Next I need to cat, collapse and prep reads for mirdeep2.
20240107
I’m going to write a script that will cat and collapse the reads. I’ll do it on a test sample first. In the scripts folder: nano test_cat_collapse.sh
.
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FASTX-Toolkit/0.0.14-GCC-9.3.0
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
echo "Concatenating R1 and R2 for test sample" $(date)
cat AST-1065_R1_001.fastq.gz_1.fastq AST-1065_R1_001.fastq.gz_2.fastq > cat.AST-1065.fastq
echo "Collapsing redundant sequences with fastx collapse" $(date)
fastx_collapser -v -i cat.AST-1065.fastq -o collapse.cat.AST-1065.fastq
echo "Prep sequence IDs for mirdeep2 analysis" $(date)
sed '/^>/ s/-/_x/g' collapse.cat.AST-1065.fastq \
| sed '/^>/ s/>/>seq_/' \
> collapse.cat.AST-1065.fastq
echo "Done!" $(date)
Submitted batch job 292222. I did the genome and database prep already so I don’t need to redo that. Took about 20 mins, but the collapsed file is empty…going to have the output file for the sed
portion be sed.collapse.cat.AST-1065.fastq
to see if the sed
portion is what is happening to the files. Submitted batch job 292224. That iteration worked!
Check how many sequences are in the collapsed file.
zgrep -c ">" sed.collapse.cat.AST-1065.fastq
11979585
head sed.collapse.cat.AST-1065.fastq
>seq_1_x357414
TGGTCTATGGTGTAACTGGCAACACGTCTG
>seq_2_x138955
ACAGACGTGTTGCCAGTTACACCATAGACC
>seq_3_x125294
AACAGACGTGTTGCCAGTTACACCATAGAC
>seq_4_x98253
TGAAAATCTTTTCTCTGAAGTGGAA
>seq_5_x87633
TTCCACTTCAGAGAAAAGATTTTCA
Nice, almost 12 million sequences were retained. I was looking at the fastx collapse documentation and it said that the first number in the sequence id corresponded to a sequence and the second number corresponded to how many times that sequence appeared prior to the file being collapsed. So for example, >seq_1_x357414
was the most represented sequence, as indicated by the 1, and it appeared 357414 times in the pre-collapsed file.
Need to make sure that all of my sequences are >17 bp, as mirdeep2 does not run if sequences are present with <16 bp.
My collapsed read file has some sequences that are <17 bp which mirdeep2 doesn’t like. I need to remove sequences with <17 nts (or do it during the trimming step). Used chatgpt for the code below :)
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1065.fastq"
output_file="17_sed.collapse.cat.AST-1065.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1065.fastq
11979585
Retained all of the sequences. NOW lets attempt an mirdeep2 run.
conda activate /data/putnamlab/mirdeep2
Map reads to genome first
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s 20240107_reads_collapsed.fa -t 20240107_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_56091
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 35658222 3275049 32383173 9.185 90.815
seq: 35658222 3275049 32383173 9.185 90.815
Still got a pretty high % of unmapped reads…lets look at some of the files produced.
head 20240107_reads_collapsed.fa
>seq_1_x357414
TGGTCTATGGTGTAACTGGCAACACGTCTG
>seq_2_x138955
ACAGACGTGTTGCCAGTTACACCATAGACC
>seq_3_x125294
AACAGACGTGTTGCCAGTTACACCATAGAC
>seq_4_x98253
TGAAAATCTTTTCTCTGAAGTGGAA
>seq_5_x87633
TTCCACTTCAGAGAAAAGATTTTCA
zgrep -c ">" 20240107_reads_collapsed.fa
11979585
head 20240107_reads_collapsed_vs_genome.arf
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 27510 27539 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 49105 49134 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 38360 38389 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 879106 879135 acaaatcttagaacaaaggcttaatctcag - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_25_x32759 30 1 30 ttgctacgatcttctgagattaagcctttg chromosome_2 30 879093 879122 ttgctacgatcttctgagattaagcctttg + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_25_x32759 30 1 30 ttgctacgatcttctgagattaagcctttg chromosome_2 30 38373 38402 ttgctacgatcttctgagattaagcctttg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
wc -l 20240107_reads_collapsed_vs_genome.arf
1091161 20240107_reads_collapsed_vs_genome.arf
I’m not sure what the 20240107_reads_collapsed_vs_genome.arf file means. Let’s see how many unique sequences are in that file
cut -f1 20240107_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
523747
So ~500,000 unique sequences were mapped to the genome? That is how I am interpreting this. Let’s try to run mirdeep2.
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/20240107_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
I’m going to let it run for ~10 mins then cut it off, as it probably takes a while (Sam White said his script took several days). I’m not sure if I can activate a conda env in a job script, emailed Kevin Bryan to ask.
Cut the script off, this is as far as it got:
#####################################
# #
# miRDeep2.0.1.3 #
# #
# last change: 08/11/2019 #
# #
#####################################
miRDeep2 started at 17:59:09
#Starting miRDeep2
#testing input files
#parsing genome mappings
#excising precursors
#preparing signature
#folding precursors
Yay, things are happening! Hopefully I can run this by the e5 meeting on Friday.
20240107
Kevin Bryan confirmed that I can run a conda environment in a job script, I just need to add -i
to the #!/bin/bash
because of the way conda changes environments. Going to try this now on the test smRNA sample. In the scripts folder: nano test_mirdeep2.sh
.
#!/bin/bash -i
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on test sample" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/20240107_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for test sample" $(date)
conda deactivate
Submitted batch job 292242. Job has been pending for a few mins and says its waiting for resources.
20230109
After pending for about a day, mirdeep2 finally finished running on the test sample. It took about 3.5 hours to run. It created these folders/files in the scripts folder:
drwxr-xr-x. 3 jillashey 4.0K Jan 9 07:49 mirdeep_runs
drwxr-xr-x. 2 jillashey 4.0K Jan 9 07:52 dir_prepare_signature1704804676
-rw-r--r--. 1 jillashey 377 Jan 9 11:23 error_09_01_2024_t_07_49_09.log
-rw-r--r--. 1 jillashey 63K Jan 9 11:24 result_09_01_2024_t_07_49_09.csv
-rw-r--r--. 1 jillashey 704K Jan 9 11:24 result_09_01_2024_t_07_49_09.html
drwxr-xr-x. 2 jillashey 4.0K Jan 9 11:28 pdfs_09_01_2024_t_07_49_09
-rw-r--r--. 1 jillashey 28K Jan 9 11:28 result_09_01_2024_t_07_49_09.bed
drwxr-xr-x. 2 jillashey 4.0K Jan 9 11:28 mirna_results_09_01_2024_t_07_49_09
-rw-r--r--. 1 jillashey 20K Jan 9 11:28 report.log
Let’s look at them each.
cd mirdeep_runs
cd run_09_01_2024_t_07_49_09 # folder
ls
identified_precursors.fa output.mrd rfam_vs_precursor.bwt run_09_01_2024_t_07_49_09_parameters survey.csv
The identified precursors fasta file includes the precursor sequences (ie the part of the sequence that forms the pre-miRNA)
head identified_precursors.fa
>chromosome_10_48090
ugauggagauggagaacgagaguggacuggacaguuuggcacugaagguucccuuuauaagcaguguuuuucuuucgacuacc
>M:chromosome_10_48090
TGTTTTTCTTTCGACTACC
>L:chromosome_10_48090
AGTGGACTGGACAGTTTGGCACTGAAGGTTCCCTTTATAAGCAG
>S:chromosome_10_48090
TGATGGAGATGGAGAACGAG
>chromosome_14_108653
cgcgcgcuauaguuacaguagcuauagcgcgcacuauaauuauagcagcuauagcgcacgcuauaguuagaaacuguagcgcgaguu
zgrep -c ">" identified_precursors.fa
18695
Almost 19000 precursor sequences.
In the output.mrd file, it has info on the different miRNAs identified I believe
>chromosome_7_30929
score total 2.4
score for star read(s) -1.3
score for read counts 0
score for mfe 2.1
score for randfold 1.6
total read count 13651
mature read count 13499
loop read count 0
star read count 152
exp fffffffffffffffffffMMMMMMMMMMMMMMMMMMMMlllllllllllllllSSSSSSSSSSSSSSSSSSSSffffffffffffffffffffffffff
ffffffffffff
obs fffffffffffffffffffMMMMMMMMMMMMMMMMMMMMlllllllllllllSSSSSSSSSSSSSSSSSSSSSfffffffffffffffffffffffffff
ffffffffffff
pri_seq cgcacugcaguugacgugaacccguagauccgaacuugugggauuuuucuccacaaguucggcuccaugguccacgugugcugugcucacaaacguugcu
acagcgugguca
pri_struct .((((.(((....(((((...((((.((.((((((((((((((....)).)))))))))))).)).))))..))))).))).))))......((((((..
.))))))..... #MM
seq_7183221_x1 .................Uaacccguagauccgaacuugug............................................................
............ 1
seq_2180915_x2 ..................aacccguagauccgaacu................................................................
............ 0
seq_9719163_x1 ..................aacccguagauccgaUcuu...............................................................
............ 1
seq_2267835_x2 ..................Cacccguagauccgaacuu...............................................................
............ 1
ola-miR-100_MIMAT0022614_Oryzias_latipes_miR-100 ..................aacccguagauccgaacuu...............................................................
............ 0
seq_126867_x21 ..................aacccguagauccgaacuug..............................................................
............ 0
seq_254962_x11 ..................Cacccguagauccgaacuug..............................................................
............ 1
sbo-miR-100_MIMAT0049501_Saimiri_boliviensis_miR-100 ..................aacccguagauccgaacuugu.............................................................
............ 0
dma-miR-100_MIMAT0049252_Daubentonia_madagascariensis_miR-100 ..................aacccguagauccgaacuugu.............................................................
............ 0
seq_199802_x14 ..................aacccguagauccgaacuugC.............................................................
............ 1
pmi-miR-100-5p_MIMAT0032156_Patiria_miniata_miR-100-5p ..................aacccguagauccgaacuugu.............................................................
............ 0
seq_2153292_x2 ..................aacccguagauccgaGcuugu.............................................................
............ 1
seq_7747127_x1
zgrep -c ">" output.mrd
5231
The rfam vs precursor file includes information about where on the chromosomes the rRNAs and tRNAs are?
head rfam_vs_precursor.bwt
M:chromosome_13_66113 + AM086652.1/1-576_RF00177;SSU_rRNA_5; 468 TGTTTCGGGATTGCAATG IIIIIIIIIIIIIIIIII 3
M:chromosome_13_66113 + AF508778.1/21-597_RF00177;SSU_rRNA_5; 469 TGTTTCGGGATTGCAATG IIIIIIIIIIIIIIIIII 3
M:chromosome_13_66113 + AJ310485.1/21-596_RF00177;SSU_rRNA_5; 468 TGTTTCGGGATTGCAATG IIIIIIIIIIIIIIIIII 3
M:chromosome_13_66113 + DQ057346.1/21-597_RF00177;SSU_rRNA_5; 469 TGTTTCGGGATTGCAATG IIIIIIIIIIIIIIIIII 3
S:chromosome_13_66113 + AACY021626480.1/155-83_RF00005;tRNA; 26 TTTGTTTCGTAAGCAAA IIIIIIIIIIIIIIIII 2 7:T>C
S:chromosome_13_66113 + AACY023301721.1/825-896_RF00005;tRNA; 25 TTTGTTTCGTAAGCAAA IIIIIIIIIIIIIIIII 2 12:A>G
S:chromosome_13_66113 + AACY022901721.1/116-188_RF00005;tRNA; 26 TTTGTTTCGTAAGCAAA IIIIIIIIIIIIIIIII 2 12:A>G
M:chromosome_14_72583 + CP000030.1/153914-153986_RF00005;tRNA; 3 CGGTTAGCTCAGTTGGTAGA IIIIIIIIIIIIIIIIIIII 13
M:chromosome_14_72583 + AACY020037993.1/1235-1163_RF00005;tRNA; 3 CGGTTAGCTCAGTTGGTAGA IIIIIIIIIIIIIIIIIIII 13
M:chromosome_14_72583 + AACY020166163.1/13-85_RF00005;tRNA; 3 CGGTTAGCTCAGTTGGTAGA IIIIIIIIIIIIIIIIIIII 13
wc -l rfam_vs_precursor.bwt
2241 rfam_vs_precursor.bwt
The run parameters file has the code specifics
Start: 09_01_2024_t_07_49_09
Script /data/putnamlab/mirdeep2/bin/miRDeep2.pl
args /data/putnamlab/mirdeep2/bin/miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/20240107_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature_T.fa none none -t N.vectensis -P -v -g -1
dir_with_tmp_files dir_miRDeep2_09_01_2024_t_07_49_09
dir /glfs/brick01/gv0/putnamlab/jillashey/Astrangia2021/smRNA/scripts
file_reads /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq
file_genome /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta
file_reads_vs_genome /data/putnamlab/jillashey/Astrangia2021/smRNA/20240107_reads_collapsed_vs_genome.arf
file_mature_ref_this_species /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/20240103_mature_T.fa
file_mature_ref_other_species none
option{t} = N.vectensis
option{v} = used
miRDeep runtime:
started: 7:49:09
ended: 11:28:44
total:3h:39m:35s
The survey file inclues info about the mirdeep2 scores. This is the same info that is at the top of the hmtl and csv files.
miRDeep2 score novel miRNAs reported by miRDeep2 novel miRNAs, estimated false positives novel miRNAs, estimated true positives known miRNAs in species known miRNAs in data known miRNAs detected by miRDeep2 estimated signal-to-noise excision gearing
10 49 3 +/- 2 46 +/- 2 (93 +/- 3%) 48885 83 1 (1%) 15.6 1
9 51 3 +/- 2 48 +/- 2 (93 +/- 3%) 48885 83 1 (1%) 15.1 1
8 56 4 +/- 2 52 +/- 2 (93 +/- 3%) 48885 83 1 (1%) 15.4 1
7 64 4 +/- 2 60 +/- 2 (94 +/- 3%) 48885 83 1 (1%) 16.3 1
6 68 4 +/- 2 64 +/- 2 (94 +/- 3%) 48885 83 1 (1%) 16 1
5 70 5 +/- 2 65 +/- 2 (93 +/- 3%) 48885 83 1 (1%) 15 1
4 72 5 +/- 2 67 +/- 2 (93 +/- 3%) 48885 83 1 (1%) 14 1
3 101 6 +/- 2 95 +/- 2 (94 +/- 2%) 48885 83 1 (1%) 16.2 1
2 183 9 +/- 3 174 +/- 3 (95 +/- 2%) 48885 83 72 (87%) 20.2 1
1 260 22 +/- 4 238 +/- 4 (91 +/- 2%) 48885 83 72 (87%) 11.6 1
0 315 53 +/- 6 262 +/- 6 (83 +/- 2%) 48885 83 72 (87%) 5.9 1
-1 365 97 +/- 8 268 +/- 8 (73 +/- 2%) 48885 83 72 (87%) 3.8 1
-2 462 150 +/- 10 312 +/- 10 (67 +/- 2%) 48885 83 72 (87%) 3.1 1
-3 707 228 +/- 14 479 +/- 14 (68 +/- 2%) 48885 83 72 (87%) 3.1 1
-4 1003 402 +/- 18 601 +/- 18 (60 +/- 2%) 48885 83 73 (88%) 2.5 1
-5 1167 750 +/- 23 417 +/- 23 (36 +/- 2%) 48885 83 73 (88%) 1.6 1
-6 1309 1227 +/- 35 82 +/- 35 (6 +/- 3%) 48885 83 73 (88%) 1.1 1
-7 1405 1733 +/- 42 0 +/- 0 (0 +/- 0%) 48885 83 73 (88%) 0.8 1
-8 1653 2208 +/- 45 0 +/- 0 (0 +/- 0%) 48885 83 73 (88%) 0.7 1
-9 2013 2615 +/- 49 0 +/- 0 (0 +/- 0%) 48885 83 73 (88%) 0.8 1
-10 2423 2951 +/- 53 0 +/- 0 (0 +/- 0%) 48885 83 73 (88%) 0.8 1
Going into the dir_prepare_signature1704804676
from the scripts folder
cd dir_prepare_signature1704804676
ls
mature_vs_precursors.arf precursors.ebwt.2.ebwt precursors.ebwt.rev.1.ebwt reads_vs_precursors.arf signature_unsorted.arf.tmp
mature_vs_precursors.bwt precursors.ebwt.3.ebwt precursors.ebwt.rev.2.ebwt reads_vs_precursors.bwt signature_unsorted.arf.tmp2
precursors.ebwt.1.ebwt precursors.ebwt.4.ebwt precursors.fa signature_unsorted.arf
Looked at the mature vs. precursors file
head mature_vs_precursors.arf
hsa-miR-100-5p_MIMAT0000098_Homo_sapiens_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30929 22 19 40 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
hsa-miR-100-5p_MIMAT0000098_Homo_sapiens_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30930 22 69 90 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
mmu-miR-100-5p_MIMAT0000655_Mus_musculus_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30930 22 69 90 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
mmu-miR-100-5p_MIMAT0000655_Mus_musculus_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30929 22 19 40 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
rno-miR-100-5p_MIMAT0000822_Rattus_norvegicus_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30930 22 69 90 aacccgtagatccgaacttgtg + 0 mmmmmmmmmmmmmmmmmmmmmm
rno-miR-100-5p_MIMAT0000822_Rattus_norvegicus_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30929 22 19 40 aacccgtagatccgaacttgtg + 0 mmmmmmmmmmmmmmmmmmmmmm
gga-miR-100-5p_MIMAT0001178_Gallus_gallus_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30930 22 69 90 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
gga-miR-100-5p_MIMAT0001178_Gallus_gallus_miR-100-5p 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30929 22 19 40 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
aga-miR-100_MIMAT0001498_Anopheles_gambiae_miR-100 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30930 22 69 90 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
aga-miR-100_MIMAT0001498_Anopheles_gambiae_miR-100 22 1 22 aacccgtagatccgaacttgtg chromosome_7_30929 22 19 40 aacccgtagatccgaacttgtg +mmmmmmmmmmmmmmmmmmmmmm
wc -l mature_vs_precursors.arf
191 mature_vs_precursors.arf
I’m not sure what this means…Need to look into this more. Is it providing info about the mirbase sequences in comparison to my own? It looks like most of them are related to miR-100, which makes sense as this is the only miRNA that is in bilaterians and cnidarians.
Back in the scripts directory, look at the error file:
RNAfold: invalid option -- n
total number of rounds controls=100
1^M2^M3^M4^M5^M6^M7^M8^M9^M10^M11^M12^M13^M14^M15^M16^M17^M18^M19^M20^M21^M22^M23^M24^M25^M26^M27^M28^M29^M30^M31^M32^M33^M34^M35^M36^M37^M38^M39^M40^M41^M42^M43^M44^M45^M46^M47^M48^M49^M50^M51^M52^M53^M54^M55^M56^M57^M58^M59^M60^M61^M62^M63^M64^M65^M66^M67^M68^M69^M70^M71^M72^M73^M74^M75^M76^M77^M78^M79^M80^M81^M82^M83^M84^M85^M86^M87^M88^M89^M90^M91
^M92^M93^M94^M95^M96^M97^M98^M99^M100^Mcontrols performed
Not sure what this means either…some issue with the RNAfold option? But I got a randfold pvalue.
The pdf folder includes a pdf file for each miRNA (?) identified and provides info about the scores and gives a nice graph about where the mature and star sequences are
In my first run of mirdeep2, 318 unique pdfs were produced. In the mirna_results_09_01_2024_t_07_49_09 results folder, there are .bed and .fa files. The bed files contain this info:
less known_mature_09_01_2024_t_07_49_09_score-50_to_na.bed
browser position chromosome_6:21902755-21902777
browser hide all
track name="notTrackname.known_miRNAs" description="known miRNAs detected by miRDeep2 for notTrackname" visibility=2
itemRgb="On";
chromosome_6 21902755 21902777 chromosome_6_22810 19.4 + 21902755 21902777 255,0,0
chromosome_7 21599230 21599250 chromosome_7_30929 2.4 - 21599230 21599250 0,0,255
chromosome_11 24411396 24411418 chromosome_11_54894 -3.5 - 24411396 24411418 0,0,255
and the fa files contain this info:
less known_mature_09_01_2024_t_07_49_09_score-50_to_na.fa
>chromosome_6_22810
AAGAACACCCAAAATAGCTGAA
>chromosome_7_30929
ACCCGTAGATCCGAACTTGT
>chromosome_11_54894
GCGGGTGTGTGTGTGTGTGTGT
So I suppose its giving information about the known mature sequences and their location on the Astrangia genome. The other files in this folder contain bed and fa files for the known precursors, star and mature sequences, as well as the novel precursors, star and mature sequences.
In the main scripts folder, there is also a file result_09_01_2024_t_07_49_09
in .bed, .csv and .html format. It has the same info in all of them, but it is summarizing the parameters used, the survey info, the novel miRNAs and the known miRNAs. In the survey info, it says that 83 known miRNAs were identified in my data but only 3 are listed at the bottom?
Anything that is listed as mirdeep2 results from Jan 9 are associated with AST-1065.
I wonder if these results would change if I trimmed to 25 bp. I also wonder what would happen if I removed Nvectensis as my related species. Sam white put S.purpurtus as the related species, but Nvectensis is more closely related. I’m going to remove the related species and rerun mirdeep2. Submitted batch job 292316. Started running immediately, nice.
I’m also going to rerun flexbar with trimming of 25 bp. In the trim data folder, make a folder for flexbar 25 bp
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim
mkdir flexbar_25bp
In the scripts folder: nano flexbar_25bp.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Flexbar/3.5.0-foss-2018b
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
echo "Trimming reads to 25 bp using flexbar" $(date)
array1=($(ls *R1_001.fastq.gz))
for i in ${array1[@]}; do
flexbar \
-r ${i} \
-p $(echo ${i}|sed s/_R1/_R2/) \
-a NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 30 \
-t /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar_25bp \
-k 25 \
-R trim.${i} \
-P trim.$(echo ${i}|sed s/_R1/_R2/) \
-z GZ
done
echo "Trimming complete" $(date)
Submitted batch job 292317. Immediately got an error saying: ERROR: Could not open file trim.AST-2360_R1_001.fastq.gz.gz
. I commented out the -z
command. Submitted batch job 292344
20230110
Hooray! Flexbar 25 bp and mirdeep2 test finished running. Let’s look at the mirdeep2 results. As a reminder, I reran the mirdeep2 code but removed the specification of Nematostella as a related species. When removing this specification, I got marginally more (1-2 more) novel miRNAs predicted, but got the same number of known miRNAs identified. The mirdeep2 documentation states “it will in practice always improve miRDeep2 performance if miRNAs from some related species is input, even if it is not closely related.” So it is likely best to keep Nematostella as a related species in the code.
My next step is to run mirdeep2 on the newly trimmed (25 bp) reads. Again, I’m going to run it on a test sample. First, move the newly trimmed reads from the raw data folder to the flexbar 25 bp folder.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/raw
mv trim* ../trim/flexbar_25bp/
Next run fastqc. In the fastqc folder, make a folder for the new flexbar 25bp QC data.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/fastqc/trim
mkdir flexbar_25bp
cd flexbar_25bp
mkdir R1 R2
In the scripts folder, edit the fastqc_trim.sh
script:
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "QC for trimmed reads using flexbar with max length of 25 bp" $(date)
for file in /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar_25bp/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/Astrangia2021/smRNA/fastqc/trim/flexbar_25bp
done
echo "FastQC complete" $(date)
#multiqc --interactive fastqc_results/trim/flexbar
20240112
Had to rerun the trimming bc I accidently left the 30 as the max length. Now im running the QC. Submitted batch job 292416
Once this has finished running, navigate to the flexbar 25bp fastqc folder and move the reads into R1 and R2 folders.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/fastqc/trim/flexbar_25bp
mv *_R1* R1
mv *_R2* R2
Go into each folder and run MultiQC on each. I have to do this because for some reason, multiQC was just running QC stats on the R2 reads when the R1 and R2 reads were in the same folder.
I think it would be a good idea to make a custom database that includes primarily cnidarian miRNAs, as there are not many in mirbase itself. Baumgarten et al. (2017) did something similar (under 2.3 miRNA annotation in their paper).
- “miRNAs were then annotated using the miRDeep2 package with default settings (Friedlander et al., 2012). To identify putatively conserved miRNAs based on previous de novo annotations of other cnidarian genomes, we created a reference library of mature miRNA sequences from N. vectensis (Grimson et al., 2008; Moran et al., 2014), Hydra magnipapillata (Krishna et al., 2013) and Stylophora pistillata (Liew et al., 2014).”
Therefore, I am going to make a custom database of the cnidarian miRNAs that I am aware of.
- Stylophora pistillata - Table S11 from Liew et al. 2014
- Aiptasia - Table S4 from Baumgarten et al. 2017
- Acropora digitifera - Table S5 from Gajigan & Conaco 2017
- the sea anemones Edwardsiella carnea, Scolanthus callimorphus, Metridium senile and Anemonia viridis, and the stony coral Acropora millepora - Tables S1 and S2 from Praher et al. 2021
- Hydra - Table S1 (SuppFile2) from Krishna et al. 2012
- Nematostella - Table S1 from Moran et al. 2014
- Anemonia viridis - Table S2 from Urbarova et al. 2018
- Nematostella + Hydra are also both on miRBase
In google sheets, I gathered all the cnidarian miRNA sequences that I could find and made it into a csv file. This is the format:
miRNA Mature_miRNA_sequence Species Citation Notes
spi-mir-temp-1 acccguagauccgaacuugugg Stylophora pistillata Liew et al. 2014 Matches miR-100 family.
spi-mir-temp-2 uaucgaauccgucaaaaagaga Stylophora pistillata Liew et al. 2014 NA
spi-mir-temp-3 ucagggauuguggugaguuaguu Stylophora pistillata Liew et al. 2014 NA
spi-mir-temp-4 aaagaaguacaagugguaggg Stylophora pistillata Liew et al. 2014 Exact match of nve-miR-2023.
spi-mir-temp-5 gagguccggaugguuga Stylophora pistillata Liew et al. 2014 NA
I downloaded the csv to my computer and manipulated it so that the first, third, fourth and fifth columns are the headers on one line and denoted with a “>”. Then the sequence, in the second column, was put under the header.
awk -F',' 'NR>1 {print ">"$1" "$2" "$3" "$4"\n"$5}' cnidarian_miRNAs.csv > cnidarian_miRNAs.fasta
Now go back to the miRbase fasta file and subset so that I make a file with Hydra and Nematostella sequences only
awk '/>.*hma|>.*nve/ {print; getline; print}' 20240103_mature_T.fa > subset.fasta
I then copied the subset fasta info into the cnidarian_miRNAs.fasta. A complete cnidarian miRNA fasta! Reformat the fasta header names so there are no spaces.
sed '/^>/ s/ /_/g' cnidarian_miRNAs.fasta \
| sed '/^>/ s/,//g' \
> cnidarian_miRNAs.fasta
Reformat sequences so that everything is uppercase
awk '/^>/ {print; getline; print toupper($0); next} {print}' cnidarian_miRNAs.fasta > cnidarian_miRNAs.fasta
I then copied the file to andromeda to /data/putnamlab/jillashey/Astrangia2021/smRNA/refs
. Change the U to T in the fasta file
#!/bin/bash
# Define the input and output files
input_file="cnidarian_miRNA.fa" # Replace with your actual input file name
output_file="cnidarian_miRNA_T.fa"
# Initialize the output file
> "$output_file"
# Use awk to process the file
awk '{
if (substr($0, 1, 1) == ">") {
print $0 >> "'$output_file'" # Print the identifier as is
} else {
gsub(/U/, "T", $0) # Replace U with T in sequences
print $0 >> "'$output_file'"
}
}' "$input_file"
Now I am going to run mirdeep2 with the cnidarian miRNA file! I’m going to modify the nano test_mirdeep2.sh
so that the fasta file is the cnidarian_miRNA_T.fa
file. Submitted batch job 292356. Didnt work. Removed the -t
argument. Submitted batch job 292358…still not running. Getting this error:
bash: cannot set terminal process group (-1): Function not implemented
bash: no job control in this shell
Need to troubleshoot this.
When I look at the report.log
file, it says:
miRDeep2 started at 12:39:17
mkdir mirdeep_runs/run_12_01_2024_t_12_39_17
#testing input files
started: 12:39:23
sanity_check_mature_ref.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/cnidarian_miRNA_T.fa
ESC[1;31mError: ESC[0mproblem with /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/cnidarian_miRNA_T.fa
Error in line 64: The sequence
SPIS
contains characters others than [acgtunACGTUN]
Please check your file for the following issues:
I. Sequences are allowed only to comprise characters [ACGTNacgtn].
II. Identifiers are not allowed to have withespaces.
I looked at the cnidarian_miRNA_T.fa
file and found that for a few sequences, it has Spis instead of the actual sequence:
>spi-mir-temp-42 Stylophora pistillata Liew et al. 2014 NA
ugugcaagaauuugagucgcugg
>apa-mir-100 Exaiptasia pallida Baumgarten et al. 2017 "miR-100; Nve
Spis
>apa-mir-2022a Exaiptasia pallida Baumgarten et al. 2017 "miR-2022; Nve
Spis
>apa-mir-2023 Exaiptasia pallida Baumgarten et al. 2017 "miR-2023; Nve
Spis
>apa-mir-2025 Exaiptasia pallida Baumgarten et al. 2017 "miR-2025; Nve
Adi"
>apa-mir-2026 Exaiptasia pallida Baumgarten et al. 2017 miR-2026; Nve
aauuucaaauauccacugauug
>apa-mir-2030 Exaiptasia pallida Baumgarten et al. 2017 "miR-2030; Nve
Spis
>apa-mir-2036 Exaiptasia pallida Baumgarten et al. 2017 "miR-2036; Nve
Spis
>apa-mir-2037 Exaiptasia pallida Baumgarten et al. 2017 "miR-2037; Nve
Spis"
>apa-mir-2050 Exaiptasia pallida Baumgarten et al. 2017 "miR-2050; Nve
Spis
I’m thinking maybe it doesn’t like the commas? Going back to csv and adding semi-colans instead of commas.
I downloaded the csv to my computer and manipulated it so that the first, third, fourth and fifth columns are the headers on one line and denoted with a “>”. Then the sequence, in the second column, was put under the header.
awk -F',' 'NR>1 {print ">"$1" "$2" "$3" "$4"\n"$5}' cnidarian_miRNAs.csv > cnidarian_miRNAs.fasta
That seems to have fixed the problem. Reformat the fasta header names so there are no spaces.
sed '/^>/ s/ /_/g' cnidarian_miRNAs.fasta \
| sed '/^>/ s/,//g' \
> cnidarian_miRNAs.fasta
Reformat sequences so that everything is uppercase
awk '/^>/ {print; getline; print toupper($0); next} {print}' cnidarian_miRNAs.fasta > cnidarian_miRNAs.fasta
I then copied the file to andromeda to /data/putnamlab/jillashey/Astrangia2021/smRNA/refs
. Change the U to T in the fasta file
#!/bin/bash
# Define the input and output files
input_file="cnidarian_miRNAs.fa" # Replace with your actual input file name
output_file="cnidarian_miRNAs_T.fa"
# Initialize the output file
> "$output_file"
# Use awk to process the file
awk '{
if (substr($0, 1, 1) == ">") {
print $0 >> "'$output_file'" # Print the identifier as is
} else {
gsub(/U/, "T", $0) # Replace U with T in sequences
print $0 >> "'$output_file'"
}
}' "$input_file"
Now concatenate the cnidarian miRNAs with the mature miRNA fasta from miRBase.
cat 20240103_mature_T.fa cnidarian_miRNAs_T.fa > mature_mirbase_cnidarian_T.fa
Now edit the test_mirdeep2.sh
so that mature_mirbase_cnidarian_T.fa
is the input fasta file. Submitted batch job 292458. Failed. The report log file is telling me that I need to put none none before the -t
flag. Changed that and resubmitted job. Submitted batch job 292459. Hooray appears to be running!
Sam White talked to Azenta (who did the sequencing for my project) and they recommended using Trimmomatic for trimming the small RNA reads. They also recommended tossing out read 2 and only using read 1 for analysis. I concatenate and collapse the reads anyway, so that shouldn’t matter too much. Sam will get back to me about trimming info/code from Azenta soon.
20240116
While I wait for Sam to get back to me about the trimming info, I’m going to run a mirdeep2 test on another sample (AST-2000). Unzip the files first.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
gunzip AST-2000_R1_001.fastq.*
Next, go to scripts folder and modify test_cat_collapse.sh
so that the sample is AST-2000. Submitted batch job 292595. Took about 20 mins.
head sed.collapse.cat.AST-2000.fastq
>seq_1_x719979
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_2_x647300
GAGAATTCTACCACTGAACCACCAGTGC
>seq_3_x228711
GCACTGTGGTTCAGTGGTAGAATTCTC
>seq_4_x206224
GAGAATTCTACCACTGAACCACAGTGC
>seq_5_x161452
GCACTGGTGGTTCAGTGGTAGAATTCT
zgrep -c ">" sed.collapse.cat.AST-2000.fastq
10773578
Remove any sequences that are <17 nts.
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2000.fastq"
output_file="17_sed.collapse.cat.AST-2000.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2000.fastq
10773578
Looks like no sequences were removed. Now let’s attempt the mirdeep2 run with AST-2000. Map reads to genome first.
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2000.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s 20240116_reads_collapsed.fa -t 20240116_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_5771
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 34857708 5888651 28969057 16.893 83.107
seq: 34857708 5888651 28969057 16.893 83.107
Higher mapping % than AST-1065. Let’s look at the files produced.
head 20240116_reads_collapsed.fa
>seq_1_x719979
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_2_x647300
GAGAATTCTACCACTGAACCACCAGTGC
>seq_3_x228711
GCACTGTGGTTCAGTGGTAGAATTCTC
>seq_4_x206224
GAGAATTCTACCACTGAACCACAGTGC
>seq_5_x161452
GCACTGGTGGTTCAGTGGTAGAATTCT
zgrep -c ">" 20240116_reads_collapsed.fa
10773578
head 20240116_reads_collapsed_vs_genome.arf
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_27_x28319 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 20610 20639 tccgacactcagacagacatgctcctggga + mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_27_x28319 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 42197 42226 tccgacactcagacagacatgctcctggga + mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 20240116_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
1016532
Edit the test_mirdeep2.sh
script to contain info for AST-2000
#!/bin/bash -i
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on test sample AST-2000" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2000.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/20240116_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for test sample AST-2000" $(date)
conda deactivate
Submitted batch job 292597. This took about 7 hours to run.
I then compared the AST-1065 and AST-2000 samples in R to see if there were any overlapping sequences (code here). I found 21 unique overlapping sequences between the two samples! Exciting.
20240119
Hollie recommended that I compare the 25 bp vs 30 bp to see what the outcome was (ie did one trim length yield more miRNAs than the other). First, I need to prep the AST-2000 25bp sample for mirdeep2.
Unzip the files.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar_25bp
gunzip trim.AST-2000*
Next, go to scripts folder and modify test_cat_collapse.sh
so that the sample is AST-2000 from the flexbar 25 bp folder. Submitted batch job 293019. Took about 15 mins
head sed.collapse.cat.AST-2000_25bp.fastq
>seq_1_x1075819
GCACTGGTGGTTCAGTGGTAGAATT
>seq_2_x673045
GAGAATTCTACCACTGAACCACCAG
>seq_3_x339694
GCACTGTGGTTCAGTGGTAGAATTC
>seq_4_x212925
GAGAATTCTACCACTGAACCACAGT
>seq_5_x150098
AGAATTCTACCACTGAACCACCAGT
zgrep -c ">" sed.collapse.cat.AST-2000_25bp.fastq
9235930
Remove any seqs that are <17 nts.
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2000_25bp.fastq"
output_file="17_sed.collapse.cat.AST-2000_25bp.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2000_25bp.fastq
9235930
No reads removed. Now map the reads to the genome using mirdeep2
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar_25bp/17_sed.collapse.cat.AST-2000_25bp.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s 20240119_reads_collapsed.fa -t 20240119_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_51907
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 34857708 6018696 28839012 17.266 82.734
seq: 34857708 6018696 28839012 17.266 82.734
Slightly higher mapping than the 30bp AST-2000. Let’s look at the files
head 20240119_reads_collapsed.fa
>seq_1_x1075819
GCACTGGTGGTTCAGTGGTAGAATT
>seq_2_x673045
GAGAATTCTACCACTGAACCACCAG
>seq_3_x339694
GCACTGTGGTTCAGTGGTAGAATTC
>seq_4_x212925
GAGAATTCTACCACTGAACCACAGT
>seq_5_x150098
AGAATTCTACCACTGAACCACCAGT
zgrep -c ">" 20240119_reads_collapsed.fa
9235930
head 20240119_reads_collapsed_vs_genome.arf
seq_6_x145559 25 1 25 aacttttgacggtggatctcttggc chromosome_2 25 20741 20765 aacttttgacggtggatctcttggc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x145559 25 1 25 aacttttgacggtggatctcttggc chromosome_2 25 31485 31509 aacttttgacggtggatctcttggc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x145559 25 1 25 aacttttgacggtggatctcttggc chromosome_2 25 42328 42352 aacttttgacggtggatctcttggc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x145559 25 1 25 aacttttgacggtggatctcttggc chromosome_2 25 53077 53101 aacttttgacggtggatctcttggc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x42856 25 1 25 tgcgtgagccaagagatccaccgtc chromosome_2 25 42321 42345 tgcgtgagccaagagatccaccgtc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x42856 25 1 25 tgcgtgagccaagagatccaccgtc chromosome_2 25 53070 53094 tgcgtgagccaagagatccaccgtc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x42856 25 1 25 tgcgtgagccaagagatccaccgtc chromosome_2 25 31478 31502 tgcgtgagccaagagatccaccgtc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x42856 25 1 25 tgcgtgagccaagagatccaccgtc chromosome_2 25 20734 20758 tgcgtgagccaagagatccaccgtc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_27_x32954 25 1 25 tccgacactcagacagacatgctcc chromosome_2 25 42197 42221 tccgacactcagacagacatgctcc mmmmmmmmmmmmmmmmmmmmmmmmm
seq_27_x32954 25 1 25 tccgacactcagacagacatgctcc chromosome_2 25 52946 52970 tccgacactcagacagacatgctcc mmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 20240119_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
792395
Less reads collapsed vs genome for the 25 bp than the 30 bp AST-2000 sample.
Edit the test_mirdeep2.sh
script to contain info for AST-2000 25bp
#!/bin/bash -i
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on test sample AST-2000 trimmed to 25bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar_25bp/17_sed.collapse.cat.AST-2000_25bp.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/20240119_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for test sample AST-2000 trimmed to 25bp" $(date)
conda deactivate
Submitted batch job 293022
20240121
Took about 16 hours to run. Now doing to download the csv and compare the AST-2000 25 vs 30 bp results.
AST-2000 25 bp
AST-2000 30 bp
The 25bp trimming had more miRNAs identified but more false positives. There were 111 unique sews when comparing 25bp v 30bp AST-2000. I think I am going to move forward with the 30bp trimming.
Yay!!!!! Okay well now I can do the rest of the mirdeep2 runs for all the samples. Idk how I feel about putting all the samples in a loop because I don’t want them to overwrite one another. I might do a separate script for each sample…If I do that, I’ll need to make separate scripts for the mirdeep2 itself.
In /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
, gunzip all fastq files.
I can write a single script to concatenate (with cat
command) and collapse reads (with fastx_collapser
from the fastx toolkit. In scripts folder: nano cat_collapse.sh
.
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FASTX-Toolkit/0.0.14-GCC-9.3.0
echo "Concatenate and collapse smRNA reads" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
samples=$(ls *_R1_001.fastq.gz_1.fastq | sed 's/\(.*\)_R1_001.fastq.gz_1.fastq/\1/')
for sample in $samples
do
# Concatenate paired-end reads
cat "${sample}_R1_001.fastq.gz_1.fastq" "${sample}_R1_001.fastq.gz_2.fastq" > "cat.${sample}.fastq"
echo "${sample} reads are concatenated into one file per sample" $(date)
# Collapse concatenated reads
fastx_collapser -v -i "cat.${sample}.fastq" -o "collapse.cat.${sample}.fastq"
echo "${sample} reads collapsed" $(date)
done
Submitted batch job 293086
20240122
All files are now concatenated and collapsed. I need to now prep the sequence IDs for mirdeep2 with sed
. I should’ve put this in the script. I’m also going to remove any reads <17nts.
AST-1065
sed '/^>/ s/-/_x/g' collapse.cat.AST-1065.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-1065.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1065.fastq"
output_file="17_sed.collapse.cat.AST-1065.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1065.fastq
11979585
Not doing AST-1105 bc of poor QC and mapping
AST-1147
sed '/^>/ s/-/_x/g' collapse.cat.AST-1147.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-1147.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1147.fastq"
output_file="17_sed.collapse.cat.AST-1147.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1147.fastq
18025765
AST-1412
sed '/^>/ s/-/_x/g' collapse.cat.AST-1412.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-1412.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1412.fastq"
output_file="17_sed.collapse.cat.AST-1412.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1412.fastq
11251691
AST-1560
sed '/^>/ s/-/_x/g' collapse.cat.AST-1560.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-1560.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1560.fastq"
output_file="17_sed.collapse.cat.AST-1560.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1560.fastq
11702120
AST-1567
sed '/^>/ s/-/_x/g' collapse.cat.AST-1567.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-1567.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1567.fastq"
output_file="17_sed.collapse.cat.AST-1567.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1567.fastq
9871465
AST-1617
sed '/^>/ s/-/_x/g' collapse.cat.AST-1617.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-1617.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1617.fastq"
output_file="17_sed.collapse.cat.AST-1617.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1617.fastq
8081542
AST-1722
sed '/^>/ s/-/_x/g' collapse.cat.AST-1722.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-1722.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-1722.fastq"
output_file="17_sed.collapse.cat.AST-1722.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-1722.fastq
7574236
AST-2000
sed '/^>/ s/-/_x/g' collapse.cat.AST-2000.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2000.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2000.fastq"
output_file="17_sed.collapse.cat.AST-2000.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2000.fastq
10773578
AST-2007
sed '/^>/ s/-/_x/g' collapse.cat.AST-2007.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2007.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2007.fastq"
output_file="17_sed.collapse.cat.AST-2007.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2007.fastq
8653745
AST-2302
sed '/^>/ s/-/_x/g' collapse.cat.AST-2302.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2302.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2302.fastq"
output_file="17_sed.collapse.cat.AST-2302.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2302.fastq
11391780
AST-2360
sed '/^>/ s/-/_x/g' collapse.cat.AST-2360.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2360.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2360.fastq"
output_file="17_sed.collapse.cat.AST-2360.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2360.fastq
9737775
AST-2398
sed '/^>/ s/-/_x/g' collapse.cat.AST-2398.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2398.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2398.fastq"
output_file="17_sed.collapse.cat.AST-2398.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2398.fastq
9507430
AST-2404
sed '/^>/ s/-/_x/g' collapse.cat.AST-2404.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2404.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2404.fastq"
output_file="17_sed.collapse.cat.AST-2404.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2404.fastq
11599063
AST-2412
sed '/^>/ s/-/_x/g' collapse.cat.AST-2412.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2412.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2412.fastq"
output_file="17_sed.collapse.cat.AST-2412.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2412.fastq
10485205
AST-2512
sed '/^>/ s/-/_x/g' collapse.cat.AST-2512.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2512.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2512.fastq"
output_file="17_sed.collapse.cat.AST-2512.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2512.fastq
9710109
AST-2523
sed '/^>/ s/-/_x/g' collapse.cat.AST-2523.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2523.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2523.fastq"
output_file="17_sed.collapse.cat.AST-2523.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2523.fastq
11748402
AST-2563
sed '/^>/ s/-/_x/g' collapse.cat.AST-2563.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2563.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2563.fastq"
output_file="17_sed.collapse.cat.AST-2563.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2563.fastq
11689234
AST-2729
sed '/^>/ s/-/_x/g' collapse.cat.AST-2729.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2729.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2729.fastq"
output_file="17_sed.collapse.cat.AST-2729.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2729.fastq
10703305
AST-2755
sed '/^>/ s/-/_x/g' collapse.cat.AST-2755.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.AST-2755.fastq
#!/bin/bash
# Define the input and output files
input_file="sed.collapse.cat.AST-2755.fastq"
output_file="17_sed.collapse.cat.AST-2755.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.AST-2755.fastq
10834414
Okay everything is prepped! I’m going to remove the extra files made (ie cat, collapse, sed iterations of the file). Going back to the main folder, I’m going to make a mirdeep2
folder that I will run all of my mirdeep2 code in so that the output is all in one place.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA
mkdir mirdeep2
cd mirdeep2
mkdir AST-1065 AST-1147 AST-1412 AST-1560 AST-1567 AST-1617 AST-1722 AST-2000 AST-2007 AST-2302 AST-2360 AST-2398 AST-2404 AST-2412 AST-2512 AST-2523 AST-2563 AST-2729 AST-2755
Now let’s run mirdeep2 on all of the samples! I am paranoid that if I run the script in a loop for all the samples, something is going to overwrite. Therefore, I am going to run each sample individually.
AST-1065
cd AST-1065
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-1065_reads_collapsed.fa -t AST-1065_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_51087
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 35658222 3275049 32383173 9.185 90.815
seq: 35658222 3275049 32383173 9.185 90.815
Look at the mapping results
head AST-1065_reads_collapsed.fa
>seq_1_x357414
TGGTCTATGGTGTAACTGGCAACACGTCTG
>seq_2_x138955
ACAGACGTGTTGCCAGTTACACCATAGACC
>seq_3_x125294
AACAGACGTGTTGCCAGTTACACCATAGAC
>seq_4_x98253
TGAAAATCTTTTCTCTGAAGTGGAA
>seq_5_x87633
TTCCACTTCAGAGAAAAGATTTTCA
zgrep -c ">" AST-1065_reads_collapsed.fa
11979585
head AST-1065_reads_collapsed_vs_genome.arf
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x81424 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 27510 27539 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 49105 49134 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 38360 38389 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x38316 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 879106 879135 acaaatcttagaacaaaggcttaatctcag - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_25_x32759 30 1 30 ttgctacgatcttctgagattaagcctttg chromosome_2 30 879093 879122 ttgctacgatcttctgagattaagcctttg + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_25_x32759 30 1 30 ttgctacgatcttctgagattaagcctttg chromosome_2 30 38373 38402 ttgctacgatcttctgagattaagcctttg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-1065_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
523747
conda deactivate
Run mirdeep2. nano AST-1065_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1065
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-1065 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1065/AST-1065_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-1065 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293148
AST-1147
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1147
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1147.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-1147_reads_collapsed.fa -t AST-1147_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_53573
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 80830448 23638038 57192410 29.244 70.756
seq: 80830448 23638038 57192410 29.244 70.756
Look at the mapping results
head AST-1147_reads_collapsed.fa
>seq_1_x3434945
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_2_x1817679
GGCGAGAATTCTACCACTGAACCACCAGTG
>seq_3_x926642
AGCGAGAATTCTACCACTGAACCACCAGTG
>seq_4_x602263
GCACTGTGGTTCAGTGGTAGAATTCTCGCC
>seq_5_x475440
GGCGAGAATTCTACCACTGAACCACAGTGC
zgrep -c ">" AST-1147_reads_collapsed.fa
18025765
head AST-1147_reads_collapsed_vs_genome.arf
seq_6_x427578 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x427578 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x427578 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x427578 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_14_x202590 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_14_x202590 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_14_x202590 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_14_x202590 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x127492 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 879106 879135 acaaatcttagaacaaaggcttaatctcag - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x127492 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 38360 38389 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-1147_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
3227088
conda deactivate
Run mirdeep2. nano AST-1147_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1147
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-1147 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1147.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1147/AST-1147_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-1147 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293149
AST-1412
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1412
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1412.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-1412_reads_collapsed.fa -t AST-1412_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_56487
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 32559110 3100518 29458592 9.523 90.477
seq: 32559110 3100518 29458592 9.523 90.477
Look at the mapping results
head AST-1412_reads_collapsed.fa
>seq_1_x137015
GGAAGAGCACACGTCTGAACTCCAGTCACT
>seq_2_x100202
AACTTTTGACGGTGGATCTCTTGGCTCACG
>seq_3_x94707
TCGGACTGTAGAACTCTGAACGTGTAGATC
>seq_4_x78147
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_5_x73890
TACTGGATAACTAAGGGAAAGTTTGGCTAA
zgrep -c ">" AST-1412_reads_collapsed.fa
11251691
head AST-1412_reads_collapsed_vs_genome.arf
seq_2_x100202 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_2_x100202 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_2_x100202 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_2_x100202 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_19_x39468 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_19_x39468 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_19_x39468 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_19_x39468 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x31213 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 52946 52975 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x31213 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 31354 31383 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-1412_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
599077
conda deactivate
Run mirdeep2. nano AST-1412_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1412
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-1412 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1412.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1412/AST-1412_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-1412 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293150
AST-1560
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1560
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1560.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-1560_reads_collapsed.fa -t AST-1560_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_58565
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 35654048 767007 34887041 2.151 97.849
seq: 35654048 767007 34887041 2.151 97.849
Look at the mapping results
head AST-1560_reads_collapsed.fa
>seq_1_x404539
GACTTTGTAGCATAGGTAAGGTTAGTGCAT
>seq_2_x272137
TCAGATGCACTAACCTTACCTATGCTACAA
>seq_3_x78722
GGAAGAGCACACGTCTGAACTCCAGTCACG
>seq_4_x58838
TCGGACTGTAGAACTCTGAACGTGTAGATC
>seq_5_x46651
GCACTGATGGTTCAGTGGTAGAATTCTCGC
zgrep -c ">" AST-1560_reads_collapsed.fa
11702120
head AST-1560_reads_collapsed_vs_genome.arf
seq_22_x19193 30 1 30 caggtctgtgatgcccttagatgtccgggg chromosome_2 30 32082 32111 caggtctgtgatgcccttagatgtacaggg - 2 mmmmmmmmmmmmmmmmmmmmmmmmMmMmmm
seq_22_x19193 30 1 30 caggtctgtgatgcccttagatgtccgggg chromosome_2 30 42925 42954 caggtctgtgatgcccttagatgtacaggg - 2 mmmmmmmmmmmmmmmmmmmmmmmmMmMmmm
seq_22_x19193 30 1 30 caggtctgtgatgcccttagatgtccgggg chromosome_2 30 21338 21367 caggtctgtgatgcccttagatgtacaggg - 2 mmmmmmmmmmmmmmmmmmmmmmmmMmMmmm
seq_43_x14818 30 1 30 acaggtctgtgatgcccttagatgtccggg chromosome_2 30 32083 32112 acaggtctgtgatgcccttagatgtacagg - 2 mmmmmmmmmmmmmmmmmmmmmmmmmMmMmm
seq_43_x14818 30 1 30 acaggtctgtgatgcccttagatgtccggg chromosome_2 30 42926 42955 acaggtctgtgatgcccttagatgtacagg - 2 mmmmmmmmmmmmmmmmmmmmmmmmmMmMmm
seq_43_x14818 30 1 30 acaggtctgtgatgcccttagatgtccggg chromosome_2 30 21339 21368 acaggtctgtgatgcccttagatgtacagg - 2 mmmmmmmmmmmmmmmmmmmmmmmmmMmMmm
seq_72_x11510 30 1 30 aggtctgtgatgcccttagatgtccggggc chromosome_2 30 32081 32110 aggtctgtgatgcccttagatgtacagggc - 2 mmmmmmmmmmmmmmmmmmmmmmmMmMmmmm
seq_72_x11510 30 1 30 aggtctgtgatgcccttagatgtccggggc chromosome_2 30 42924 42953 aggtctgtgatgcccttagatgtacagggc - 2 mmmmmmmmmmmmmmmmmmmmmmmMmMmmmm
seq_72_x11510 30 1 30 aggtctgtgatgcccttagatgtccggggc chromosome_2 30 21337 21366 aggtctgtgatgcccttagatgtacagggc - 2 mmmmmmmmmmmmmmmmmmmmmmmMmMmmmm
seq_87_x10194 30 1 30 agccaggaatcctaaccgctagaccatctg chromosome_12 30 41900395 41900424 agccaggaatcctaaccgctagaccatttg - 1 mmmmmmmmmmmmmmmmmmmmmmmmmmmMmm
cut -f1 AST-1560_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
123911
conda deactivate
Run mirdeep2. nano AST-1560_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1560
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-1560 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1560.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1560/AST-1560_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-1560 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293152
AST-1567
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1567
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1567.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-1567_reads_collapsed.fa -t AST-1567_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_12237
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 33222794 3149689 30073105 9.481 90.519
seq: 33222794 3149689 30073105 9.481 90.519
Look at the mapping results
head AST-1567_reads_collapsed.fa
>seq_1_x330204
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_2_x226911
GCACTGATGGTTCAGTGGTAGAATTCTCGC
>seq_3_x157338
GGCGAGAATTCTACCACTGAACCACCAGTG
>seq_4_x98181
AGCGAGAATTCTACCACTGAACCACCAGTG
>seq_5_x77909
GGAAGAGCACACGTCTGAACTCCAGTCACA
zgrep -c ">" AST-1567_reads_collapsed.fa
9871465
head AST-1567_reads_collapsed_vs_genome.arf
seq_9_x59542 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x59542 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x59542 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x59542 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_24_x25786 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_24_x25786 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_24_x25786 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_24_x25786 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_37_x16700 29 1 29 aatggataaccctcaaccgtccggacctc chromosome_14 29 6295369 6295397 aatggataaccctcaaccgtccggacctc - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_54_x12863 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 31354 31383 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-1567_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
569922
conda deactivate
Run mirdeep2. nano AST-1567_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1567
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-1567 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1567.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1567/AST-1567_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-1567 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293153
I’m going to call it quits for today. 3 samples are running through mirdeep2 now, and 2 samples are pending on the server.
20240123
Last night/early this morning, AST-1054, AST-1412, AST-1560, and AST-1567 all finished running mirdeep2. I downloaded the html and csv ouput files to my computer. I’m going to continue
AST-1617
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1617
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1617.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-1617_reads_collapsed.fa -t AST-1617_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_11212
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 32155434 7494814 24660620 23.308 76.692
seq: 32155434 7494814 24660620 23.308 76.692
Look at the mapping results
head AST-1617_reads_collapsed.fa
>seq_1_x1583958
TAAGACTATGATTATATGCAGCTTCTTGCA
>seq_2_x1378737
ATTGGTTTCGAGATGCAAGAAGCTGCATAT
>seq_3_x447663
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_4_x368633
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_5_x329309
GAGAATTCTACCACTGAACCACCAGTGC
zgrep -c ">" AST-1617_reads_collapsed.fa
8081542
head AST-1617_reads_collapsed_vs_genome.arf
seq_7_x232409 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x232409 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x232409 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x232409 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x108899 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x108899 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x108899 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x108899 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_30_x33370 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 31354 31383 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_30_x33370 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 20610 20639 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-1617_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
1368740
conda deactivate
Run mirdeep2. nano AST-1617_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1617
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-1617 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1617.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1617/AST-1617_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-1617 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293162
AST-1722
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1722
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1722.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-1722_reads_collapsed.fa -t AST-1722_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_13688
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 32860442 11326758 21533684 34.469 65.531
seq: 32860442 11326758 21533684 34.469 65.531
Look at the mapping results
head AST-1722_reads_collapsed.fa
>seq_1_x1663348
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_2_x942322
GGCGAGAATTCTACCACTGAACCACCAGTG
>seq_3_x422823
AGCGAGAATTCTACCACTGAACCACCAGTG
>seq_4_x345835
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_5_x329076
GCACTGTGGTTCAGTGGTAGAATTCTCGCC
zgrep -c ">" AST-1722_reads_collapsed.fa
7574236
head AST-1722_reads_collapsed_vs_genome.arf
seq_6_x311272 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x311272 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x311272 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_6_x311272 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x140612 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x140612 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x140612 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x140612 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x123585 30 1 30 cgagctgttcttcctcgcaaagactgtgtg chromosome_2 30 32824 32853 cgagctgttcttcctcgcaaagactgtgtg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x123585 30 1 30 cgagctgttcttcctcgcaaagactgtgtg chromosome_2 30 43667 43696 cgagctgttcttcctcgcaaagactgtgtg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-1722_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
1688293
conda deactivate
Run mirdeep2. nano AST-1722_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1722
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-1722 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1722.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1722/AST-1722_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-1722 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293188
AST-2000
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2000
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2000.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2000_reads_collapsed.fa -t AST-2000_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_15165
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 34857708 5888651 28969057 16.893 83.107
seq: 34857708 5888651 28969057 16.893 83.107
Look at the mapping results
head AST-2000_reads_collapsed.fa
>seq_1_x719979
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_2_x647300
GAGAATTCTACCACTGAACCACCAGTGC
>seq_3_x228711
GCACTGTGGTTCAGTGGTAGAATTCTC
>seq_4_x206224
GAGAATTCTACCACTGAACCACAGTGC
>seq_5_x161452
GCACTGGTGGTTCAGTGGTAGAATTCT
zgrep -c ">" AST-2000_reads_collapsed.fa
10773578
head AST-2000_reads_collapsed_vs_genome.arf
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_7_x119692 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_15_x41699 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_27_x28319 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 20610 20639 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_27_x28319 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 42197 42226 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2000_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
1016532
conda deactivate
Run mirdeep2. nano AST-2000_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2000
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2000 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2000.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2000/AST-2000_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2000 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293203
AST-2007
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2007
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2007.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2007_reads_collapsed.fa -t AST-2007_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_18266
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 33119102 7261428 25857674 21.925 78.075
seq: 33119102 7261428 25857674 21.925 78.075
Look at the mapping results
head AST-2007_reads_collapsed.fa
>seq_1_x745390
TGAAAATCTTTTCTCTGAAGTGGAA
>seq_2_x666548
TTCCACTTCAGAGAAAAGATTTTCA
>seq_3_x176638
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_4_x162570
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_5_x157967
GAGAATTCTACCACTGAACCACCAGTGC
zgrep -c ">" AST-2007_reads_collapsed.fa
8653745
head AST-2007_reads_collapsed_vs_genome.arf
seq_9_x90166 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x90166 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x90166 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x90166 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_11_x64281 30 1 30 agcacacacagtctttgcgaggaagaacag chromosome_2 30 22076 22105 agcacacacagtctttgcgaggaagaacag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_11_x64281 30 1 30 agcacacacagtctttgcgaggaagaacag chromosome_2 30 43663 43692 agcacacacagtctttgcgaggaagaacag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_11_x64281 30 1 30 agcacacacagtctttgcgaggaagaacag chromosome_2 30 32820 32849 agcacacacagtctttgcgaggaagaacag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x52870 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 879106 879135 acaaatcttagaacaaaggcttaatctcag - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x52870 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 38360 38389 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x52870 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 27510 27539 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2007_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
949862
conda deactivate
Run mirdeep2. nano AST-2007_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2007
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2007 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2007.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2007/AST-2007_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2007 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293252
AST-2302
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2302
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2302.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2302_reads_collapsed.fa -t AST-2302_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_35942
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 33330740 5544909 27785831 16.636 83.364
seq: 33330740 5544909 27785831 16.636 83.364
Look at the mapping results
head AST-2302_reads_collapsed.fa
>seq_1_x184430
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_2_x151792
TGAAAATCTTTTCTCTGAAGTGGAA
>seq_3_x131568
TTCCACTTCAGAGAAAAGATTTTCA
>seq_4_x102320
GGCGAGAATTCTACCACTGAACCACCAGTG
>seq_5_x88502
GCACTGGTGGTTCAGTGGTAGAATTCTC
zgrep -c ">" AST-2302_reads_collapsed.fa
11391780
head AST-2302_reads_collapsed_vs_genome.arf
seq_8_x67603 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_8_x67603 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_8_x67603 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_8_x67603 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_21_x35069 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 52946 52975 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_21_x35069 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 31354 31383 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_21_x35069 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 20610 20639 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_21_x35069 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 42197 42226 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_30_x23771 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 38360 38389 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_30_x23771 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 27510 27539 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2302_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
1149823
conda deactivate
Run mirdeep2. nano AST-2302_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2302
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2302 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2302.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2302/AST-2302_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2302 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293254
AST-2360
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2360
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2360.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2360_reads_collapsed.fa -t AST-2360_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_39928
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 33296712 4751266 28545446 14.269 85.731
seq: 33296712 4751266 28545446 14.269 85.731
Look at the mapping results
head AST-2360_reads_collapsed.fa
>seq_1_x265109
TAAGACTATGATTATATGCAGCTTCTTGCA
>seq_2_x223703
ATTGGTTTCGAGATGCAAGAAGCTGCATAT
>seq_3_x134102
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_4_x86252
AACTTTTGACGGTGGATCTCTTGGCTCACG
>seq_5_x66786
GGCGAGAATTCTACCACTGAACCACCAGTG
zgrep -c ">" AST-2360_reads_collapsed.fa
9737775
head AST-2360_reads_collapsed_vs_genome.arf
seq_4_x86252 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_4_x86252 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_4_x86252 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_4_x86252 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_22_x22715 30 1 30 aagagcgccatttgcgttcaaagattcgat chromosome_2 30 52982 53011 aagagcgccatttgcgttcaaagattcgat + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_22_x22715 30 1 30 aagagcgccatttgcgttcaaagattcgat chromosome_2 30 31390 31419 aagagcgccatttgcgttcaaagattcgat + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_22_x22715 30 1 30 aagagcgccatttgcgttcaaagattcgat chromosome_2 30 20646 20675 aagagcgccatttgcgttcaaagattcgat + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_22_x22715 30 1 30 aagagcgccatttgcgttcaaagattcgat chromosome_2 30 42233 42262 aagagcgccatttgcgttcaaagattcgat + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_24_x22365 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_24_x22365 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2360_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
464780
conda deactivate
Run mirdeep2. nano AST-2360_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2360
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2360 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2360.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2360/AST-2360_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2360 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293255
AST-2398
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2398
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2398.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2398_reads_collapsed.fa -t AST-2398_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_48437
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 33576416 7946641 25629775 23.667 76.333
seq: 33576416 7946641 25629775 23.667 76.333
Look at the mapping results
head AST-2398_reads_collapsed.fa
>seq_1_x881553
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_2_x433411
GGCGAGAATTCTACCACTGAACCACCAGTG
>seq_3_x266696
AGCGAGAATTCTACCACTGAACCACCAGTG
>seq_4_x218909
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_5_x195072
GAGAATTCTACCACTGAACCACCAGTGC
zgrep -c ">" AST-2398_reads_collapsed.fa
9507430
head AST-2398_reads_collapsed_vs_genome.arf
seq_8_x145611 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_8_x145611 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_8_x145611 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_8_x145611 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_17_x64706 29 1 29 aatggataaccctcaaccgtccggacctc chromosome_14 29 6295369 6295397 aatggataaccctcaaccgtccggacctc - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x61937 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x61937 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x61937 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_18_x61937 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_21_x38252 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 27510 27539 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2398_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
conda deactivate
Run mirdeep2. nano AST-2398_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2398
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2398 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2398.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2398/AST-2398_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2398 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293257
need to run quantifier module for samples: https://github.com/rajewsky-lab/mirdeep2/tree/master
How do I find the MFE? Is it calculated by mirdeep2 or by the quantifier module? I think it is linked to the randfold step. Need to look into this.
I looked at Gajigan & Conaco 2017 mirdeep2 pdf outputs from their supplementary materials and they got similar MFE values in their pdfs. However, in Table S5, they have MFE info that is <-25 kcal/mol. How did they calculate the MFE that mirdeep2 gave them to the MFE that was displayed in their table??
20240124
All of the mirdeep2 jobs finished last night/early this morning. Now time to start some more.
AST-2404
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2404
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2404.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2404_reads_collapsed.fa -t AST-2404_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_58495
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 33425806 5062194 28363612 15.145 84.855
seq: 33425806 5062194 28363612 15.145 84.855
Look at the mapping results
head AST-2404_reads_collapsed.fa
>seq_1_x144129
GGAAGAGCACACGTCTGAACTCCAGTCACC
>seq_2_x132047
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_3_x124405
AACTTTTGACGGTGGATCTCTTGGCTCACG
>seq_4_x97061
TCGGACTGTAGAACTCTGAACGTGTAGATC
>seq_5_x94760
GCACTGATGGTTCAGTGGTAGAATTCTCGC
zgrep -c ">" AST-2404_reads_collapsed.fa
11599063
head AST-2404_reads_collapsed_vs_genome.arf
seq_3_x124405 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_3_x124405 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_3_x124405 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_3_x124405 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x51804 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x51804 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x51804 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_9_x51804 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_11_x39689 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 52946 52975 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_11_x39689 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 31354 31383 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2404_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
881575
conda deactivate
Run mirdeep2. nano AST-2404_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2404
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2404 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2404.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2404/AST-2404_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2404 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293308
AST-2412
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2412
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2412.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2412_reads_collapsed.fa -t AST-2412_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_59645
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 34977016 2244564 32732452 6.417 93.583
seq: 34977016 2244564 32732452 6.417 93.583
Look at the mapping results
head AST-2412_reads_collapsed.fa
>seq_1_x1651314
GACTTTGTAGCATAGGTAAGGTTAGTGCAT
>seq_2_x1124747
TCAGATGCACTAACCTTACCTATGCTACAA
>seq_3_x189997
CAGATGCACTAACCTTACCTATGCTACAAA
>seq_4_x98647
AACTTTTGACGGTGGATCTCTTGGCTCACG
>seq_5_x84357
ATCAGATGCACTAACCTTACCTATGCTACA
zgrep -c ">" AST-2412_reads_collapsed.fa
10485205
head AST-2412_reads_collapsed_vs_genome.arf
seq_4_x98647 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_4_x98647 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_4_x98647 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_4_x98647 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x44893 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x44893 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x44893 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x44893 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_33_x19915 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 42197 42226 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_33_x19915 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 52946 52975 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2412_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
415100
conda deactivate
Run mirdeep2. nano AST-2412_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2412
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2412 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2412.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2412/AST-2412_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2412 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293309
AST-2512
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2512
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2512.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2512_reads_collapsed.fa -t AST-2512_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_60774
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 32531432 1232618 31298814 3.789 96.211
seq: 32531432 1232618 31298814 3.789 96.211
Look at the mapping results
head AST-2512_reads_collapsed.fa
>seq_1_x1229410
AAATACAAATCGTTCAGGTATTAGGAGTGA
>seq_2_x900725
AGCTCACTCCTAATACCTGAACGATTTGTA
>seq_3_x539793
AGATGGAATTGTAGCATG
>seq_4_x488884
CATGCTACAATTCCATCT
>seq_5_x298053
ACTGGATAACTAAGGGAAAGTTTGGCTAAT
zgrep -c ">" AST-2512_reads_collapsed.fa
9710109
head AST-2512_reads_collapsed_vs_genome.arf
seq_20_x37501 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x37501 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x37501 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_20_x37501 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_51_x16071 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 42321 42350 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_51_x16071 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 53070 53099 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_51_x16071 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_51_x16071 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_76_x10820 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 42197 42226 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_76_x10820 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 52946 52975 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2512_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
237458
conda deactivate
Run mirdeep2. nano AST-2512_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2512
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2512 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2512.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2512/AST-2512_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2512 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293310
AST-2523
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2523
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2523.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2523_reads_collapsed.fa -t AST-2523_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_61964
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 33990530 4117469 29873061 12.114 87.886
seq: 33990530 4117469 29873061 12.114 87.886
Look at the mapping results
head AST-2523_reads_collapsed.fa
zgrep -c ">" AST-2523_reads_collapsed.fa
head AST-2523_reads_collapsed_vs_genome.arf
cut -f1 AST-2523_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
conda deactivate
Run mirdeep2. nano AST-2523_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2523
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2523 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2523.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2523/AST-2523_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2523 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293319
I want to try running the quantifier module. on the mirdeep2 github, it says that the input should be:
- A FASTA file with precursor sequences,
- A FASTA file with mature miRNA sequences,
- A FASTA file with deep sequencing reads, and
- Optionally a FASTA file with star sequences and the 3 letter code of the species of interest.
I need to create fasta files with the known and novel mature and precursor sequences. Going to test this on AST-1065 first.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1065/mirna_results_22_01_2024_t_17_11_47
cat known_mature_22_01_2024_t_17_11_47_score-50_to_na.fa novel_mature_22_01_2024_t_17_11_47_score-50_to_na.fa > AST-1065-known_novel_mature.fa
cat known_pres_22_01_2024_t_17_11_47_score-50_to_na.fa novel_pres_22_01_2024_t_17_11_47_score-50_to_na.fa > AST-1065-known_novel_pres.fa
Now I have files with the known and novel mature and precursor sequences. I hope I am correct in my interpretation of the mature and precursors seqs coming from the Time to run quantifier!
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1065
conda activate /data/putnamlab/mirdeep2
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1065/mirna_results_22_01_2024_t_17_11_47/AST-1065-known_novel_pres.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1065/mirna_results_22_01_2024_t_17_11_47/AST-1065-known_novel_mature.fa -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq
getting samples and corresponding read numbers
Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 35658222 69287 35588935 0.194 99.806
seq: 35658222 69287 35588935 0.194 99.806
analyzing data
326 mature mappings to precursors
Expressed miRNAs are written to expression_analyses/expression_analyses_1706128068/miRNA_expressed.csv
not expressed miRNAs are written to expression_analyses/expression_analyses_1706128068/miRNA_not_expressed.csv
Creating miRBase.mrd file
Mapped READS readin - DONE
Very low mapping…but 326 mature mapping to precursors is good because I had 326 mature miRNA seqs to begin with. Looking at the output, there are different read counts produced from the mirdeep2.pl module and the quantifier.pl module. I’m still not sure if I’m supposed to use the mirbase mature and precursor seqs or the ones produced from the mirdeep output…Let’s try running the quantifier module again using the mirbase info instead. I’m using the cnidarian mature miRNA fasta that I created a while ago. I downloaded the hairpins.fa
file from mirbase to use as the fasta for the precursor sequences.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1065
conda activate /data/putnamlab/mirdeep2
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/hairpin.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-1065.fastq
getting samples and corresponding read numbers
Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 35658222 168104 35490118 0.471 99.529
seq: 35658222 168104 35490118 0.471 99.529
analyzing data
53298 mature mappings to precursors
Expressed miRNAs are written to expression_analyses/expression_analyses_1706129674/miRNA_expressed.csv
not expressed miRNAs are written to expression_analyses/expression_analyses_1706129674/miRNA_not_expressed.csv
Creating miRBase.mrd file
Mapped READS readin - DONE
Instead of creating pdfs, I am getting this: Negative repeat count does nothing at /data/putnamlab/mirdeep2/bin/quantifier.pl line 1312, <IN> line 122912.
Not sure what it means but it makes me think that I should use the former script for quantifying. Okay yes I should use the other script. The expressed miRNA counts was just the IDs from mirbase, so that doesn’t help me very much
Also tried to figure out how to calculate MFE based on the supplemental files from Gajigan & Conaco 2017 paper. In the supplement, they inlcuded their PDF output with the MFE scores (which are similar to mine) and their supplemental table with MFE scores in kcal/mol-1, which were all -18 or lower. I plotted them against one another (x axis was MFE in kcal/mol-1, y axis was MFE from pdf) and got the slope of that line (y = -0.0437*x + 0.73, R-squared = 0.604). Then I rearranged the equation to solve for y (x = (0.73-y)/0.0437). I used the MFE pdf value as the y and attempted to re-calculate the MFE in kcal/mol. The equation did not exactly capture MFE, as the output was not the same as the MFE values reported. Here is the google sheet where I did those calculations. HOW TO CALCULATE MFE??????? This paper and this paper may help. Will look into it tomorrow.
20240129
RE my notes above, I did not look at them but it is still on my list of things to do. First, I’m going to prioritize finishing running mirdeep2 for all samples.
AST-2563
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2563
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2563.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2563_reads_collapsed.fa -t AST-2563_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_53481
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 34046004 3503553 30542451 10.291 89.709
seq: 34046004 3503553 30542451 10.291 89.709
Look at the mapping results
head AST-2563_reads_collapsed.fa
>seq_1_x117943
ATGCGTAGTGGAATACTCTGGAAAGTGT
>seq_2_x105383
ACACTTTCCAGAGTATTCCACTACGCAT
>seq_3_x69250
AACTTTTGACGGTGGATCTCTTGGCTCACG
>seq_4_x59605
GGAAGAGCACACGTCTGAACTCCAGTCACA
>seq_5_x56451
AGAAATGTGTGTAGCTGAGCAGTACTAATT
zgrep -c ">" AST-2563_reads_collapsed.fa
11689234
head AST-2563_reads_collapsed_vs_genome.arf
seq_3_x69250 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_3_x69250 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_3_x69250 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_3_x69250 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x27176 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 42197 42226 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x27176 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 52946 52975 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x27176 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 31354 31383 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_16_x27176 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 20610 20639 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_26_x22828 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_26_x22828 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2563_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
671528
conda deactivate
Run mirdeep2. nano AST-2563_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2563
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2563 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2563.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2563/AST-2563_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2563 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293787
AST-2729
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2729
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2729.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2729_reads_collapsed.fa -t AST-2729_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_54539
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 34243738 1423780 32819958 4.158 95.842
seq: 34243738 1423780 32819958 4.158 95.842
Look at the mapping results
head AST-2729_reads_collapsed.fa
>seq_1_x228146
GCACTGATGGTTCAGTGGTAGAATTCTCGC
>seq_2_x188816
GGAAGAGCACACGTCTGAACTCCAGTCACC
>seq_3_x137884
TCGGACTGTAGAACTCTGAACGTGTAGATC
>seq_4_x104317
AACTCTAAGCGGTGGATCACTCGGCTCGTG
>seq_5_x86774
ACACACGAGCCGAGTGATCCACCGCTTAGA
zgrep -c ">" AST-2729_reads_collapsed.fa
10703305
head AST-2729_reads_collapsed_vs_genome.arf
seq_36_x27457 30 1 30 tggctccccggcggggaatcgaaccccggt chromosome_14 30 42692589 42692618 tggctccccggcggggaatcgaaccccggt - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_59_x18783 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_59_x18783 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_59_x18783 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_59_x18783 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_94_x13183 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 52946 52975 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_94_x13183 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 31354 31383 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_94_x13183 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 20610 20639 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_94_x13183 30 1 30 tccgacactcagacagacatgctcctggga chromosome_2 30 42197 42226 tccgacactcagacagacatgctcctggga + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_201_x7329 30 1 30 tggctccccggcggggaaatgaaccccggt chromosome_14 30 42692589 42692618 tggctccccggcggggaaaggaaccccggt - 2 mmmmmmmmmmmmmmmmmmMMmmmmmmmmmm
cut -f1 AST-2729_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
299482
conda deactivate
Run mirdeep2. nano AST-2729_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2729
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2729 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2729.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2729/AST-2729_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2729 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293788
AST-2755
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2755
Run mapping step
conda activate /data/putnamlab/mirdeep2
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2755.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s AST-2755_reads_collapsed.fa -t AST-2755_reads_collapsed_vs_genome.arf -v
discarding short reads
mapping reads to genome index
trimming unmapped nts in the 3' ends
Log file for this run is in mapper_logs and called mapper.log_55943
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 32539646 3255098 29284548 10.003 89.997
seq: 32539646 3255098 29284548 10.003 89.997
Look at the mapping results
head AST-2755_reads_collapsed.fa
>seq_1_x282696
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_2_x165033
GGCGAGAATTCTACCACTGAACCACCAGTG
>seq_3_x76982
AGCGAGAATTCTACCACTGAACCACCAGTG
>seq_4_x73794
TGAAAATCTTTTCTCTGAAGTGGAA
>seq_5_x66624
TTCCACTTCAGAGAAAAGATTTTCA
zgrep -c ">" AST-2755_reads_collapsed.fa
10834414
head AST-2755_reads_collapsed_vs_genome.arf
seq_12_x49088 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 42323 42352 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x49088 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 53072 53101 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x49088 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 20736 20765 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_12_x49088 30 1 30 aacttttgacggtggatctcttggctcacg chromosome_2 30 31480 31509 aacttttgacggtggatctcttggctcacg - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_35_x24220 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 879106 879135 acaaatcttagaacaaaggcttaatctcag - 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_35_x24220 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 38360 38389 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_35_x24220 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 27510 27539 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_35_x24220 30 1 30 acaaatcttagaacaaaggcttaatctcag chromosome_2 30 49105 49134 acaaatcttagaacaaaggcttaatctcag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_40_x21090 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 31478 31507 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_40_x21090 30 1 30 tgcgtgagccaagagatccaccgtcaaaag chromosome_2 30 20734 20763 tgcgtgagccaagagatccaccgtcaaaag + 0 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cut -f1 AST-2755_reads_collapsed_vs_genome.arf | sort | uniq | wc -l
588859
conda deactivate
Run mirdeep2. nano AST-2755_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2755
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 on sample AST-2755 trimmed to 30bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/17_sed.collapse.cat.AST-2755.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-2755/AST-2755_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2 concluded for sample AST-2755 trimmed to 30bp" $(date)
conda deactivate
Submitted batch job 293789. All of the mirdeep2 samples finished running today, yay!!!
20240130
I am now looking at all of the output data in an R script and I am thinking that I should concatenate all of the reads together, regardless of timepoint or treatment. Since its so many files, I’m going to write a script to concatenate and collapse
In the scripts folder: nano cat_collapse_all.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FASTX-Toolkit/0.0.14-GCC-9.3.0
echo "Concatenate and collapse smRNA reads from ALL samples" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
# Concatenate reads
cat AST*.fastq > cat.all.fastq
echo "Reads concatenated, start collapse" $(date)
# Collapse concatenated reads
fastx_collapser -v -i cat.all.fastq -o collapse.cat.all.fastq
echo "Reads collapsed, start header adjustments for mirdeep2" $(date)
sed '/^>/ s/-/_x/g' collapse.cat.all.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.all.fastq
echo "Headers adjusted, start removing sequences <17 nts" $(date)
# Define the input and output files
input_file="sed.collapse.cat.all.fastq"
output_file="17_sed.collapse.cat.all.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.all.fastq
echo "Sequences removed, ready for mirdeep2" $(date)
Submitted batch job 293863. Ran into this error: fastx_collapser: Error: invalid quality score data on line 234199948 (quality_tok = "AAFFFJAJJJJJJJJJJJJJJJJJJJJ@GWNJ-0957:1001:GW2306054826th:4:1101:1560:1801 1:N:0:GTAGAGAT"
. Maybe the concatenate step didn’t work so well? I’m not sure why it would have failed but maybe the samples ‘catted’ together so that the first line of one sample ended up with the last line of another sample.
Trying to look at the line where things errored out:
sed -n '234199948,+20p' cat.all.fastq
AAFFFJAJJJJJJJJJJJJJJJJJJJJ
@GWNJ-0957:1001:GW2306054826th:4:1101:1560:1801 1:N:0:GTAGAGAT
CACCCCTCTTCCAATAACTTTACCTCTTA
+
AAAFA<F<FFJFJFA-FFJF-<FFJJJF7
@GWNJ-0957:1001:GW2306054826th:4:1101:1621:1801 1:N:0:GTAGAGAT
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
+
AAAFFJJJJJA--FJ7AAJFAJ-FF<F-7J
@GWNJ-0957:1001:GW2306054826th:4:1101:2067:1801 1:N:0:GTAGAGAT
TTTTGAAATCTAGAAGCTATGAAACT
+
AAAFFJJJJJJJFFJ<FFFF<JJJJJ
@GWNJ-0957:1001:GW2306054826th:4:1101:2087:1801 1:N:0:GTAGAGAT
TGATCTTGTAGGTTCCATCTTAT
+
AA<AA<FFJ7F7<7AAAJFJFJF
@GWNJ-0957:1001:GW2306054826th:4:1101:2189:1801 1:N:0:GTAGAGAT
GCCTTTGTGCTATGATCTGTTGAGGTTCTG
+
AA<AFJJJFJJJ7AFFJJJFFJJFJJJ-F-
@GWNJ-0957:1001:GW2306054826th:4:1101:3224:1801 1:N:0:GTAGAGAT
I appear to be correct in that one line was catted to the end of another line. How to prevent this???? Idk if this is even the only instance where this happened in the cat file. I added the line below to the cat_collapse_all.sh
script after the reads are concatenated and before the collapsing.
sed 's/@/\n@/g' cat.all.fastq > check.cat.all.fastq
Submitted batch job 293874. Now I’m getting this error: fastx_collapser: input file (check.cat.all.fastq) has unknown file format (not FASTA or FASTQ), first character = (10)
. I’m adding this line in after the sed line: grep -v '^[[:space:]]*$' check.cat.all.fastq > check.cat.all.fastq
. This should remove any blank lines in the data. Submitted batch job 293876
20240130
Now I’m getting this error: grep: input file ‘check.cat.all.fastq’ is also the output. fastx_collapser: Premature End-Of-File (filename ='check.cat.all.fastq')
. I’m going to rename the grep output so that it is grep.check.cat.all.fastq
and change the input file name to grep.check.cat.all.fastq
for fastx. Submitted batch job 293889. Ran for 3 hours and then failed. This is the error: fastx_collapser: Error: invalid quality score data on line 234199948 (quality_tok = "AAFFFJAJJJJJJJJJJJJJJJJJJJJ"
. AGGGGGGGHHHHHHGGHHHH. I don’t know what to do…Okay Chat gpt wrote an awk script for me. In the scripts folder, I did nano filter_length.awk
and put the following code in:
awk '{
# Read the header line
header = $0
# Read the sequence line
getline
sequence = $0
# Read the "+" line
getline
# Read the quality score line
quality = $0
# Check if sequence length matches quality score length
if (length(sequence) == length(quality)) {
# Print the record
print header
print sequence
print "+"
print quality
}
}' input.fastq > output.fastq
This code SHOULD remove any sequence whose length differs from the length of its quality score lines. I added this line: awk -f /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/filter_length.awk grep.check.cat.all.fastq > awk.grep.check.cat.all.fastq
below the grep line. Now cat_collapse_all.sh
looks like this:
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FASTX-Toolkit/0.0.14-GCC-9.3.0
echo "Concatenate and collapse smRNA reads from ALL samples" $(date)
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
# Concatenate reads
#cat AST*.fastq > cat.all.fastq
# Make sure there are no issues with cat - ensure that every @ symbol is a new line
echo "Make sure all lines start with @ symbol" $(date)
sed 's/@/\n@/g' cat.all.fastq > check.cat.all.fastq
echo "Remove any blank lines" $(date)
grep -v '^[[:space:]]*$' check.cat.all.fastq > grep.check.cat.all.fastq
echo "Remove any sequences where the length of the sequence and length of quality score differ" $(date)
awk -f /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/filter_length.awk grep.check.cat.all.fastq > awk.grep.check.cat.all.fastq
echo "Reads concatenated, start collapse" $(date)
# Collapse concatenated reads
fastx_collapser -v -i awk.grep.check.cat.all.fastq -o collapse.cat.all.fastq
echo "Reads collapsed, start header adjustments for mirdeep2" $(date)
sed '/^>/ s/-/_x/g' collapse.cat.all.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.all.fastq
echo "Headers adjusted, start removing sequences <17 nts" $(date)
# Define the input and output files
input_file="sed.collapse.cat.all.fastq"
output_file="17_sed.collapse.cat.all.fastq"
# Initialize the output file
> "$output_file"
# Use awk to process the sequences
awk '{
if (substr($0, 1, 1) == ">") {
header = $0
getline
sequence = $0
if (length(sequence) >= 17) {
print header >> "'$output_file'"
print sequence >> "'$output_file'"
}
}
}' "$input_file"
zgrep -c ">" 17_sed.collapse.cat.all.fastq
echo "Sequences removed, ready for mirdeep2" $(date)
Submitted batch job 293969
20240201
NOW I got this error:
awk: /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/filter_length.awk:1: awk '{
awk: /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/filter_length.awk:1: ^ invalid char ''' in expression
awk: /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/filter_length.awk:1: awk '{
awk: /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/filter_length.awk:1: ^ syntax error
fastx_collapser: Premature End-Of-File (filename ='awk.grep.check.cat.all.fastq')
I might just manually do it. I also rezipped all of the files in my flexbar trimmed data folder.
20240202
I am going to follow the concatenate step in Sam’s code. Here’s his code. He does a lot of echoing and what not.
# Load bash variables into memory
source .bashvars
# Check for existence of concatenated FastA before running
if [ ! -f "${output_dir_top}/${concatenated_trimmed_reads_fastq}" ]; then
cat ${trimmed_fastqs_dir}/*.fastq.gz \
> "${output_dir_top}/${concatenated_trimmed_reads_fastq}"
fi
ls -lh "${output_dir_top}/${concatenated_trimmed_reads_fastq}"
Let’s try to modify it for my info. First, I’m just going to concatenate. In the scripts folder: nano cat_all.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load FASTX-Toolkit/0.0.14-GCC-9.3.0
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
echo "Concatenate smRNA reads from ALL samples" $(date)
# Check for existence of concatenated FastA before running
if [ ! -f "cat.all.fastq" ]; then
cat AST*.fastq.gz \
> cat.all.fastq
fi
echo "Reads concatenated" $(date)
Submitted batch job 294079. Ran in about 2 mins but produced a binary file which is weird…Going to edit the code so that it is just cat AST*.fastq.gz
. Submitted batch job 294080. Took about 2 mins. Once again, produced a binary file. Maybe because I am concatenating .gz files? Let me try to run fastx_collapse and see if it works. In the scripts folder: nano collapse_all.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FASTX-Toolkit/0.0.14-GCC-9.3.0
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
echo "Collapse concatenated reads" $(date)
fastx_collapser -v -i cat.all.fastq -o collapse.cat.all.fastq
echo "Reads collapsed" $(date)
Submitted batch job 294081. Failed immediately and got this error: fastx_collapser: input file (cat.all.fastq) has unknown file format (not FASTA or FASTQ), first character = ^_ (31)
. Okay I might go back to basics…so the cat step worked when I was just catting R1 and R2 together. Maybe lets try to do that with two samples and then cat those samples together and see if they collapse.
interactive
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
gunzip AST-1065_R1_001.fastq*
gunzip AST-1105_R1_001.fastq*
cat AST-1065_R1_001.fastq.gz_1.fastq AST-1065_R1_001.fastq.gz_2.fastq > AST-1065.fastq
cat AST-1105_R1_001.fastq.gz_1.fastq AST-1105_R1_001.fastq.gz_2.fastq > AST-1105.fastq
cat AST-1065.fastq AST-1105.fastq > test.fastq
module load FASTX-Toolkit/0.0.14-GCC-9.3.0
fastx_collapser -v -i AST-1065.fastq -o collapse.AST-1065.fastq
Input: 35658222 sequences (representing 35658222 reads)
Output: 11979585 sequences (representing 35658222 reads)
Okay so fastx is happy with the R1 and R2 concatenated output. Now lets try on the test.fastq file
.
fastx_collapser -v -i test.fastq -o collapse.test.fastq
Got this error: fastx_collapser: Error: invalid quality score data on line 234199948 (quality_tok = "AAFFFJAJJJJJJJJJJJJJJJJJJJJ"
. Let’s look at the line where the error is occurring. Going to set it 10 lines up so that I can see whats going on
sed -n '234199948, 234199948p' test.fastq
AAFFFJAJJJJJJJJJJJJJJJJJJJJ
Does it hate the J ascii character???
I am now realizing that AST-1105 was one of the samples that looked weird and that I excluded from analysis. Let me choose a different sample to test with AST-1065
gunzip AST-2404_R1_001.fastq.gz*
cat AST-2404_R1_001.fastq.gz_1.fastq AST-2404_R1_001.fastq.gz_2.fastq > AST-2404.fastq
Let’s look at how many lines are in each file
wc -l AST-1065.fastq
142632888 AST-1065.fastq
wc -l AST-2404.fastq
133703224 AST-2404.fastq
Since I am catting these two files, 142632888 lines + 133703224 lines = 276336112 lines – this is how many lines should be in the fastq file once I cat them together.
cat AST-1065.fastq AST-2404.fastq > test.fastq
wc -l test.fastq
276336112 test.fastq
Expected number of lines. Let’s try to collapse
fastx_collapser -v -i test.fastq -o collapse.test.fastq
Input: 69084028 sequences (representing 69084028 reads)
Output: 22118660 sequences (representing 69084028 reads)
Took a while but worked!!! Maybe it is the problem sample AST-1105 that is messing everything up…going to move it to its own folder in the flexbar folder and rerun the script.
mkdir AST-1105
mv *1105* AST-1105
cd ../../../scripts/
sbatch cat_all.sh
Submitted batch job 294086. Took 2 mins but still a binary file…Do I have to just do it manually??? Maybe instead I will cat all of the R1 and R2 reads together and then cat those files together.
We had the e5 molecular meeting and Sam White said that Azenta recommended that we toss read 2 and don’t use it at all. So I may try to just cat the R1…I’m going to edit the cat_all.sh
script. Now here’s what it says:
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load FASTX-Toolkit/0.0.14-GCC-9.3.0
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
echo "Unzip R1 files" $(date)
gunzip AST*_1.fastq.gz
echo "Unzipping complete, concatenate smRNA reads from ALL samples - R1 only" $(date)
cat AST*_1.fastq > cat.all.fastq
echo "R1 concatenated" $(date)
Submitted batch job 294088. Took about 5 mins. Looks good!! Finally successful cat (so far). Now running the collapse_all.sh
script. Submitted batch job 294089
20240204
Success!!!!!!! Reads have been concatenated!!!!! At least the R1 reads. Here’s the output message:
Collapse concatenated reads Fri Feb 2 14:31:47 EST 2024
Input: 343437674 sequences (representing 343437674 reads)
Output: 62419423 sequences (representing 343437674 reads)
Reads collapsed Fri Feb 2 16:28:11 EST 2024
Go to the file in /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
and check it out
head collapse.cat.all.fastq
>1-8268409
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>2-2708289
GCACTGGTGGTTCAGTGGTAGAATTCTC
>3-2338875
GACTTTGTAGCATAGGTAAGGTTAGTGCAT
>4-2210232
AACTTTTGACGGTGGATCTCTTGGCTCACG
>5-1864084
TAAGACTATGATTATATGCAGCTTCTTGCA
tail collapse.cat.all.fastq
>62419419-1
ATTTGACGAAGGCTCCAAAGGAAGTCATGG
>62419420-1
TTGCCCGTATTACTGCCGT
>62419421-1
TACCTGCCCTATTTGCCTTATACTAG
>62419422-1
CTGGAAATCTGCTGGACTTACGTTT
>62419423-1
CGACAGTGGGCTGAAGCTG
zgrep -c ">" collapse.cat.all.fastq
62419423
Prep the sequence headers for mirdeep2 analysis
sed '/^>/ s/-/_x/g' collapse.cat.all.fastq \
| sed '/^>/ s/>/>seq_/' \
> sed.collapse.cat.all.fastq
head sed.collapse.cat.all.fastq
>seq_1_x8268409
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_2_x2708289
GCACTGGTGGTTCAGTGGTAGAATTCTC
>seq_3_x2338875
GACTTTGTAGCATAGGTAAGGTTAGTGCAT
>seq_4_x2210232
AACTTTTGACGGTGGATCTCTTGGCTCACG
>seq_5_x1864084
TAAGACTATGATTATATGCAGCTTCTTGCA
zgrep -c ">" sed.collapse.cat.all.fastq
62419423
Now I can run mirdeep2 (mapping and prediction steps). Because the input fasta file will likely be so big, I’m going to include the mapper.pl with the mirdeep2.pl script. First, make a new folder in the mirdeep2 folder for all samples.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2
mkdir all
Run mirdeep2. In the scripts folder: nano mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mapping" $(date)
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/sed.collapse.cat.all.fastq -c -p /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts/Apoc_ref.btindex -s /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/all_reads_collapsed.fa -t /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/all_reads_collapsed_vs_genome.arf -v
echo "Mapping complete, Starting mirdeep2" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/sed.collapse.cat.all.fastq /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/all_reads_collapsed_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -t N.vectensis -P -v -g -1 2>report.log
echo "mirdeep2" $(date)
conda deactivate
Submitted batch job 294127
20240206
Success! mirdeep2 finished running, took about 1.5 days. I’m going to download the results to my computer. Since the output is in the scripts folder, I’m going to move all of the output to the all folder. Here’s a summary of the output:
I think my next steps will be filtering in R. I’m going to filter the csv so that I retain potential miRNAs that have an mirdeep2 score > 10, no rfam info, at least 10 reads in mature and star read count, and significant randfold pvalue (this has been done in most of the other cnidarian miRNA papers). When I did this filtering in this script, I ended up with 278 novel miRNAs.
Additionally, I will need to write a script that looks at: ““requirement of a 2-nucleotide overhang on the 3’ end of the precursor miRNA, 5’ consistency of the mature miRNA strand (at least 90% of the reads have to be starting from the same position), and at least 16 nucleotide complementarity between mature and star strand” (Praher et al., 2021). I may do this manually by looking at the PDFs…
I also need to figure out how MFE is calculated?
After I do that, I will probably blast the predicted seqs against a tRNA and rRNA database to remove any unwanted RNAs. RNAcentral gives a good overview of the different ncRNA databases that could be used. Here are some options:
- rRNA
- tRNA
I should also blast it against the NCBI database.
20240226
Since mirdeep2 has run with all R1 reads concatenated, I can now run the quantifier module for each sample. I think I might need to collapse each R1 file and modify it with the correct headers. From the mirdeep2 github, it says the input for the quantifier module is:
- A FASTA file with precursor sequences,
- a FASTA file with mature miRNA sequences,
- a FASTA file with deep sequencing reads, and
- optionally a FASTA file with star sequences and the 3 letter code of the species of interest.
The fasta files with the precursor, mature and star sequences are in this folder: /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57
. The fasta files are separated by known and novel, so I will cat the known and novel sequences together.
cat known_mature_04_02_2024_t_11_15_57_score-50_to_na.fa novel_mature_04_02_2024_t_11_15_57_score-50_to_na.fa > mature_all.fa
cat known_pres_04_02_2024_t_11_15_57_score-50_to_na.fa novel_pres_04_02_2024_t_11_15_57_score-50_to_na.fa > precursor_all.fa
cat known_star_04_02_2024_t_11_15_57_score-50_to_na.fa novel_star_04_02_2024_t_11_15_57_score-50_to_na.fa > star_all.fa
Let’s try to run a sample without collapsing it.
conda activate /data/putnamlab/mirdeep2
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -s /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/star_all.fa -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq
As I suspected, it gave me this error:
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq ids do not have the correct format
it must have the id line >SSS_INT_xINT
SSS is a three letter code indicating the sample origin
INT is just a running number
xINT is the number of read occurrences
But it did give me this recommendation:
You can use the mapper.pl module to create such a file from a fasta file with
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq.collapsed
So lets give that a try! Ran in about a minute. Let’s try to run the quantifier module now.
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -s /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/star_all.fa -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq.collapsed
Ran in about a minute and gave this output:
getting samples and corresponding read numbers
Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 17829111 43551 17785560 0.244 99.756
seq: 17829111 43551 17785560 0.244 99.756
mapping star sequences against index
analyzing data
1873 mature mappings to precursors
1865 star mappings to precursors
Expressed miRNAs are written to expression_analyses/expression_analyses_1708999755/miRNA_expressed.csv
not expressed miRNAs are written to expression_analyses/expression_analyses_1708999755/miRNA_not_expressed.csv
Creating miRBase.mrd file
Mapped READS readin - DONE
make_html2.pl -q expression_analyses/expression_analyses_1708999755/miRBase.mrd -k mature_all.fa -y 1708999755 -o -i expression_analyses/expression_analyses_1708999755/mature_all.fa_mapped.arf -j expression_analyses/expression_analyses_1708999755/star_all.fa_mapped.arf -l -M miRNAs_expressed_all_samples_1708999755.csv
miRNAs_expressed_all_samples_1708999755.csv file with miRNA expression values
parsing miRBase.mrd file finished
creating PDF files
Can't use string ("29") as a HASH ref while "strict refs" in use at /data/putnamlab/mirdeep2/bin/make_html2.pl line 658.
I also ran the code so that the -s
argument was removed. It gave the same mapping but it did print out creating pdf for chromosome_XXXX
while the other line of code didn’t. I looked at less miRNAs_expressed_all_samples_1708999755.csv
#miRNA read_count precursor total seq seq(norm)
chromosome_10_365643 2.00 chromosome_10_365643 2.00 2.00 33.80
chromosome_10_365701 2.00 chromosome_10_365701 2.00 2.00 33.80
chromosome_10_366823 1.00 chromosome_10_366823 1.00 1.00 16.90
chromosome_10_366897 229.00 chromosome_10_366897 229.00 229.00 3870.47
chromosome_10_367039 2.00 chromosome_10_367039 2.00 2.00 33.80
chromosome_10_367612 0.00 chromosome_10_367612 0.00 0 0
chromosome_10_367894 56.00 chromosome_10_367894 56.00 56.00 946.49
chromosome_10_368110 0.00 chromosome_10_368110 0.00 0 0
chromosome_10_368443 0.00 chromosome_10_368443 0.00 0 0
chromosome_10_370371 11.00 chromosome_10_370371 11.00 11.00 185.92
chromosome_10_370480 0.00 chromosome_10_370480 0.00 0 0
chromosome_10_370856 2.00 chromosome_10_370856 2.00 2.00 33.80
I’m not sure what read count, seq and seq(norm) means…but this file looks different than miRNA_expressed.csv
, which looks like:
#miRNA read_count precursor
chromosome_10_365643 2 chromosome_10_365643
chromosome_10_365701 2 chromosome_10_365701
chromosome_10_366823 1 chromosome_10_366823
chromosome_10_366897 229 chromosome_10_366897
chromosome_10_367039 2 chromosome_10_367039
chromosome_10_367612 0 chromosome_10_367612
chromosome_10_367894 56 chromosome_10_367894
chromosome_10_368110 0 chromosome_10_368110
chromosome_10_368443 0 chromosome_10_368443
chromosome_10_370371 11 chromosome_10_370371
chromosome_10_370480 0 chromosome_10_370480
chromosome_10_370856 2 chromosome_10_370856
chromosome_10_371859 1 chromosome_10_371859
chromosome_10_371901 0 chromosome_10_371901
It looks like they have the same read counts but what does seq(norm) mean???? I need to read this page. I think I would just use the read count info but the counts seem so low. This is also just one sample and I’m not seeing all of the potential miRNAs when I just look at the data briefly.
Also maybe look at this: https://www.biorxiv.org/content/10.1101/2021.10.19.464446v1.full.pdf
It would be interesting to try the -W
and -k
tags in the quantifier module. -W
indicates that read counts are weighed by their number of mappings. e.g. A read maps twice so each position gets 0.5 added to its read profile. -k
also considers precursor-mature mappings that have different ids, eg let7c would be allowed to map to pre-let7a.
20240227
Going to continue looking at the mirdeep2 quantifier output. When looking at other miRNA cnidarian papers (eg Liew et al. 2014, Baumgarten et al.), it seems like my counts (at least for this one sample) are comparable. Let’s try with the -W
tag.
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -W -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq.collapsed
getting samples and corresponding read numbers
Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 17829111 43551 17785560 0.244 99.756
seq: 17829111 43551 17785560 0.244 99.756
analyzing data
1873 mature mappings to precursors
Expressed miRNAs are written to expression_analyses/expression_analyses_1709042138/miRNA_expressed.csv
not expressed miRNAs are written to expression_analyses/expression_analyses_1709042138/miRNA_not_expressed.csv
Creating miRBase.mrd file
Mapped READS readin - DONE
make_html2.pl -q expression_analyses/expression_analyses_1709042138/miRBase.mrd -k mature_all.fa -y 1709042138 -o -i expression_analyses/expression_analyses_1709042138/mature_all.fa_mapped.arf -l -M miRNAs_expressed_all_samples_1709042138.csv -W expression_analyses/expression_analyses_1709042138/read_occ
miRNAs_expressed_all_samples_1709042138.csv file with miRNA expression values
parsing miRBase.mrd file finished
Took a few mins to run and generate the pdfs. Let’s look at the expressed file:
#miRNA read_count precursor
chromosome_10_365643 2 chromosome_10_365643
chromosome_10_365701 2 chromosome_10_365701
chromosome_10_366823 0.5 chromosome_10_366823
chromosome_10_366897 229 chromosome_10_366897
chromosome_10_367039 2 chromosome_10_367039
chromosome_10_367612 0 chromosome_10_367612
chromosome_10_367894 3.83455475552999 chromosome_10_367894
chromosome_10_368110 0 chromosome_10_368110
chromosome_10_368443 0 chromosome_10_368443
chromosome_10_370371 0.703267973856209 chromosome_10_370371
chromosome_10_370480 0 chromosome_10_370480
chromosome_10_370856 2 chromosome_10_370856
chromosome_10_371859 1 chromosome_10_371859
chromosome_10_371901 0 chromosome_10_371901
Interesting. So with the -W
tag, read counts are weighed by their number of mappings. e.g. A read maps twice so each position gets 0.5 added to its read profile. So for instance, does that mean that chromosome_10_366823 has 2 reads mapped to it? That’s confusing.
Let’s try with the -k
tag.
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -k -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq.collapsed
getting samples and corresponding read numbers
Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics
#desc total mapped unmapped %mapped %unmapped
total: 17829111 43551 17785560 0.244 99.756
seq: 17829111 43551 17785560 0.244 99.756
analyzing data
6029 mature mappings to precursors
Expressed miRNAs are written to expression_analyses/expression_analyses_1709043340/miRNA_expressed.csv
not expressed miRNAs are written to expression_analyses/expression_analyses_1709043340/miRNA_not_expressed.csv
Creating miRBase.mrd file
Mapped READS readin - DONE
make_html2.pl -q expression_analyses/expression_analyses_1709043340/miRBase.mrd -k mature_all.fa -y 1709043340 -o -i expression_analyses/expression_analyses_1709043340/mature_all.fa_mapped.arf -M miRNAs_expressed_all_samples_1709043340.csv
miRNAs_expressed_all_samples_1709043340.csv file with miRNA expression values
parsing miRBase.mrd file finished
Not sure what is different looking at the output files. It is giving me more mature mappings to precursors because I set it so that it would map to multiple locations.
I think I am just going to stick with the original code without the star sequence information. Going to now process the rest of the samples. I moved all of the stuff I did above into a folder called test (/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/test
). First, going to run the mapper module to collapse the samples. I could definitely do this in a script but I want to do it manually the first time so that I am understanding how the code works.
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq.collapsed
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1147_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1147_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_156798
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1412_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1412_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_157142
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1560_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1560_R1_001.fastq.gz_1.fastq.collapsed
Log file for this run is in mapper_logs and called mapper.log_187756
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1567_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1567_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_157352
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1617_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1617_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_157469
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1722_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1722_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_157568
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2000_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2000_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_157668
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2007_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2007_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_157831
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2302_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2302_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_157928
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2360_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2360_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158073
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2398_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2398_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158172
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2404_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2404_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158270
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2412_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2412_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158370
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2512_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2512_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158493
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2523_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2523_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158577
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2563_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2563_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158695
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2729_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2729_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158793
mapper.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2755_R1_001.fastq.gz_1.fastq -e -m -h -s /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2755_R1_001.fastq.gz_1.fastq.collapsed
# Log file for this run is in mapper_logs and called mapper.log_158919
This is what the top lines of one of the collapsed files looks like:
head AST-2755_R1_001.fastq.gz_1.fastq.collapsed
>seq_0_x282696
GCACTGGTGGTTCAGTGGTAGAATTCTCGC
>seq_282696_x73794
TGAAAATCTTTTCTCTGAAGTGGAA
>seq_356490_x65651
TGACTAGATATATACTCATGCT
>seq_422141_x62171
GCACTGATGGTTCAGTGGTAGAATTCTCGC
>seq_484312_x61289
TGACTAGATATACACTCATTCT
Count the number of unique sequences in each collapsed file.
zgrep -c ">" *fastq.collapsed
AST-1065_R1_001.fastq.gz_1.fastq.collapsed:5255291
AST-1147_R1_001.fastq.gz_1.fastq.collapsed:7375460
AST-1412_R1_001.fastq.gz_1.fastq.collapsed:4942609
AST-1567_R1_001.fastq.gz_1.fastq.collapsed:4255807
AST-1617_R1_001.fastq.gz_1.fastq.collapsed:3410530
AST-1560_R1_001.fastq.gz_1.fastq.collapsed:5054135
AST-1722_R1_001.fastq.gz_1.fastq.collapsed:3144650
AST-2000_R1_001.fastq.gz_1.fastq.collapsed:4692561
AST-2007_R1_001.fastq.gz_1.fastq.collapsed:3658938
AST-2302_R1_001.fastq.gz_1.fastq.collapsed:4891546
AST-2360_R1_001.fastq.gz_1.fastq.collapsed:4165438
AST-2398_R1_001.fastq.gz_1.fastq.collapsed:4065976
AST-2404_R1_001.fastq.gz_1.fastq.collapsed:5091971
AST-2412_R1_001.fastq.gz_1.fastq.collapsed:4462387
AST-2512_R1_001.fastq.gz_1.fastq.collapsed:4235505
AST-2523_R1_001.fastq.gz_1.fastq.collapsed:5171942
AST-2563_R1_001.fastq.gz_1.fastq.collapsed:5147628
AST-2729_R1_001.fastq.gz_1.fastq.collapsed:4622195
AST-2755_R1_001.fastq.gz_1.fastq.collapsed:4816214
Retained 3-7 million unique reads per sample. Run quantifier module on all samples. Again, I could definitely do this in a script but I want to do it manually the first time so that I am understanding how the code works.
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 17829111 43551 17785560 0.244 99.756
#seq: 17829111 43551 17785560 0.244 99.756
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709048527/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1147_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 40415224 869326 39545898 2.151 97.849
#seq: 40415224 869326 39545898 2.151 97.849
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709052026/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1412_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16279555 69729 16209826 0.428 99.572
#seq: 16279555 69729 16209826 0.428 99.572
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709053021/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1560_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 17827024 6510 17820514 0.037 99.963
#seq: 17827024 6510 17820514 0.037 99.963
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709054526/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1567_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16611397 105945 16505452 0.638 99.362
#seq: 16611397 105945 16505452 0.638 99.362
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709053231/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1617_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16077717 121046 15956671 0.753 99.247
#seq: 16077717 121046 15956671 0.753 99.247
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709055134/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1722_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16430221 258350 16171871 1.572 98.428
#seq: 16430221 258350 16171871 1.572 98.428
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709055319/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2000_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 17428854 91821 17337033 0.527 99.473
#seq: 17428854 91821 17337033 0.527 99.473
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709055397/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2007_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16559551 102426 16457125 0.619 99.381
#seq: 16559551 102426 16457125 0.619 99.381
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709055480/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2302_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16665370 230870 16434500 1.385 98.615
#seq: 16665370 230870 16434500 1.385 98.615
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709055687/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2360_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16648356 20477 16627879 0.123 99.877
#seq: 16648356 20477 16627879 0.123 99.877
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709055924/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2398_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16788208 230412 16557796 1.372 98.628
#seq: 16788208 230412 16557796 1.372 98.628
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709055986/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2404_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16712903 128490 16584413 0.769 99.231
#seq: 16712903 128490 16584413 0.769 99.231
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709057037/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2412_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 17488508 27520 17460988 0.157 99.843
#seq: 17488508 27520 17460988 0.157 99.843
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709057124/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2512_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16265716 14678 16251038 0.090 99.910
#seq: 16265716 14678 16251038 0.090 99.910
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709057192/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2523_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16995265 88993 16906272 0.524 99.476
#seq: 16995265 88993 16906272 0.524 99.476
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709057251/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2563_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 17023002 70635 16952367 0.415 99.585
#seq: 17023002 70635 16952367 0.415 99.585
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709057330/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2729_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 17121869 23317 17098552 0.136 99.864
#seq: 17121869 23317 17098552 0.136 99.864
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709057412/miRNA_expressed.csv
quantifier.pl -p /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/precursor_all.fa -m /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -d -r /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2755_R1_001.fastq.gz_1.fastq.collapsed
#desc total mapped unmapped %mapped %unmapped
#total: 16269823 127434 16142389 0.783 99.217
#seq: 16269823 127434 16142389 0.783 99.217
#analyzing data
#1873 mature mappings to precursors
#Expressed miRNAs are written to expression_analyses/expression_analyses_1709057479/miRNA_expressed.csv
Mapped % are very low for all samples (<1%). This does not surprise me since miRNAs aren’t super abundant in the genome. For each sample, a miRNAs_expressed_all_samples_XXXXX.csv
file got produced, as well as files of miRNAs expressed and not expressed. Look at how many lines are in the miRNA expressed all samples file for each sample:
wc -l miRNAs_expressed_all_samples*
1874 miRNAs_expressed_all_samples_1709048527.csv
1874 miRNAs_expressed_all_samples_1709052026.csv
1874 miRNAs_expressed_all_samples_1709052927.csv
1874 miRNAs_expressed_all_samples_1709053021.csv
1874 miRNAs_expressed_all_samples_1709053231.csv
1874 miRNAs_expressed_all_samples_1709054272.csv
1874 miRNAs_expressed_all_samples_1709054526.csv
1874 miRNAs_expressed_all_samples_1709055134.csv
1874 miRNAs_expressed_all_samples_1709055319.csv
1874 miRNAs_expressed_all_samples_1709055397.csv
1874 miRNAs_expressed_all_samples_1709055480.csv
1874 miRNAs_expressed_all_samples_1709055687.csv
1874 miRNAs_expressed_all_samples_1709055924.csv
1874 miRNAs_expressed_all_samples_1709055986.csv
1874 miRNAs_expressed_all_samples_1709057037.csv
1874 miRNAs_expressed_all_samples_1709057124.csv
1874 miRNAs_expressed_all_samples_1709057192.csv
1874 miRNAs_expressed_all_samples_1709057251.csv
1874 miRNAs_expressed_all_samples_1709057330.csv
1874 miRNAs_expressed_all_samples_1709057412.csv
1874 miRNAs_expressed_all_samples_1709057479.csv
39354 total
All have the same number of lines. Are they in the same order?
# Top of file
head miRNAs_expressed_all_samples_1709048527.csv
#miRNA read_count precursor total seq seq(norm)
chromosome_10_365643 2.00 chromosome_10_365643 2.00 2.00 36.63
chromosome_10_365701 2.00 chromosome_10_365701 2.00 2.00 36.63
chromosome_10_366823 1.00 chromosome_10_366823 1.00 1.00 18.31
chromosome_10_366897 229.00 chromosome_10_366897 229.00 229.00 4193.60
chromosome_10_367039 2.00 chromosome_10_367039 2.00 2.00 36.63
chromosome_10_367612 0.00 chromosome_10_367612 0.00 0 0
chromosome_10_367894 56.00 chromosome_10_367894 56.00 56.00 1025.51
chromosome_10_368110 0.00 chromosome_10_368110 0.00 0 0
chromosome_10_368443 0.00 chromosome_10_368443 0.00 0 0
head miRNAs_expressed_all_samples_1709057479.csv
#miRNA read_count precursor total seq seq(norm)
chromosome_10_365643 10.00 chromosome_10_365643 10.00 10.00 58.13
chromosome_10_365701 0.00 chromosome_10_365701 0.00 0 0
chromosome_10_366823 10.00 chromosome_10_366823 10.00 10.00 58.13
chromosome_10_366897 1.00 chromosome_10_366897 1.00 1.00 5.81
chromosome_10_367039 8.00 chromosome_10_367039 8.00 8.00 46.50
chromosome_10_367612 0.00 chromosome_10_367612 0.00 0 0
chromosome_10_367894 54.00 chromosome_10_367894 54.00 54.00 313.89
chromosome_10_368110 0.00 chromosome_10_368110 0.00 0 0
chromosome_10_368443 15.00 chromosome_10_368443 15.00 15.00 87.19
# Bottom of file
tail miRNAs_expressed_all_samples_1709048527.csv
chromosome_9_363461 0.00 chromosome_9_363461 0.00 0 0
chromosome_9_363522 3.00 chromosome_9_363522 3.00 3.00 54.94
chromosome_9_363632 0.00 chromosome_9_363632 0.00 0 0
chromosome_9_363719 6.00 chromosome_9_363719 6.00 6.00 109.88
chromosome_9_364029 3.00 chromosome_9_364029 3.00 3.00 54.94
chromosome_9_364032 0.00 chromosome_9_364032 0.00 0 0
chromosome_9_364034 49.00 chromosome_9_364034 49.00 49.00 897.32
chromosome_9_364498 0.00 chromosome_9_364498 0.00 0 0
chromosome_9_364714 13.00 chromosome_9_364714 13.00 13.00 238.06
chromosome_9_364995 2.00 chromosome_9_364995 2.00 2.00 36.63
tail miRNAs_expressed_all_samples_1709057479.csv
chromosome_9_363461 6.00 chromosome_9_363461 6.00 6.00 34.88
chromosome_9_363522 4.00 chromosome_9_363522 4.00 4.00 23.25
chromosome_9_363632 0.00 chromosome_9_363632 0.00 0 0
chromosome_9_363719 5.00 chromosome_9_363719 5.00 5.00 29.06
chromosome_9_364029 33.00 chromosome_9_364029 33.00 33.00 191.82
chromosome_9_364032 0.00 chromosome_9_364032 0.00 0 0
chromosome_9_364034 159.00 chromosome_9_364034 159.00 159.00 924.24
chromosome_9_364498 2.00 chromosome_9_364498 2.00 2.00 11.63
chromosome_9_364714 9.00 chromosome_9_364714 9.00 9.00 52.32
chromosome_9_364995 7.00 chromosome_9_364995 7.00 7.00 40.69
From a quick look at the top and bottom of files, it looks like they are in the same order. I copied all miRNAs_expressed_all_samples_XXXX.csv
files to my local computer and renamed each file so that the sample ID was in the name of the file. Now I’ll look at them in R.
20240305
Still trying to figure out how the heck MFE is calculated and where that number is stored in the files that mirdeep2 produces. I looked through all of the files and STILL CAN’T FIND IT…
I do think that the files mature_vs_precursors.bwt
and reads_vs_precursors.bwt
in /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/dir_prepare_signature1707063877
will help me calculate if 90% of the reads share the same nucleotide start at the 5’ end.
20240319
I have now identified all putative miRNAs (code here) and run DESeq2 (code here). For both the mRNAs and the miRNAs, I now have a list of unique differentially expressed genes or miRNAs. I can use these lists to filter the gene and miRNA fastas so that I can run miranda with the subsetted sequences.
I need to do mirnda with the 3’ UTR of the mRNA, meaning I have to use the gff to identify the 3’ UTR in the genome and then subset those sequences specifically…When looking at the Astrangia gff, there are only 687 3’ UTR rows in the gff, despite there being >48,000 genes. Somehow, I need to get the sequences of the 3 UTRs for all mRNA sequences.
These are some options for IDing the 3’ end:
- https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4241-1
- GETUTR
- 3USS
- UTRscan
- Maker
- Augustus
I know this won’t be the same, but I am going to run blast with the query as the miRNA sequences and the reference db as the mRNA sequences. In the scripts folder: nano blastn_miRNA.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
echo "Making blast db from mRNA sequences" $(date)
makeblastdb -in /data/putnamlab/jillashey/Astrangia_Genome/apoculata_mrna_v2.0.fasta -dbtype 'nucl' -out /data/putnamlab/jillashey/Astrangia2021/smRNA/data/apoc_mRNA_db
echo "Blast db creation complete, blasting miRNAs against db" $(date)
blastn -query /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -db /data/putnamlab/jillashey/Astrangia2021/smRNA/data/apoc_mRNA_db -outfmt 6 -evalue 4 -num_threads 15 -out /data/putnamlab/jillashey/Astrangia2021/smRNA/data/blastn_miRNA_query.tab
echo "Blast complete!" $(date)
Submitted batch job 309649. Ran fast, but output file was empty…
20240320
Going to try to rerun the script above but removing any extraneous flags in hopes that this will fix the problem. Submitted batch job 309660. Still empty but not getting any error…I find it hard to believe that there are no blast matches between the miRNAs and the mRNAs…though I guess I am looking for complementary sequences, not identical sequences. Is there a way to do this in blast? Yes there is! I just need to add -strand both
to allow BLAST to search both the forward and reverse-complement strands of the database sequences. Submitted batch job 309661. Still ran but still output file is empty………….Truly don’t understand and I am quite frustrated. Maybe let’s try blastx? In the scripts folder: nano blastx_miRNA.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
echo "Making blast db from mRNA sequences" $(date)
makeblastdb -in /data/putnamlab/jillashey/Astrangia_Genome/apoculata_proteins_v2.0.fasta -dbtype prot -out /data/putnamlab/jillashey/Astrangia2021/smRNA/data/apoc_prot_db
echo "Blast db creation complete, blasting miRNAs against db" $(date)
blastx -query /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -db /data/putnamlab/jillashey/Astrangia2021/smRNA/data/apoc_prot_db -out /data/putnamlab/jillashey/Astrangia2021/smRNA/data/blastx_miRNA_query.txt -outfmt 6
echo "Blast complete!" $(date)
Submitted batch job 309665. That worked! It was very fast. I wonder why it’s not working with the nucleotide sequences…Maybe there really are no hits? I just find that hard to believe but idk maybe.
20240321
Coming to the realization that I either have to locate the 3’ UTRs in the genome myself or run a gene prediction software (augustus/maker) to find the 3’ UTR coordinates. My worry is that I use augustus or maker and not all of the 3’UTRs for the mRNAs are identified.
The other idea was to look at the stop codon coordinates for each mRNA and assume that everything after the stop codon sequence is the 3’UTR. This will probably take me less time but still not sure how to do it. After searching around, the bedtools
suite may help me. Someone on a github issue page suggested the following:
“I recommend you get a BED file with the position of the CDS (the annotation is always correct….). You can then use some Bedtools operations: bedtools merge the different exons of your Isoscm gtf files which form a same gene (they are next to each other). Use bedtools substract using the CDS bed file, leaving you with this structure: 5’/ a hole corresponding to the cds/3’. To finish, use bedtools closest with this last file and the CDS bed file to get the closest features to the right of each cds. You get the 3’ of each cds this way. Use bedtools intersect on the original Isoscm file and you are left with only the 3’ features (exons and introns).”
In other words:
Exons = gene - introns
CDS = gene - introns - UTRs
therefore also:
CDS = Exons - UTRs
This makes sense to me. First, I’m going make a cds and exon specific gff file.
cd /data/putnamlab/jillashey/Astrangia_Genome
awk '$3 == "exon"' apoculata_v2.0.gff3 > exons.gff
wc -l exons.gff
239708 exons.gff
head exon.gff
chromosome_1 tRNAScan-SE exon 12926481 12926553 62.76 + . ID=Ser_58_exon;Parent=Ser_58_tRNA
chromosome_1 tRNAScan-SE exon 20472613 20472685 57.43 - . ID=Ser_104_exon;Parent=Ser_104_tRNA
chromosome_1 . exon 15863279 15863694 . - . ID=evm.model.chromosome_1.1448.exon3;Parent=evm.model.chromosome_1.1448
chromosome_1 . exon 15864523 15864816 . - . ID=evm.model.chromosome_1.1448.exon2;Parent=evm.model.chromosome_1.1448
chromosome_1 . exon 15865287 15865521 . - . ID=evm.model.chromosome_1.1448.exon1;Parent=evm.model.chromosome_1.1448
chromosome_1 . exon 15865882 15865901 . - . ID=evm.model.chromosome_1.1449.exon27;Parent=evm.model.chromosome_1.1449
chromosome_1 . exon 15866163 15866244 . - . ID=evm.model.chromosome_1.1449.exon26;Parent=evm.model.chromosome_1.1449
chromosome_1 . exon 15866990 15867097 . - . ID=evm.model.chromosome_1.1449.exon25;Parent=evm.model.chromosome_1.1449
chromosome_1 . exon 15867543 15867625 . - . ID=evm.model.chromosome_1.1449.exon24;Parent=evm.model.chromosome_1.1449
chromosome_1 . exon 15868562 15868656 . - . ID=evm.model.chromosome_1.1449.exon23;Parent=evm.model.chromosome_1.1449
awk '$3 == "CDS"' apoculata_v2.0.gff3 > cds.gff
wc -l cds.gff
236997 cds.gff
head cds.gff
chromosome_1 . CDS 15863279 15863694 . - 2 ID=cds.evm.model.chromosome_1.1448;Parent=evm.model.chromosome_1.1448
chromosome_1 . CDS 15864523 15864816 . - 2 ID=cds.evm.model.chromosome_1.1448;Parent=evm.model.chromosome_1.1448
chromosome_1 . CDS 15865287 15865377 . - 0 ID=cds.evm.model.chromosome_1.1448;Parent=evm.model.chromosome_1.1448
chromosome_1 . CDS 15865882 15865901 . - 2 ID=cds.evm.model.chromosome_1.1449;Parent=evm.model.chromosome_1.1449
chromosome_1 . CDS 15866163 15866244 . - 0 ID=cds.evm.model.chromosome_1.1449;Parent=evm.model.chromosome_1.1449
chromosome_1 . CDS 15866990 15867097 . - 0 ID=cds.evm.model.chromosome_1.1449;Parent=evm.model.chromosome_1.1449
chromosome_1 . CDS 15867543 15867625 . - 2 ID=cds.evm.model.chromosome_1.1449;Parent=evm.model.chromosome_1.1449
chromosome_1 . CDS 15868562 15868656 . - 1 ID=cds.evm.model.chromosome_1.1449;Parent=evm.model.chromosome_1.1449
chromosome_1 . CDS 15870067 15870167 . - 0 ID=cds.evm.model.chromosome_1.1449;Parent=evm.model.chromosome_1.1449
chromosome_1 . CDS 15870684 15870725 . - 0 ID=cds.evm.model.chromosome_1.1449;Parent=evm.model.chromosome_1.1449
There are multiple exons and CDSs for many genes. I’m going to merge both of these features with bedtools merge.
cd /data/putnamlab/jillashey/Astrangia_Genome
mkdir scripts
cd scripts
Bedtools merge requires that the data is pre-sorted by chromosome and then by start position. In the scripts folder: nano bed_merge.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia_Genome/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
cd /data/putnamlab/jillashey/Astrangia_Genome
echo "Sorting by chromosome and then by start position" $(date)
sort -k1,1 -k4,4n exons.gff > exons.sorted.gff
sort -k1,1 -k4,4n cds.gff > cds.sorted.gff
echo "Sorting complete, starting merge with exons" $(date)
bedtools merge -i exons.sorted.gff > exon.merge.bed
echo "Exon merge complete, starting merge with CDS" $(date)
bedtools merge -i cds.sorted.gff > cds.merge.bed
echo "Merges complete!" $(date)
Submitted batch job 309744. That ran super fast but the number of lines in each file did not decrease all that much. I’m going to add the -d
flag, which controls how close two features must be in order to merge. By default, it is set to 0, but I’m going to set it at 2000. Submitted batch job 309748. Once again, ran super fast.
wc -l exon.merge.bed
47035 exon.merge.bed
wc -l cds.merge.bed
46485 cds.merge.bed
These values make more sense to me, as that is about how many genes are in the genome. But they are still not identical…Going to do the subtraction anyway and see what happens. I’ll subtract the exon from the CDS? Using bedtools subtract. In the scripts folder: nano bed_subtract.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia_Genome/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
cd /data/putnamlab/jillashey/Astrangia_Genome
echo "Subtracting exons from CDS" $(date)
bedtools subtract -a cds.merge.bed -b exon.merge.bed > test.bed
echo "Subtraction complete!" $(date)
Submitted batch job 309749. Test.bed file is empty…I’m guessing this was because the files are different lengths. I need to somehow remove the genes that do not have CDSs or exons annotated. In R (code here), I created a separate column for the parent ID in the attribute column and removed rows that do not have a row corresponding to an mRNA, an exon or a CDS. I then made two dataframes, one that contained only exons and one that contained only CDSs. I saved these as new gffs and put it in the Astrangia genome folder on Andromeda. I edited the bed_merge.sh
to include the new gffs as input. Submitted batch job 309756. Look at number of lines and header:
wc -l exon.merge.bed
67105 exon.merge.bed
wc -l cds.merge.bed
64688 cds.merge.bed
head exon.merge.bed
Ala_1098_tRNA 22229936 22230008
Ala_1107_tRNA 20496372 20496446
Ala_1112_tRNA 18122334 18122406
Ala_1113_tRNA 18121779 18121851
Ala_1114_tRNA 18121543 18121615
Ala_1115_tRNA 18121306 18121378
Ala_1116_tRNA 18121069 18121141
Ala_1117_tRNA 18120832 18120904
Ala_1144_tRNA 11456062 11456134
Ala_1158_tRNA 8187155 8187227
head cds.merge.bed
evm.model.chromosome_10.1 12622 13741
evm.model.chromosome_10.10 55895 56373
evm.model.chromosome_10.10 58725 58982
evm.model.chromosome_10.100 966627 971367
evm.model.chromosome_10.1000 9548910 9549171
evm.model.chromosome_10.1001 9559190 9569583
evm.model.chromosome_10.1001 9572518 9574130
evm.model.chromosome_10.1002 9580187 9582242
evm.model.chromosome_10.1003 9587494 9595116
evm.model.chromosome_10.1004 9597669 9600588
They are still not the same number of lines. But let’s QC real quick. Pull out the bed data from a gene that is in both the exon and cds files.
# cds file
evm.model.chromosome_10.1004 9597669 9600588
# 9600588 - 9597669 = 2919
evm.model.chromosome_10.1004 9602637 9607464
# 9607464 - 9602637 = 4827
# exon file
evm.model.chromosome_10.1004 9597669 9600588
# 9600588 - 9597669 = 2919
evm.model.chromosome_10.1004 9602637 9607464
# 9607464 - 9602637 = 4827
Both the cds and exon file has two lines for this gene, interesting. Why? Maybe too far apart to be grouped even though they appear to be assigned to the same gene. If I subtract the end of the first entry from the start of the second entry, I get: 9607464 - 9597669 = 9795
. So these two portions of the same gene (I think) are separated by ~10000 bp. I’m going to increase the -d
flag to 15000. I also added wc -l
into the script for each ending file so I don’t have to do it manually. Submitted batch job 309759. Still not the same, but I am going to remove any gene/chromosome names that have tRNA
in them, as I don’t think these are true exons. Removed them in R and reuploaded gffs to server. Now rerunning bed_merge.sh
. Submitted batch job 309764. Here are the line numbers.
46551 exon.merge.bed
46547 cds.merge.bed
So close… I wonder what the discrepancy is? The rows also look similar in both files…Changing d flag to 10000 and rerunning. Submitted batch job 309766
47001 exon.merge.bed
46993 cds.merge.bed
Still close…and still look the same:
head -20 exon.merge.bed
evm.model.chromosome_10.1 12622 13741
evm.model.chromosome_10.10 55895 58982
evm.model.chromosome_10.100 966626 971376
evm.model.chromosome_10.1000 9548910 9549171
evm.model.chromosome_10.1001 9559190 9574130
evm.model.chromosome_10.1002 9580187 9582242
evm.model.chromosome_10.1003 9587494 9595116
evm.model.chromosome_10.1004 9597669 9607464
evm.model.chromosome_10.1005 9616973 9637640
evm.model.chromosome_10.1006 9645049 9652090
evm.model.chromosome_10.1007 9660976 9662278
evm.model.chromosome_10.1008 9662508 9662954
evm.model.chromosome_10.1009 9663218 9665033
evm.model.chromosome_10.101 971989 975710
evm.model.chromosome_10.1010 9667643 9669122
evm.model.chromosome_10.1011 9669880 9673441
evm.model.chromosome_10.1012 9676142 9683173
evm.model.chromosome_10.1013 9689782 9691495
evm.model.chromosome_10.1014 9697732 9706569
evm.model.chromosome_10.1015 9707472 9711010
(base) [jillashey@ssh3 Astrangia_Genome]$ head -20 cds.merge.bed
evm.model.chromosome_10.1 12622 13741
evm.model.chromosome_10.10 55895 58982
evm.model.chromosome_10.100 966627 971367
evm.model.chromosome_10.1000 9548910 9549171
evm.model.chromosome_10.1001 9559190 9574130
evm.model.chromosome_10.1002 9580187 9582242
evm.model.chromosome_10.1003 9587494 9595116
evm.model.chromosome_10.1004 9597669 9607464
evm.model.chromosome_10.1005 9616973 9637640
evm.model.chromosome_10.1006 9645049 9652090
evm.model.chromosome_10.1007 9660976 9662278
evm.model.chromosome_10.1008 9662508 9662954
evm.model.chromosome_10.1009 9663218 9665033
evm.model.chromosome_10.101 971989 975710
evm.model.chromosome_10.1010 9667643 9669122
evm.model.chromosome_10.1011 9669880 9673441
evm.model.chromosome_10.1012 9676142 9683173
evm.model.chromosome_10.1013 9689782 9691495
evm.model.chromosome_10.1014 9697732 9706569
evm.model.chromosome_10.1015 9707472 9711010
If I subtract the cds from the exon, it will result in 0. Not sure what to do here…I may call it quits for the day here. Will ask the molecular mechanisms team tomorrow if they have any suggestions.
20240325
Javi sent me code to create 3’UTRs! I will use it to create 3’UTRs for my GFF so that I can extract the 3’UTR sequences for miranda. First, identify the counts of each feature from the gff file.
GFF_FILE="/data/putnamlab/jillashey/Astrangia_Genome/apoculata_v2.0.gff3"
genome="/data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta"
grep -v '^#' ${GFF_FILE} | cut -s -f 3 | sort | uniq -c | sort -rn > all_features.txt
cat all_features.txt
239708 exon
236997 CDS
47156 gene
45867 mRNA
44823 stop_codon
44312 start_codon
2889 five_prime_UTR
2317 tRNA
687 three_prime_UTR
Extract feature types and generate individual gffs for each figure. In Javi’s code, he has grep $'\tmRNA\t' ${GFF_FILE} | grep -v '^NC_' > ACER_k2.GFFannotation.mRNA.gff
, which I will follow. I’m not sure why he is removing lines that have NC_ in them…maybe its a Acerv specific thing? I’m going to remove it for now, but may come back if I need to filter out specific rows.
grep $'\texon\t' ${GFF_FILE} > apoc_GFFannotation.exon.gff
grep $'\tCDS\t' ${GFF_FILE} > apoc_GFFannotation.CDS.gff
grep $'\tgene\t' ${GFF_FILE} > apoc_GFFannotation.gene.gff
grep $'\tmRNA\t' ${GFF_FILE} > apoc_GFFannotation.mRNA.gff
grep $'\tstop_codon\t' ${GFF_FILE} > apoc_GFFannotation.stop_codon.gff
grep $'\tstart_codon\t' ${GFF_FILE} > apoc_GFFannotation.start_codon.gff
grep $'\tfive_prime_UTR\t' ${GFF_FILE} > apoc_GFFannotation.five_prime_UTR.gff
grep $'\ttRNA\t' ${GFF_FILE} > apoc_GFFannotation.tRNA.gff
grep $'\tthree_prime_UTR\t' ${GFF_FILE} > apoc_GFFannotation.three_prime_UTR.gff
Extract chromosome lengths
cat is ${genome} | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > apoc.Chromosome_lenghts.txt
cat apoc.Chromosome_lenghts.txt
chromosome_1 21106950
chromosome_2 21532011
chromosome_3 22725890
chromosome_4 27282911
chromosome_5 29602567
chromosome_6 30212294
chromosome_7 30683508
chromosome_8 29863146
chromosome_9 31044737
chromosome_10 33861578
chromosome_11 33534404
chromosome_12 42634786
chromosome_13 42928715
chromosome_14 58409227
Extract scaffold names
awk -F" " '{print $1}' apoc.Chromosome_lenghts.txt > apoc.Chromosome_names.txt
Sort the gffs by chromosome name. In the scripts of the Astrangia_Genome
folder: nano bed_sort.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia_Genome/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
cd /data/putnamlab/jillashey/Astrangia_Genome
echo "Sorting gffs by chromosome" $(date)
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.exon.gff > apoc_GFFannotation.exon_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.CDS.gff > apoc_GFFannotation.CDS_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.gene.gff > apoc_GFFannotation.gene_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.mRNA.gff > apoc_GFFannotation.mRNA_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.stop_codon.gff > apoc_GFFannotation.stop_codon_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.start_codon.gff > apoc_GFFannotation.start_codon_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.five_prime_UTR.gff > apoc_GFFannotation.five_prime_UTR_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.tRNA.gff > apoc_GFFannotation.tRNA_sorted.gff
sortBed -faidx apoc.Chromosome_names.txt -i apoc_GFFannotation.three_prime_UTR.gff > apoc_GFFannotation.three_prime_UTR_sorted.gff
echo "Sorting complete!" $(date)
Submitted batch job 309987. Ran super fast. Use flankBed to extract the UTRs around the genes. Javi used 2kb and 3kb as his cutoffs (ie he extracted the flanks that were 2000 and 3000 bp around his gene of interest). In his actual analysis, he used 3kb. I’m going to use 3kb for now, but may have to redo with a different number. I need to look into how far the UTRs are away from the gene end. I’m also going to remove portions of the UTRs that overlap with neighboring genes. In the scripts folder: nano flank_sub_bed.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia_Genome/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
cd /data/putnamlab/jillashey/Astrangia_Genome
echo "Extracting 3kb UTRs" $(date)
flankBed -i apoc_GFFannotation.gene_sorted.gff -g ${genome} -l 0 -r 3000 -s | awk '{gsub("gene","3prime_UTR",$3); print $0 }' | awk '{if($5-$4 > 3)print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}' | tr ' ' '\t' > apoc.GFFannotation.3UTR_3kb.gff
flankBed -i apoc_GFFannotation.gene_sorted.gff -g ${genome} -l 3000 -r 0 -s | awk '{gsub("gene","5prime_UTR",$3); print $0 }'| awk '{if($5-$4 > 3)print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}'| tr ' ' '\t' > apoc.GFFannotation.5UTR_3kb.gff
echo "Subtract portions of UTRs that overlap nearby genes" $(date)
subtractBed -a apoc.GFFannotation.3UTR_3kb.gff -b apoc_GFFannotation.gene_sorted.gff > apoc.GFFannotation.3UTR_3kb_corrected.gff
subtractBed -a apoc.GFFannotation.5UTR_3kb.gff -b apoc_GFFannotation.gene_sorted.gff > apoc.GFFannotation.5UTR_3kb_corrected.gff
echo "UTRs identified!" $(date)
Submitted batch job 309994. Gave me this error:
*****ERROR: Unrecognized parameter: 0 *****
*****ERROR: Need both -l and -r.
*****ERROR: Must supply -l and -r or just -b with -s.
It does not seem to like 0 being set for -l
or -r
. I’m going to set the 0 to 1 instead. Submitted batch job 309995. Still getting an error saying unrecognized parameter. Replace flankBed
with bedtools flank
and subtractBed
with bedtools subtract
. Submitted batch job 309997. Still getting an error saying unrecognized parameter. I might need to give the -g
flag the apoc.Chromosome_lenghts.txt
file. In the help notes, it says:
The genome file should tab delimited and structured as follows:
<chromName><TAB><chromSize>
For example, Human (hg19):
chr1 249250621
chr2 243199373
...
chr18_gl000207_random 4262
Going to add apoc.Chromosome_lenghts.txt
as the -g
argument. Submitted batch job 310001. Now getting the error: Error: Unable to open file apoc.GFFannotation.5UTR_3kb.gff. Exiting.
Editing output file names so that they correspond to 5’ or 3’. Submitted batch job 310005. Worked, hooray! Now I will download the following files to my local computer: apoc.GFFannotation.3UTR_3kb_corrected.gff
, apoc.GFFannotation.5UTR_3kb_corrected.gff
, apoc_GFFannotation.gene_sorted.gff
, and apoc_GFFannotation.mRNA_sorted.gff
. After working through the R script that Javi provided me, I don’t think I need to do all of the R code that he proposed. I think I just need to extract the 3’UTR sequences from the genome using bedtools (example here). I can use bedtools getfasta. In the /data/putnamlab/jillashey/Astrangia_Genome/scripts
folder: nano bed_getfasta_3UTR.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia_Genome/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
cd /data/putnamlab/jillashey/Astrangia_Genome
echo "Extracting 3' UTR sequences" $(date)
bedtools getfasta -fi apoculata.assembly.scaffolds_chromosome_level.fasta -bed apoc.GFFannotation.3UTR_3kb_corrected.gff -fo apoc_3UTR.fasta -name
echo "Sequence extraction complete!" $(date)
Submitted batch job 310035. Ran in 9 seconds. Look at the output fasta.
zgrep -c ">" apoc_3UTR.fasta
60613
head apoc_3UTR.fasta
>3prime_UTR::chromosome_1:17663-20663
GGAACGCGTACCAATTTTATCCGTGATCGGACTACCTCGCGACTAGGTCCGGGTAAAATTTTGGTTCAGCACGGGTGAGGTAAATCCGTTTACACGTCAAATGTTATCCGTGCTGAACCATTTTTTTCGGTATGCGTTCCATTTTTACTCGAGCCGTGCCGCGTAAAATGGGCTTGAGTTATAAGCAATCAAAACTAAAATGAAGGGTGTCATTGCAAGATTAAACTGTTGCTTTGGTAACCTTTTATGTCATAAAAATCATAACAACGTGTTCCGCAATCATTGGGCATTTGTTTGACACCATGATTGTAGCATCAACTGATGAAGAGTGGTTAAAGTGACCCGTCAAAATCCAAGACTTGGAAAGTGCTGGAAAATTGAGCCAGCCACCTTAAACCTTCATCTTAACTCAGTGAAGGCAAGAACTAGCCAAAAATTCTTATAAACAATGAACAGGGTCAGTGACAGACCCCGTTAAGTGTCAACAGGATAACCCCCGACTACTCGTATTTATTATTCAAACAGTTCAGTTTCTCTCTAAGATAAAATTCAAATTTAAGAGGCTGTGTCTGTAATACTGGTCTGCCCGATCTGGCTTGTTAAATAGATTTCGCTCTTTTCTAGTGTGTTGCAAGACTTTCATGTACTTATACTAAAAACACAATTCACTTGACCGGATCATTAATTTTTTTTTCGAGTTTTTAAGGATTTTCGATTCACCCGAACGAGCACTTGTAACGACCACTGTCAAGGCTTAAATTTACCACAACTTTGAAAATTAATTAAAACAGAACTAACAACTTCTGTACCAAACAAACACCACAGCCTGTTAAATTATGTCTGTTTAATTTATCACACCAACTTTCACGAACATTGAGCTCGAATCAAGCTAGCAGCAAAACATTAACGACACCCCTATCACACATCTTGATGAAACACTTGAAAACGGCTCAGATTCAAACTTTAGTTTTGAAAAAGTGGGGCGGTCTACTTCAATCAAACCAACGCAGCCGAAAAGTAGATGTAAAAAAAAAATGCTCCTATCGTAAAAAACCTTCTGGTATCGTCTTAATGACATATTTACAGCTGTGTAAACTTGCCCAATTTCGAAGATCTTCATGTTTGCAGTCATTTTGAGGGACTCGTCAGGACGAGGAGTTTGATCAACTGGCCTCGCAGACATACCTAACACCATCAGAAATAAACAAAGATTAGATATAAATGCATTGCAGTTCAACTGAAAGAGAATAAATCGTGTAATAGATCAACGGGTCGAGAATTATGATTAACGTTTAGCTATAAGCTTTATATTGATCTACTCCCGTTGTTATTGTTGAATGAATCGGAGAGTTTTTCAAAAAATTATTTAACATCTGTTACAAATAAAAATAACAACAATTTTTTTTTTCAAGTTTACAAAATAGCTCGTGCAGTATTTGATCTTAACGACTGAAAAAGACATAGTATATCTTGTTTCTAAGCCCTTACCTCTTTTATCGGGTCGAGAATTCACCTTATAAAACTCTGTGCGTTTGCAAATAAAACAAAACAAATCATTTCGCACTGATCAAGCTTGACGCGTTGTTTAAATGCCGCCGACATCGAAAAAATTACACCAATATTGTAGCCATCAGATACCATCTGTGGCGTCCGAAATATATTATCGTAAGTAAAGAAATAGTTTATTCTATGTTCGTTTTTAAAAAAGATTAATAAAAATGAAATCGTTTTCAGTAAACCTACCAAAGTTTCGTGATGTTAACGTTCCCTGATGGTGTTCAATTATTCTAAAAATAGTCGTTTAATTTGAACGGAATAGTTTCAATTAAAACTTAAATATCTTTTAATTTTAAAAAAGTTAATATACAAGGCTTCCAAAAATAAACAAACCAGCATGCACAGAACATTTTTTAGCAAAACTGAACGAGTTTTTCTGGGTAGGGTTATAAAACAGGCTACGCCTGCACTGTTATAATGAAACGCTTTGTTAAAGGGAACCTCTGTTGTTTAGGCAAGTTGACGTTGAGTTGTTTGAAATAATGCCTAAACTACAATTGTACAGGACAGGGGAGAAACAGCATTTAATAACAACCGTAAACGGAAAAGAAAAAGGCTTTTAAATTTCCCGCTCGGCGCCATTTTGAAATTTATGACGTTAGTGCGTTGTCCAAATGTTTCACTCAAAAGAACCTCTGCTTTCTAATAATAGCGGCAACGCAGTGACGTCATAAAATTCAAAATGGCGGCGCGCGGGAAATTCAAAAGCACAAAAATTGACAAATCAAAGGCCCAGTATTCCAATTAAAAAATGGAATCCGCGACTTTAAACAACCAAAACGATTTTAGACTGTGGTAAACTTATTATTTTGACCAAAACTGTTTTACAGTCGAGGTTCCCTTTAAGCAAAACACCTTAAAGCTTAATAGTCCAGTCATTTTGTGGTTGGACCTGCAACTAATTGCAACAGAAGCTTTTTGACCGCAAATCATTTCGAATAAGTCTCTAGATTAACATACTAAAAGTCCCAAAGGATTCATGCATGGGAACATCTCGAAAATTTAACGGATTCTAAGTTGTGCTTTTGCGTAACATGAAAACGAGGCTCAAAAACAGAGGTTCTGTTATTATGCTTGTTTATGGCAAGAATTTCATTTATTTCATCCAAAAACAATGAAATATACTTTAAATATGCTTGAAAACTTTCATCCAGAGGAAAAATTAATCAAAAAGTATTCAATAAGCAAAATAAATCTCATTATATTTAAATTTTCAATTGTGAAGCGAATTGGACTACACCAAAAATGGTAGTTGCGATAGTTTATTATCCTTTTATGGATCGCACAAAAAAATTGCAAGCGAGCCAACAAATAAATATATCTTTCACAATGGTGAAATATTCTAAAAGAACATCGTTTGTGAAATGAAAACGAATTTCACAGTTTCTTCCAGCTCTTGTGTCAATCACCGTCGTCATCGCTATCATCGAAATCTTGATGGGG
>3prime_UTR::chromosome_1:40489-43489
TTCCAGGAATTTTGAAATAATTTTCTACAATTAATTTATTGTAATAAGAAAATCAACTGTATGCAATTGTTATGTAGACAGTCTCTAAATATCTTCTTAGTACACTCCAGAAAGCACATCATTTCTCACCACCATTGAAAAGTGCATATAAAAACTCTCAGATTCTCGCCATGTTTTCGAGCTGCAAAGAGCTTTATAACTTCTTTTGGATCCTCTCCGCATATTCGTGACGTAAATCTGATTTCACAGGTAGCCCAGGAGGTCTGAACTCCGCTTTCACTGTAGTCGTGTTCGGGTGGATTTTGTCTTGATTCAAAATACTAGCTGGTGTGGCTTGTCCAGCTGGATGTAAAGTTTATATATAGCTTTTTAGAACCAGAAACACAACTAGTATCTATTATTTTATACACTTTTTAATTGTGGCGGGAAAACGTGTACAACATGATGTACGTTGATGGGATCGTGTCCTCAAAGACCCACTGCCCCAAGGCCATGGAGTCCAAAGCACGGCCCCTATATATTTGTTTTAATCTTCAAGGCAAATGCGCATTACCAGTATCAATTATCAAGGATGGACCGTGAACAGTGTAGAAATAAACAGAGTAAGAAACGAAACAAAGTATTTCCTTGTTCCTATGCTTTATGGACATTGGTGGCTTGATTATACTTTATCTTCTTAATCTGGACAGTTGAGGTTTCGCAAGAAGTGTTTAAGTGAATGTTTACCTGTTCATGAGGGTGATCAGACCGCTTTTTGTAGCAGAATTTAATGCCAGATATTTTCATTGATTTCCGTCCGCCATGTTGGAGCCCAAGCAGTTGGGCTCCAACATGGCGTCTCCATAATGTGCTCTCTAAATTTGCGTGAAACATTTTGACGGATAACTCGAGTACGAAATAATGTAAAGACCTGAGACTTGGACAAGTAGTATGTTTATATAGTAGTTTTCTACAACCTGTAATATTCTTGATTTTTTTGACTCAATAGTTTTGGATTTTTTTTCGCGGCGTGACAGTGAAAACCACCTATTGAAATGGATCGCGCCTTCGGCACCCGAGGAGTTTAATTCACGTTGCCCCCGAATAGAGAGCCTGCTGGCAGGCTACAGTTAATCTTAAGTTGCATTATGTAACCGGGCCCTGGCCCAGTTCTTCGTGTTCCACTTGTATTTGAGTGATTTGTTGGTCTTTGTTTGCCAAAAACAAAAGAAAAAAATCAAAAATGTCGAGCAAAATCTAAATTATAGACTACGTATAGAACACAATAGTGGCTAAAGCGTAAAAATGGCCAGTTACACAGCGACCTCACACAAGAGTGCATGGACGGCGGAGCATTTTTCCAACTGGGGGGGCTTATGAGACTCTTCTCTTATAAATTCTTTATTTTTTCTGTAAAAGTGGGGGGCTAAAGCCCCCCCAGCCCCCCTGGCTCCGCCGTCCCTGGAGTGTTTGGGAAATAGAGATTACTCCATGGAAAATGCGCGCGTACGATTTTTATTCACGAGTTGAGTAGTATCAGAAAACGAACGAGTGAGCGTAGCGAACGAGTGATTTTTCTGATACAACTCTGCGAGTGAATAAAAATCGTACAAAGCATTTTCCATGGTGTAATTTGTTTATTTTATAGATACTGAGGTTTTTGGAAAGCAAATTAAATCGCTCGACTTTTTATCAGTTACGCGCGTGTTTAGGAAACCCATTCATCTGCCATAAACTAAAATAACTAAAACATGTTTGTCTTTCAAAATCACATCTTTAATGAAGTGTAGCACAAGAAATAAGACGTCAGTTTATACTCTCGCAACTTTTACAGAAACAACTTTTGATAGAAAACCTGACTGAGACTTGTACAAAACTTGCAAGTTCCAGTTCAATGTTTTAACAGAACAAAACAGCGCTAAACGCTGAATAACTCATCGCGAACTCGAGCGAACTTTTACAAATTAAATCAAGAAAACAGTCTGCCAAAACCAAGGACAAACCTGCAGTGTTTTAGTCATCATCTGAATCGAAACTCGCTGCAATCCCTCGGAGTGAAGACGGCCCGTATTCTTGTCCTTCCTTTGTGCGAATTTATACTCAAAGGTTTATTCAACTCGGCAGGAGGAATTTCGGTTATCTCGCTTCGTGTTTCTCTTTTTGTTTGAAGAAACGTTGTGACAACAAGGCCGTCTTGTTTCGTTTTCTTTATAGTGTTTTGGTTTTCCTGTTCTGCAATATAAACGACTCAGTGGAGCCGTTAAAGACACAAACAAACCTTGAACGTGTCTCCGCCATGTCGCCGGTGTTTACGACGAGAATCAAAATGGCGATACGGTGTTGTTCACAAGTGAGTTTTGGCCGTTCGCGCTTGTGATTGGACGAACGACGTTTTTTCACTAGTGAGTTTTCATCGTTCGCCTTTCCTATTGGCTAATGCATTTTACGTATCATTTGTACGCAATCGTTTTTTCGTGAGAAAATCTCAGTATCTATGATATAAATAGTTGTTCTGTTCTGATATTTCCATCGGTTTGTATAAAAATAACCTAAGAAGGCGAGAAACAAAACTGTTACGAATCATTTACTTTTCTAGCCCCAAAAGATGATTCCCAGGGGTTTTTGCCTCGTTCTTGACTTAGAGGGGTGTGGTACTTGACCGATGTTTGGTTATTAGGGTGTTACAGAGGGTTTTTAAATGTTCAAAGACTTCCTAAGGAAGCTCCCTCTCTTTACGGACACGGACACCTCCCTGTTAAAGACAGTTTATTTGGTCCGTAAGAGATCAGAATTTATACAAACTTTACCTCTGTAATACGAATGCCTCCATGATCTAGCCGCTTTTTTCTAACCCTTTGGTATCCGACCGCGCTATTTTTTGTAACGTTTCAAACGCGCGAAGGACAATCGAAGCTTCTCGTACGATGAACGTATAATTTCTCAAAAAACTAGCTTCATGTACTTCATTTTCATTCCCAGGCCTTCAAACTGACAGAATATGAGCAAAATCGTTTTATGAAT
>3prime_UTR::chromosome_1:53463-55801
Hooray! I have the 3’ UTRs! But how do I know which gene they go with?? I guess the location? Editing the script above so that instead of -name
, its fullHeader
. Submitted batch job 310037. Nope that didn’t change anything. Well somehow I need to figure out how to obtain the gene id that the 3’UTR sequence corresponds to, likely based on the coordinates of the 3’UTR and gene sequences.
20240326
I’m going to run a test of miranda with the mature miRNAs and the 3’UTRs. In the miranda documentation, file1 is the miRNA file and file2 is the 3’UTR sequences. There are options to play with scaling and alignment parameters (ie -en
, -sc
, -scale
, -loose
, etc) but for now I will just run a basic test to see if it works. In the scripts folder, nano miranda_test.sh
:
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/miranda
miranda /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa /data/putnamlab/jillashey/Astrangia_Genome/apoc_3UTR.fasta
conda deactivate
miranda conda stuff appears to be empty on the server…might need to reactivate? Removing miranda from /data/putnamlab
and going to try to create a new one in the conda folder on putnam lab.
cd /data/putnamlab/conda
module load Miniconda3/4.9.2
# Create a new environment
conda create --prefix /data/putnamlab/conda/miranda
conda install miranda
Gives me this error: EnvironmentLocationNotFound: Not a conda environment: /data/putnamlab/miranda
. There is a miranda folder but it only has the history of the commands that I have run for this. I deleted this folder and am going to try installing with mamba
, as this is what the bioconda website recommends.
module load Mamba/22.11.1-4
# Create a new environment
mamba create --name miranda_JA miranda
Didn’t work, gave me this error:
Looking for: ['miranda']
error libmamba Could not open lockfile '/opt/software/Mamba/22.11.1-4/pkgs/cache/cache.lock'
error libmamba Could not open lockfile '/opt/software/Mamba/22.11.1-4/pkgs/cache/cache.lock'
conda-forge/noarch 16.3MB @ 7.2MB/s 2.4s
conda-forge/linux-64 39.1MB @ 7.1MB/s 6.2s
Could not solve for environment specs
Encountered problems while solving:
- nothing provides requested miranda
The environment can't be solved, aborting the operation
Trying again
module purge
module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/conda/miranda
cd /data/putnamlab/conda/miranda
conda install bioconda::miranda
Downloading and Extracting Packages
ca-certificates-2024 | 127 KB | ######################################################################################## | 100%
certifi-2024.2.2 | 159 KB | ######################################################################################## | 100%
miranda-3.3a | 58 KB | ######################################################################################## | 100%
Preparing transaction: done
Verifying transaction: failed
EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
environment location: /opt/software/Miniconda3/4.9.2
uid: 1050
gid: 1012
Says that I do not have permissions to write to Miniconda location. But I’m not trying to write there. Asked chatGPT and this is what it gave me:
conda create --prefix /data/putnamlab/conda/miranda
conda activate /data/putnamlab/conda/miranda
conda install bioconda::miranda
Success! Looks like I just had to activate the environment. Let’s now trying running miranda. In the scripts folder, nano miranda_test.sh
:
#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/miranda
miranda /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa /data/putnamlab/jillashey/Astrangia_Genome/apoc_3UTR.fasta
conda deactivate
Submitted batch job 310076
20240328
miranda test finished in about a day. The output is in slurm-310076.out
in the scripts folder, I should’ve piped it to an output file but oh well. The file is quite large. Here’s an example of the information in it:
Read Sequence:chromosome_1:40489-43489 (3000 nt)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing Scan: chromosome_12_481048 vs chromosome_1:40489-43489
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Forward: Score: 152.000000 Q:2 to 18 R:855 to 875 Align Len (17) (70.59%) (82.35%)
Query: 3' cccTTGTTCG-GCTTTGTAAc 5'
||: || :||||||||
Ref: 5' ctaAATTTGCGTGAAACATTt 3'
Energy: -11.470000 kCal/Mol
Scores for this hit:
>chromosome_12_481048 chromosome_1:40489-43489 152.00 -11.47 2 18 855 875 17 70.59% 82.35%
Score for this Scan:
Seq1,Seq2,Tot Score,Tot Energy,Max Score,Max Energy,Strand,Len1,Len2,Positions
>>chromosome_12_481048 chromosome_1:40489-43489 152.00 -11.47 152.00 -11.47 2 20 3000 855
Complete
So much information…Let’s go through this line by line. The top line Read Sequence
is the 3’UTR sequence. The Performing Scan
line indicates the specific miRNA that will be compared to the specified 3’UTR sequence. The Forward
line has information about the score and the lengths of the query and reference sequences that matched up. I’m not sure what the percentages mean…Maybe its a percent of base pairs in the miRNA that aligned? The Query
is the miRNA sequence and the Ref
is the 3’UTR sequence. The bars between the query and reference sequence represent aligned bps and the dots between the sequences an instance of a base wobble pair. The Scores for this hit
has the total score and energy of this interaction. A good cutoff is typically 100-150 for score and -10 kcal/mol for energy. I think the 2 and 18 is the miRNA sequence that aligns with the query, and vice versa with the 3’UTR sequence (855 and 875). Still not sure what the percentages mean. The Scores for this scan
includes much of the same information as above. I’m guessing that hit vs scan is the hit of the miRNA to a specific portion of the 3’UTR and the scan refers to all of the hits along a specific 3’UTR sequence? Not totally sure though.
This file is massive. It would have taken ~1 hour to download to my personal computer. I’m going to add some lines to the script that count all of the putative interactions. This code is being added to the miranda_test.sh
script (and the actual miranda code is being commented out so it does not run again).
zgrep -c "Performing Scan" slurm-310076.out
I used the “Performing Scan” phrase because I believe that this is the header for each putative interaction. Submitted batch job 310209. Ran in about 6 mins and there are 112,558,341 putative interactions! That is crazy. I think I should apply some filtering parameters in the miranda code now. I’m going to add -sc 100
and -en -10
. Also adding -out test_miranda.tab
. Submitted batch job 310211
20240401
Miranda has pre-emptively stopped and restarted a couple times. I’m going to stop job 310211
now and assess the results thus far.
These were the current settings:
Current Settings:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Query Filename: /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa
Reference Filename: /data/putnamlab/jillashey/Astrangia_Genome/apoc_3UTR.fasta
Gap Open Penalty: -9.000000
Gap Extend Penalty: -4.000000
Score Threshold: 100.000000
Energy Threshold: -10.000000 kcal/mol
Scaling Parameter: 4.000000
If I want to invoke strict seed binding, I will probably have to play with the gap open and extend penalty thresholds. Here’s an example of one of the miRNA:mRNA matches:
Performing Scan: chromosome_12_481048 vs chromosome_1:17663-20663
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Forward: Score: 143.000000 Q:2 to 17 R:2880 to 2900 Align Len (16) (75.00%) (87.50%)
Query: 3' ccctTGTT-CGGCTTTGTAAc 5'
|||| | :||||:|||
Ref: 5' tttcACAATGGTGAAATATTc 3'
Energy: -12.000000 kCal/Mol
Scores for this hit:
>chromosome_12_481048 chromosome_1:17663-20663 143.00 -12.00 2 17 2880 2900 16 75.00% 87.50%
Very cool! How do I invoke strict seed binding?
20240409
In order for miranda to run faster, I am going to filter the input so that I’m only getting the differentially expressed genes and miRNAs. First, I’m going to subset the mature_all.fa
file. This is a list of my differentially expressed miRNAs. I’m copying it to the server into /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57
. Subset the mature fasta file based on the miRNAs in the DEM list.
grep -A 1 -f DEM_list.txt mature_all.fa | grep -v "^--$" > mature_DE.fa
zgrep -c ">" mature_DE.fa
50
Excellent, now all differentially expressed miRNAs are in the mature_DE.fa
. Time to work on subsetting the gene list by the DEGs. This one is a little bit more complicated because I have a fasta of the 3’UTR sequences for each gene which is being used as input for the miranda script. However the 3’UTR sequences are only labeled with a coordinate (ie >chromosome14:1-200) as opposed to what gene corresponds to that specific 3’UTR sequence (ie >gene1). I can do this with bedtools intersect (also used above). First, convert the fasta headers of the 3’UTR fasta file to bed file format.
cd /data/putnamlab/jillashey/Astrangia_Genome
awk -F'[:-]' '/^>/ {print $2 "\t" $3-1 "\t" $4}' apoc_3UTR.fasta > apoc_3UTR.bed
Intersect BED files with primary GFF. In the scripts folder: nano bed_intersect_3UTR.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia_Genome/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
cd /data/putnamlab/jillashey/Astrangia_Genome
echo "Merging 3'UTR bed file with gff file to get genes associated with 3'UTR sequences" $(date)
bedtools intersect -a apoc_3UTR.bed -b apoculata_v2.0.gff3 -wa -wb > intersect_output.txt
echo "Merge complete" $(date)
Submitted batch job 310666. Got this error:
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Apparently the bed tools is looking for the bed file to have chromosome, start and end columns. I need to include the chromosome columns
awk -F'[:-]' '/^>/ {chromosome=substr($1, 2); start=$2; end=$3; print chromosome "\t" start "\t" end}' apoc_3UTR.fasta > apoc_3UTR.bed
head apoc_3UTR.bed
chromosome_1 17663 20663
chromosome_1 40489 43489
chromosome_1 53463 55801
chromosome_1 53463 55801
chromosome_1 74713 76735
chromosome_1 74713 76735
chromosome_1 92518 93919
chromosome_1 94852 95518
chromosome_1 92518 93919
chromosome_1 101190 104190
Now try running the bed_intersect_3UTR.sh
script. Submitted batch job 310668. It runs but the output file is 0…try using apoc_GFFannotation.gene_sorted.gff
as gff input. Submitted batch job 310669. Still produced nothing…
head apoc_GFFannotation.gene_sorted.gff
chromosome_1 EVM gene 20664 21393 . - . ID=evm.TU.chromosome_1.1;Name=EVM%20prediction%20chromosome_1.1
chromosome_1 . gene 34636 40489 . + . ID=evm.TU.chromosome_1.2;Name=EVM%20prediction%20chromosome_1.2
chromosome_1 . gene 43758 53463 . + . ID=evm.TU.chromosome_1.3;Name=EVM%20prediction%20chromosome_1.3
chromosome_1 . gene 55802 60431 . - . ID=evm.TU.chromosome_1.4;Name=EVM%20prediction%20chromosome_1.4
chromosome_1 . gene 62282 74713 . + . ID=evm.TU.chromosome_1.5;Name=EVM%20prediction%20chromosome_1.5
chromosome_1 . gene 76736 78519 . - . ID=evm.TU.chromosome_1.6;Name=EVM%20prediction%20chromosome_1.6
chromosome_1 . gene 78661 92518 . + . ID=evm.TU.chromosome_1.7;Name=EVM%20prediction%20chromosome_1.7
chromosome_1 . gene 93920 94852 . - . ID=evm.TU.chromosome_1.8;Name=EVM%20prediction%20chromosome_1.8
chromosome_1 . gene 104191 109764 . - . ID=evm.TU.chromosome_1.9;Name=EVM%20prediction%20chromosome_1.9
chromosome_1 . gene 113080 113418 . - . ID=evm.TU.chromosome_1.10;Name=EVM%20prediction%20chromosome_1.10
>chromosome_1:17663-20663
GGAACGCGTACCAATTTTATCCGTGATCGGACTACCTCGCGACTAGGTCCGGGTAAAATTTTGGTTCAGCACGGGTGAGGTAAATCCGTTTACACGTCAAATGTTATCCGTGCTGAACCATTTTTTTCGGTATGCGTTCCATTTTTACTCGAGCCGTGCCGCGTAAAATGGGCTTGAGTTATAAGCAATCAAAACTAAAATGAAGGGTGTCATTGCAAGATTAAACTGTTGCTTTGGTAACCTTTTATGTCATAAAAATCATAACAACGTGTTCCGCAATCATTGGGCATTTGTTTGACACCATGATTGTAGCATCAACTGATGAAGAGTGGTTAAAGTGACCCGTCAAAATCCAAGACTTGGAAAGTGCTGGAAAATTGAGCCAGCCACCTTAAACCTTCATCTTAACTCAGTGAAGGCAAGAACTAGCCAAAAATTCTTATAAACAATGAACAGGGTCAGTGACAGACCCCGTTAAGTGTCAACAGGATAACCCCCGACTACTCGTATTTATTATTCAAACAGTTCAGTTTCTCTCTAAGATAAAATTCAAATTTAAGAGGCTGTGTCTGTAATACTGGTCTGCCCGATCTGGCTTGTTAAATAGATTTCGCTCTTTTCTAGTGTGTTGCAAGACTTTCATGTACTTATACTAAAAACACAATTCACTTGACCGGATCATTAATTTTTTTTTCGAGTTTTTAAGGATTTTCGATTCACCCGAACGAGCACTTGTAACGACCACTGTCAAGGCTTAAATTTACCACAACTTTGAAAATTAATTAAAACAGAACTAACAACTTCTGTACCAAACAAACACCACAGCCTGTTAAATTATGTCTGTTTAATTTATCACACCAACTTTCACGAACATTGAGCTCGAATCAAGCTAGCAGCAAAACATTAACGACACCCCTATCACACATCTTGATGAAACACTTGAAAACGGCTCAGATTCAAACTTTAGTTTTGAAAAAGTGGGGCGGTCTACTTCAATCAAACCAACGCAGCCGAAAAGTAGATGTAAAAAAAAAATGCTCCTATCGTAAAAAACCTTCTGGTATCGTCTTAATGACATATTTACAGCTGTGTAAACTTGCCCAATTTCGAAGATCTTCATGTTTGCAGTCATTTTGAGGGACTCGTCAGGACGAGGAGTTTGATCAACTGGCCTCGCAGACATACCTAACACCATCAGAAATAAACAAAGATTAGATATAAATGCATTGCAGTTCAACTGAAAGAGAATAAATCGTGTAATAGATCAACGGGTCGAGAATTATGATTAACGTTTAGCTATAAGCTTTATATTGATCTACTCCCGTTGTTATTGTTGAATGAATCGGAGAGTTTTTCAAAAAATTATTTAACATCTGTTACAAATAAAAATAACAACAATTTTTTTTTTCAAGTTTACAAAATAGCTCGTGCAGTATTTGATCTTAACGACTGAAAAAGACATAGTATATCTTGTTTCTAAGCCCTTACCTCTTTTATCGGGTCGAGAATTCACCTTATAAAACTCTGTGCGTTTGCAAATAAAACAAAACAAATCATTTCGCACTGATCAAGCTTGACGCGTTGTTTAAATGCCGCCGACATCGAAAAAATTACACCAATATTGTAGCCATCAGATACCATCTGTGGCGTCCGAAATATATTATCGTAAGTAAAGAAATAGTTTATTCTATGTTCGTTTTTAAAAAAGATTAATAAAAATGAAATCGTTTTCAGTAAACCTACCAAAGTTTCGTGATGTTAACGTTCCCTGATGGTGTTCAATTATTCTAAAAATAGTCGTTTAATTTGAACGGAATAGTTTCAATTAAAACTTAAATATCTTTTAATTTTAAAAAAGTTAATATACAAGGCTTCCAAAAATAAACAAACCAGCATGCACAGAACATTTTTTAGCAAAACTGAACGAGTTTTTCTGGGTAGGGTTATAAAACAGGCTACGCCTGCACTGTTATAATGAAACGCTTTGTTAAAGGGAACCTCTGTTGTTTAGGCAAGTTGACGTTGAGTTGTTTGAAATAATGCCTAAACTACAATTGTACAGGACAGGGGAGAAACAGCATTTAATAACAACCGTAAACGGAAAAGAAAAAGGCTTTTAAATTTCCCGCTCGGCGCCATTTTGAAATTTATGACGTTAGTGCGTTGTCCAAATGTTTCACTCAAAAGAACCTCTGCTTTCTAATAATAGCGGCAACGCAGTGACGTCATAAAATTCAAAATGGCGGCGCGCGGGAAATTCAAAAGCACAAAAATTGACAAATCAAAGGCCCAGTATTCCAATTAAAAAATGGAATCCGCGACTTTAAACAACCAAAACGATTTTAGACTGTGGTAAACTTATTATTTTGACCAAAACTGTTTTACAGTCGAGGTTCCCTTTAAGCAAAACACCTTAAAGCTTAATAGTCCAGTCATTTTGTGGTTGGACCTGCAACTAATTGCAACAGAAGCTTTTTGACCGCAAATCATTTCGAATAAGTCTCTAGATTAACATACTAAAAGTCCCAAAGGATTCATGCATGGGAACATCTCGAAAATTTAACGGATTCTAAGTTGTGCTTTTGCGTAACATGAAAACGAGGCTCAAAAACAGAGGTTCTGTTATTATGCTTGTTTATGGCAAGAATTTCATTTATTTCATCCAAAAACAATGAAATATACTTTAAATATGCTTGAAAACTTTCATCCAGAGGAAAAATTAATCAAAAAGTATTCAATAAGCAAAATAAATCTCATTATATTTAAATTTTCAATTGTGAAGCGAATTGGACTACACCAAAAATGGTAGTTGCGATAGTTTATTATCCTTTTATGGATCGCACAAAAAAATTGCAAGCGAGCCAACAAATAAATATATCTTTCACAATGGTGAAATATTCTAAAAGAACATCGTTTGTGAAATGAAAACGAATTTCACAGTTTCTTCCAGCTCTTGTGTCAATCACCGTCGTCATCGCTATCATCGAAATCTTGATGGGG
>chromosome_1:40489-43489
TTCCAGGAATTTTGAAATAATTTTCTACAATTAATTTATTGTAATAAGAAAATCAACTGTATGCAATTGTTATGTAGACAGTCTCTAAATATCTTCTTAGTACACTCCAGAAAGCACATCATTTCTCACCACCATTGAAAAGTGCATATAAAAACTCTCAGATTCTCGCCATGTTTTCGAGCTGCAAAGAGCTTTATAACTTCTTTTGGATCCTCTCCGCATATTCGTGACGTAAATCTGATTTCACAGGTAGCCCAGGAGGTCTGAACTCCGCTTTCACTGTAGTCGTGTTCGGGTGGATTTTGTCTTGATTCAAAATACTAGCTGGTGTGGCTTGTCCAGCTGGATGTAAAGTTTATATATAGCTTTTTAGAACCAGAAACACAACTAGTATCTATTATTTTATACACTTTTTAATTGTGGCGGGAAAACGTGTACAACATGATGTACGTTGATGGGATCGTGTCCTCA
Hmm so the gene and the utr do not overlap by the looks of it. The first line in the sorted gene gff has start and stops here: 20664 21393
but the 3’UTR sequence goes from chromosome_1:17663-20663
so there is no overlap. Maybe I need to use the 3’UTR sorted gff file aka apoc_GFFannotation.three_prime_UTR_sorted.gff
? Submitted batch job 310671. Still empty :’(. Maybe I calculated 3’UTR incorrectly…or maybe not but I am not getting the overlap because I used flank bed. so how to connect them…
This is what flank bed did:
I need to know the flanks, which are the 3’UTRs, and what is being flanked, which is the input or the gene itself. I need to know the genes that were flanked! Trying the apoc.GFFannotation.3UTR_3kb.gff
. Submitted batch job 310686. This was the gff that was produced from flank bed. This got an output file but I’m not sure I understand it.
wc -l test.txt
117033 test.txt
head test.txt
chromosome_1 17663 20663 chromosome_1 EVM 3prime_UTR 17664 20663 . - . ID=evm.TU.chromosome_1.1;Name=EVM%20prediction%20chromosome_1.1
chromosome_1 40489 43489 chromosome_1 . 3prime_UTR 40490 43489 . + . ID=evm.TU.chromosome_1.2;Name=EVM%20prediction%20chromosome_1.2
chromosome_1 53463 55801 chromosome_1 . 3prime_UTR 53464 56463 . + . ID=evm.TU.chromosome_1.3;Name=EVM%20prediction%20chromosome_1.3
chromosome_1 53463 55801 chromosome_1 . 3prime_UTR 52802 55801 . - . ID=evm.TU.chromosome_1.4;Name=EVM%20prediction%20chromosome_1.4
chromosome_1 53463 55801 chromosome_1 . 3prime_UTR 53464 56463 . + . ID=evm.TU.chromosome_1.3;Name=EVM%20prediction%20chromosome_1.3
chromosome_1 53463 55801 chromosome_1 . 3prime_UTR 52802 55801 . - . ID=evm.TU.chromosome_1.4;Name=EVM%20prediction%20chromosome_1.4
chromosome_1 74713 76735 chromosome_1 . 3prime_UTR 74714 77713 . + . ID=evm.TU.chromosome_1.5;Name=EVM%20prediction%20chromosome_1.5
chromosome_1 74713 76735 chromosome_1 . 3prime_UTR 73736 76735 . - . ID=evm.TU.chromosome_1.6;Name=EVM%20prediction%20chromosome_1.6
chromosome_1 74713 76735 chromosome_1 . 3prime_UTR 74714 77713 . + . ID=evm.TU.chromosome_1.5;Name=EVM%20prediction%20chromosome_1.5
chromosome_1 74713 76735 chromosome_1 . 3prime_UTR 73736 76735 . - . ID=evm.TU.chromosome_1.6;Name=EVM%20prediction%20chromosome_1.6
The numbers don’t look correct. It’s trying to find the intersections and it is finding them but its the same info because its the 3UTR 3kb downstream gff. There is an associated ID which corresponds to gene but I am hesitant to use it, especially since there are so many rows, there are probably repeats. I could use bed closest, which searches for the nearest feature in the other file. In the scripts folder: nano bed_close_3UTR.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia_Genome/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
cd /data/putnamlab/jillashey/Astrangia_Genome
echo "Finding closest gene to 3'UTR seqs " $(date)
bedtools closest -a apoc_3UTR.bed -b apoc_GFFannotation.gene_sorted.gff > closest_genes.txt
echo "Complete!" $(date)
Submitted batch job 310687. Got this error:
Error: Sorted input specified, but the file apoc_3UTR.bed has the following out of order record
chromosome_1 92518 93919
Adding this line: sort -k1,1 -k2,2n -o apoc_3UTR_sorted.bed apoc_3UTR.bed
to the script. Submitted batch job 310688. Now got this error:
ERROR: chromomsome sort ordering for file apoc_GFFannotation.gene_sorted.gff is inconsistent with other files. Record was:
chromosome_10 . gene 12623 13741 . + . ID=evm.TU.chromosome_10.1;Name=EVM%20prediction%20chromosome_10.1
But an output file did get produced!
wc -l closest_genes.txt
73170 closest_genes.txt
chromosome_1 17663 20663 chromosome_1 EVM gene 20664 21393 . - . ID=evm.TU.chromosome_1.1;Name=EVM%20prediction%20chromosome_1.1
chromosome_1 40489 43489 chromosome_1 . gene 34636 40489 . + . ID=evm.TU.chromosome_1.2;Name=EVM%20prediction%20chromosome_1.2
chromosome_1 53463 55801 chromosome_1 . gene 43758 53463 . + . ID=evm.TU.chromosome_1.3;Name=EVM%20prediction%20chromosome_1.3
chromosome_1 53463 55801 chromosome_1 . gene 55802 60431 . - . ID=evm.TU.chromosome_1.4;Name=EVM%20prediction%20chromosome_1.4
chromosome_1 53463 55801 chromosome_1 . gene 43758 53463 . + . ID=evm.TU.chromosome_1.3;Name=EVM%20prediction%20chromosome_1.3
chromosome_1 53463 55801 chromosome_1 . gene 55802 60431 . - . ID=evm.TU.chromosome_1.4;Name=EVM%20prediction%20chromosome_1.4
chromosome_1 74713 76735 chromosome_1 . gene 62282 74713 . + . ID=evm.TU.chromosome_1.5;Name=EVM%20prediction%20chromosome_1.5
chromosome_1 74713 76735 chromosome_1 . gene 76736 78519 . - . ID=evm.TU.chromosome_1.6;Name=EVM%20prediction%20chromosome_1.6
chromosome_1 74713 76735 chromosome_1 . gene 62282 74713 . + . ID=evm.TU.chromosome_1.5;Name=EVM%20prediction%20chromosome_1.5
chromosome_1 74713 76735 chromosome_1 . gene 76736 78519 . - . ID=evm.TU.chromosome_1.6;Name=EVM%20prediction%20chromosome_1.6
Interesting, there are duplicates, which makes sense since it was doing it for all the genes. I think I need to select the ones on the -
strand.
awk '$10 == "-" {print}' closest_genes.txt > closest_genes_neg_strand.txt
wc -l closest_genes_neg_strand.txt
21183 closest_genes_neg_strand.txt
Maybe not? Idk. Some chromosome info is just on the + strand and some on the - strand. An example:
chromosome_1 1315574 1318574 chromosome_1 EVM gene 1318575 1320328 . - . ID=evm.TU.chromosome_1.94;Name=EVM%20prediction%20chromosome_1.94
chromosome_1 1320328 1321584 chromosome_1 EVM gene 1318575 1320328 . - . ID=evm.TU.chromosome_1.94;Name=EVM%20prediction%20chromosome_1.94
chromosome_1 1335009 1336339 chromosome_1 . gene 1330359 1335009 . + . ID=evm.TU.chromosome_1.96;Name=EVM%20prediction%20chromosome_1.96
chromosome_1 1335009 1336339 chromosome_1 . gene 1330359 1335009 . + . ID=evm.TU.chromosome_1.96;Name=EVM%20prediction%20chromosome_1.96
evm.TU.chromosome_1.94 is only on the - strand, even though it shows up twice. evm.TU.chromosome_1.96 is only on the + strand, even though it shows up twice as well. Lot of duplicates? I am confused about the strandness.
Next steps:
- subset
closest_genes.txt
to only contain DEGs. - extract 3’UTR sequence info about the DEGs
- Use this info to subset the 3’UTR DEG sequences
20240410
I’m going to subset the closest_genes.txt
file to contain only DEGs. This is a list of my differentially expressed genes. I’m copying it to the server into /data/putnamlab/jillashey/Astrangia_Genome
. In the closest_genes.txt
file, the 12th column has the information about the corresponding gene. This info is under the ID=XXXXX
part of the file.
grep -Ff DEG_list.txt closest_genes.txt > deg_closest_genes.txt
wc -l deg_closest_genes.txt
3187 deg_closest_genes.txt
Sanity checking by randomly taking IDs and searching the DEG list. Not seeing all of the DEGs…Ah because grep was looking for anything that matched. So for example, if I had DEG evm.TU.chromosome_4.153
, grep would also pull evm.TU.chromosome_4.1535
because it contains the same string. Remove ID=
and other extraneous info from last column.
awk 'BEGIN{FS=OFS="\t"} {split($NF, id, ";"); split(id[1], id_value, "="); $NF=id_value[2]; print $0, id_value[1]}' closest_genes.txt > modified_closest_genes.txt
head modified_closest_genes.txt
chromosome_1 17663 20663 chromosome_1 EVM gene 20664 21393 . - . evm.TU.chromosome_1.1 ID
chromosome_1 40489 43489 chromosome_1 . gene 34636 40489 . + . evm.TU.chromosome_1.2 ID
chromosome_1 53463 55801 chromosome_1 . gene 43758 53463 . + . evm.TU.chromosome_1.3 ID
chromosome_1 53463 55801 chromosome_1 . gene 55802 60431 . - . evm.TU.chromosome_1.4 ID
chromosome_1 53463 55801 chromosome_1 . gene 43758 53463 . + . evm.TU.chromosome_1.3 ID
chromosome_1 53463 55801 chromosome_1 . gene 55802 60431 . - . evm.TU.chromosome_1.4 ID
chromosome_1 74713 76735 chromosome_1 . gene 62282 74713 . + . evm.TU.chromosome_1.5 ID
chromosome_1 74713 76735 chromosome_1 . gene 76736 78519 . - . evm.TU.chromosome_1.6 ID
chromosome_1 74713 76735 chromosome_1 . gene 62282 74713 . + . evm.TU.chromosome_1.5 ID
chromosome_1 74713 76735 chromosome_1 . gene 76736 78519 . - . evm.TU.chromosome_1.6 ID
Not sure why it make an column that just has “ID” in it but whatever. Count number of columns and rows:
awk '{print NF}' modified_closest_genes.txt | sort -nu | tail -n 1
# 13 columns
wc -l modified_closest_genes.txt
# 73170 rows
Remove any duplicate rows
awk '!seen[$0]++' modified_closest_genes.txt > uniq_modified_closest_genes.txt
awk '{print NF}' uniq_modified_closest_genes.txt | sort -nu | tail -n 1
# 13 columns
wc -l uniq_modified_closest_genes.txt
57300 uniq_modified_closest_genes.txt
Okay now I can filter uniq_modified_closest_genes.txt
by the DEG list.
awk 'NR==FNR{deg[$1]; next} $12 in deg' DEG_list.txt uniq_modified_closest_genes.txt > uniq_modified_closest_degs.txt
head uniq_modified_closest_degs.txt
chromosome_1 987846 990846 chromosome_1 . gene 990847 993879 . - . evm.TU.chromosome_1.74 ID
chromosome_1 993879 996650 chromosome_1 . gene 990847 993879 . - . evm.TU.chromosome_1.74 ID
chromosome_1 1335009 1336339 chromosome_1 . gene 1330359 1335009 . + . evm.TU.chromosome_1.96 ID
chromosome_1 1397052 1400052 chromosome_1 EVM gene 1400053 1410554 . - . evm.TU.chromosome_1.105 ID
chromosome_1 2741797 2744797 chromosome_1 . gene 2744798 2746546 . - . evm.TU.chromosome_1.220 ID
chromosome_1 3378499 3381499 chromosome_1 . gene 3381500 3384205 . - . evm.TU.chromosome_1.267 ID
chromosome_1 3387647 3390647 chromosome_1 . gene 3390648 3394579 . - . evm.TU.chromosome_1.268 ID
chromosome_1 3452321 3455321 chromosome_1 . gene 3455322 3508018 . - . evm.TU.chromosome_1.272 ID
chromosome_1 3790853 3791719 chromosome_1 EVM gene 3791720 3796967 . - . evm.TU.chromosome_1.303 ID
chromosome_1 4998359 5000981 chromosome_1 . gene 4997077 4998359 . + . evm.TU.chromosome_1.419 ID
wc -l uniq_modified_closest_degs.txt
956 uniq_modified_closest_degs.txt
Does this make sense? That not all of the DEGs are represented in the 3’UTRs? I also see there are some duplicates (evm.TU.chromosome_1.74) where two 3’UTR sequences were closest to evm.TU.chromosome_1.74 coordinates.
tail uniq_modified_closest_degs.txt
chromosome_9 23374102 23375784 chromosome_9 . gene 23369259 23374102 . - . evm.TU.chromosome_9.2218 ID
chromosome_9 23399610 23402610 chromosome_9 . gene 23402611 23417409 . - . evm.TU.chromosome_9.2221 ID
chromosome_9 23472207 23473119 chromosome_9 . gene 23469848 23472207 . + . evm.TU.chromosome_9.2232 ID
chromosome_9 23484687 23486331 chromosome_9 . gene 23486332 23493309 . - . evm.TU.chromosome_9.2235 ID
chromosome_9 23643484 23646484 chromosome_9 . gene 23646485 23658996 . - . evm.TU.chromosome_9.2248 ID
chromosome_9 23662284 23664931 chromosome_9 EVM gene 23664932 23667853 . + . evm.TU.chromosome_9.2250 ID
chromosome_9 23667853 23669741 chromosome_9 EVM gene 23664932 23667853 . + . evm.TU.chromosome_9.2250 ID
chromosome_9 23837944 23840944 chromosome_9 . gene 23832305 23837944 . + . evm.TU.chromosome_9.2269 ID
chromosome_9 24085331 24088331 chromosome_9 . gene 24088332 24106631 . - . evm.TU.chromosome_9.2293 ID
chromosome_9 24157356 24160356 chromosome_9 . gene 24160357 24173706 . - . evm.TU.chromosome_9.2299 ID
Only goes up to chromosome 9?? Weird. Even closest_genes.txt
file only goes up to chromosome 9…I think it goes back to this error from above:
ERROR: chromomsome sort ordering for file apoc_GFFannotation.gene_sorted.gff is inconsistent with other files. Record was:
chromosome_10 . gene 12623 13741 . + . ID=evm.TU.chromosome_10.1;Name=EVM%20prediction%20chromosome_10.1
Let’s try to resort the gff?
sort -k1,1 -k4,4n apoc_GFFannotation.gene_sorted.gff > sorted_apoc_GFFannotation.gene_sorted.gff
Ah so its sorting it so chromosome 1 is first, then 10, etc. 9 is last because it is that last number in 1-9. But the original file (apoc_GFFannotation.gene_sorted.gff
) is sorted by chromosome? I am so frustrated haha. Look at all lines with chromosome 10:
awk '$1 == "chromosome_10"' apoc_GFFannotation.gene_sorted.gff
Weird, it looks like chromosome 10 is repeating itself?
chromosome_10 EVM gene 33840711 33841013 . - . ID=evm.TU.chromosome_10.3274;Name=EVM%20prediction%20chromosome_10.3274
chromosome_10 . gene 33859427 33860415 . + . ID=evm.TU.chromosome_10.3275;Name=EVM%20prediction%20chromosome_10.3275
chromosome_10 . gene 12623 13741 . + . ID=evm.TU.chromosome_10.1;Name=EVM%20prediction%20chromosome_10.1
chromosome_10 . gene 15707 16786 . - . ID=evm.TU.chromosome_10.2;Name=EVM%20prediction%20chromosome_10.2
chromosome_10 EVM gene 18626 18934 . + . ID=evm.TU.chromosome_10.3;Name=EVM%20prediction%20chromosome_10.3
chromosome_10 . gene 22435 26984 . - . ID=evm.TU.chromosome_10.4;Name=EVM%20prediction%20chromosome_10.4
chromosome_10 EVM gene 29369 30803 . + . ID=evm.TU.chromosome_10.5;Name=EVM%20prediction%20chromosome_10.5
chromosome_10 EVM gene 31529 36887 . - . ID=evm.TU.chromosome_10.6;Name=EVM%20prediction%20chromosome_10.6
It ends with evm.TU.chromosome_10.3275, then restarts with evm.TU.chromosome_10.1…Are all of the chromosomes 10-14 like this?
awk '$1 == "chromosome_11"' apoc_GFFannotation.gene_sorted.gff
Nope 11 doesn’t look like that. Let’s try to remove duplicate rows from apoc_GFFannotation.gene_sorted.gff
?
wc -l apoc_GFFannotation.gene_sorted.gff
47156 apoc_GFFannotation.gene_sorted.gff
awk '!seen[$0]++' apoc_GFFannotation.gene_sorted.gff > uniq_apoc_GFFannotation.gene_sorted.gff
wc -l uniq_apoc_GFFannotation.gene_sorted.gff
47156 uniq_apoc_GFFannotation.gene_sorted.gff
This didn’t do anything. Looking at the bed_close_3UTR.sh
script, I have a line that does: sort -k1,1 -k2,2n -o apoc_3UTR_sorted.bed apoc_3UTR.bed
aka not sorting by bedtools. In this line: bedtools closest -a apoc_3UTR_sorted.bed -b apoc_GFFannotation.gene_sorted.gff > closest_genes.txt
, let’s try to sub apoc_GFFannotation.gene_sorted.gff
for sorted_apoc_GFFannotation.gene_sorted.gff
. Submitted batch job 311017. No error message produced.
wc -l closest_genes.txt
89784 closest_genes.txt
Now let’s try this again. Make Id column.
awk 'BEGIN{FS=OFS="\t"} {split($NF, id, ";"); split(id[1], id_value, "="); $NF=id_value[2]; print $0, id_value[1]}' closest_genes.txt > modified_closest_genes.txt
wc -l modified_closest_genes.txt
89784 modified_closest_genes.txt
Remove any duplicate rows
awk '!seen[$0]++' modified_closest_genes.txt > uniq_modified_closest_genes.txt
wc -l uniq_modified_closest_genes.txt
67788 uniq_modified_closest_genes.txt
Okay now I can filter uniq_modified_closest_genes.txt
by the DEG list.
awk 'NR==FNR{deg[$1]; next} $12 in deg' DEG_list.txt uniq_modified_closest_genes.txt > uniq_modified_closest_degs.txt
Sanity check! Check some of the IDs from uniq_modified_closest_degs.txt
against the DEG list. Okay looking good so far. Now I need to make a new column that combines columns 1, 2, and 3 (ie the 3UTR info columns) so that a new column is created that it represents the file headers in apoc_3UTR.fasta
. Ie the new column should look like this: chromosome_1:17663-20663
.
awk '{print $1":"$2"-"$3, $0}' uniq_modified_closest_degs.txt > uniq_modified_closest_degs_3UTRid.txt
head uniq_modified_closest_degs_3UTRid.txt
chromosome_1:987846-990846 chromosome_1 987846 990846 chromosome_1 . gene 990847 993879 . - . evm.TU.chromosome_1.74 ID
chromosome_1:993879-996650 chromosome_1 993879 996650 chromosome_1 . gene 990847 993879 . - . evm.TU.chromosome_1.74 ID
chromosome_1:1335009-1336339 chromosome_1 1335009 1336339 chromosome_1 . gene 1330359 1335009 . + . evm.TU.chromosome_1.96 ID
chromosome_1:1397052-1400052 chromosome_1 1397052 1400052 chromosome_1 EVM gene 1400053 1410554 . - . evm.TU.chromosome_1.105 ID
chromosome_1:2741797-2744797 chromosome_1 2741797 2744797 chromosome_1 . gene 2744798 2746546 . - . evm.TU.chromosome_1.220 ID
chromosome_1:3378499-3381499 chromosome_1 3378499 3381499 chromosome_1 . gene 3381500 3384205 . - . evm.TU.chromosome_1.267 ID
chromosome_1:3387647-3390647 chromosome_1 3387647 3390647 chromosome_1 . gene 3390648 3394579 . - . evm.TU.chromosome_1.268 ID
chromosome_1:3452321-3455321 chromosome_1 3452321 3455321 chromosome_1 . gene 3455322 3508018 . - . evm.TU.chromosome_1.272 ID
chromosome_1:3790853-3791719 chromosome_1 3790853 3791719 chromosome_1 EVM gene 3791720 3796967 . - . evm.TU.chromosome_1.303 ID
chromosome_1:4998359-5000981 chromosome_1 4998359 5000981 chromosome_1 . gene 4997077 4998359 . + . evm.TU.chromosome_1.419 ID
Separate the new column into its own text file.
awk '{print $1}' uniq_modified_closest_degs_3UTRid.txt > DEG_3UTR.txt
Now subset the 3’UTR fasta file based on the 3’UTR DEG seqs in the DEG_3UTR.txt
list.
grep -A 1 -f DEG_3UTR.txt apoc_3UTR.fasta | grep -v "^--$" > 3UTR_DE.fa
zgrep -c ">" 3UTR_DE.fa
2179
Important note: uniq_modified_closest_degs_3UTRid.txt
has the DEG and 3’UTR ids in it. I copied this file, along with 3UTR_DE.fa
to my local computer.
I believe I can run miranda now!!!!!!!!! In /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
: nano miranda_de.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "starting miranda run with differentially expressed genes and miRNAs with score cutoff >100 and energy cutoff <-10"$(date)
module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/miranda
miranda /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_DE.fa /data/putnamlab/jillashey/Astrangia_Genome/3UTR_DE.fa -sc 100 -en -10 -out /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_de.tab
conda deactivate
echo "miranda run finished!"$(date)
echo "counting number of putative interactions predicted"$(date)
zgrep -c "Performing Scan" /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_de.tab
Submitted batch job 311424
20240411
Miranda finished running in about 15 mins and ~108,000 putative interactions were predicted! This is great! Let’s look at some of the results:
Forward: Score: 123.000000 Q:2 to 18 R:676 to 694 Align Len (16) (81.25%) (87.50%)
Query: 3' cccTTGTTCGGCTTTGTAAc 5'
|||||||| |||: ||
Ref: 5' tcaAACAAGCC-AAATTTTc 3'
Energy: -12.090000 kCal/Mol
Scores for this hit:
>chromosome_12_481048 chromosome_1:987846-990846 123.00 -12.09 2 18 676 694 16 81.25% 87.50%
Woohoo. In this example, there isn’t exactly strict seed binding and there is a G:U wobble pair. Once again, how do I invoke strict seed binding?? A comment on this biostars question about miranda output meaning said: “No, I’m currently still using miRanda but I might complement it by using other Tools. Anyway, if you do not use the default parameters and therefore add this function “ -strict “ in the command, this means that you require strict alignment in the seed region (position 2-8). We already identified some interesting targets.” In the manual, it does not say anything about -strict
function. Maybe I’ll try it? Added the strict flag and renamed the output file. Submitted batch job 311700. The question also recommended the following to parse the miranda output:
grep -A 1 "Scores for this hit:" miranda_out.txt | sort | grep '>'
Which will provide me with this:
>mirna_name transcript_target 143.00 -22.87 2 18 339 357 16 75.00% 87.50%
With the header being this:
mirna Target Score Energy-Kcal/Mol Query-Aln(start-end) Subjetct-Al(Start-End) Al-Len Subject-Identity Query-Identity
The strict miranda ran, but no difference in number of outputs it looks like. But there looks to be less info in the strict file:
ls -othr
total 355M
-rw-r--r--. 1 jillashey 315M Apr 10 11:57 miranda_de.tab
-rw-r--r--. 1 jillashey 40M Apr 11 10:16 miranda_de_strict.tab
Let’s parse the outputs from both files.
grep -A 1 "Scores for this hit:" miranda_de.tab | sort | grep '>' > miranda_de_parsed.txt
wc -l miranda_de_parsed.txt
750674 miranda_de_parsed.txt
grep -A 1 "Scores for this hit:" miranda_de_strict.tab | sort | grep '>' > miranda_de_strict_parsed.txt
wc -l miranda_de_strict_parsed.txt
14899 miranda_de_strict_parsed.txt
Strict definitely resulted in less hits, which is probably good because I want the binding to be pretty stringent, given the strict seed binding that appears to be present in cnidarians. Yay!!!!!!! Copying miranda_de_strict_parsed.txt
to my local computer.
20240421
I’m going to now run miranda on all miRNAs and mRNAs.
In /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
: nano miranda_strict_all.sh
#!/bin/bash -i
#SBATCH -t 200:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB --cpus-per-task=24
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "starting miranda run with all genes and miRNAs with score cutoff >100, energy cutoff <-10, and strict binding invoked"$(date)
module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/miranda
miranda /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa /data/putnamlab/jillashey/Astrangia_Genome/apoc_3UTR.fasta -sc 100 -en -10 -strict -out /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all.tab
conda deactivate
echo "miranda run finished!"$(date)
echo "counting number of putative interactions predicted" $(date)
zgrep -c "Performing Scan" /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all.tab
echo "Parsing output" $(date)
grep -A 1 "Scores for this hit:" /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all.tab | sort | grep '>' > /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_parsed.txt
echo "counting number of putative interactions predicted" $(date)
wc -l /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_parsed.txt
echo "miranda script complete" $(date)
Submitted batch job 312601. Job might pend for a while because it requires a lot of resources. Started running after about 13 hours.
20240422
Now that the job above is running, I’m going to run miranda with the miRNAs and the lncRNAs. lncRNAs may bind to miRNAs and “sponge” them up, essentially sequestering them so that they can’t bind to their target mRNAs and degrade those mRNAs.
In /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
: nano miranda_strict_lncRNA.sh
#!/bin/bash -i
#SBATCH -t 200:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB --cpus-per-task=24
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "starting miranda run with all lncRNAs and miRNAs with score cutoff >100, energy cutoff <-10, and strict binding invoked"$(date)
module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/miranda
miranda /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2/apoc_bedtools_lncRNAs.fasta -sc 100 -en -10 -strict -out /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_lncRNA.tab
conda deactivate
echo "miranda run finished!"$(date)
echo "counting number of putative interactions predicted" $(date)
zgrep -c "Performing Scan" /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_lncRNA.tab
echo "Parsing output" $(date)
grep -A 1 "Scores for this hit:" /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_lncRNA.tab | sort | grep '>' > /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_lncRNA_parsed.txt
echo "counting number of putative interactions predicted" $(date)
wc -l /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_lncRNA_parsed.txt
echo "miranda script complete" $(date)
Submitted batch job 312886
20240426
Both miranda jobs have preemptively failed and restarted. Not sure why…Going back to what Kevin Bryan said to me regarding this in the Apul genome assembly: “you didn’t specify -t SLURM_CPUS_ON_NODE (and also #SBATCH –exclusive) to make use of all of the CPU cores on the node. You might want to consider re-submitting this job with those parameters. Because the nodes generally have 36 cores, it should be able to catch up to where it is now in a little over half a day, assuming perfect scaling.” Looking at the miranda manual, it doesn’t look like there is any flag to specify cpus on node. Cancelled jobs 312601
and 312886
. I am going to redo the SLURM info in these scripts. I edited both miranda_strict_all.sh
and miranda_strict_lncRNA.sh
so that the slurm headers look like:
#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
For miranda_strict_all.sh
: Submitted batch job 313313. For miranda_strict_lncRNA.sh
: Submitted batch job 313314. Both started running right away!
20240517
Over the last few weeks, I downloaded the mRNA miranda results and identified correlations between differentially expressed mRNAs and miRNAs (see code here). I have now isolated 51 genes that are shared across 4 comparisons (TP0vTP5 amb, TP0vTP7amb, TP0vTP5 heat, and TP0vTP7 heat) (see csv of that list hereXXXXX). With this list, I want to subset those genes from the mRNA fasta and blast them against the protein db.
I also might run interproscan…idk yet.
cd /data/putnamlab/jillashey/Astrangia2021/mRNA
mkdir blast
cd blast
Copied in sigcorr_subset.fasta
, which I made on my computer. In the scripts folder: nano sigcorr_blast.sh
#!/bin/bash
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=125GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/blast
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/blast
echo "Blasting Apoc genes of interest (significant correlation with miRNAs) against remote nt database" $(date)
blastn -query sigcorr_subset.fasta -db nt -evalue 1E-40 -num_threads 10 -max_target_seqs 3 -max_hsps 3 -outfmt 6 -out sigcorr_subset_blast_results.txt
echo "Blast complete" $(date)
Submitted batch job 317026. Also going to submit a blastx
job. Submitted batch job 317071
20240812
Zoe found this great tool called GeneExt, which is used to adjust/extend genes so that the 3’UTRs are annotated based on the mapping of reads, helping with overall mapping in scRNA and tag-seq. I’m going to use it to find the 3’UTRs for my genes! Cnidarian miRNAs bind to the 3’UTR of genes and in the code above, I estimated that the 3’UTRs are 3kb before the gene. However, this was just an estimate and I want to use GeneExt to find the 3’UTRs in a quantified way. I’m using Zoe’s code.
First, convert gff3 to gtf using gffread. In the /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
scripts folder: nano gffread.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --mem=250GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --error="%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output="%x_output.%j" #once your job is completed, any final job report comments will be put in this file
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
# load modules needed
module load gffread/0.12.7-GCCcore-11.2.0
cd /data/putnamlab/jillashey/Astrangia_Genome/
# Combined code, want to see if these are any different
gffread -E apoculata_v2.0.gff3 -T -o apoculata_v2.0.gtf
Submitted batch job 334287. Ran super fast. Checked gtf file, looks good. Next, combine 20 bam files together from the mRNA mapping step.
In the /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
folder, nano merge_bam.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=125GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load SAMtools/1.9-foss-2018b
cd /data/putnamlab/jillashey/Astrangia2021/mRNA/output/bowtie/align
#use samtools merge to merge all the files
samtools merge AST_merge.bam *.bam
Submitted batch job 334289. While this is running, append apoculata_v2.0_modified.gtf
in /data/putnamlab/jillashey/Astrangia_Genome
so that the transcript_ids have a -T at the end. The gene and transcript ids in the gtf must be different for GeneExt.
cd /data/putnamlab/jillashey/Astrangia_Genome
sed 's/transcript_id "\([^"]*\)"/transcript_id "\1-T"/g' apoculata_v2.0.gtf > apoculata_v2.0_modified.gtf
Zoe already loaded the GeneExt program into her directory here: /data/putnamlab/zdellaert/snRNA/programs/GeneExt
but I couldn’t get it to activate so I’m going to try to install it here /data/putnamlab/conda
.
cd /data/putnamlab/conda
mkdir GeneExt
cd GeneExt
git clone https://github.com/sebepedroslab/GeneExt.git
interactive
module load Miniconda3/4.9.2
conda env create -n geneext -f environment.yaml
Probably going to run for a while. Once this is installed, I can run GeneExt!
20240813
Installation for GeneExt ran for most of the day yesterday, then quit. Updating it now:
cd /data/putnamlab/conda/GeneExt/
interactive -c 10
module load Miniconda3/4.9.2
conda env update -n geneext -f environment.yaml
This ran for about an hour but successfully updated. Let’s try to run the test data to make sure the program works.
exit # exit previous interactive mode
interactive -c 10
conda activate geneext
# test run
python geneext.py -g test_data/annotation.gtf -b test_data/alignments.bam -o result.gtf --peak_perc 0
Output:
____ _____ _
/ ___| ___ _ __ ___| ____|_ _| |_
| | _ / _ \ '_ \ / _ \ _| \ \/ / __|
| |_| | __/ | | | __/ |___ > <| |_
\____|\___|_| |_|\___|_____/_/\_\__|
______ ___ ______
-----[______]==[___]==[______]===>----
Gene model adjustment for improved single-cell RNA-seq data counting
╭──────────────────╮
│ Preflight checks │
╰──────────────────╯
Genome annotation warning: Could not find "gene" features in test_data/annotation.gtf! Trying to fix ...
╭───────────╮
│ Execution │
╰───────────╯
Running macs2 ...
########## macs2 FAILED ##############
return code: 1
Output: Traceback (most recent call last):
File "/home/jillashey/.conda/envs/geneext/bin/macs2", line 653, in <module>
main()
File "/home/jillashey/.conda/envs/geneext/bin/macs2", line 49, in main
from MACS2.callpeak_cmd import run
File "/home/jillashey/.conda/envs/geneext/lib/python3.9/site-packages/MACS2/callpeak_cmd.py", line 23, in <module>
from MACS2.OptValidator import opt_validate
File "/home/jillashey/.conda/envs/geneext/lib/python3.9/site-packages/MACS2/OptValidator.py", line 20, in <module>
from MACS2.IO.Parser import BEDParser, ELANDResultParser, ELANDMultiParser, \
File "__init__.pxd", line 206, in init MACS2.IO.Parser
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Traceback (most recent call last):
File "/glfs/brick01/gv0/putnamlab/conda/GeneExt/geneext.py", line 703, in <module>
helper.run_macs2(tempdir+'/' + 'plus.bam','plus',tempdir,verbose = verbose)
File "/glfs/brick01/gv0/putnamlab/conda/GeneExt/geneext/helper.py", line 79, in run_macs2
ps.check_returncode()
File "/home/jillashey/.conda/envs/geneext/lib/python3.9/subprocess.py", line 460, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '('macs2', 'callpeak', '-t', 'tmp/plus.bam', '-f', 'BAM', '--keep-dup', '20', '-q', '0.01', '--shift', '1', '--extsize', '100', '--broad', '--nomodel', '--min-length', '30', '-n', 'plus', '--outdir', 'tmp')' returned non-zero exit status 1.
Something went wrong.
20240819
GeneExt is being weird when installing so I’m going to delete what I have so far and restart. BLEH.
interactive
module load Miniconda3/4.9.2
cd /data/putnamlab/conda
rm -r GeneExt/
git clone https://github.com/sebepedroslab/GeneExt.git
cd GeneExt/
conda env create -n geneext -f environment.yaml
This took much faster than last time…Let’s try to run the test data?
exit # need to exit interactive mode after I create the env
interactive
conda activate geneext
# test run
python geneext.py -g test_data/annotation.gtf -b test_data/alignments.bam -o result.gtf --peak_perc 0
Got same error as above. I hate myself. Going to email Kevin Bryan. Okay while I go insane with this, I want to run ShortStack on my AST data. We ended up using this program for the e5 deep dive ncRNA paper, it was recommended by Javi, who has used it in the past.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA
mkdir shortstack
Kevin Bryan already installed short stack on Andromeda (thank god). Trimmed reads are in /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
. Sam merged R1 and R2 before running. I am going to just run on R1 for a first pass. In the scripts folder: nano shortstack_test.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Running short stack on mature trimmed miRNAs (R1) from AST project"
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
# Load modules
module load ShortStack/4.0.2-foss-2022a
# Run short stack
ShortStack \
--genomefile /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta \
--readfile AST-1065_R1_001.fastq.gz_1.fastq \
AST-1147_R1_001.fastq.gz_1.fastq \
AST-1412_R1_001.fastq.gz_1.fastq \
AST-1560_R1_001.fastq.gz_1.fastq \
AST-1567_R1_001.fastq.gz_1.fastq \
AST-1617_R1_001.fastq.gz_1.fastq \
AST-1722_R1_001.fastq.gz_1.fastq \
AST-2000_R1_001.fastq.gz_1.fastq \
AST-2007_R1_001.fastq.gz_1.fastq \
AST-2302_R1_001.fastq.gz_1.fastq \
AST-2360_R1_001.fastq.gz_1.fastq \
AST-2398_R1_001.fastq.gz_1.fastq \
AST-2404_R1_001.fastq.gz_1.fastq \
AST-2412_R1_001.fastq.gz_1.fastq \
AST-2512_R1_001.fastq.gz_1.fastq \
AST-2523_R1_001.fastq.gz_1.fastq \
AST-2563_R1_001.fastq.gz_1.fastq \
AST-2729_R1_001.fastq.gz_1.fastq \
AST-2755_R1_001.fastq.gz_1.fastq \
--known_miRNAs /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa \
--outdir /data/putnamlab/jillashey/Astrangia2021/smRNA/shortstack \
--threads 10 \
--dn_mirna
echo "Short stack run complete"
Submitted batch job 334980. Failed with this error: Required executable wigToBigWig : Not found!
. Confused by this…apparently wigToBigWig is a type of file format. Not sure what to do with this info…Going to email Kevin Bryan!
20240829
Kevin Bryan got back to me and said to add the module Kent_tools/442-GCC-11.3.0
to find wigToBigWig
for the short stack code. Submitted batch job 336294. Running successfully!
I also emailed him about the issues installing gene ext and he said “it looks like you have two different genext environments loaded there since some paths refer to /home/jillashey/.conda/envs/geneext and others to /glfs/brick01/gv0/putnamlab/conda/GeneExt, so I think that’s what’s causing the issue. You might want to delete both and recreate the second one to ensure it has all the dependencies.” I deleted gene ext in /home/jillashey/.conda/envs/
directory. Went to /glfs/brick01/gv0/putnamlab/conda/
and deleted gene ext. Because it takes a while for gene ext to install, I’m going to run the get_geneext.sh
script. Submitted batch job 336304.
20240830
Gene ext installed but once again, got the same error message even though I thought I deleted it from all other locations. Here’s the error once again:
########## macs2 FAILED ##############
return code: 1
Output: Traceback (most recent call last):
File "/home/jillashey/.conda/envs/geneext/bin/macs2", line 653, in <module>
main()
File "/home/jillashey/.conda/envs/geneext/bin/macs2", line 49, in main
from MACS2.callpeak_cmd import run
File "/home/jillashey/.conda/envs/geneext/lib/python3.9/site-packages/MACS2/callpeak_cmd.py", line 23, in <module>
from MACS2.OptValidator import opt_validate
File "/home/jillashey/.conda/envs/geneext/lib/python3.9/site-packages/MACS2/OptValidator.py", line 20, in <module>
from MACS2.IO.Parser import BEDParser, ELANDResultParser, ELANDMultiParser, \
File "__init__.pxd", line 206, in init MACS2.IO.Parser
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Maybe it’s something in here: /home/jillashey/.conda/pkgs
but there are so many folders and tools in there that I am not sure which one to delete. But shortstack did run! Look at the out file in the scripts folder (slurm-336294.out
):
Found a total of 51 MIRNA loci
Writing final files
Non-MIRNA loci by DicerCall:
N 9093
23 48
22 38
21 16
24 12
Creating visualizations of microRNA loci with strucVis
<<< WARNING >>>
Do not rely on these results alone to annotate new MIRNA loci!
The false positive rate for de novo MIRNA identification is low, but NOT ZERO
Insepct each mirna locus, especially the strucVis output, and see
https://doi.org/10.1105/tpc.17.00851 , https://doi.org/10.1093/nar/gky1141
Thu 29 Aug 2024 19:06:45 -0400 EDT
Run Completed!
Short stack run complete
The out file found 51 putative miRNAs.
Look at shortstack results (using e5 deep dive shortstack code as a reference).
echo "Number of potential loci:"
awk '(NR>1)' Results.txt | wc -l
Nummber of potential loci:
9258
Column 20 in the Results.txt
file identifies if a cluster is an miRNA or not (Y or N). 51 loci are characterized as miRNA and 9207 loci are not characterized as miRNAs. Column 21 in the Results.txt
file identifies if a cluster aligned to a known miRNA or not (Y or N). 37 loci aligned to a known miRNA and 9222 loci did not. However, there are only 5 valid miRNAs that match up with known miRNAs
column: line 3 is too long, output will be truncated
chromosome_6:21902734-21902826 Y nve-miR-9425_MIMAT0035384_Nematostella_vectensis_miR-9425;miR-9425_Nematostella_vectensis_Moran_et_al._2014_NA
chromosome_7:2303817-2303911 Y sca-nve-F-miR-2036_Scolanthus_callimorphus_Praher_et_al._2021_Transcriptome-level;eca-nve-F-miR-2036_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level
0-5p;ccr-miR-100_MIMAT0026195_Cyprinus_carpio_miR-100;pmi-miR-100-5p_MIMAT0032156_Patiria_miniata_miR-100-5p;cpi-miR-100-5p_MIMAT0037714_Chrysemys_picta_miR-100-5p;chi-miR-100-5p_MIMAT0035897_Capra_hircus_miR-100-5p;dma-miR-100_MIMAT0049252_Daubentonia_madagascariensis_miR-100;sbo-miR-100_MIMAT0049501_Saimiri_boliviensis_miR-100;ola-miR-100_MIMAT0022614_Oryzias_latipes_miR-100
chromosome_8:4117884-4117974 Y Adi-Mir-2030_5p_Acropora_digitifera_Gajigan_&_Conaco_2017_nve-miR-2030-5p;_nve-miR-2030-5p;_spi-miR-temp-40;ami-nve-F-miR-2030-5p_Acropora_millepora_Praher_et_al._2021_NA;adi-nve-F-miR-2030_Acropora_digitifera__Praher_et_al._2021_NA;spi-mir-temp-40_Stylophora_pistillata_Liew_et_al._2014_Close_match_of_nve-miR-2030;eca-nve-F-miR-2030_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level;miR-2030_Nematostella_vectensis_Moran_et_al._2014_NA
chromosome_14:8601339-8601434 Y apa-mir-2037_Exaiptasia_pallida_Baumgarten_et_al._2017_miR-2037;_Nve;_Spis;spi-mir-temp-20_Stylophora_pistillata_Liew_et_al._2014_NA;eca-nve-F-miR-2037_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level;spi-nve-F-miR-2037_Stylophora_pistillata_Praher_et_al._2021_NA;sca-nve-F-miR-2037-3p_Scolanthus_callimorphus_Praher_et_al._2021_Transcriptome-level;ami-nve-F-miR-2037-3p_Acropora_millepora_Praher_et_al._2021_NA;mse-nve-F-miR-2037-3p_Metridium_senile_Praher_et_al._2021_Transcriptome-level;avi-miR-temp-2037_Anemonia_viridis_Urbarova_et_al._2018_NA
Sam White found a bug in the shortstack code that was making the Results.gff3 starting coordinates 1 greater than those listed in the FastA description lines. For example:
grep "Cluster_1155.mature" Results.gff3 mir.fasta
Results.gff3:chromosome_5 ShortStack mature_miRNA 8149619 8149640 822 - . ID=Cluster_1155.mature;Parent=Cluster_1155
mir.fasta:>Cluster_1155.mature::chromosome_5:8149618-8149640(-)
The Results.gff3
file has Cluster_1155
starting at position 8149619, while mir.fasta
has it starting at position 8149618. This is an issue because the shortstack fasta headers and gff information need to match or there will be downstream issues. I will need to clean up the fasta with Sam’s code.
20240901
Attempting gene ext installation again.
cd /data/putnamlab/conda
git clone https://github.com/sebepedroslab/GeneExt.git
interactive
module load Miniconda3/4.9.2
cd GeneExt
conda env create --prefix /data/putnamlab/conda/GeneExt/geneext -f environment.yaml
python geneext.py -g test_data/annotation.gtf -b test_data/alignments.bam -o result.gtf --peak_perc 0
____ _____ _
/ ___| ___ _ __ ___| ____|_ _| |_
| | _ / _ \ '_ \ / _ \ _| \ \/ / __|
| |_| | __/ | | | __/ |___ > <| |_
\____|\___|_| |_|\___|_____/_/\_\__|
______ ___ ______
-----[______]==[___]==[______]===>----
Gene model adjustment for improved single-cell RNA-seq data counting
╭──────────────────╮
│ Preflight checks │
╰──────────────────╯
Genome annotation warning: Could not find "gene" features in test_data/annotation.gtf! Trying to fix ...
╭───────────╮
│ Execution │
╰───────────╯
Running macs2 ...
########## macs2 FAILED ##############
return code: 1
Output: Traceback (most recent call last):
File "/data/putnamlab/conda/GeneExt/geneext/bin/macs2", line 653, in <module>
main()
File "/data/putnamlab/conda/GeneExt/geneext/bin/macs2", line 49, in main
from MACS2.callpeak_cmd import run
File "/data/putnamlab/conda/GeneExt/geneext/lib/python3.9/site-packages/MACS2/callpeak_cmd.py", line 23, in <module>
from MACS2.OptValidator import opt_validate
File "/data/putnamlab/conda/GeneExt/geneext/lib/python3.9/site-packages/MACS2/OptValidator.py", line 20, in <module>
from MACS2.IO.Parser import BEDParser, ELANDResultParser, ELANDMultiParser, \
File "__init__.pxd", line 206, in init MACS2.IO.Parser
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Traceback (most recent call last):
File "/glfs/brick01/gv0/putnamlab/conda/GeneExt/geneext.py", line 703, in <module>
helper.run_macs2(tempdir+'/' + 'plus.bam','plus',tempdir,verbose = verbose)
File "/glfs/brick01/gv0/putnamlab/conda/GeneExt/geneext/helper.py", line 79, in run_macs2
ps.check_returncode()
File "/data/putnamlab/conda/GeneExt/geneext/lib/python3.9/subprocess.py", line 460, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '('macs2', 'callpeak', '-t', 'tmp/plus.bam', '-f', 'BAM', '--keep-dup', '20', '-q', '0.01', '--shift', '1', '--extsize', '100', '--broad', '--nomodel', '--min-length', '30', '-n', 'plus', '--outdir', 'tmp')' returned non-zero exit status 1.
Still getting same error but now it seems to be in the correct location.
20240912
Kevin Wong is also running into the same error when installing Gene Ext on Umiami server. Zoe did install Gene Ext successfully in June on Andromeda and Unity. Tried activating conda env in her folder on Andromeda: /data/putnamlab/zdellaert/snRNA/programs/GeneExt
but do not have permissions. Asked her to run chmod o+rwx /data/putnamlab/zdellaert/snRNA/programs/GeneEx
Let’s see if it works now.
interactive
module load Miniconda3/4.9.2
cd /data/putnamlab/conda
git clone https://github.com/sebepedroslab/GeneExt.git
cd GeneExt/
conda env create -n geneext -f environment.yaml
20241023
Hello world. Stuff still didnt work but Kevin got it to work on the UMiami server!!! He had to update numpy and pandas versions. See his post here.
Going to remove GeneExt and try to reinstall by updating the python package versions.
module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/GeneExt
20241113
Hello again. I decided not to run gene ext – I did it with the e5 data and the estimates of the 3’UTRs were ~1000bp, which was my estimate as well. So I am going to clean up my shortstack output using Sam’s code.
Examine the Results.txt
file.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/shortstack
head Results.txt | column -t
Locus Name Chrom Start End Length Reads UniqueReads FracTop Strand MajorRNA MajorRNAReads Short Long 21 22 23 24 DicerCall MIRNA known_miRNAs
chromosome_1:18460-18882 Cluster_1 chromosome_1 18460 18882 423 460 107 0.9108695652173913 + UCAAACCAACGCAGCCGAAAAGUAGAUGU 131 18 414 2 1 15 10 N N NA
chromosome_1:70293-70717 Cluster_2 chromosome_1 70293 70717 425 62964 382 1.0 + UAAGACUCUGAGGAUGGAAU 30756 37586 393 2202 15934 4643 2206 N N NA
chromosome_1:125835-126254 Cluster_3 chromosome_1 125835 126254 420 5946 24 0.0 - CGAAAACGAAUCUGCUAA 3302 5885 0 59 2 0 0 N N NA
chromosome_1:187614-188042 Cluster_4 chromosome_1 187614 188042 429 942 156 0.024416135881104035 - UAAAUCGGUUUUCUGUUUUCCAACGUGU 279 8 923 3 1 3 4 N N NA
chromosome_1:734332-734761 Cluster_5 chromosome_1 734332 734761 430 809 87 0.9802224969097652 + ACAACAAGAACUUGCGUUUGCUGAACGUU 264 1 794 4 1 5 4 N N NA
chromosome_1:772269-772697 Cluster_6 chromosome_1 772269 772697 429 463 71 0.028077753779697623 - UGUUCAAAUCUGUAAGCAUUAACCGAAGC 240 5 451 0 0 1 6 N N NA
chromosome_1:809191-809619 Cluster_7 chromosome_1 809191 809619 429 1150 216 0.0017391304347826088 - UCAUAUCGUCUACAUUGUGAUGCACCAGU 250 23 1113 0 1 1 12 N N NA
chromosome_1:811151-811576 Cluster_8 chromosome_1 811151 811576 426 1399 392 0.0035739814152966403 - UAGCCGACCAGGAUUGAAGGAUUGAGUCUU 49 19 1352 9 2 1 16 N N NA
chromosome_1:812845-813280 Cluster_9 chromosome_1 812845 813280 436 2670 414 0.0026217228464419477 - UGCUUCUGGUGCACUGUAAAUGACAGCUCC 739
wc -l Results.txt
9259 Results.txt
Columns of interest:
- Column 1 - genomic region of miRNA match
- Column 20 - shortstack miRNA? Y/N
- Column 21 - match to mirbase? NA or mirbase match
awk '{print $1"\t"$20"\t"$21}' "Results.txt" | head | column -t
Locus MIRNA known_miRNAs
chromosome_1:18460-18882 N NA
chromosome_1:70293-70717 N NA
chromosome_1:125835-126254 N NA
chromosome_1:187614-188042 N NA
chromosome_1:734332-734761 N NA
chromosome_1:772269-772697 N NA
chromosome_1:809191-809619 N NA
chromosome_1:811151-811576 N NA
chromosome_1:812845-813280 N NA
miRNAs matching mirbase:
awk '$20 == "Y" && $21 != "NA" {print $1"\t"$20"\t"$21}' "Results.txt" | head | column -t
column: line 3 is too long, output will be truncated
chromosome_6:21902734-21902826 Y nve-miR-9425_MIMAT0035384_Nematostella_vectensis_miR-9425;miR-9425_Nematostella_vectensis_Moran_et_al._2014_NA
chromosome_7:2303817-2303911 Y sca-nve-F-miR-2036_Scolanthus_callimorphus_Praher_et_al._2021_Transcriptome-level;eca-nve-F-miR-2036_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level
0-5p;ccr-miR-100_MIMAT0026195_Cyprinus_carpio_miR-100;pmi-miR-100-5p_MIMAT0032156_Patiria_miniata_miR-100-5p;cpi-miR-100-5p_MIMAT0037714_Chrysemys_picta_miR-100-5p;chi-miR-100-5p_MIMAT0035897_Capra_hircus_miR-100-5p;dma-miR-100_MIMAT0049252_Daubentonia_madagascariensis_miR-100;sbo-miR-100_MIMAT0049501_Saimiri_boliviensis_miR-100;ola-miR-100_MIMAT0022614_Oryzias_latipes_miR-100
chromosome_8:4117884-4117974 Y Adi-Mir-2030_5p_Acropora_digitifera_Gajigan_&_Conaco_2017_nve-miR-2030-5p;_nve-miR-2030-5p;_spi-miR-temp-40;ami-nve-F-miR-2030-5p_Acropora_millepora_Praher_et_al._2021_NA;adi-nve-F-miR-2030_Acropora_digitifera__Praher_et_al._2021_NA;spi-mir-temp-40_Stylophora_pistillata_Liew_et_al._2014_Close_match_of_nve-miR-2030;eca-nve-F-miR-2030_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level;miR-2030_Nematostella_vectensis_Moran_et_al._2014_NA
chromosome_14:8601339-8601434 Y apa-mir-2037_Exaiptasia_pallida_Baumgarten_et_al._2017_miR-2037;_Nve;_Spis;spi-mir-temp-20_Stylophora_pistillata_Liew_et_al._2014_NA;eca-nve-F-miR-2037_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level;spi-nve-F-miR-2037_Stylophora_pistillata_Praher_et_al._2021_NA;sca-nve-F-miR-2037-3p_Scolanthus_callimorphus_Praher_et_al._2021_Transcriptome-level;ami-nve-F-miR-2037-3p_Acropora_millepora_Praher_et_al._2021_NA;mse-nve-F-miR-2037-3p_Metridium_senile_Praher_et_al._2021_Transcriptome-level;avi-miR-temp-2037_Anemonia_viridis_Urbarova_et_al._2018_NA
awk '$20 == "Y" && $21 != "NA" {print $1"\t"$20"\t"$21}' "Results.txt" | wc -l
5
There are 5 miRNAs that had matches to mirbase (all from the cnidarian miRNAs). How many miRNAs were identified in total?
awk '$20 == "Y" {print $1"\t"$20"\t"$21}' "Results.txt" | wc -l
51
Look at fasta
grep "^>" "mir.fasta" | head
>Cluster_30::chromosome_1:2984896-2984988(+)
>Cluster_30.mature::chromosome_1:2984918-2984940(+)
>Cluster_30.star::chromosome_1:2984947-2984968(+)
>Cluster_137::chromosome_1:16932584-16932677(-)
>Cluster_137.mature::chromosome_1:16932635-16932657(-)
>Cluster_137.star::chromosome_1:16932604-16932626(-)
>Cluster_247::chromosome_2:2155517-2155610(+)
>Cluster_247.mature::chromosome_2:2155539-2155561(+)
>Cluster_247.star::chromosome_2:2155568-2155590(+)
>Cluster_449::chromosome_2:20014601-20014694(+)
The fasta starting coordinates need to be fixed due to a big in the code (see issue), which incorrectly calculates the starting coordinates in the fasta output. All other files where start/stop coordinates are displayed are correct. Using Sam White code.
awk '
/^>/ {
# Split the line into main parts based on "::" delimiter
split($0, main_parts, "::")
# Extract the coordinate part and strand information separately
coordinates_strand = main_parts[2]
split(coordinates_strand, coord_parts, "[:-]")
# Determine if the strand information is present and extract it
strand = ""
if (substr(coordinates_strand, length(coordinates_strand)) ~ /[\(\)\-\+]/) {
strand = substr(coordinates_strand, length(coordinates_strand) - 1)
coordinates_strand = substr(coordinates_strand, 1, length(coordinates_strand) - 2)
split(coordinates_strand, coord_parts, "[:-]")
}
# Increment the starting coordinate by 1
new_start = coord_parts[2] + 1
# Reconstruct the description line with the new starting coordinate
new_description = main_parts[1] "::" coord_parts[1] ":" new_start "-" coord_parts[3] strand
# Print the modified description line
print new_description
# Skip to the next line to process the sequence line
next
}
# For sequence lines, print them as-is
{
print
}
' "mir.fasta" \
> "mir_coords_fixed.fasta"
diff "mir.fasta" \
"mir_coords_fixed.fasta" \
| head
< >Cluster_30::chromosome_1:2984896-2984988(+)
---
> >Cluster_30::chromosome_1:2984897-2984988(+)
3c3
< >Cluster_30.mature::chromosome_1:2984918-2984940(+)
---
> >Cluster_30.mature::chromosome_1:2984919-2984940(+)
5c5
< >Cluster_30.star::chromosome_1:2984947-2984968(+)
Success. Select only mature miRNAs from fasta
awk '/^>/ {p = ($0 ~ /mature/)} p' mir_coords_fixed.fasta > mir_coords_fixed_mature.fasta
zgrep -c ">" mir_coords_fixed_mature.fasta
51
I want to double check my 3’UTR identification for Astrangia and rerun miranda with the shortstack miRNAs.
Going to redo the 3’UTR estimation (using e5 deep dive code). I already have all_features.txt
, apoc.Chromosome_lenghts.txt
, and apoc.Chromosome_names.txt
. Generate individual gff for genes
cd /data/putnamlab/jillashey/Astrangia_Genome
grep $'\tgene\t' apoculata_v2.0.gff3 > apoc_gene.gtf
Create 1kb 3’UTR
interactive
module load BEDTools/2.30.0-GCC-11.3.0
bedtools flank -i apoc_gene.gtf -g apoc.Chromosome_lenghts.txt -l 0 -r 1000 -s | awk '{gsub("gene","3prime_UTR",$3); print $0 }' | awk '{if($5-$4 > 3)print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}' | tr ' ' '\t' > apoc_3UTR_1kb.gtf
Subtract portions of 3’ UTRs that overlap nearby genes
bedtools subtract -a apoc_3UTR_1kb.gtf -b apoc_gene.gtf > apoc_3UTR_1kb_corrected.gtf
This might have more lines because it is subtracting gene overlaps from the 3’UTRs. So if a 3’UTR contains a short gene (200bp), it will subtract that 200bp section and keep the rest of the 3’UTR regardless of where it is. This might not make sense except in my head.
Extract 3’UTR sequences from genome
awk '{print $1 "\t" $4-1 "\t" $5 "\t" $9 "\t" "." "\t" $7}' apoc_3UTR_1kb_corrected.gtf | sed 's/"//g' > apoc_3UTR_1kb_corrected.bed
bedtools getfasta -fi apoculata.assembly.scaffolds_chromosome_level.fasta -bed apoc_3UTR_1kb_corrected.bed -fo apoc_3UTR_1kb.fasta -name
Hooray now we have a fasta of the 3’UTRs with their corresponding gene!
Before running miranda, clean up /data/putnamlab/jillashey/Astrangia2021/smRNA
cd /data/putnamlab/jillashey/Astrangia2021/smRNA
ls -othr
total 1.9G
drwxr-xr-x 4 jillashey 4.0K Oct 9 2023 fastqc
-rw-r--r-- 1 jillashey 80K Jan 4 2024 reads_collapsed.fa
-rw-r--r-- 1 jillashey 92K Jan 4 2024 reads_collapsed_vs_genome.arf
drwxr-xr-x 2 jillashey 4.0K Jan 4 2024 dir_prepare_signature1704403591
-rw-r--r-- 1 jillashey 377 Jan 4 2024 error_04_01_2024_t_16_26_17.log
-rw-r--r-- 1 jillashey 4.5K Jan 4 2024 result_04_01_2024_t_16_26_17.csv
-rw-r--r-- 1 jillashey 45K Jan 4 2024 result_04_01_2024_t_16_26_17.html
drwxr-xr-x 2 jillashey 4.0K Jan 4 2024 pdfs_04_01_2024_t_16_26_17
-rw-r--r-- 1 jillashey 941 Jan 4 2024 result_04_01_2024_t_16_26_17.bed
drwxr-xr-x 2 jillashey 4.0K Jan 4 2024 mirna_results_04_01_2024_t_16_26_17
drwxr-xr-x 2 jillashey 4.0K Jan 4 2024 dir_prepare_signature1704404056
-rw-r--r-- 1 jillashey 377 Jan 4 2024 error_04_01_2024_t_16_34_08.log
-rw-r--r-- 1 jillashey 4.5K Jan 4 2024 result_04_01_2024_t_16_34_08.csv
-rw-r--r-- 1 jillashey 45K Jan 4 2024 result_04_01_2024_t_16_34_08.html
drwxr-xr-x 2 jillashey 4.0K Jan 4 2024 pdfs_04_01_2024_t_16_34_08
-rw-r--r-- 1 jillashey 941 Jan 4 2024 result_04_01_2024_t_16_34_08.bed
drwxr-xr-x 2 jillashey 4.0K Jan 4 2024 mirna_results_04_01_2024_t_16_34_08
-rw-r--r-- 1 jillashey 500M Jan 7 2024 20240107_reads_collapsed.fa
-rw-r--r-- 1 jillashey 147M Jan 7 2024 20240107_reads_collapsed_vs_genome.arf
drwxr-xr-x 7 jillashey 4.0K Jan 7 2024 mirdeep_runs
drwxr-xr-x 2 jillashey 4.0K Jan 7 2024 dir_prepare_signature1704668491
-rw-r--r-- 1 jillashey 29 Jan 7 2024 error_07_01_2024_t_17_59_09.log
-rw-r--r-- 1 jillashey 2.7K Jan 7 2024 report.log
-rw-r--r-- 1 jillashey 449M Jan 16 2024 20240116_reads_collapsed.fa
-rw-r--r-- 1 jillashey 256M Jan 16 2024 20240116_reads_collapsed_vs_genome.arf
drwxr-xr-x 2 jillashey 4.0K Jan 19 2024 mapper_logs
-rw-r--r-- 1 jillashey 359M Jan 19 2024 20240119_reads_collapsed.fa
-rw-r--r-- 1 jillashey 246 Jan 19 2024 bowtie.log
-rw-r--r-- 1 jillashey 181M Jan 19 2024 20240119_reads_collapsed_vs_genome.arf
drwxr-xr-x 3 jillashey 4.0K Jan 24 2024 refs
drwxr-xr-x 22 jillashey 4.0K Jan 30 2024 mirdeep2
drwxr-xr-x 4 jillashey 4.0K Mar 20 2024 data
drwxr-xr-x 2 jillashey 4.0K May 8 2024 miranda
drwxr-xr-x 18 jillashey 4.0K Oct 27 13:52 scripts
drwxr-xr-x 3 jillashey 4.0K Nov 14 08:59 shortstack
mv 2024* mirdeep2
mv reads* mirdeep2
mv result_04_01_2024_t_16_* mirdeep2
mv mapper_logs/ mirdeep2
mv dir_prepare_signature1704* mirdeep2
mv mirna_results_04_01_2024_t_16_* mirdeep2
mv *log mirdeep2
mv pdfs_04_01_2024_t_16_* mirdeep2
mv mirdeep_runs/ mirdeep2
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda
mkdir old
mv * old
Rerun miranda with shortstack mirnas and 3’UTR fasta (with gene names included in fasta headers yay). In the scripts folder, I have a script miranda_strict_all.sh
that ran miranda with the mirdeep2 miRNAs. Going to rename this script:
mv miranda_strict_all.sh miranda_strict_all_mirdeep2.sh
Write new script for miranda with shortstack miRNAs. In the scripts folder: nano miranda_strict_all_shortstack.sh
#!/bin/bash -i
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Apoc starting miranda run with all genes and miRNAs with energy cutoff <-20 and strict binding invoked"$(date)
echo "Using updated miRNAs from short stack"
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/miranda
miranda /data/putnamlab/jillashey/Astrangia2021/smRNA/shortstack/mir_coords_fixed_mature.fasta /data/putnamlab/jillashey/Astrangia_Genome/apoc_3UTR_1kb.fasta -en -20 -strict -out /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_1kb_apoc_shortstack.tab
conda deactivate
echo "miranda run finished!" $(date)
echo "counting number of interactions attempted" $(date)
zgrep -c "Performing Scan" /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_1kb_apoc_shortstack.tab
echo "Parsing output" $(date)
grep -A 1 "Scores for this hit:" /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_1kb_apoc_shortstack.tab | sort | grep '>' > /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_1kb_apoc_shortstack_parsed.txt
echo "counting number of putative interactions predicted" $(date)
wc -l /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_1kb_apoc_shortstack_parsed.txt
echo "Apoc miranda script complete" $(date)
Submitted batch job 348808. Ran in ~20 mins. Look at output:
counting number of interactions attempted Thu Nov 14 14:43:19 EST 2024
2547144
Parsing output Thu Nov 14 14:43:28 EST 2024
counting number of putative interactions predicted Thu Nov 14 14:43:28 EST 2024
5187 /data/putnamlab/jillashey/Astrangia2021/smRNA/miranda/miranda_strict_all_1kb_apoc_shortstack_parsed.txt
Look at txt file
head miranda_strict_all_1kb_apoc_shortstack_parsed.txt
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_10.126;Name=EVM%20prediction%20chromosome_10.126::chromosome_10:1142345-1143345 170.00 -21.16 2 20 717 739 19 84.21% 84.21%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_11.3016;Name=EVM%20prediction%20chromosome_11.3016::chromosome_11:31902611-31903611 166.00 -21.91 2 21 842 865 21 76.19% 85.71%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_11.359;Name=EVM%20prediction%20chromosome_11.359::chromosome_11:3751965-3752965 164.00 -20.30 2 18 439 461 17 82.35% 88.24%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_12.1475;Name=EVM%20prediction%20chromosome_12.1475::chromosome_12:15544468-15545468 155.00 -21.98 2 17 149 174 19 73.68% 78.95%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_12.1806;Name=EVM%20prediction%20chromosome_12.1806::chromosome_12:19406302-19407302 163.00 -20.09 2 17 977 999 16 87.50% 87.50%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_12.2475;Name=EVM%20prediction%20chromosome_12.2475::chromosome_12:26095701-26096701 173.00 -20.76 2 21 336 355 19 84.21% 89.47%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_12.2814;Name=EVM%20prediction%20chromosome_12.2814::chromosome_12:28905929-28906929 163.00 -20.09 2 17 349 371 16 87.50% 87.50%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_13.1892;Name=EVM%20prediction%20chromosome_13.1892::chromosome_13:19643825-19644825 177.00 -20.79 2 18 860 881 16 93.75% 93.75%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_13.2239;Name=EVM%20prediction%20chromosome_13.2239::chromosome_13:23509872-23510872 173.00 -21.02 2 21 577 596 19 84.21% 89.47%
>Cluster_1155.mature::chromosome_5:8149619-8149640(-) ID=evm.TU.chromosome_13.330;Name=EVM%20prediction%20chromosome_13.330::chromosome_13:2959926-2960926 172.00 -21.83 2 17 218 239 15 93.33% 93.33%
How many unique miRNAs had predicted interactions?
cut -f1 miranda_strict_all_1kb_apoc_shortstack_parsed.txt | sort | uniq | wc -l
51
How many unique 3’UTRs had interactions with miRNAs?
cut -f2 miranda_strict_all_1kb_apoc_shortstack_parsed.txt | sort | uniq | wc -l
4515
Copy miranda_strict_all_1kb_apoc_shortstack_parsed.txt
onto local computer.
20241118
Rerun mirdeep2???? Okay. I feel like I have a little more clarity with the mirdeep2 analysis now that I’ve also run it for my Mcap data. First, I need to map the samples to the genome with the mapper.pl script. Genome must first be indexed by bowtie-build (NOT bowtie2). Because I have multiple samples, I need to make a config file that contains fq file locations and a unique 3 letter code (see mirdeep2 documentation).
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
nano config.txt
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1065_R1_001.fastq.gz_1.fastq s01
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1147_R1_001.fastq.gz_1.fastq s02
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1412_R1_001.fastq.gz_1.fastq s03
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1560_R1_001.fastq.gz_1.fastq s04
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1567_R1_001.fastq.gz_1.fastq s05
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1617_R1_001.fastq.gz_1.fastq s06
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-1722_R1_001.fastq.gz_1.fastq s07
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2000_R1_001.fastq.gz_1.fastq s08
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2007_R1_001.fastq.gz_1.fastq s09
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2302_R1_001.fastq.gz_1.fastq s10
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2360_R1_001.fastq.gz_1.fastq s11
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2398_R1_001.fastq.gz_1.fastq s12
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2404_R1_001.fastq.gz_1.fastq s13
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2412_R1_001.fastq.gz_1.fastq s14
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2512_R1_001.fastq.gz_1.fastq s15
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2523_R1_001.fastq.gz_1.fastq s16
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2563_R1_001.fastq.gz_1.fastq s17
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2729_R1_001.fastq.gz_1.fastq s18
/data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar/AST-2755_R1_001.fastq.gz_1.fastq s19
In the config.txt file, I have the path and file name, as well as the unique 3 letter code (s followed by sample number).
In the scripts folder: nano mapper_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load GCCcore/11.3.0 #I needed to add this to resolve conflicts between loaded GCCcore/9.3.0 and GCCcore/11.3.0
module load Bowtie/1.3.1-GCC-11.3.0
echo "Index Apoc genome" $(date)
# Index the reference genome for Apoc
bowtie-build /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/Apoc_ref.btindex
echo "Referece genome indexed!" $(date)
echo "Unload unneeded packages and run mapper script for trimmed stringent reads" $(date)
module unload module load GCCcore/11.3.0
module unload Bowtie/1.3.1-GCC-11.3.0
conda activate /data/putnamlab/mirdeep2
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
mapper.pl config.txt -e -d -h -j -l 18 -m -p /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/Apoc_ref.btindex -s apoc_mapped_reads.fa -t apoc_mapped_reads_vs_genome.arf
echo "Mapping complete for trimmed reads" $(date)
conda deactivate
Submitted batch job 349330. Ran in about an hour.
20241119
Look at the output
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
zgrep -c ">" apoc_mapped_reads.fa
88460783
wc -l apoc mapped_reads_vs_genome.arf
17920088 apoc_mapped_reads_vs_genome.arf
less bowtie.log
# reads processed: 4816214
# reads with at least one reported alignment: 326607 (6.78%)
# reads that failed to align: 4413294 (91.63%)
# reads with alignments suppressed due to -m: 76313 (1.58%)
Reported 624511 alignments to 1 output stream(s)
I can now run mirdeep2 to predict my miRNAs. Move the mapper output from the data folder to the output folder.
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2
mkdir 20241119
cd /data/putnamlab/jillashey/Astrangia2021/smRNA/data/trim/flexbar
mv mapper_logs/ /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/20241119/
mv *mapped* /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/20241119/
mv bowtie.log /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/20241119/
The output files from the mapper.pl
portion of mirdeep2 will be used as input for the mirdeep2.pl
portion of the pipeline. In the scripts folder: nano mirna_predict_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2" $(date)
miRDeep2.pl /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/20241119/apoc_mapped_reads.fa /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/20241119/apoc_mapped_reads_vs_genome.arf /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa none none -P -v -g -1 2>report.log
echo "mirdeep2 complete" $(date)
conda deactivate
Submitted batch job 349437. Ran in about 2 days.
20241121
Downloaded output to computer and identified putative novel and known miRNAs. There were ~240 putative miRNAs total. I will need to look at pdfs to determine MFE and total number of binding bps in hairpin structure.
Look into these:
Understanding mirdeep2 output – I understand the mirdeep2 output but I do not understand the known miRNA output info. On the summary table that is outputted with the csv/html, it says XXXX # of known miRNAs were detected. However, In the /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/AST-1560/dir_prepare_signature1705975309
folder, there is a file (mature_vs_precursors.arf
) that has info about known sequences which I am confused by. It looks like these are known miRNAs that were identified in the Astrangia genome, as they are given genomic coordinates. I may need to go through these files and make sure I am not missing anything. For instance, when I look up chromosome_7_11677
(genomic coordinates for known miRNA ola-miR-100) in that file, it provides me with 80 other matches that have the same genomic coordinates and are the same as miR-100. I may need to go through these files for each sample to make sure that I am not missing any known info.
General mirdeep2 questions
- How do I find the MFE? Is it calculated by mirdeep2 or by the quantifier module? I think it is linked to the randfold step. Need to look into this.
- I looked at Gajigan & Conaco 2017 mirdeep2 pdf outputs from their supplementary materials and they got similar MFE values in their pdfs. However, in Table S5, they have MFE info that is <-25 kcal/mol. How did they calculate the MFE that mirdeep2 gave them to the MFE that was displayed in their table??
- After I identify the putative miRNAs, I should blast against tRNA and rRNA dbs
Interpretation of mirdeep2 output
- https://ccbr.github.io/pipeliner-docs/miRNA-seq/miRSeq-Output-Files/
good resource for miranda
- https://bioinformaticsworkbook.org/dataAnalysis/SmallRNA/Miranda_miRNA_Target_Prediction.html#gsc.tab=0