Developmental 2023 Timeseries smRNA analysis
Developmental 2023 Timeseries smRNA analysis
These data came from my developmental timeseries experiment in 2023 with Montipora capitata in Hawaii. In this experiment, Montipora capitata embryos and larvae were exposed to ambient and heat stress over 72 hours from embryo to swimming larvae. Samples were collected at 8 time points: 1, 4, 9, 14, 22, 28, 48 and 72 hpf. The github for this project is here.
I initially sent 8 libraries (n=1 from each timepoint) to Oklahoma Medical Research Foundation NGS Core for sequencing and they looked good, so I sent out the rest of my sampels (24 samples, n=3 per timepoint). Data is now back!
Files were downloaded to these location on Andromeda: XXXX and XXXX. Time to analyze! I’m going to write my notebook in chronological order by date and then will reorganize once the workflow is complete.
20241024
Sequences were received from sequencer today! Make a new directory in my own folder on Andromeda for this project:
cd /data/putnamlab/jillashey
mkdir DT_Mcap_2023
cd DT_Mcap_2023
mkdir smRNA
cd dmRNA
mkdir data scripts output
cd data
mkdir raw trim
cd raw
Copy the files into the raw data folder
cp XXXXXX/* .
cp XXXXXX .
nano fastqc_smallRNAseq_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "Start fastqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw
for file in *fastq.gz
do
fastqc $file
done
echo "Fastqc complete, start multiqc" $(date)
multiqc *
echo "multiqc complete!" $(date)
Submitted batch job 340170. Raw QC for smRNAseq data here. Raw QC looking very funky. There are definitely batch effects from the initial and last sequencing and a high level of duplication in the last sequencing run (not uncommon for bbs but also may be due to number of cycles during library prep). The adapter content plots look very different between the two batches as well. The Zymo library prep kit protocol recommended single end sequencing at 75bp max. Oops I did paired end seq at 150bp. I believe that this means I will only be moving forward with R1. The protocol also recommends using cutadapt to trim.
In the scripts folder: nano cutadapt_trim.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load cutadapt/4.2-GCCcore-11.3.0
module load FastQC/0.11.8-Java-1.8
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/
#echo "Count number of reads in each file" $(date)
#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt
echo "Start read trimming with cutadapt, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# cutadapt loop
for i in ${array1[@]}; do
cutadapt -a TGGAATTCTCGGGTGCCAAGG \
-u 3 \
-m 15 \
-o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim_${i} \
$i
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim_${i} \
-o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc
done
echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)
module purge
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc
multiqc *
echo "MultiQC complete" $(date)
echo "Count number of reads in each trimmed file" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim
zgrep -c "@LH00" *.gz > smRNA_trim_read_count.txt
Submitted batch job 345007. The -a
denotes the adapter sequence that I want to trim and its the Illumina TruSeq small RNA adapter, which Zymo uses in its kit.
20241025
Cutadapt finished running overnight. Cutadapt trim QC is here. The initial samples that were sequenced look good, but the ones sequenced most recently don’t seem to have gotten any adapter trimmed off. Run fastp instead (based on Sam’s trimming for e5).
In the scripts folder: nano fastp_trim.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load bio/fastp/0.23.2-GCC-11.2.0
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/
#echo "Count number of reads in each file" $(date)
#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt
echo "Start read trimming with fastp, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--in2 $(echo ${i}|sed s/_R1/_R2/)\
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--trim_poly_g \
--overlap_len_require 17 \
--length_limit 31 \
--merge \
--merged_out /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
done
echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim
multiqc *
echo "MultiQC complete" $(date)
echo "Count number of reads in each file" $(date)
zgrep -c "@LH00" *.gz > smRNA_trim_read_count.txt
Submitted batch job 345041. Seems to have worked but the merge wasn’t amazing…seems like a lot of reads were lost when I merged because I required a decently high overlap length. Looking at the e5 [trimming iterations] for smRNAseq data, it also seems like I might be good to just use R1, similar to what Sam did with e5 deep dive data.
In the scripts folder: nano fastp_trim_R1.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/
#echo "Count number of reads in each file" $(date)
#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt
echo "Start read trimming with fastp, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--out1 /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i} \
--qualified_quality_phred 30 \
--trim_poly_g \
--length_limit 31 \
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
done
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim
multiqc *
echo "MultiQC complete" $(date)
echo "Count number of reads in each file" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim
zgrep -c "@LH00" *.gz > smRNA_trim_read_count.txt
Submitted batch job 345061. Started, then stopped it. Looking at the slurm error file, I am getting this information:
Detecting adapter sequence for read1...
No adapter detected for read1
Read1 before filtering:
total reads: 19773964
total bases: 2985868564
Q20 bases: 2596847950(86.9713%)
Q30 bases: 2097923022(70.2617%)
Read1 after filtering:
total reads: 475
total bases: 13245
Q20 bases: 13005(98.188%)
Q30 bases: 12269(92.6312%)
Filtering result:
reads passed filter: 475
reads failed due to low quality: 901386
reads failed due to too many N: 76
reads failed due to too short: 103
reads failed due to too long: 18871924
reads with adapter trimmed: 0
bases trimmed due to adapters: 0
Reads are failing left and right because they are too long…I think i need to specify what adapters that I want to trim because fastp is trying to detect the adapters by analyzing the tails of the first ~1M reads. I’m going to supply it with adapter sequences instead. In my raw QC, it says some of the reads have the illumina small RNA 3’ adapter and some have the illumina universal adapter.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data
nano illumina_adapters.fasta
>Illumina TruSeq small RNA 3' adapter
TGGAATTCTCGGGTGCCAAGG
>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>Illumina TruSeq Adapter Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
In the scripts folder, edit: nano fastp_trim_R1.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
#module load fastp/0.19.7-foss-2018b
module load bio/fastp/0.23.2-GCC-11.2.0
module load FastQC/0.11.8-Java-1.8
#module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/
#echo "Count number of reads in each file" $(date)
#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt
echo "Start read trimming with fastp, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--out1 /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i} \
--qualified_quality_phred 30 \
--trim_poly_x \
#--length_limit 31 \
--adapter_fasta /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/illumina_adapters.fasta
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
done
Submitted batch job 345073. Ran in 1.5 hours, and then ran multiQC on the output. Also looked at the individual QC htmls. Most samples have good phred scores and a peak where the miRNAs should be in terms of length. But there is a lot of weirdness… a lot of overrepresented sequences that might be adapters? And still a decent amount of adapter content. I’m not sure what I should be trimming and i am stressed haha. I need to look into this further and think about what I’m really trimming. also evaluate tools to use…maybe email some ppl???? Also the files say 31bp but they are not.
20241027
Even though the QC looks horrible, I want to run short stack on this data to see what it looks like. In the scripts folder: nano shortstack.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Running short stack on mature trimmed miRNAs (R1) from Mcap DT project"
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim
# Load modules
module load ShortStack/4.0.2-foss-2022a
module load Kent_tools/442-GCC-11.3.0
# Run short stack
ShortStack \
--genomefile /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
--readfile trim.fastp.31bp.10_small_RNA_S4_R1_001.fastq.gz \
trim.fastp.31bp.11_small_RNA_S5_R1_001.fastq.gz \
trim.fastp.31bp.13_S76_R1_001.fastq.gz \
trim.fastp.31bp.14_small_RNA_S6_R1_001.fastq.gz \
trim.fastp.31bp.23_S77_R1_001.fastq.gz \
trim.fastp.31bp.24_small_RNA_S7_R1_001.fastq.gz \
trim.fastp.31bp.26_small_RNA_S8_R1_001.fastq.gz \
trim.fastp.31bp.28_small_RNA_S9_R1_001.fastq.gz \
trim.fastp.31bp.35_S78_R1_001.fastq.gz \
trim.fastp.31bp.36_small_RNA_S10_R1_001.fastq.gz \
trim.fastp.31bp.37_small_RNA_S11_R1_001.fastq.gz \
trim.fastp.31bp.39_small_RNA_S12_R1_001.fastq.gz \
trim.fastp.31bp.47_small_RNA_S13_R1_001.fastq.gz \
trim.fastp.31bp.48_small_RNA_S14_R1_001.fastq.gz \
trim.fastp.31bp.51_small_RNA_S15_R1_001.fastq.gz \
trim.fastp.31bp.52_S79_R1_001.fastq.gz \
trim.fastp.31bp.60_S80_R1_001.fastq.gz \
trim.fastp.31bp.61_small_RNA_S16_R1_001.fastq.gz \
trim.fastp.31bp.62_small_RNA_S17_R1_001.fastq.gz \
trim.fastp.31bp.63_small_RNA_S18_R1_001.fastq.gz \
trim.fastp.31bp.6_small_RNA_S1_R1_001.fastq.gz \
trim.fastp.31bp.72_S81_R1_001.fastq.gz \
trim.fastp.31bp.73_small_RNA_S19_R1_001.fastq.gz \
trim.fastp.31bp.74_small_RNA_S20_R1_001.fastq.gz \
trim.fastp.31bp.75_small_RNA_S21_R1_001.fastq.gz \
trim.fastp.31bp.7_small_RNA_S2_R1_001.fastq.gz \
trim.fastp.31bp.85_S82_R1_001.fastq.gz \
trim.fastp.31bp.86_small_RNA_S22_R1_001.fastq.gz \
trim.fastp.31bp.87_small_RNA_S23_R1_001.fastq.gz \
trim.fastp.31bp.88_small_RNA_S24_R1_001.fastq.gz \
trim.fastp.31bp.8_small_RNA_S3_R1_001.fastq.gz \
trim.fastp.31bp.9_S75_R1_001.fastq.gz \
--known_miRNAs /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa \
--outdir /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/shortstack \
--threads 10 \
--dn_mirna
echo "Short stack complete!"
Submitted batch job 345205.
20241028
Shortstack finished running overnight but doesn’t seem to have worked well. Getting this at the end of the error file:
Traceback (most recent call last):
File "/opt/software/ShortStack/4.0.2-foss-2022a/ShortStack", line 3592, in <module>
mir_qdata = mirna(args, merged_bam, fai, pmir_bedfile, read_count)
File "/opt/software/ShortStack/4.0.2-foss-2022a/ShortStack", line 2259, in mirna
dn_q_bedlines, dn_locus_bedlines = zip(*denovo_mloci1)
ValueError: not enough values to unpack (expected 2, got 0)
Also getting this in the error file for the samples:
# reads processed: 18875806
# reads with at least one alignment: 999 (0.01%)
# reads that failed to align: 18874807 (99.99%)
Reported 2699 alignments
[bam_sort_core] merging from 0 files and 10 in-memory blocks...
# reads processed: 11804203
# reads with at least one alignment: 1216 (0.01%)
# reads that failed to align: 11802987 (99.99%)
Reported 2981 alignments
[bam_sort_core] merging from 0 files and 10 in-memory blocks...
# reads processed: 24558419
# reads with at least one alignment: 50 (0.00%)
# reads that failed to align: 24558369 (100.00%)
Reported 803 alignments
Vast majority of reads are not aligning which makes sense because I gave it very long reads (>150bp). This is probably why it failed. No results files or miRNA fasta files were generated either. Need to revisit trimming
20241105
Revisiting trimming with cutadapt. In the scripts folder, edit cutadapt_trim.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load cutadapt/4.2-GCCcore-11.3.0
#module load cutadapt/3.5-GCCcore-11.2.0
module load FastQC/0.11.8-Java-1.8
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/
#echo "Count number of reads in each file" $(date)
#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt
echo "Start read trimming with cutadapt, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# cutadapt loop
for i in ${array1[@]}; do
cutadapt -a TGGAATTCTCGGGTGCCAAGG -u 3 -m 15 -q 20,20 --trim-n --max-n 0 -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim_cutadapt_${i} $i
done
echo "Read trimming of adapters complete" $(date)
Submitted batch job 347763. Took almost 4 hours. I didn’t put the QC code in here because I wanted to run it in interactive mode.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim
interactive
module load FastQC/0.11.8-Java-1.8
fastqc *
Started running in interactive mode at 2:15pm
mv *qc ../../output/fastqc/trim
cd ../../output/fastqc/trim
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc *
Looks very similar to cutadapt attempt 1. The samples that were sequenced in the first batch were able to be successfully trimmed to 25 bp. Going to rerun cutadapt with some more stringent cutoffs and more adapter seqs that were identified in the individual fastqc outputs. In the scripts folder: nano cutadapt_trim_stringent.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load cutadapt/4.2-GCCcore-11.3.0
#module load cutadapt/3.5-GCCcore-11.2.0
module load FastQC/0.11.8-Java-1.8
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/
#echo "Count number of reads in each file" $(date)
#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt
echo "Start read trimming with cutadapt, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# cutadapt loop
for i in ${array1[@]}; do
cutadapt -a TGGAATTCTCGGGTGCCAAGG -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -a GATCGGAAGAGCACACGTCTGAACTCCAGTCA -a GAAACGTTGGGTTGCGGTATTGGAAGAGCACACGTCTGAACTCCAGTCAC -a AGTGTCCTATCAGCTACCATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a TAGCGTCGAACGGGCGCAATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a ACGTCTGAACTCCAGTCACTTACAGGAATCTGGGGGGGGGGGGGGGGGGG -a TATCGGTGAAACATCCTCATCGGAAGAGCACACGTCTGAACTCCAGTCAC -m 18 -M 30 -q 30,30 --trim-n --max-n 0 --discard-untrimmed -e 0.1 -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_${i} $i
done
## added adapters that were overrepresented sequences from the individual fastqc htmls
echo "Read trimming of adapters complete" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/
fastqc * -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim_stringent
done
echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)
module unload cutadapt/4.2-GCCcore-11.3.0
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim_stringent
multiqc *
echo "MultiQC complete" $(date)
echo "Count number of reads in each trimmed file" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
zgrep -c "@LH00" *.gz > smRNA_trim_stringent_read_count.txt
Submitted batch job 347786. Based on initial looks, a lot of the reads are getting tossed because they are too short.
20241108
Took a while but ran. The lengths look much better, as does the GC content. There are some samples that have such a high duplication rate though…I know this is a result of PCR artifacts. I may run shortstack to see what the output looks like. I may need to do some adapter trimming as well with certain samples.
next steps:
- run short stack on stringent trimmed reads
- run picard (picard/2.25.1-Java-11)
- EstimateLibraryComplexity - Estimates the numbers of unique molecules in a sequencing library.
- MarkDuplicates - Identifies duplicate reads.
USAGE: PicardCommandLine[-h]
Run short stack.
In the scripts folder: nano shortstack_trim_stringent.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Running short stack on mature trimmed miRNAs from Mcap DT project"
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
# Load modules
module load ShortStack/4.0.2-foss-2022a
module load Kent_tools/442-GCC-11.3.0
# Run short stack
# Run short stack
ShortStack \
--genomefile /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
--readfile trim_stringent_cutadapt_10_small_RNA_S4_R1_001.fastq.gz \
trim_stringent_cutadapt_11_small_RNA_S5_R1_001.fastq.gz \
trim_stringent_cutadapt_13_S76_R1_001.fastq.gz \
trim_stringent_cutadapt_63_small_RNA_S18_R1_001.fastq.gz \
--known_miRNAs /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa \
--outdir /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/shortstack_trim_stringent \
--threads 10 \
--dn_mirna
echo "Short stack complete!"
Submitted batch job 348221 - only doing a subset for now. Ran but ended with this error: /var/spool/slurmd/job348221/slurm_script: line 56: --known_miRNAs: command not found
, even though that is a command…Maybe copy mature_mirbase_cnidarian_T.fa
to this folder?
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data
cp /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa .
Edit the shortstack_trim_stringent.sh
script so that the known miRNAs path is the new path (/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa
). Submitted batch job 348256. Got the same error. Trying on a subset and making sure that the backslashes are correct. Submitted batch job 348269
Let’s try running picard in interactive mode
interactive
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/shortstack_trim_stringent/ShortStack_1731009886
module load picard/2.25.1-Java-11
To execute picard run: java -jar $EBROOTPICARD/picard.jar
I just want to test picard on one of the bam files (trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam
).
# Run EstimateLibraryComplexity
java -jar $EBROOTPICARD/picard.jar EstimateLibraryComplexity \
I=trim_stringent_cutadapt_24_small_RNA_S7_R1_001.bam \
O=trim_stringent_cutadapt_24_small_RNA_complexity_metrics.txt
INFO 2024-11-08 09:24:40 EstimateLibraryComplexity
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** EstimateLibraryComplexity -I trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam -O trim_stringent_cutadapt_51_small_RNA_complexity_metrics.txt
**********
09:24:42.296 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/software/picard/2.25.1-Java-11/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Nov 08 09:24:42 EST 2024] EstimateLibraryComplexity INPUT=[trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam] OUTPUT=trim_stringent_cutadapt_51_small_RNA_complexity_metrics.txt MIN_IDENTICAL_BASES=5 MAX_DIFF_RATE=0.03 MIN_MEAN_QUALITY=20 MAX_GROUP_RATIO=500 MAX_READ_LENGTH=0 MIN_GROUP_COUNT=2 READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=33053659 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Nov 08 09:24:42 EST 2024] Executing as jillashey@n063.cluster.com on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.2+9; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.25.1
INFO 2024-11-08 09:24:42 EstimateLibraryComplexity Will store 33053659 read pairs in memory before sorting.
INFO 2024-11-08 09:24:45 EstimateLibraryComplexity Finished reading - read 0 records - moving on to scanning for duplicates.
[Fri Nov 08 09:24:45 EST 2024] picard.sam.markduplicates.EstimateLibraryComplexity done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2035417088
didn’t really seem to run successfully…I got an output file but no information in the output file. Let’s try running MarkDuplicates
java -jar $EBROOTPICARD/picard.jar MarkDuplicates -I trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam -O marked_dups_trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam -M marked_dup_51_small_RNA_metrics.txt
09:32:36.015 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/software/picard/2.25.1-Java-11/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Nov 08 09:32:36 EST 2024] MarkDuplicates INPUT=[trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam] OUTPUT=marked_dups_trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam METRICS_FILE=marked_dup_51_small_RNA_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Nov 08 09:32:36 EST 2024] Executing as jillashey@n063.cluster.com on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.2+9; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.25.1
INFO 2024-11-08 09:32:36 MarkDuplicates Start of doWork freeMemory: 2018640208; totalMemory: 2035417088; maxMemory: 31136546816
INFO 2024-11-08 09:32:36 MarkDuplicates Reading input file and constructing read end information.
INFO 2024-11-08 09:32:36 MarkDuplicates Will retain up to 112813575 data points before spilling to disk.
INFO 2024-11-08 09:32:37 MarkDuplicates Read 102803 records. 0 pairs never matched.
INFO 2024-11-08 09:32:38 MarkDuplicates After buildSortedReadEndLists freeMemory: 1285863816; totalMemory: 2214113280; maxMemory: 31136546816
INFO 2024-11-08 09:32:38 MarkDuplicates Will retain up to 973017088 duplicate indices before spilling to disk.
Killed
Failed bleh. Also no miRNA loci found in short stack…may need to adjust trimming parameters.
20241112
Need to retrim but first going to run short stack with the samples from the initial sequencing run. These have lower duplication rates and I do not think that they have as much PCR duplication. In the scripts folder: nano shortstack_trim_stringent_sub.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Running short stack on mature trimmed miRNAs from Mcap DT project"
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
# Load modules
module load ShortStack/4.0.2-foss-2022a
module load Kent_tools/442-GCC-11.3.0
# Run short stack
# Run short stack
ShortStack \
--genomefile /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
--readfile trim_stringent_cutadapt_13_S76_R1_001.fastq.gz \
trim_stringent_cutadapt_23_S77_R1_001.fastq.gz \
trim_stringent_cutadapt_35_S78_R1_001.fastq.gz \
trim_stringent_cutadapt_52_S79_R1_001.fastq.gz \
trim_stringent_cutadapt_60_S80_R1_001.fastq.gz \
trim_stringent_cutadapt_72_S81_R1_001.fastq.gz \
trim_stringent_cutadapt_85_S82_R1_001.fastq.gz \
trim_stringent_cutadapt_9_S75_R1_001.fastq.gz \
--known_miRNAs /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa \
--threads 10 \
--dn_mirna
echo "Short stack complete!"
Submitted batch job 348560. Ran in a few hours. Output said it found 3 miRNA loci but not sure if this includes known miRNAs…
awk '$3 == "mature_miRNA"' Results.gff3 > filtered_mature_miRNA.gff
head filtered_mature_miRNA.gff
Montipora_capitata_HIv3___Scaffold_2 ShortStack mature_miRNA 46381100 46381121 476 + . ID=Cluster_2999.mature;Parent=Cluster_2999
Montipora_capitata_HIv3___Scaffold_8 ShortStack mature_miRNA 22810894 22810915 204 + . ID=Cluster_12334.mature;Parent=Cluster_12334
Montipora_capitata_HIv3___Scaffold_14 ShortStack mature_miRNA 2986538 2986560 1 + . ID=Cluster_27569.mature;Parent=Cluster_27569
awk -F'\t' '$20 == "Y"' Results.txt > Results_miRNA.txt
head Results_miRNA.txt
Montipora_capitata_HIv3___Scaffold_2:46381078-46381169 Cluster_2999 Montipora_capitata_HIv3___Scaffold_2 46381078 46381169 92 1012 59 1.0 + ACAAUGUUUCGGCUUGUUCCCG 476 108 8 169 705 17 5 22 Y spi-mir-temp-14_Stylophora_pistillata_Liew_et_al._2014_NA
Montipora_capitata_HIv3___Scaffold_8:22810843-22810935 Cluster_12334 Montipora_capitata_HIv3___Scaffold_8 22810843 22810935 93 4483 95 1.0 + AUGCAGCGGAAAUCGAACCUGGG 3053 154 15 43 256 3224 791 23 Y NA
Montipora_capitata_HIv3___Scaffold_14:2986487-2986580 Cluster_27569 Montipora_capitata_HIv3___Scaffold_14 2986487 2986580 94 49 7 1.0 + ACGGUGAAAGUCGUCUCAAUAUUCA 35 2 35 0 1 1 10 N Y spi-mir-temp-30_Stylophora_pistillata_Liew_et_al._2014_Close_match_of_nve-miR-2036.;eca-nve-F-miR-2036_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level;Adi-Mir-2036_3p_Acropora_digitifera_Gajigan_&_Conaco_2017_nve-miR-2036-3p;_nve-miR-2036-3p;_spi-miR-temp-30;ami-nve-F-miR-2036-3p_Acropora_millepora_Praher_et_al._2021_NA;spi-nve-F-miR-2036_Stylophora_pistillata_Praher_et_al._2021_NA
Clearly didn’t pick up all miRNAs. Maybe trying running mirdeep2?? I did a lot of trouble shooting with AST and mirdeep2 (see post here). First, I need to map the samples to the genome with the mapper.pl
script. Genome must first be indexed by bowtie-build (NOT bowtie2). Because I have multiple samples, I need to make a config file that contains fq file locations and a unique 3 letter code (see mirdeep2 documentation). I will also be unzipping the files so remove the .gz from file name.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
nano config.txt
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_10_small_RNA_S4_R1_001.fastq s10
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_11_small_RNA_S5_R1_001.fastq s11
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_13_S76_R1_001.fastq s13
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_14_small_RNA_S6_R1_001.fastq s14
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_23_S77_R1_001.fastq s23
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_24_small_RNA_S7_R1_001.fastq s24
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_26_small_RNA_S8_R1_001.fastq s26
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_28_small_RNA_S9_R1_001.fastq s28
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_35_S78_R1_001.fastq s35
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_36_small_RNA_S10_R1_001.fastq s36
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_37_small_RNA_S11_R1_001.fastq s37
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_39_small_RNA_S12_R1_001.fastq s39
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_47_small_RNA_S13_R1_001.fastq s47
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_48_small_RNA_S14_R1_001.fastq s48
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_51_small_RNA_S15_R1_001.fastq s51
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_52_S79_R1_001.fastq s52
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_60_S80_R1_001.fastq s60
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_61_small_RNA_S16_R1_001.fastq s61
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_62_small_RNA_S17_R1_001.fastq s62
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_63_small_RNA_S18_R1_001.fastq s63
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_6_small_RNA_S1_R1_001.fastq s06
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_72_S81_R1_001.fastq s72
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_73_small_RNA_S19_R1_001.fastq s73
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_74_small_RNA_S20_R1_001.fastq s74
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_75_small_RNA_S21_R1_001.fastq s75
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_7_small_RNA_S2_R1_001.fastq s07
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_85_S82_R1_001.fastq s85
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_86_small_RNA_S22_R1_001.fastq s86
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_87_small_RNA_S23_R1_001.fastq s87
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_88_small_RNA_S24_R1_001.fastq s88
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_8_small_RNA_S3_R1_001.fastq s08
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_9_S75_R1_001.fastq s09
In the config.txt
file, I have the path and file name, as well as the unique 3 letter code (s followed by sample number).
In the scripts folder: nano mapper_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load GCCcore/11.3.0 #I needed to add this to resolve conflicts between loaded GCCcore/9.3.0 and GCCcore/11.3.0
module load Bowtie/1.3.1-GCC-11.3.0
echo "Index Mcap genome" $(date)
# Index the reference genome for Mcap
bowtie-build /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex
echo "Referece genome indexed!" $(date)
echo "Unload unneeded packages and run mapper script for trimmed stringent reads" $(date)
module unload module load GCCcore/11.3.0
module unload Bowtie/1.3.1-GCC-11.3.0
conda activate /data/putnamlab/mirdeep2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
gunzip *.fastq.gz
mapper.pl config.txt -e -d -h -j -l 18 -m -p /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex -s mapped_reads.fa -t mapped_reads_vs_genome.arf
echo "Mapping complete for trimmed stringent reads" $(date)
conda deactivate
Submitted batch job 348612. The flags mean the following:
-e
- use fastq files as input-d
- input is config file-h
- convert to fasta format-i
- convert RNA to DNA alphabet-j
- remove entries with non-canonical letters-l
- discard reads shorter than 18 nt-m
- collapse reads-p
- map to genome–genome has to be indexed by bowtie build-s
- mapped reads fasta output-t
- read mappings v genome output
Also consider adding -q which allows mapping with one mismatch in seed but mapping takes longer. Ran in about an hour. Look at output
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
zgrep -c ">" mapped_reads.fa
21804779
wc -l mapped_reads_vs_genome.arf
8098667 mapped_reads_vs_genome.arf
less bowtie.log
# reads processed: 2215013
# reads with at least one reported alignment: 416126 (18.79%)
# reads that failed to align: 1500959 (67.76%)
# reads with alignments suppressed due to -m: 297928 (13.45%)
Reported 841100 alignments to 1 output stream(s)
I can now run mirdeep2 but this is a problem for tomorrow!
20241113
Short stack and mirdeep2 output has been in this folder: /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
. Going to move to output folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output
mkdir mirdeep2_trim_stringent
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
mv ShortStack_1731* ../../output/shortstack_trim_stringent/
mv mapp* ../../output/mirdeep2_trim_stringent/
mv bowtie.log ../../output/mirdeep2_trim_stringent/
mv config.txt ../../output/mirdeep2_trim_stringent/
The output files from the mapper.pl
portion of mirdeep2 will be used as input for the mirdeep2.pl
portion of the pipeline. In the scripts folder: nano mirna_predict_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 with reads that were trimmed stringently via cutadapt" $(date)
miRDeep2.pl /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mapped_reads.fa /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mapped_reads_vs_genome.arf /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa none none -P -v -g -1 2>report.log
echo "mirdeep2 complete" $(date)
conda deactivate
Submitted batch job 348663. In the past, mirdeep2 has taken a few days to process my AST samples so I am guessing it will take a similar amount of time. The flags mean the following:
mapped_reads.fa
- fasta file with sequencing readsMontipora_capitata_HIv3.assembly.fasta
- genome fastamapped_reads_vs_genome.arf
- mapped reads to the genome in miRDeep2 arf formatmature_mirbase_cnidarian_T.fa
- known miRNAsnone
- no fasta file for precursor sequences. I accidently added twonone
flags in there, hopefully that wont mess things up too much.-P
- report novel miRNAs-v
- output verbose mode-g -1
- report all potential miRNAs regardless of low score2>report.log
- redirects errors to log file
20241118
Took about 2 days to run and most of output was put in scripts folder. Downloaded output to my computer.
20241121
I need to revisit trimming. I downloaded slurm-347786.out
(most recent trimming iteration output) to my computer to look at sample adapter information.
# M10
Total reads processed: 19,773,964
Reads with adapters: 12,283,878 (62.1%)
== Read fate breakdown ==
Reads that were too short: 3,481,028 (17.6%)
Reads that were too long: 7,532,588 (38.1%)
Reads with too many N: 33,125 (0.2%)
Reads discarded as untrimmed: 25,736 (0.1%)
Reads written (passing filters): 8,701,487 (44.0%)
Total basepairs processed: 2,985,868,564 bp
Quality-trimmed: 49,791,977 bp (1.7%)
Total written (filtered): 175,293,612 bp (5.9%)
# M11
Total reads processed: 12,310,413
Reads with adapters: 10,586,911 (86.0%)
== Read fate breakdown ==
Reads that were too short: 6,691,286 (54.4%)
Reads that were too long: 1,817,067 (14.8%)
Reads with too many N: 14,264 (0.1%)
Reads discarded as untrimmed: 13,200 (0.1%)
Reads written (passing filters): 3,774,596 (30.7%)
Total basepairs processed: 1,858,872,363 bp
Quality-trimmed: 32,849,546 bp (1.8%)
Total written (filtered): 79,147,991 bp (4.3%)
# M13
Total reads processed: 25,732,779
Reads with adapters: 25,613,472 (99.5%)
== Read fate breakdown ==
Reads that were too short: 7,832,390 (30.4%)
Reads that were too long: 4,570,545 (17.8%)
Reads with too many N: 6,929 (0.0%)
Reads discarded as untrimmed: 48,788 (0.2%)
Reads written (passing filters): 13,274,127 (51.6%)
Total basepairs processed: 3,885,649,629 bp
Quality-trimmed: 133,678,654 bp (3.4%)
Total written (filtered): 328,216,706 bp (8.4%)
# M14
Total reads processed: 35,754,439
Reads with adapters: 29,105,761 (81.4%)
== Read fate breakdown ==
Reads that were too short: 26,385,621 (73.8%)
Reads that were too long: 6,927,638 (19.4%)
Reads with too many N: 8,413 (0.0%)
Reads discarded as untrimmed: 15,538 (0.0%)
Reads written (passing filters): 2,417,229 (6.8%)
Total basepairs processed: 5,398,920,289 bp
Quality-trimmed: 90,439,586 bp (1.7%)
Total written (filtered): 53,521,904 bp (1.0%)
# M23
Total reads processed: 28,608,340
Reads with adapters: 28,488,320 (99.6%)
== Read fate breakdown ==
Reads that were too short: 8,300,653 (29.0%)
Reads that were too long: 5,340,020 (18.7%)
Reads with too many N: 8,149 (0.0%)
Reads discarded as untrimmed: 50,412 (0.2%)
Reads written (passing filters): 14,909,106 (52.1%)
Total basepairs processed: 4,319,859,340 bp
Quality-trimmed: 121,050,542 bp (2.8%)
Total written (filtered): 381,785,695 bp (8.8%)
# M24
Total reads processed: 16,185,558
Reads with adapters: 15,051,209 (93.0%)
== Read fate breakdown ==
Reads that were too short: 1,238,159 (7.6%)
Reads that were too long: 1,298,021 (8.0%)
Reads with too many N: 53,492 (0.3%)
Reads discarded as untrimmed: 17,232 (0.1%)
Reads written (passing filters): 13,578,654 (83.9%)
Total basepairs processed: 2,444,019,258 bp
Quality-trimmed: 51,942,286 bp (2.1%)
Total written (filtered): 258,626,182 bp (10.6%)
# M26
Total reads processed: 45,463,982
Reads with adapters: 33,200,737 (73.0%)
== Read fate breakdown ==
Reads that were too short: 26,954,608 (59.3%)
Reads that were too long: 12,621,446 (27.8%)
Reads with too many N: 21,951 (0.0%)
Reads discarded as untrimmed: 29,590 (0.1%)
Reads written (passing filters): 5,836,387 (12.8%)
Total basepairs processed: 6,865,061,282 bp
Quality-trimmed: 114,030,951 bp (1.7%)
Total written (filtered): 124,668,281 bp (1.8%)
# M28
Total reads processed: 29,330,724
Reads with adapters: 22,529,784 (76.8%)
== Read fate breakdown ==
Reads that were too short: 20,934,300 (71.4%)
Reads that were too long: 6,864,087 (23.4%)
Reads with too many N: 5,150 (0.0%)
Reads discarded as untrimmed: 36,179 (0.1%)
Reads written (passing filters): 1,491,008 (5.1%)
Total basepairs processed: 4,428,939,324 bp
Quality-trimmed: 57,180,891 bp (1.3%)
Total written (filtered): 33,212,070 bp (0.7%)
# M35
Total reads processed: 27,415,125
Reads with adapters: 27,297,487 (99.6%)
== Read fate breakdown ==
Reads that were too short: 8,746,075 (31.9%)
Reads that were too long: 4,714,853 (17.2%)
Reads with too many N: 7,301 (0.0%)
Reads discarded as untrimmed: 46,476 (0.2%)
Reads written (passing filters): 13,900,420 (50.7%)
Total basepairs processed: 4,139,683,875 bp
Quality-trimmed: 135,071,303 bp (3.3%)
Total written (filtered): 350,393,120 bp (8.5%)
# M36
Total reads processed: 21,669,140
Reads with adapters: 15,229,442 (70.3%)
== Read fate breakdown ==
Reads that were too short: 3,530,235 (16.3%)
Reads that were too long: 6,397,943 (29.5%)
Reads with too many N: 45,695 (0.2%)
Reads discarded as untrimmed: 37,563 (0.2%)
Reads written (passing filters): 11,657,704 (53.8%)
Total basepairs processed: 3,272,040,140 bp
Quality-trimmed: 51,150,713 bp (1.6%)
Total written (filtered): 228,309,042 bp (7.0%)
# M37
Total reads processed: 20,247,832
Reads with adapters: 19,802,633 (97.8%)
== Read fate breakdown ==
Reads that were too short: 7,675,729 (37.9%)
Reads that were too long: 434,204 (2.1%)
Reads with too many N: 45,858 (0.2%)
Reads discarded as untrimmed: 20,730 (0.1%)
Reads written (passing filters): 12,071,311 (59.6%)
Total basepairs processed: 3,057,422,632 bp
Quality-trimmed: 62,466,338 bp (2.0%)
Total written (filtered): 246,713,865 bp (8.1%)
# M39
Total reads processed: 20,298,000
Reads with adapters: 14,625,467 (72.1%)
== Read fate breakdown ==
Reads that were too short: 13,581,604 (66.9%)
Reads that were too long: 5,768,966 (28.4%)
Reads with too many N: 3,465 (0.0%)
Reads discarded as untrimmed: 11,546 (0.1%)
Reads written (passing filters): 932,419 (4.6%)
Total basepairs processed: 3,064,998,000 bp
Quality-trimmed: 41,859,848 bp (1.4%)
Total written (filtered): 19,320,488 bp (0.6%)
# M47
Total reads processed: 61,683,862
Reads with adapters: 56,145,283 (91.0%)
== Read fate breakdown ==
Reads that were too short: 13,015,507 (21.1%)
Reads that were too long: 5,696,266 (9.2%)
Reads with too many N: 166,609 (0.3%)
Reads discarded as untrimmed: 153,126 (0.2%)
Reads written (passing filters): 42,652,354 (69.1%)
Total basepairs processed: 9,314,263,162 bp
Quality-trimmed: 181,161,945 bp (1.9%)
Total written (filtered): 846,330,853 bp (9.1%)
# M48
Total reads processed: 26,207,354
Reads with adapters: 5,677,260 (21.7%)
== Read fate breakdown ==
Reads that were too short: 3,896,981 (14.9%)
Reads that were too long: 20,434,620 (78.0%)
Reads with too many N: 6,817 (0.0%)
Reads discarded as untrimmed: 90,086 (0.3%)
Reads written (passing filters): 1,778,850 (6.8%)
Total basepairs processed: 3,957,310,454 bp
Quality-trimmed: 46,499,881 bp (1.2%)
Total written (filtered): 35,591,311 bp (0.9%)
# M51
Total reads processed: 20,348,304
Reads with adapters: 7,710,258 (37.9%)
== Read fate breakdown ==
Reads that were too short: 7,042,815 (34.6%)
Reads that were too long: 12,673,695 (62.3%)
Reads with too many N: 2,273 (0.0%)
Reads discarded as untrimmed: 14,418 (0.1%)
Reads written (passing filters): 615,103 (3.0%)
Total basepairs processed: 3,072,593,904 bp
Quality-trimmed: 48,509,512 bp (1.6%)
Total written (filtered): 12,632,989 bp (0.4%)
# M52
Total reads processed: 29,279,123
Reads with adapters: 29,137,374 (99.5%)
== Read fate breakdown ==
Reads that were too short: 7,298,239 (24.9%)
Reads that were too long: 6,020,859 (20.6%)
Reads with too many N: 8,657 (0.0%)
Reads discarded as untrimmed: 61,778 (0.2%)
Reads written (passing filters): 15,889,590 (54.3%)
Total basepairs processed: 4,421,147,573 bp
Quality-trimmed: 151,159,079 bp (3.4%)
Total written (filtered): 408,145,425 bp (9.2%)
# M60
Total reads processed: 24,332,450
Reads with adapters: 24,230,741 (99.6%)
== Read fate breakdown ==
Reads that were too short: 9,734,570 (40.0%)
Reads that were too long: 3,070,229 (12.6%)
Reads with too many N: 5,938 (0.0%)
Reads discarded as untrimmed: 38,908 (0.2%)
Reads written (passing filters): 11,482,805 (47.2%)
Total basepairs processed: 3,674,199,950 bp
Quality-trimmed: 119,459,545 bp (3.3%)
Total written (filtered): 282,122,042 bp (7.7%)
# M61
Total reads processed: 24,731,279
Reads with adapters: 6,861,730 (27.7%)
== Read fate breakdown ==
Reads that were too short: 6,034,816 (24.4%)
Reads that were too long: 17,805,944 (72.0%)
Reads with too many N: 2,980 (0.0%)
Reads discarded as untrimmed: 76,252 (0.3%)
Reads written (passing filters): 811,287 (3.3%)
Total basepairs processed: 3,734,423,129 bp
Quality-trimmed: 46,085,583 bp (1.2%)
Total written (filtered): 16,211,139 bp (0.4%)
# M62
Total reads processed: 15,145,513
Reads with adapters: 7,633,247 (50.4%)
== Read fate breakdown ==
Reads that were too short: 4,771,918 (31.5%)
Reads that were too long: 7,508,877 (49.6%)
Reads with too many N: 11,032 (0.1%)
Reads discarded as untrimmed: 45,692 (0.3%)
Reads written (passing filters): 2,807,994 (18.5%)
Total basepairs processed: 2,286,972,463 bp
Quality-trimmed: 37,090,659 bp (1.6%)
Total written (filtered): 55,944,873 bp (2.4%)
# M63
Total reads processed: 21,903,406
Reads with adapters: 10,379,795 (47.4%)
== Read fate breakdown ==
Reads that were too short: 9,069,045 (41.4%)
Reads that were too long: 11,612,371 (53.0%)
Reads with too many N: 4,485 (0.0%)
Reads discarded as untrimmed: 28,940 (0.1%)
Reads written (passing filters): 1,188,565 (5.4%)
Total basepairs processed: 3,307,414,306 bp
Quality-trimmed: 46,947,183 bp (1.4%)
Total written (filtered): 25,709,935 bp (0.8%)
# M6
Total reads processed: 24,211,196
Reads with adapters: 15,778,017 (65.2%)
== Read fate breakdown ==
Reads that were too short: 15,305,643 (63.2%)
Reads that were too long: 8,428,574 (34.8%)
Reads with too many N: 1,536 (0.0%)
Reads discarded as untrimmed: 43,373 (0.2%)
Reads written (passing filters): 432,070 (1.8%)
Total basepairs processed: 3,655,890,596 bp
Quality-trimmed: 52,333,701 bp (1.4%)
Total written (filtered): 9,537,955 bp (0.3%)
# M72
Total reads processed: 25,266,689
Reads with adapters: 25,144,539 (99.5%)
== Read fate breakdown ==
Reads that were too short: 7,803,487 (30.9%)
Reads that were too long: 3,726,653 (14.7%)
Reads with too many N: 7,388 (0.0%)
Reads discarded as untrimmed: 51,139 (0.2%)
Reads written (passing filters): 13,678,022 (54.1%)
Total basepairs processed: 3,815,270,039 bp
Quality-trimmed: 137,376,345 bp (3.6%)
Total written (filtered): 328,694,194 bp (8.6%)
# M73
Total reads processed: 5,419,554
Reads with adapters: 5,395,120 (99.5%)
== Read fate breakdown ==
Reads that were too short: 3,459,671 (63.8%)
Reads that were too long: 320,201 (5.9%)
Reads with too many N: 6,668 (0.1%)
Reads discarded as untrimmed: 4,293 (0.1%)
Reads written (passing filters): 1,628,721 (30.1%)
Total basepairs processed: 818,352,654 bp
Quality-trimmed: 17,216,446 bp (2.1%)
Total written (filtered): 32,665,596 bp (4.0%)
# M74
Total reads processed: 23,599,752
Reads with adapters: 6,346,294 (26.9%)
== Read fate breakdown ==
Reads that were too short: 5,447,500 (23.1%)
Reads that were too long: 17,208,033 (72.9%)
Reads with too many N: 3,295 (0.0%)
Reads discarded as untrimmed: 55,067 (0.2%)
Reads written (passing filters): 885,857 (3.8%)
Total basepairs processed: 3,563,562,552 bp
Quality-trimmed: 44,030,663 bp (1.2%)
Total written (filtered): 18,225,152 bp (0.5%)
# M75
Total reads processed: 39,936,022
Reads with adapters: 31,308,251 (78.4%)
== Read fate breakdown ==
Reads that were too short: 26,334,128 (65.9%)
Reads that were too long: 9,283,704 (23.2%)
Reads with too many N: 15,318 (0.0%)
Reads discarded as untrimmed: 53,526 (0.1%)
Reads written (passing filters): 4,249,346 (10.6%)
Total basepairs processed: 6,030,339,322 bp
Quality-trimmed: 104,351,331 bp (1.7%)
Total written (filtered): 94,582,130 bp (1.6%)
# M7
Total reads processed: 30,275,553
Reads with adapters: 24,226,251 (80.0%)
== Read fate breakdown ==
Reads that were too short: 11,063,587 (36.5%)
Reads that were too long: 6,232,765 (20.6%)
Reads with too many N: 48,605 (0.2%)
Reads discarded as untrimmed: 51,715 (0.2%)
Reads written (passing filters): 12,878,881 (42.5%)
Total basepairs processed: 4,571,608,503 bp
Quality-trimmed: 88,504,444 bp (1.9%)
Total written (filtered): 255,714,395 bp (5.6%)
# M85
Total reads processed: 21,835,402
Reads with adapters: 21,711,666 (99.4%)
== Read fate breakdown ==
Reads that were too short: 7,627,502 (34.9%)
Reads that were too long: 2,616,361 (12.0%)
Reads with too many N: 5,995 (0.0%)
Reads discarded as untrimmed: 48,112 (0.2%)
Reads written (passing filters): 11,537,432 (52.8%)
Total basepairs processed: 3,297,145,702 bp
Quality-trimmed: 123,423,025 bp (3.7%)
Total written (filtered): 268,088,833 bp (8.1%)
# M86
Total reads processed: 17,514,990
Reads with adapters: 8,429,216 (48.1%)
== Read fate breakdown ==
Reads that were too short: 5,771,993 (33.0%)
Reads that were too long: 9,103,226 (52.0%)
Reads with too many N: 10,039 (0.1%)
Reads discarded as untrimmed: 38,292 (0.2%)
Reads written (passing filters): 2,591,440 (14.8%)
Total basepairs processed: 2,644,763,490 bp
Quality-trimmed: 46,364,967 bp (1.8%)
Total written (filtered): 52,301,229 bp (2.0%)
# M87
Total reads processed: 35,917,496
Reads with adapters: 30,802,070 (85.8%)
== Read fate breakdown ==
Reads that were too short: 27,183,568 (75.7%)
Reads that were too long: 5,601,464 (15.6%)
Reads with too many N: 11,049 (0.0%)
Reads discarded as untrimmed: 64,118 (0.2%)
Reads written (passing filters): 3,057,297 (8.5%)
Total basepairs processed: 5,423,541,896 bp
Quality-trimmed: 82,622,227 bp (1.5%)
Total written (filtered): 68,971,401 bp (1.3%)
# M88
Total reads processed: 28,792,588
Reads with adapters: 11,393,165 (39.6%)
== Read fate breakdown ==
Reads that were too short: 9,948,192 (34.6%)
Reads that were too long: 17,523,034 (60.9%)
Reads with too many N: 4,898 (0.0%)
Reads discarded as untrimmed: 30,002 (0.1%)
Reads written (passing filters): 1,286,462 (4.5%)
Total basepairs processed: 4,347,680,788 bp
Quality-trimmed: 58,097,687 bp (1.3%)
Total written (filtered): 27,541,768 bp (0.6%)
# M8
Total reads processed: 37,732,448
Reads with adapters: 21,548,435 (57.1%)
== Read fate breakdown ==
Reads that were too short: 19,322,853 (51.2%)
Reads that were too long: 16,300,593 (43.2%)
Reads with too many N: 7,780 (0.0%)
Reads discarded as untrimmed: 49,950 (0.1%)
Reads written (passing filters): 2,051,272 (5.4%)
Total basepairs processed: 5,697,599,648 bp
Quality-trimmed: 82,384,975 bp (1.4%)
Total written (filtered): 45,156,438 bp (0.8%)
# M9
Total reads processed: 25,911,154
Reads with adapters: 25,800,281 (99.6%)
== Read fate breakdown ==
Reads that were too short: 8,632,667 (33.3%)
Reads that were too long: 4,601,815 (17.8%)
Reads with too many N: 6,587 (0.0%)
Reads discarded as untrimmed: 45,543 (0.2%)
Reads written (passing filters): 12,624,542 (48.7%)
Total basepairs processed: 3,912,584,254 bp
Quality-trimmed: 153,459,563 bp (3.9%)
Total written (filtered): 305,187,572 bp (7.8%)
A lot of samples had a high percentage of reads that were too short for the 18 bp cutoff. There were also some samples that did not have many reads with adapters, which means I need to go back to the fastqc files to identify adapters/sequences that were overrepresented in these samples and add them to the cut adapt code. By doing this, I will capture more reads with adapters for trimming.
Next trimming iteration: decrease min length to 15 bp, look at fastqc for more adapter sequences
20241203
I am going to look at the outputs from the individual sample fastQC to see what adapters I am missing. I will focus primarily on the samples that did not have high adapter content in my trimmed stringent iteration. I am hoping there is no limit to number of adapters.
Most of the samples I did in the initial sequencing batch have no adaptere hits for overrepresented sequences. Using the individual fastQC files, I pulled all the adapters that had >1% of possible contamination and got over 400 unique adapter sequences lol. Instead, I am going to go through the individual fastQCs again while looking at the cutadapt output to see which samples did not have many reads with adapters.
Adapters to include:
TGGAATTCTCGGGTGCCAAGG
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
GATCGGAAGAGCACACGTCTGAACTCCAGTCA
GAAACGTTGGGTTGCGGTATTGGAAGAGCACACGTCTGAACTCCAGTCAC
AGTGTCCTATCAGCTACCATCGGAAGAGCACACGTCTGAACTCCAGTCAC
CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
TAGCGTCGAACGGGCGCAATCGGAAGAGCACACGTCTGAACTCCAGTCAC
ACGTCTGAACTCCAGTCACTTACAGGAATCTGGGGGGGGGGGGGGGGGGG
TATCGGTGAAACATCCTCATCGGAAGAGCACACGTCTGAACTCCAGTCAC
ACCGTTGTTCGAGCAGGAATCGGAAGAGCACACGTCTGAACTCCAGTCAC # M10
TCGGAAGAGCACACGTCTGAACTCCAGTCACTACCGAGGATCTGGGGGGG # M11
TCGGAAGAGCACACGTCTGAACTCCAGTCACTCGTAGTGATCTGGGGGGG # M14
TCGGAAGAGCACACGTCTGAACTCCAGTCACCGTTAGAAATCTGGGGGGG # M26
TCGGAAGAGCACACGTCTGAACTCCAGTCACCTACGACAATCTGGGGGGG # M28
CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC # M36
AAGAGCACACGTCTGAACTCCAGTCACTAAGTGGTATCTGGGGGGGGGGG # M39
ACGTCTGAACTCCAGTCACCGGACAACATCTGGGGGGGGGGGGGGGGGGG # M48
TCGGAAGAGCACACGTCTGAACTCCAGTCACATATGGATATCTGGGGGGG # M51
TCGGAAGAGCACACGTCTGAACTCCAGTCACGCGCAAGCATCTAGGGGGG # M61
GAAGAGCACACGTCTGAACTCCAGTCACCCGTGAAGATCTGGGGGGGGGG # M62
TCGGAAGAGCACACGTCTGAACTCCAGTCACAAGATACTATCTGGGGGGG # M63
TCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGGCCATCTGGGGGGG # M6
ACGTCTGAACTCCAGTCACGGAGCGTCATCTGGGGGGGGGGGGGGGGGGG # M74
TCGGAAGAGCACACGTCTGAACTCCAGTCACATGGCATGATCTGGGGGGG # M75
CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC # M7
CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC # M86
TCGGAAGAGCACACGTCTGAACTCCAGTCACGCAATGCAATCTGGGGGGG # M87
TCGGAAGAGCACACGTCTGAACTCCAGTCACGTTCCAATATCTAGGGGGG # M88
TCGGAAGAGCACACGTCTGAACTCCAGTCACGATTCTGCATCTGGGGGGG # M8
Make new folders for output
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data
mkdir trim_stringent_take2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc
mkdir trim_stringent_take2
My last cutadapt run (job 347786) took about a day to run. I am going to set the time higher this time since I am adding so many more adapters. In the scripts folder: nano cutadapt_trim_stringent_take2.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load cutadapt/4.2-GCCcore-11.3.0
#module load cutadapt/3.5-GCCcore-11.2.0
module load FastQC/0.11.8-Java-1.8
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/
#echo "Count number of reads in each file" $(date)
#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt
echo "Start read trimming with cutadapt, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# cutadapt loop
for i in ${array1[@]}; do
cutadapt -b TGGAATTCTCGGGTGCCAAGG -b AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -b GATCGGAAGAGCACACGTCTGAACTCCAGTCA -b GAAACGTTGGGTTGCGGTATTGGAAGAGCACACGTCTGAACTCCAGTCAC -b AGTGTCCTATCAGCTACCATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b TAGCGTCGAACGGGCGCAATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b ACGTCTGAACTCCAGTCACTTACAGGAATCTGGGGGGGGGGGGGGGGGGG -b TATCGGTGAAACATCCTCATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b ACCGTTGTTCGAGCAGGAATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b TCGGAAGAGCACACGTCTGAACTCCAGTCACTACCGAGGATCTGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACTCGTAGTGATCTGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACCGTTAGAAATCTGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACCTACGACAATCTGGGGGGG -b CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAGAGCACACGTCTGAACTCCAGTCACTAAGTGGTATCTGGGGGGGGGGG -b ACGTCTGAACTCCAGTCACCGGACAACATCTGGGGGGGGGGGGGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACATATGGATATCTGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACGCGCAAGCATCTAGGGGGG -b GAAGAGCACACGTCTGAACTCCAGTCACCCGTGAAGATCTGGGGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACAAGATACTATCTGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGGCCATCTGGGGGGG -b ACGTCTGAACTCCAGTCACGGAGCGTCATCTGGGGGGGGGGGGGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACATGGCATGATCTGGGGGGG -b CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b TCGGAAGAGCACACGTCTGAACTCCAGTCACGCAATGCAATCTGGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACGTTCCAATATCTAGGGGGG -b TCGGAAGAGCACACGTCTGAACTCCAGTCACGATTCTGCATCTGGGGGGG -m 15 -M 35 -q 30,30 --trim-n --max-n 0 --discard-untrimmed -e 0.1 -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent_take2/trim_stringent_take2_cutadapt_${i} $i
done
## added even more adapters that were overrepresented sequences from the individual fastqc htmls
echo "Read trimming of adapters complete" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent_take2/
fastqc * -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim_stringent_take2
done
echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)
module unload cutadapt/4.2-GCCcore-11.3.0
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim_stringent_take2
multiqc *
echo "MultiQC complete" $(date)
echo "Count number of reads in each trimmed file" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent_take2
zgrep -c "@LH00" *.gz > smRNA_trim_stringent_take2_read_count.txt
Submitted batch job 352143. In this iteration, I used cutadapt with -b: For this type of adapter, the sequence is specified with -b ADAPTER (or use the longer spelling –anywhere ADAPTER). The adapter may appear in the beginning (even degraded), within the read, or at the end of the read (even partially). The decision which part of the read to remove is made as follows: If there is at least one base before the found adapter, then the adapter is considered to be a 3’ adapter and the adapter itself and everything following it is removed. Otherwise, the adapter is considered to be a 5’ adapter and it is removed from the read, but the sequence after it remains.
discussion w. zoe
- do they all have adapters?
- blast overrepresented sequences
- reverse complement of sequence
- what would data look like if i dont throw out anything >30bp
- sequences of index primers
Bleh trimming is hard and confusing.
20241206
While trimming takes place, I am going to run the quantifier module to obtain counts for the miRNAs. From the mirdeep2 output, I identified putative novel and known miRNAs (see code). For the novel miRNAs, I filtered the list so that mirdeep2 score >10, no rfam info, at least 10 reads in mature read count, and significant randfold p-value; 52 putative novel miRNAs were identified. For the known miRNAs, I filtered the list so that mirdeep2 score >0, no rfam info, at least 10 reads in mature read count, and significant ranfold p-value; 7 putative known miRNAs were identified. 59 miRNAs total!
# novel
Montipora_capitata_HIv3___Scaffold_2_98529
Montipora_capitata_HIv3___Scaffold_8_488597
Montipora_capitata_HIv3___Scaffold_4_264216
Montipora_capitata_HIv3___Scaffold_9_587940
Montipora_capitata_HIv3___Scaffold_9_625013
Montipora_capitata_HIv3___Scaffold_2_71488
Montipora_capitata_HIv3___Scaffold_9_613239
Montipora_capitata_HIv3___Scaffold_11_759461
Montipora_capitata_HIv3___Scaffold_12_859123
Montipora_capitata_HIv3___Scaffold_10_631026
Montipora_capitata_HIv3___Scaffold_8_585867
Montipora_capitata_HIv3___Scaffold_2_124999
Montipora_capitata_HIv3___Scaffold_8_578139
Montipora_capitata_HIv3___Scaffold_11_826703
Montipora_capitata_HIv3___Scaffold_10_729473
Montipora_capitata_HIv3___Scaffold_10_667068
Montipora_capitata_HIv3___Scaffold_151_1075760
Montipora_capitata_HIv3___Scaffold_14_1017829
Montipora_capitata_HIv3___Scaffold_5_309429
Montipora_capitata_HIv3___Scaffold_8_508612
Montipora_capitata_HIv3___Scaffold_8_553085
Montipora_capitata_HIv3___Scaffold_7_456843
Montipora_capitata_HIv3___Scaffold_4_291272
Montipora_capitata_HIv3___Scaffold_6_400996
Montipora_capitata_HIv3___Scaffold_13_954675
Montipora_capitata_HIv3___Scaffold_8_542727
Montipora_capitata_HIv3___Scaffold_14_999529
Montipora_capitata_HIv3___Scaffold_14_994729
Montipora_capitata_HIv3___Scaffold_9_617967
Montipora_capitata_HIv3___Scaffold_5_322659
Montipora_capitata_HIv3___Scaffold_6_419669
Montipora_capitata_HIv3___Scaffold_10_645863
Montipora_capitata_HIv3___Scaffold_2_87665
Montipora_capitata_HIv3___Scaffold_1_35393
Montipora_capitata_HIv3___Scaffold_4_232369
Montipora_capitata_HIv3___Scaffold_5_375083
Montipora_capitata_HIv3___Scaffold_2_64557
Montipora_capitata_HIv3___Scaffold_3_149822
Montipora_capitata_HIv3___Scaffold_9_606053
Montipora_capitata_HIv3___Scaffold_11_801837
Montipora_capitata_HIv3___Scaffold_8_508613
Montipora_capitata_HIv3___Scaffold_8_535535
Montipora_capitata_HIv3___Scaffold_11_796076
Montipora_capitata_HIv3___Scaffold_11_787579
Montipora_capitata_HIv3___Scaffold_11_816162
Montipora_capitata_HIv3___Scaffold_13_949160
Montipora_capitata_HIv3___Scaffold_11_841599
Montipora_capitata_HIv3___Scaffold_5_310446
Montipora_capitata_HIv3___Scaffold_6_439094
Montipora_capitata_HIv3___Scaffold_5_365006
Montipora_capitata_HIv3___Scaffold_2_67225
Montipora_capitata_HIv3___Scaffold_5_307262
# known
Montipora_capitata_HIv3___Scaffold_14_1018693
Montipora_capitata_HIv3___Scaffold_2_96328
Montipora_capitata_HIv3___Scaffold_8_565039
Montipora_capitata_HIv3___Scaffold_8_586427
Montipora_capitata_HIv3___Scaffold_14_992284
Montipora_capitata_HIv3___Scaffold_12_910953
Montipora_capitata_HIv3___Scaffold_1_4183
I now want to do 2 things - quantify the miRNAs and use miranda to assess miRNA binding to 3’UTR. Let’s quantify first.
All output files from mirdeep2 were put in the scripts folder. Move them to the proper output folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
mv dir_prepare_signature1731506059/ ../output/mirdeep2_trim_stringent/
mv result_13_11_2024_t_08_48_44.* ../output/mirdeep2_trim_stringent/
mv mirna_results_13_11_2024_t_08_48_44/ ../output/mirdeep2_trim_stringent/
mv report.log ../output/mirdeep2_trim_stringent/
mv pdfs_13_11_2024_t_08_48_44/ ../output/mirdeep2_trim_stringent/
mv mirdeep_runs/ ../output/mirdeep2_trim_stringent/
In this folder (/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mirna_results_13_11_2024_t_08_48_44
), there are bed and fasta files for the known and novel mature, star and precursor sequences. I need to filter them by the miRNAs that I identified. Make a text file with the novel and known names of putative miRNAs.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mirna_results_13_11_2024_t_08_48_44
nano putative_miRNA_list.txt
Montipora_capitata_HIv3___Scaffold_2_98529
Montipora_capitata_HIv3___Scaffold_8_488597
Montipora_capitata_HIv3___Scaffold_4_264216
Montipora_capitata_HIv3___Scaffold_9_587940
Montipora_capitata_HIv3___Scaffold_9_625013
Montipora_capitata_HIv3___Scaffold_2_71488
Montipora_capitata_HIv3___Scaffold_9_613239
Montipora_capitata_HIv3___Scaffold_11_759461
Montipora_capitata_HIv3___Scaffold_12_859123
Montipora_capitata_HIv3___Scaffold_10_631026
Montipora_capitata_HIv3___Scaffold_8_585867
Montipora_capitata_HIv3___Scaffold_2_124999
Montipora_capitata_HIv3___Scaffold_8_578139
Montipora_capitata_HIv3___Scaffold_11_826703
Montipora_capitata_HIv3___Scaffold_10_729473
Montipora_capitata_HIv3___Scaffold_10_667068
Montipora_capitata_HIv3___Scaffold_151_1075760
Montipora_capitata_HIv3___Scaffold_14_1017829
Montipora_capitata_HIv3___Scaffold_5_309429
Montipora_capitata_HIv3___Scaffold_8_508612
Montipora_capitata_HIv3___Scaffold_8_553085
Montipora_capitata_HIv3___Scaffold_7_456843
Montipora_capitata_HIv3___Scaffold_4_291272
Montipora_capitata_HIv3___Scaffold_6_400996
Montipora_capitata_HIv3___Scaffold_13_954675
Montipora_capitata_HIv3___Scaffold_8_542727
Montipora_capitata_HIv3___Scaffold_14_999529
Montipora_capitata_HIv3___Scaffold_14_994729
Montipora_capitata_HIv3___Scaffold_9_617967
Montipora_capitata_HIv3___Scaffold_5_322659
Montipora_capitata_HIv3___Scaffold_6_419669
Montipora_capitata_HIv3___Scaffold_10_645863
Montipora_capitata_HIv3___Scaffold_2_87665
Montipora_capitata_HIv3___Scaffold_1_35393
Montipora_capitata_HIv3___Scaffold_4_232369
Montipora_capitata_HIv3___Scaffold_5_375083
Montipora_capitata_HIv3___Scaffold_2_64557
Montipora_capitata_HIv3___Scaffold_3_149822
Montipora_capitata_HIv3___Scaffold_9_606053
Montipora_capitata_HIv3___Scaffold_11_801837
Montipora_capitata_HIv3___Scaffold_8_508613
Montipora_capitata_HIv3___Scaffold_8_535535
Montipora_capitata_HIv3___Scaffold_11_796076
Montipora_capitata_HIv3___Scaffold_11_787579
Montipora_capitata_HIv3___Scaffold_11_816162
Montipora_capitata_HIv3___Scaffold_13_949160
Montipora_capitata_HIv3___Scaffold_11_841599
Montipora_capitata_HIv3___Scaffold_5_310446
Montipora_capitata_HIv3___Scaffold_6_439094
Montipora_capitata_HIv3___Scaffold_5_365006
Montipora_capitata_HIv3___Scaffold_2_67225
Montipora_capitata_HIv3___Scaffold_5_307262
Montipora_capitata_HIv3___Scaffold_14_1018693
Montipora_capitata_HIv3___Scaffold_2_96328
Montipora_capitata_HIv3___Scaffold_8_565039
Montipora_capitata_HIv3___Scaffold_8_586427
Montipora_capitata_HIv3___Scaffold_14_992284
Montipora_capitata_HIv3___Scaffold_12_910953
Montipora_capitata_HIv3___Scaffold_1_4183
Cat the known and novel miRNA fasta together and filter fasta so that I keep only the miRNAs in putative_miRNA_list.txt
.
cat novel_mature_13_11_2024_t_08_48_44_score-50_to_na.fa known_mature_13_11_2024_t_08_48_44_score-50_to_na.fa > putative_miRNAs.fa
grep -F -f putative_miRNA_list.txt putative_miRNAs.fa -A 1 --no-group-separator > putative_miRNAs_filt.fa
Do the same with the precursor sequences.
cat novel_pres_13_11_2024_t_08_48_44_score-50_to_na.fa known_pres_13_11_2024_t_08_48_44_score-50_to_na.fa > putative_precursors.fa
grep -F -f putative_miRNA_list.txt putative_precursors.fa -A 1 --no-group-separator > putative_precursors_filt.fa
Similar to the mapper module, I can use a config.txt
file to specify all the samples. I also need to use the mapped_reads.fa
as the collapsed reads.
In the scripts folder: nano quantifier_mirdeep2.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Quantifying smRNA counts" $(date)
conda activate /data/putnamlab/mirdeep2
quantifier.pl -r /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mapped_reads.fa -p /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mirna_results_13_11_2024_t_08_48_44/putative_precursors_filt.fa -m /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mirna_results_13_11_2024_t_08_48_44/putative_miRNAs_filt.fa -c /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/config.txt
echo "Quantifying complete!" $(date)
conda deactivate
Submitted batch job 352797. Success! But…
# reads processed: 21804779
# reads with at least one reported alignment: 5795 (0.03%)
# reads that failed to align: 21798984 (99.97%)
Reported 6985 alignments to 1 output stream(s)
Only ~5700 reads aligned…I am going to mess around with the cutoffs. Going to add the following:
-k
- also considers precursor-mature mappings that have different IDs-g 3
- number of allowed mismatches when mapping reads to precursors, default is 1-e 4
- number of nt upstream of mature sequence to consider, default is 2-f 7
- number of nt downstream of mature sequence to consider, default is 5-w
- considers whole precursor as mature sequence
Submitted batch job 352801
Identify 3’UTRs. First, identify counts of each feature from gff file
cd /data/putnamlab/jillashey/genome/Mcap/V3
module load BEDTools/2.30.0-GCC-11.3.0
grep -v '^#' Montipora_capitata_HIv3.genes.gff3 | cut -s -f 3 | sort | uniq -c | sort -rn > all_features_mcap.txt
256031 exon
256031 CDS
54384 transcript
Generate individual gff for transcripts
grep $'\ttranscript\t' Montipora_capitata_HIv3.genes.gff3 > Mcap_transcript.gtf
Extract scaffold lengths
cat is Montipora_capitata_HIv3.assembly.fasta | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > Mcap_Chromosome_lengths.txt
wc -l Mcap_Chromosome_lengths.txt
1699 Mcap_Chromosome_lengths.txt
Extract scaffold names
awk -F" " '{print $1}' Mcap_Chromosome_lengths.txt > Mcap_Chromosome_names.txt
Sort gtf
bedtools sort -faidx Mcap_Chromosome_names.txt -i Mcap_transcript.gtf > Mcap_transcript_sorted.gtf
Create 1kb 3’UTR in gff
bedtools flank -i Mcap_transcript_sorted.gtf -g Mcap_Chromosome_lengths.txt -l 0 -r 1000 -s | awk '{gsub("transcript","3prime_UTR",$3); print $0 }' | awk '{if($5-$4 > 3)print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}' | tr ' ' '\t' > Mcap_3UTR_1kb.gtf
Subtract portions of 3’UTRs that overlap with nearby genes
bedtools subtract -a Mcap_3UTR_1kb.gtf -b Mcap_transcript_sorted.gtf > Mcap_3UTR_1kb_corrected.gtf
Extract 3’UTR sequences from genome
awk '{print $1 "\t" $4-1 "\t" $5 "\t" $9 "\t" "." "\t" $7}' Mcap_3UTR_1kb_corrected.gtf | sed 's/"//g' > Mcap_3UTR_1kb_corrected.bed
bedtools getfasta -fi Montipora_capitata_HIv3.assembly.fasta -bed Mcap_3UTR_1kb_corrected.bed -fo Mcap_3UTR_1kb.fasta -name
grep -c ">" Mcap_3UTR_1kb.fasta
55148
Run miranda for the Mcap data. In the scripts folder: nano miranda_strict_all_1kb_mcap.sh
#!/bin/bash -i
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Mcap starting miranda run with all genes and miRNAs with energy cutoff <-20 and strict binding invoked"$(date)
module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/miranda
miranda /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mirna_results_13_11_2024_t_08_48_44/putative_miRNAs_filt.fa /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_3UTR_1kb.fasta -en -20 -strict -out /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/miranda_trim_stringent/miranda_strict_all_1kb_mcap.tab
conda deactivate
echo "miranda run finished!"$(date)
echo "counting number of interactions possible" $(date)
zgrep -c "Performing Scan" /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/miranda_trim_stringent/miranda_strict_all_1kb_mcap.tab
echo "Parsing output" $(date)
grep -A 1 "Scores for this hit:" /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/miranda_trim_stringent/miranda_strict_all_1kb_mcap.tab | sort | grep '>' > /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/miranda_trim_stringent/miranda_strict_all_1kb_parsed_mcap.txt
echo "counting number of putative interactions predicted" $(date)
wc -l /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/miranda_trim_stringent/miranda_strict_all_1kb_parsed_mcap.txt
echo "Mcap DT miranda script complete" $(date)
Submitted batch job 352829. And the quantifier script finished running!
# reads processed: 21804779
# reads with at least one reported alignment: 20407 (0.09%)
# reads that failed to align: 21784372 (99.91%)
Reported 24370 alignments to 1 output stream(s)
Still very low alignment rate…and when looking at the csv, only the samples that were sequenced in the first batch got read counts. The others have 0s for basically all miRNAs. Need to look back at trimming.
Miranda script ran!
counting number of interactions possible Fri Dec 6 16:07:31 EST 2024
3253732
Parsing output Fri Dec 6 16:07:43 EST 2024
counting number of putative interactions predicted Fri Dec 6 16:07:44 EST 2024
25457 /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/miranda_trim_stringent/miranda_strict_all_1kb_parsed_mcap.txt
Copy onto local computer. Rerunning the QC for raw data (Submitted batch job 352838).
20241230
Back at it! Holidays and travel have prevented me from working more on this. I reran the QC for the raw data and put it in my repo (link here). There is a lot of duplication. Even in the individual fastqc files, there is high duplication for each sample. My sequence duplication plotes look exactly like example 4 (see below) from this page:
On the page, the author writes about this example: “This was about the worst example I could find. This was an RNA-Seq library which had some nasty contamination and PCR duplication. you can see that in the raw data the majority of reads come from sequences which are present tens of thousands of times in the library, but that there is also strong representation from reads which are present hundreds or thousands of times, suggesting that the duplication is widespreads. In this case deduplicating may help the experiment, but it causes the loss of 93% of the original data so it’s a pretty drastic step to take.” Need to discuss with Hollie.
I am going to rerun fastp on R1. Make new folder for them.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data
mkdir trim_fastp
I already have a script (fastp_trim_R1.sh
) that will run this. I am not going to set a length limit for now. I am also changing --trim_poly_x
to --trim_poly_g
, which is what was used in the e5 deep dive code. Submitted batch job 354306
20241231
Talked to Hollie briefly about my concerns. My data looks relatively similar to the deep dive smRNA QC. We also know we have a small diversity library (ie there are not that many unique miRNAs) and we sequenced to 28 million reads, which is a lot of sequencing to throw at a low diversity library. The Zymo miRNA library prep kit recommended to only sequence 75bp single end, but we did 150bp paired end. Following Sam’s notebook, I am going to use flexbar to trim on R1 only. I am first only going to trim to 75 bp and then see how that looks.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data
mkdir flexbar_75bp
In the scripts folder: nano flexbar_75bp.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Flexbar trimming to 75bp" $(date)
module load Flexbar/3.5.0-foss-2018b
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
for i in ${array1[@]}; do
flexbar \
-r ${i} \
-a /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/illumina_adapters.fasta \
-qf i1.8 \
-qt 25 \
--post-trim-length 75 \
--target /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.${i} \
--zip-output GZ
done
echo "Flexbar trimming complete, run QC" $(date)
for file in /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/flexbar_75bp
done
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/flexbar_75bp
multiqc *
echo "QC complete" $(date)
Submitted batch job 354418
20250101
Happy new year! Flexbar ran in about 9 hours. The QC data is here. The length of samples are from 20 - 70bp, quite a range. The adapter content looks good though. Still a lot of overrepresented sequences and sequence duplication but we are going to move forward. Now I need to run the mapper portion of mirdeep2. According to the mirdeep2 documentation because I have multiple samples, I need to make a config file that contains the fq file locations and a unique 3 letter code. I will also be unzipping the files so remove the .gz from file name.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp
nano config.txt
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.10_small_RNA_S4_R1_001.fastq.gz.fastq s10
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.11_small_RNA_S5_R1_001.fastq.gz.fastq s11
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.13_S76_R1_001.fastq.gz.fastq s13
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.14_small_RNA_S6_R1_001.fastq.gz.fastq s14
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.23_S77_R1_001.fastq.gz.fastq s23
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.24_small_RNA_S7_R1_001.fastq.gz.fastq s24
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.26_small_RNA_S8_R1_001.fastq.gz.fastq s26
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.28_small_RNA_S9_R1_001.fastq.gz.fastq s28
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.35_S78_R1_001.fastq.gz.fastq s35
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.36_small_RNA_S10_R1_001.fastq.gz.fastq s36
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.37_small_RNA_S11_R1_001.fastq.gz.fastq s37
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.39_small_RNA_S12_R1_001.fastq.gz.fastq s39
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.47_small_RNA_S13_R1_001.fastq.gz.fastq s47
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.48_small_RNA_S14_R1_001.fastq.gz.fastq s48
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.51_small_RNA_S15_R1_001.fastq.gz.fastq s51
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.52_S79_R1_001.fastq.gz.fastq s52
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.60_S80_R1_001.fastq.gz.fastq s60
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.61_small_RNA_S16_R1_001.fastq.gz.fastq s61
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.62_small_RNA_S17_R1_001.fastq.gz.fastq s62
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.63_small_RNA_S18_R1_001.fastq.gz.fastq s63
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.6_small_RNA_S1_R1_001.fastq.gz.fastq s06
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.72_S81_R1_001.fastq.gz.fastq s72
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.73_small_RNA_S19_R1_001.fastq.gz.fastq s73
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.74_small_RNA_S20_R1_001.fastq.gz.fastq s74
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.75_small_RNA_S21_R1_001.fastq.gz.fastq s75
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.7_small_RNA_S2_R1_001.fastq.gz.fastq s07
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.85_S82_R1_001.fastq.gz.fastq s85
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.86_small_RNA_S22_R1_001.fastq.gz.fastq s86
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.87_small_RNA_S23_R1_001.fastq.gz.fastq s87
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.88_small_RNA_S24_R1_001.fastq.gz.fastq s88
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.8_small_RNA_S3_R1_001.fastq.gz.fastq s08
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp/trim.flexbar.75bp.9_S75_R1_001.fastq.gz.fastq s09
In the config.txt file, I have the path and file name, as well as the unique 3 letter code (s followed by sample number). In the scripts folder, I already have a mapper script (mapper_mirdeep2.sh
) that I am going to edit with the updated paths.
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load GCCcore/11.3.0 #I needed to add this to resolve conflicts between loaded GCCcore/9.3.0 and GCCcore/11.3.0
#module load Bowtie/1.3.1-GCC-11.3.0
#echo "Index Mcap genome" $(date)
# Index the reference genome for Mcap
#bowtie-build /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex
#module unload module load GCCcore/11.3.0
#module unload Bowtie/1.3.1-GCC-11.3.0
#echo "Map trimmed reads with mirdeep2 mapper script" $(date)
conda activate /data/putnamlab/mirdeep2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_75bp
gunzip *.fastq.gz
mapper.pl config.txt -e -d -h -j -l 18 -m -q -p /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex -s mapped_reads.fa -t mapped_reads_vs_genome.arf
echo "Mapping complete for trimmed stringent reads" $(date)
conda deactivate
Submitted batch job 354426. The flags mean the following:
-e
- use fastq files as input-d
- input is config file-h
- convert to fasta format-i
- convert RNA to DNA alphabet-j
- remove entries with non-canonical letters-l
- discard reads shorter than 18 nt-m
- collapse reads-p
- map to genome–genome has to be indexed by bowtie build-s
- mapped reads fasta output-t
- read mappings v genome output-q
- map with one mismatch in the seed
20250102
Here is the mapping results:
#desc total mapped unmapped %mapped %unmapped
total: 628271581 61907926 566363655 9.854 90.146
s06: 16391063 52942 16338121 0.323 99.677
s07: 26105703 4952449 21153254 18.971 81.029
s08: 20914298 102971 20811327 0.492 99.508
s09: 17402129 3044859 14357270 17.497 82.503
s10: 17319666 1252959 16066707 7.234 92.766
s11: 8702825 573453 8129372 6.589 93.411
s13: 18013957 4252214 13761743 23.605 76.395
s14: 22983933 205680 22778253 0.895 99.105
s23: 20418458 5323812 15094646 26.074 73.926
s24: 15238105 569210 14668895 3.735 96.265
s26: 29779739 1057567 28722172 3.551 96.449
s28: 20579195 104219 20474976 0.506 99.494
s35: 18792516 4519750 14272766 24.051 75.949
s36: 20592535 3164178 17428357 15.366 84.634
s37: 18149246 1286810 16862436 7.090 92.910
s39: 15148458 118789 15029669 0.784 99.216
s47: 57405680 13777537 43628143 24.000 76.000
s48: 23820154 361072 23459082 1.516 98.484
s51: 15857877 55736 15802141 0.351 99.649
s52: 22109116 5553213 16555903 25.117 74.883
s60: 14726932 3047585 11679347 20.694 79.306
s61: 21442064 142987 21299077 0.667 99.333
s62: 12666509 226583 12439926 1.789 98.211
s63: 14685987 125076 14560911 0.852 99.148
s72: 17626646 2760425 14866221 15.661 84.339
s73: 5016164 2083222 2932942 41.530 58.470
s74: 20074738 130657 19944081 0.651 99.349
s75: 23958496 411982 23546514 1.720 98.280
s85: 14366598 1732149 12634449 12.057 87.943
s86: 13903555 500450 13403105 3.599 96.401
s87: 21859884 216463 21643421 0.990 99.010
s88: 22219355 200927 22018428 0.904 99.096
The samples that were trimmed to be 30-20bp mapped better than the samples that were trimmed to 69-75bp. Should I trim now so that the rest of the samples are also ~30bp? Lets rerun the trimming but trim to 35bp max.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data
mkdir flexbar_35bp
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc
mkdir flexbar_35bp
In the scripts folder: nano flexbar_35bp.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Flexbar trimming to 35bp" $(date)
module load Flexbar/3.5.0-foss-2018b
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
for i in ${array1[@]}; do
flexbar \
-r ${i} \
-a /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/illumina_adapters.fasta \
-qf i1.8 \
-qt 25 \
--post-trim-length 35 \
--target /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.${i} \
--zip-output GZ
done
echo "Flexbar trimming complete, run QC" $(date)
for file in /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/flexbar_35bp
done
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/flexbar_35bp
multiqc *
echo "QC complete" $(date)
Submitted batch job 354480. While this runs, going to run mirdeep2 miRNA predictions on the reads trimmed to 75bp. First, make new output folders and move mapper mirdeep2 output from data folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output
mkdir mirdeep2_flexbar_75bp mirdeep2_flexbar_35bp
cd ../data/flexbar_75bp/
mv mapp* ../../output/mirdeep2_flexbar_75bp/
mv bowtie.log ../../output/mirdeep2_flexbar_75bp/
In the scripts folder: nano mirdeep2_flexbar_75bp.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2
echo "Starting mirdeep2 with reads trimmed with flexbar to max 75bp" $(date)
miRDeep2.pl /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_flexbar_75bp/mapped_reads.fa /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_flexbar_75bp/mapped_reads_vs_genome.arf /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa none none -P -v -g -1 2>report.log
echo "mirdeep2 complete" $(date)
conda deactivate
Submitted batch job 354482. This will take about 2 days.
20250103
Flexbar 35bp trimming finished running (QC data here). Higher amounts of duplication than the 75bp trimming iteration. Going to run mapping on these reads. Because I have multiple samples, I need to make a config file that contains the fq file locations and a unique 3 letter code. I will also be unzipping the files so remove the .gz from file name.
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp
nano config.txt
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.10_small_RNA_S4_R1_001.fastq.gz.fastq s10
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.11_small_RNA_S5_R1_001.fastq.gz.fastq s11
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.13_S76_R1_001.fastq.gz.fastq s13
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.14_small_RNA_S6_R1_001.fastq.gz.fastq s14
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.23_S77_R1_001.fastq.gz.fastq s23
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.24_small_RNA_S7_R1_001.fastq.gz.fastq s24
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.26_small_RNA_S8_R1_001.fastq.gz.fastq s26
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.28_small_RNA_S9_R1_001.fastq.gz.fastq s28
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.35_S78_R1_001.fastq.gz.fastq s35
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.36_small_RNA_S10_R1_001.fastq.gz.fastq s36
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.37_small_RNA_S11_R1_001.fastq.gz.fastq s37
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.39_small_RNA_S12_R1_001.fastq.gz.fastq s39
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.47_small_RNA_S13_R1_001.fastq.gz.fastq s47
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.48_small_RNA_S14_R1_001.fastq.gz.fastq s48
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.51_small_RNA_S15_R1_001.fastq.gz.fastq s51
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.52_S79_R1_001.fastq.gz.fastq s52
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.60_S80_R1_001.fastq.gz.fastq s60
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.61_small_RNA_S16_R1_001.fastq.gz.fastq s61
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.62_small_RNA_S17_R1_001.fastq.gz.fastq s62
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.63_small_RNA_S18_R1_001.fastq.gz.fastq s63
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.6_small_RNA_S1_R1_001.fastq.gz.fastq s06
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.72_S81_R1_001.fastq.gz.fastq s72
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.73_small_RNA_S19_R1_001.fastq.gz.fastq s73
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.74_small_RNA_S20_R1_001.fastq.gz.fastq s74
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.75_small_RNA_S21_R1_001.fastq.gz.fastq s75
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.7_small_RNA_S2_R1_001.fastq.gz.fastq s07
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.85_S82_R1_001.fastq.gz.fastq s85
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.86_small_RNA_S22_R1_001.fastq.gz.fastq s86
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.87_small_RNA_S23_R1_001.fastq.gz.fastq s87
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.88_small_RNA_S24_R1_001.fastq.gz.fastq s88
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.8_small_RNA_S3_R1_001.fastq.gz.fastq s08
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp/trim.flexbar.35bp.9_S75_R1_001.fastq.gz.fastq s09
In the config.txt file, I have the path and file name, as well as the unique 3 letter code (s followed by sample number). In the scripts folder: nano mapper_mirdeep2_flexbar_35bp.sh
#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load GCCcore/11.3.0 #I needed to add this to resolve conflicts between loaded GCCcore/9.3.0 and GCCcore/11.3.0
#module load Bowtie/1.3.1-GCC-11.3.0
#echo "Index Mcap genome" $(date)
# Index the reference genome for Mcap
#bowtie-build /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex
#module unload module load GCCcore/11.3.0
#module unload Bowtie/1.3.1-GCC-11.3.0
echo "Map trimmed reads (flexbar 35bp) with mirdeep2 mapper script" $(date)
conda activate /data/putnamlab/mirdeep2
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/flexbar_35bp
gunzip *.fastq.gz
mapper.pl config.txt -e -d -h -j -l 18 -m -q -p /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex -s mapped_reads.fa -t mapped_reads_vs_genome.arf
echo "Mapping complete for trimmed reads (flexbar 35bp)" $(date)
conda deactivate
Submitted batch job 354532
20250105
mirdeep2 miRNA prediction finished running early this morning.
https://dnatech.ucdavis.edu/faqs/should-i-remove-pcr-duplicates-from-my-rna-seq-data
Next steps for trimming ????
- Look at each raw QC file and trim each one using adapters and index primers
- blast overrepresented seqs – maybe they are something ?
- reverse complement of adapters and primers
- what would data look like if i dont throw out anything >30bp
Zoe cutadapt trimming: https://github.com/zdellaert/LaserCoral/blob/1b776313512822d368e957591890b2228c590bd5/code/RNA-seq-bioinf.md