Developmental 2023 Timeseries smRNA analysis

Developmental 2023 Timeseries smRNA analysis

These data came from my developmental timeseries experiment in 2023 with Montipora capitata in Hawaii. In this experiment, Montipora capitata embryos and larvae were exposed to ambient and heat stress over 72 hours from embryo to swimming larvae. Samples were collected at 8 time points: 1, 4, 9, 14, 22, 28, 48 and 72 hpf. The github for this project is here.

I initially sent 8 libraries (n=1 from each timepoint) to Oklahoma Medical Research Foundation NGS Core for sequencing and they looked good, so I sent out the rest of my sampels (24 samples, n=3 per timepoint). Data is now back!

Files were downloaded to these location on Andromeda: XXXX and XXXX. Time to analyze! I’m going to write my notebook in chronological order by date and then will reorganize once the workflow is complete.

20241024

Sequences were received from sequencer today! Make a new directory in my own folder on Andromeda for this project:

cd /data/putnamlab/jillashey
mkdir DT_Mcap_2023
cd DT_Mcap_2023
mkdir smRNA
cd dmRNA
mkdir data scripts output
cd data 
mkdir raw trim 
cd raw

Copy the files into the raw data folder

cp XXXXXX/* .
cp XXXXXX . 

nano fastqc_smallRNAseq_raw.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts            
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2

echo "Start fastqc" $(date)

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw

for file in *fastq.gz
do 
fastqc $file
done

echo "Fastqc complete, start multiqc" $(date)

multiqc *

echo "multiqc complete!" $(date)

Submitted batch job 340170. Raw QC for smRNAseq data here. Raw QC looking very funky. There are definitely batch effects from the initial and last sequencing and a high level of duplication in the last sequencing run (not uncommon for bbs but also may be due to number of cycles during library prep). The adapter content plots look very different between the two batches as well. The Zymo library prep kit protocol recommended single end sequencing at 75bp max. Oops I did paired end seq at 150bp. I believe that this means I will only be moving forward with R1. The protocol also recommends using cutadapt to trim.

In the scripts folder: nano cutadapt_trim.sh

#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts            
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load modules 
module load cutadapt/4.2-GCCcore-11.3.0
module load FastQC/0.11.8-Java-1.8

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/

#echo "Count number of reads in each file" $(date)

#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt

echo "Start read trimming with cutadapt, followed by QC" $(date)

# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))

# cutadapt loop
for i in ${array1[@]}; do
    cutadapt -a TGGAATTCTCGGGTGCCAAGG \
        -u 3 \
        -m 15 \
        -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim_${i} \
        $i
    
    fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim_${i} \
        -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc
done

echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)

module purge 
module load MultiQC/1.9-intel-2020a-Python-3.8.2

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc
multiqc * 

echo "MultiQC complete" $(date)

echo "Count number of reads in each trimmed file" $(date)

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim
zgrep -c "@LH00" *.gz > smRNA_trim_read_count.txt

Submitted batch job 345007. The -a denotes the adapter sequence that I want to trim and its the Illumina TruSeq small RNA adapter, which Zymo uses in its kit.

20241025

Cutadapt finished running overnight. Cutadapt trim QC is here. The initial samples that were sequenced look good, but the ones sequenced most recently don’t seem to have gotten any adapter trimmed off. Run fastp instead (based on Sam’s trimming for e5).

In the scripts folder: nano fastp_trim.sh

#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts            
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load modules 
module load bio/fastp/0.23.2-GCC-11.2.0
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/

#echo "Count number of reads in each file" $(date)

#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt

echo "Start read trimming with fastp, followed by QC" $(date)

# Make an array of sequences to trim 
array1=($(ls *R1_001.fastq.gz))

# fastp and fastqc loop 
for i in ${array1[@]}; do
    fastp --in1 ${i} \
        --in2 $(echo ${i}|sed s/_R1/_R2/)\
        --detect_adapter_for_pe \
        --qualified_quality_phred 30 \
        --trim_poly_g \
        --overlap_len_require 17 \
        --length_limit 31 \
        --merge \
        --merged_out /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
    fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
done
        
echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim
multiqc *

echo "MultiQC complete" $(date)

echo "Count number of reads in each file" $(date)

zgrep -c "@LH00" *.gz > smRNA_trim_read_count.txt

Submitted batch job 345041. Seems to have worked but the merge wasn’t amazing…seems like a lot of reads were lost when I merged because I required a decently high overlap length. Looking at the e5 [trimming iterations] for smRNAseq data, it also seems like I might be good to just use R1, similar to what Sam did with e5 deep dive data.

In the scripts folder: nano fastp_trim_R1.sh

#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts            
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load modules 
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/

#echo "Count number of reads in each file" $(date)

#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt

echo "Start read trimming with fastp, followed by QC" $(date)

# Make an array of sequences to trim 
array1=($(ls *R1_001.fastq.gz))

# fastp and fastqc loop 
for i in ${array1[@]}; do
    fastp --in1 ${i} \
    	 --out1 /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i} \
        --qualified_quality_phred 30 \
        --trim_poly_g \
        --length_limit 31 \
    fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
done

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim
multiqc *

echo "MultiQC complete" $(date)

echo "Count number of reads in each file" $(date)

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim

zgrep -c "@LH00" *.gz > smRNA_trim_read_count.txt

Submitted batch job 345061. Started, then stopped it. Looking at the slurm error file, I am getting this information:

Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 19773964
total bases: 2985868564
Q20 bases: 2596847950(86.9713%)
Q30 bases: 2097923022(70.2617%)

Read1 after filtering:
total reads: 475
total bases: 13245
Q20 bases: 13005(98.188%)
Q30 bases: 12269(92.6312%)

Filtering result:
reads passed filter: 475
reads failed due to low quality: 901386
reads failed due to too many N: 76
reads failed due to too short: 103
reads failed due to too long: 18871924
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Reads are failing left and right because they are too long…I think i need to specify what adapters that I want to trim because fastp is trying to detect the adapters by analyzing the tails of the first ~1M reads. I’m going to supply it with adapter sequences instead. In my raw QC, it says some of the reads have the illumina small RNA 3’ adapter and some have the illumina universal adapter.

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data

nano illumina_adapters.fasta
>Illumina TruSeq small RNA 3' adapter
TGGAATTCTCGGGTGCCAAGG
>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>Illumina TruSeq Adapter Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

In the scripts folder, edit: nano fastp_trim_R1.sh

#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts            
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load modules 
#module load fastp/0.19.7-foss-2018b
module load bio/fastp/0.23.2-GCC-11.2.0
module load FastQC/0.11.8-Java-1.8
#module load MultiQC/1.9-intel-2020a-Python-3.8.2

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/

#echo "Count number of reads in each file" $(date)

#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt

echo "Start read trimming with fastp, followed by QC" $(date)

# Make an array of sequences to trim 
array1=($(ls *R1_001.fastq.gz))

# fastp and fastqc loop 
for i in ${array1[@]}; do
    fastp --in1 ${i} \
    	 --out1 /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i} \
        --qualified_quality_phred 30 \
        --trim_poly_x \
        #--length_limit 31 \
         --adapter_fasta /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/illumina_adapters.fasta
    fastqc /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim.fastp.31bp.${i}
done

Submitted batch job 345073. Ran in 1.5 hours, and then ran multiQC on the output. Also looked at the individual QC htmls. Most samples have good phred scores and a peak where the miRNAs should be in terms of length. But there is a lot of weirdness… a lot of overrepresented sequences that might be adapters? And still a decent amount of adapter content. I’m not sure what I should be trimming and i am stressed haha. I need to look into this further and think about what I’m really trimming. also evaluate tools to use…maybe email some ppl???? Also the files say 31bp but they are not.

20241027

Even though the QC looks horrible, I want to run short stack on this data to see what it looks like. In the scripts folder: nano shortstack.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Running short stack on mature trimmed miRNAs (R1) from Mcap DT project"

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim

# Load modules 
module load ShortStack/4.0.2-foss-2022a  
module load Kent_tools/442-GCC-11.3.0

# Run short stack
ShortStack \
--genomefile /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
--readfile trim.fastp.31bp.10_small_RNA_S4_R1_001.fastq.gz \
trim.fastp.31bp.11_small_RNA_S5_R1_001.fastq.gz \
trim.fastp.31bp.13_S76_R1_001.fastq.gz \
trim.fastp.31bp.14_small_RNA_S6_R1_001.fastq.gz \
trim.fastp.31bp.23_S77_R1_001.fastq.gz \
trim.fastp.31bp.24_small_RNA_S7_R1_001.fastq.gz \
trim.fastp.31bp.26_small_RNA_S8_R1_001.fastq.gz \
trim.fastp.31bp.28_small_RNA_S9_R1_001.fastq.gz \
trim.fastp.31bp.35_S78_R1_001.fastq.gz \
trim.fastp.31bp.36_small_RNA_S10_R1_001.fastq.gz \
trim.fastp.31bp.37_small_RNA_S11_R1_001.fastq.gz \
trim.fastp.31bp.39_small_RNA_S12_R1_001.fastq.gz \
trim.fastp.31bp.47_small_RNA_S13_R1_001.fastq.gz \
trim.fastp.31bp.48_small_RNA_S14_R1_001.fastq.gz \
trim.fastp.31bp.51_small_RNA_S15_R1_001.fastq.gz \
trim.fastp.31bp.52_S79_R1_001.fastq.gz \
trim.fastp.31bp.60_S80_R1_001.fastq.gz \
trim.fastp.31bp.61_small_RNA_S16_R1_001.fastq.gz \
trim.fastp.31bp.62_small_RNA_S17_R1_001.fastq.gz \
trim.fastp.31bp.63_small_RNA_S18_R1_001.fastq.gz \
trim.fastp.31bp.6_small_RNA_S1_R1_001.fastq.gz \
trim.fastp.31bp.72_S81_R1_001.fastq.gz \
trim.fastp.31bp.73_small_RNA_S19_R1_001.fastq.gz \
trim.fastp.31bp.74_small_RNA_S20_R1_001.fastq.gz \
trim.fastp.31bp.75_small_RNA_S21_R1_001.fastq.gz \
trim.fastp.31bp.7_small_RNA_S2_R1_001.fastq.gz \
trim.fastp.31bp.85_S82_R1_001.fastq.gz \
trim.fastp.31bp.86_small_RNA_S22_R1_001.fastq.gz \
trim.fastp.31bp.87_small_RNA_S23_R1_001.fastq.gz \
trim.fastp.31bp.88_small_RNA_S24_R1_001.fastq.gz \
trim.fastp.31bp.8_small_RNA_S3_R1_001.fastq.gz \
trim.fastp.31bp.9_S75_R1_001.fastq.gz \
--known_miRNAs /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa \
--outdir /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/shortstack \
--threads 10 \
--dn_mirna

echo "Short stack complete!"

Submitted batch job 345205.

20241028

Shortstack finished running overnight but doesn’t seem to have worked well. Getting this at the end of the error file:

Traceback (most recent call last):
  File "/opt/software/ShortStack/4.0.2-foss-2022a/ShortStack", line 3592, in <module>
    mir_qdata = mirna(args, merged_bam, fai, pmir_bedfile, read_count)
  File "/opt/software/ShortStack/4.0.2-foss-2022a/ShortStack", line 2259, in mirna
    dn_q_bedlines, dn_locus_bedlines = zip(*denovo_mloci1)
ValueError: not enough values to unpack (expected 2, got 0)

Also getting this in the error file for the samples:

# reads processed: 18875806
# reads with at least one alignment: 999 (0.01%)
# reads that failed to align: 18874807 (99.99%)
Reported 2699 alignments
[bam_sort_core] merging from 0 files and 10 in-memory blocks...
# reads processed: 11804203
# reads with at least one alignment: 1216 (0.01%)
# reads that failed to align: 11802987 (99.99%)
Reported 2981 alignments
[bam_sort_core] merging from 0 files and 10 in-memory blocks...
# reads processed: 24558419
# reads with at least one alignment: 50 (0.00%)
# reads that failed to align: 24558369 (100.00%)
Reported 803 alignments

Vast majority of reads are not aligning which makes sense because I gave it very long reads (>150bp). This is probably why it failed. No results files or miRNA fasta files were generated either. Need to revisit trimming

20241105

Revisiting trimming with cutadapt. In the scripts folder, edit cutadapt_trim.sh

#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts            
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load modules 
module load cutadapt/4.2-GCCcore-11.3.0
#module load cutadapt/3.5-GCCcore-11.2.0
module load FastQC/0.11.8-Java-1.8

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/

#echo "Count number of reads in each file" $(date)

#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt

echo "Start read trimming with cutadapt, followed by QC" $(date)

# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))

# cutadapt loop
for i in ${array1[@]}; do
    cutadapt -a TGGAATTCTCGGGTGCCAAGG -u 3 -m 15 -q 20,20 --trim-n --max-n 0 -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim/trim_cutadapt_${i} $i
done

echo "Read trimming of adapters complete" $(date)

Submitted batch job 347763. Took almost 4 hours. I didn’t put the QC code in here because I wanted to run it in interactive mode.

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim
interactive
module load FastQC/0.11.8-Java-1.8
fastqc *

Started running in interactive mode at 2:15pm

mv *qc ../../output/fastqc/trim
cd ../../output/fastqc/trim
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc *

Looks very similar to cutadapt attempt 1. The samples that were sequenced in the first batch were able to be successfully trimmed to 25 bp. Going to rerun cutadapt with some more stringent cutoffs and more adapter seqs that were identified in the individual fastqc outputs. In the scripts folder: nano cutadapt_trim_stringent.sh

#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts            
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load modules 
module load cutadapt/4.2-GCCcore-11.3.0
#module load cutadapt/3.5-GCCcore-11.2.0
module load FastQC/0.11.8-Java-1.8

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw/

#echo "Count number of reads in each file" $(date)

#zgrep -c "@LH00" *.gz > smRNA_raw_read_count.txt

echo "Start read trimming with cutadapt, followed by QC" $(date)

# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))

# cutadapt loop
for i in ${array1[@]}; do
cutadapt -a TGGAATTCTCGGGTGCCAAGG -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -a GATCGGAAGAGCACACGTCTGAACTCCAGTCA -a GAAACGTTGGGTTGCGGTATTGGAAGAGCACACGTCTGAACTCCAGTCAC -a AGTGTCCTATCAGCTACCATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a CTCGGCATGGAAGCCGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a TAGCGTCGAACGGGCGCAATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a ACGTCTGAACTCCAGTCACTTACAGGAATCTGGGGGGGGGGGGGGGGGGG -a TATCGGTGAAACATCCTCATCGGAAGAGCACACGTCTGAACTCCAGTCAC -m 18 -M 30 -q 30,30 --trim-n --max-n 0 --discard-untrimmed -e 0.1 -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_${i} $i
done

## added adapters that were overrepresented sequences from the individual fastqc htmls 

echo "Read trimming of adapters complete" $(date)

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/
fastqc * -o /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim_stringent
done

echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)

module unload cutadapt/4.2-GCCcore-11.3.0
module load MultiQC/1.9-intel-2020a-Python-3.8.2

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/fastqc/trim_stringent
multiqc * 

echo "MultiQC complete" $(date)

echo "Count number of reads in each trimmed file" $(date)

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
zgrep -c "@LH00" *.gz > smRNA_trim_stringent_read_count.txt

Submitted batch job 347786. Based on initial looks, a lot of the reads are getting tossed because they are too short.

20241108

Took a while but ran. The lengths look much better, as does the GC content. There are some samples that have such a high duplication rate though…I know this is a result of PCR artifacts. I may run shortstack to see what the output looks like. I may need to do some adapter trimming as well with certain samples.

next steps:

  • run short stack on stringent trimmed reads
  • run picard (picard/2.25.1-Java-11)
    • EstimateLibraryComplexity - Estimates the numbers of unique molecules in a sequencing library.
    • MarkDuplicates - Identifies duplicate reads.
      USAGE: PicardCommandLine [-h]

Run short stack.

In the scripts folder: nano shortstack_trim_stringent.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Running short stack on mature trimmed miRNAs from Mcap DT project"

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent

# Load modules 
module load ShortStack/4.0.2-foss-2022a  
module load Kent_tools/442-GCC-11.3.0

# Run short stack
# Run short stack
ShortStack \
--genomefile /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
--readfile trim_stringent_cutadapt_10_small_RNA_S4_R1_001.fastq.gz \
trim_stringent_cutadapt_11_small_RNA_S5_R1_001.fastq.gz \
trim_stringent_cutadapt_13_S76_R1_001.fastq.gz \
trim_stringent_cutadapt_63_small_RNA_S18_R1_001.fastq.gz \
--known_miRNAs /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa \
--outdir /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/shortstack_trim_stringent \
--threads 10 \
--dn_mirna

echo "Short stack complete!"

Submitted batch job 348221 - only doing a subset for now. Ran but ended with this error: /var/spool/slurmd/job348221/slurm_script: line 56: --known_miRNAs: command not found, even though that is a command…Maybe copy mature_mirbase_cnidarian_T.fa to this folder?

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data
cp /data/putnamlab/jillashey/Astrangia2021/smRNA/refs/mature_mirbase_cnidarian_T.fa .

Edit the shortstack_trim_stringent.sh script so that the known miRNAs path is the new path (/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa). Submitted batch job 348256. Got the same error. Trying on a subset and making sure that the backslashes are correct. Submitted batch job 348269

Let’s try running picard in interactive mode

interactive 
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/shortstack_trim_stringent/ShortStack_1731009886
module load picard/2.25.1-Java-11
To execute picard run: java -jar $EBROOTPICARD/picard.jar

I just want to test picard on one of the bam files (trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam).

# Run EstimateLibraryComplexity
java -jar $EBROOTPICARD/picard.jar EstimateLibraryComplexity \
    I=trim_stringent_cutadapt_24_small_RNA_S7_R1_001.bam \
    O=trim_stringent_cutadapt_24_small_RNA_complexity_metrics.txt
    
INFO	2024-11-08 09:24:40	EstimateLibraryComplexity	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    EstimateLibraryComplexity -I trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam -O trim_stringent_cutadapt_51_small_RNA_complexity_metrics.txt
**********


09:24:42.296 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/software/picard/2.25.1-Java-11/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Nov 08 09:24:42 EST 2024] EstimateLibraryComplexity INPUT=[trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam] OUTPUT=trim_stringent_cutadapt_51_small_RNA_complexity_metrics.txt    MIN_IDENTICAL_BASES=5 MAX_DIFF_RATE=0.03 MIN_MEAN_QUALITY=20 MAX_GROUP_RATIO=500 MAX_READ_LENGTH=0 MIN_GROUP_COUNT=2 READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=33053659 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Nov 08 09:24:42 EST 2024] Executing as jillashey@n063.cluster.com on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.2+9; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.25.1
INFO	2024-11-08 09:24:42	EstimateLibraryComplexity	Will store 33053659 read pairs in memory before sorting.
INFO	2024-11-08 09:24:45	EstimateLibraryComplexity	Finished reading - read 0 records - moving on to scanning for duplicates.
[Fri Nov 08 09:24:45 EST 2024] picard.sam.markduplicates.EstimateLibraryComplexity done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2035417088

didn’t really seem to run successfully…I got an output file but no information in the output file. Let’s try running MarkDuplicates

java -jar $EBROOTPICARD/picard.jar MarkDuplicates -I trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam -O marked_dups_trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam -M marked_dup_51_small_RNA_metrics.txt

09:32:36.015 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/software/picard/2.25.1-Java-11/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Nov 08 09:32:36 EST 2024] MarkDuplicates INPUT=[trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam] OUTPUT=marked_dups_trim_stringent_cutadapt_51_small_RNA_S15_R1_001.bam METRICS_FILE=marked_dup_51_small_RNA_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Nov 08 09:32:36 EST 2024] Executing as jillashey@n063.cluster.com on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.2+9; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.25.1
INFO	2024-11-08 09:32:36	MarkDuplicates	Start of doWork freeMemory: 2018640208; totalMemory: 2035417088; maxMemory: 31136546816
INFO	2024-11-08 09:32:36	MarkDuplicates	Reading input file and constructing read end information.
INFO	2024-11-08 09:32:36	MarkDuplicates	Will retain up to 112813575 data points before spilling to disk.
INFO	2024-11-08 09:32:37	MarkDuplicates	Read 102803 records. 0 pairs never matched.
INFO	2024-11-08 09:32:38	MarkDuplicates	After buildSortedReadEndLists freeMemory: 1285863816; totalMemory: 2214113280; maxMemory: 31136546816
INFO	2024-11-08 09:32:38	MarkDuplicates	Will retain up to 973017088 duplicate indices before spilling to disk.
Killed

Failed bleh. Also no miRNA loci found in short stack…may need to adjust trimming parameters.

20241112

Need to retrim but first going to run short stack with the samples from the initial sequencing run. These have lower duplication rates and I do not think that they have as much PCR duplication. In the scripts folder: nano shortstack_trim_stringent_sub.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Running short stack on mature trimmed miRNAs from Mcap DT project"

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent

# Load modules 
module load ShortStack/4.0.2-foss-2022a  
module load Kent_tools/442-GCC-11.3.0

# Run short stack
# Run short stack
ShortStack \
--genomefile /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
--readfile trim_stringent_cutadapt_13_S76_R1_001.fastq.gz \
trim_stringent_cutadapt_23_S77_R1_001.fastq.gz \
trim_stringent_cutadapt_35_S78_R1_001.fastq.gz \
trim_stringent_cutadapt_52_S79_R1_001.fastq.gz \
trim_stringent_cutadapt_60_S80_R1_001.fastq.gz \
trim_stringent_cutadapt_72_S81_R1_001.fastq.gz \
trim_stringent_cutadapt_85_S82_R1_001.fastq.gz \
trim_stringent_cutadapt_9_S75_R1_001.fastq.gz \
--known_miRNAs /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa \
--threads 10 \
--dn_mirna

echo "Short stack complete!"

Submitted batch job 348560. Ran in a few hours. Output said it found 3 miRNA loci but not sure if this includes known miRNAs…

awk '$3 == "mature_miRNA"' Results.gff3 > filtered_mature_miRNA.gff
head filtered_mature_miRNA.gff
Montipora_capitata_HIv3___Scaffold_2    ShortStack      mature_miRNA    46381100        46381121        476     +       .       ID=Cluster_2999.mature;Parent=Cluster_2999
Montipora_capitata_HIv3___Scaffold_8    ShortStack      mature_miRNA    22810894        22810915        204     +       .       ID=Cluster_12334.mature;Parent=Cluster_12334
Montipora_capitata_HIv3___Scaffold_14   ShortStack      mature_miRNA    2986538 2986560 1       +       .       ID=Cluster_27569.mature;Parent=Cluster_27569

awk -F'\t' '$20 == "Y"' Results.txt > Results_miRNA.txt
head Results_miRNA.txt
Montipora_capitata_HIv3___Scaffold_2:46381078-46381169  Cluster_2999    Montipora_capitata_HIv3___Scaffold_2    46381078        46381169        92      1012    59      1.0     +       ACAAUGUUUCGGCUUGUUCCCG  476     108     8       169     705     17      5       22      Y       spi-mir-temp-14_Stylophora_pistillata_Liew_et_al._2014_NA
Montipora_capitata_HIv3___Scaffold_8:22810843-22810935  Cluster_12334   Montipora_capitata_HIv3___Scaffold_8    22810843        22810935        93      4483    95      1.0     +       AUGCAGCGGAAAUCGAACCUGGG 3053    154     15      43      256     3224    791     23      Y       NA
Montipora_capitata_HIv3___Scaffold_14:2986487-2986580   Cluster_27569   Montipora_capitata_HIv3___Scaffold_14   2986487 2986580 94      49      7       1.0     +       ACGGUGAAAGUCGUCUCAAUAUUCA       35      2       35      0       1       1       10      N       Y       spi-mir-temp-30_Stylophora_pistillata_Liew_et_al._2014_Close_match_of_nve-miR-2036.;eca-nve-F-miR-2036_Edwardsiella_carnea_Praher_et_al._2021_Transcriptome-level;Adi-Mir-2036_3p_Acropora_digitifera_Gajigan_&_Conaco_2017_nve-miR-2036-3p;_nve-miR-2036-3p;_spi-miR-temp-30;ami-nve-F-miR-2036-3p_Acropora_millepora_Praher_et_al._2021_NA;spi-nve-F-miR-2036_Stylophora_pistillata_Praher_et_al._2021_NA

Clearly didn’t pick up all miRNAs. Maybe trying running mirdeep2?? I did a lot of trouble shooting with AST and mirdeep2 (see post here). First, I need to map the samples to the genome with the mapper.pl script. Genome must first be indexed by bowtie-build (NOT bowtie2). Because I have multiple samples, I need to make a config file that contains fq file locations and a unique 3 letter code (see mirdeep2 documentation). I will also be unzipping the files so remove the .gz from file name.

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent

nano config.txt

/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_10_small_RNA_S4_R1_001.fastq s10
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_11_small_RNA_S5_R1_001.fastq s11
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_13_S76_R1_001.fastq s13
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_14_small_RNA_S6_R1_001.fastq s14
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_23_S77_R1_001.fastq s23
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_24_small_RNA_S7_R1_001.fastq s24
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_26_small_RNA_S8_R1_001.fastq s26
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_28_small_RNA_S9_R1_001.fastq s28
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_35_S78_R1_001.fastq s35
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_36_small_RNA_S10_R1_001.fastq s36
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_37_small_RNA_S11_R1_001.fastq s37
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_39_small_RNA_S12_R1_001.fastq s39
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_47_small_RNA_S13_R1_001.fastq s47
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_48_small_RNA_S14_R1_001.fastq s48
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_51_small_RNA_S15_R1_001.fastq s51
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_52_S79_R1_001.fastq s52
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_60_S80_R1_001.fastq s60
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_61_small_RNA_S16_R1_001.fastq s61
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_62_small_RNA_S17_R1_001.fastq s62
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_63_small_RNA_S18_R1_001.fastq s63
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_6_small_RNA_S1_R1_001.fastq s06
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_72_S81_R1_001.fastq s72
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_73_small_RNA_S19_R1_001.fastq s73
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_74_small_RNA_S20_R1_001.fastq s74
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_75_small_RNA_S21_R1_001.fastq s75
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_7_small_RNA_S2_R1_001.fastq s07
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_85_S82_R1_001.fastq s85
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_86_small_RNA_S22_R1_001.fastq s86
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_87_small_RNA_S23_R1_001.fastq s87
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_88_small_RNA_S24_R1_001.fastq s88
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_8_small_RNA_S3_R1_001.fastq s08
/data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent/trim_stringent_cutadapt_9_S75_R1_001.fastq s09

In the config.txt file, I have the path and file name, as well as the unique 3 letter code (s followed by sample number).

In the scripts folder: nano mapper_mirdeep2.sh

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load GCCcore/11.3.0 #I needed to add this to resolve conflicts between loaded GCCcore/9.3.0 and GCCcore/11.3.0
module load Bowtie/1.3.1-GCC-11.3.0

echo "Index Mcap genome" $(date)

# Index the reference genome for Mcap 
bowtie-build /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex

echo "Referece genome indexed!" $(date)
echo "Unload unneeded packages and run mapper script for trimmed stringent reads" $(date)

module unload module load GCCcore/11.3.0 
module unload Bowtie/1.3.1-GCC-11.3.0

conda activate /data/putnamlab/mirdeep2

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent

gunzip *.fastq.gz

mapper.pl config.txt -e -d -h -j -l 18 -m -p /data/putnamlab/jillashey/genome/Mcap/V3/Mcap_ref.btindex -s mapped_reads.fa -t mapped_reads_vs_genome.arf

echo "Mapping complete for trimmed stringent reads" $(date)

conda deactivate 

Submitted batch job 348612. The flags mean the following:

  • -e- use fastq files as input
  • -d - input is config file
  • -h - convert to fasta format
  • -i - convert RNA to DNA alphabet
  • -j - remove entries with non-canonical letters
  • -l - discard reads shorter than 18 nt
  • -m - collapse reads
  • -p - map to genome–genome has to be indexed by bowtie build
  • -s - mapped reads fasta output
  • -t - read mappings v genome output

Also consider adding -q which allows mapping with one mismatch in seed but mapping takes longer. Ran in about an hour. Look at output

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent

zgrep -c ">" mapped_reads.fa
21804779

wc -l mapped_reads_vs_genome.arf
8098667 mapped_reads_vs_genome.arf

less bowtie.log 
# reads processed: 2215013
# reads with at least one reported alignment: 416126 (18.79%)
# reads that failed to align: 1500959 (67.76%)
# reads with alignments suppressed due to -m: 297928 (13.45%)
Reported 841100 alignments to 1 output stream(s)

I can now run mirdeep2 but this is a problem for tomorrow!

20241113

Short stack and mirdeep2 output has been in this folder: /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent. Going to move to output folder.

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output
mkdir mirdeep2_trim_stringent

cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/trim_stringent
mv ShortStack_1731* ../../output/shortstack_trim_stringent/
mv mapp* ../../output/mirdeep2_trim_stringent/
mv bowtie.log ../../output/mirdeep2_trim_stringent/
mv config.txt ../../output/mirdeep2_trim_stringent/

The output files from the mapper.pl portion of mirdeep2 will be used as input for the mirdeep2.pl portion of the pipeline. In the scripts folder: nano mirna_predict_mirdeep2.sh

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

#module load Miniconda3/4.9.2
conda activate /data/putnamlab/mirdeep2

echo "Starting mirdeep2 with reads that were trimmed stringently via cutadapt" $(date)

miRDeep2.pl /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mapped_reads.fa /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/output/mirdeep2_trim_stringent/mapped_reads_vs_genome.arf /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/mature_mirbase_cnidarian_T.fa none none -P -v -g -1 2>report.log

echo "mirdeep2 complete" $(date)

conda deactivate

Submitted batch job 348663. In the past, mirdeep2 has taken a few days to process my AST samples so I am guessing it will take a similar amount of time. The flags mean the following:

  • mapped_reads.fa - fasta file with sequencing reads
  • Montipora_capitata_HIv3.assembly.fasta - genome fasta
  • mapped_reads_vs_genome.arf - mapped reads to the genome in miRDeep2 arf format
  • mature_mirbase_cnidarian_T.fa - known miRNAs
  • none - no fasta file for precursor sequences. I accidently added two none flags in there, hopefully that wont mess things up too much.
  • -P - report novel miRNAs
  • -v - output verbose mode
  • -g -1 - report all potential miRNAs regardless of low score
  • 2>report.log - redirects errors to log file

20241118

Took about 2 days to run and most of output was put in scripts folder. Downloaded output to my computer.

20241121

I need to revisit trimming. I downloaded slurm-347786.out (most recent trimming iteration output) to my computer to look at sample adapter information.

# M10
Total reads processed:              19,773,964
Reads with adapters:                12,283,878 (62.1%)

== Read fate breakdown ==
Reads that were too short:           3,481,028 (17.6%)
Reads that were too long:            7,532,588 (38.1%)
Reads with too many N:                  33,125 (0.2%)
Reads discarded as untrimmed:           25,736 (0.1%)
Reads written (passing filters):     8,701,487 (44.0%)

Total basepairs processed: 2,985,868,564 bp
Quality-trimmed:              49,791,977 bp (1.7%)
Total written (filtered):    175,293,612 bp (5.9%)

# M11
Total reads processed:              12,310,413
Reads with adapters:                10,586,911 (86.0%)

== Read fate breakdown ==
Reads that were too short:           6,691,286 (54.4%)
Reads that were too long:            1,817,067 (14.8%)
Reads with too many N:                  14,264 (0.1%)
Reads discarded as untrimmed:           13,200 (0.1%)
Reads written (passing filters):     3,774,596 (30.7%)

Total basepairs processed: 1,858,872,363 bp
Quality-trimmed:              32,849,546 bp (1.8%)
Total written (filtered):     79,147,991 bp (4.3%)

# M13
Total reads processed:              25,732,779
Reads with adapters:                25,613,472 (99.5%)

== Read fate breakdown ==
Reads that were too short:           7,832,390 (30.4%)
Reads that were too long:            4,570,545 (17.8%)
Reads with too many N:                   6,929 (0.0%)
Reads discarded as untrimmed:           48,788 (0.2%)
Reads written (passing filters):    13,274,127 (51.6%)

Total basepairs processed: 3,885,649,629 bp
Quality-trimmed:             133,678,654 bp (3.4%)
Total written (filtered):    328,216,706 bp (8.4%)

# M14
Total reads processed:              35,754,439
Reads with adapters:                29,105,761 (81.4%)

== Read fate breakdown ==
Reads that were too short:          26,385,621 (73.8%)
Reads that were too long:            6,927,638 (19.4%)
Reads with too many N:                   8,413 (0.0%)
Reads discarded as untrimmed:           15,538 (0.0%)
Reads written (passing filters):     2,417,229 (6.8%)

Total basepairs processed: 5,398,920,289 bp
Quality-trimmed:              90,439,586 bp (1.7%)
Total written (filtered):     53,521,904 bp (1.0%)

# M23
Total reads processed:              28,608,340
Reads with adapters:                28,488,320 (99.6%)

== Read fate breakdown ==
Reads that were too short:           8,300,653 (29.0%)
Reads that were too long:            5,340,020 (18.7%)
Reads with too many N:                   8,149 (0.0%)
Reads discarded as untrimmed:           50,412 (0.2%)
Reads written (passing filters):    14,909,106 (52.1%)

Total basepairs processed: 4,319,859,340 bp
Quality-trimmed:             121,050,542 bp (2.8%)
Total written (filtered):    381,785,695 bp (8.8%)

# M24 
Total reads processed:              16,185,558
Reads with adapters:                15,051,209 (93.0%)

== Read fate breakdown ==
Reads that were too short:           1,238,159 (7.6%)
Reads that were too long:            1,298,021 (8.0%)
Reads with too many N:                  53,492 (0.3%)
Reads discarded as untrimmed:           17,232 (0.1%)
Reads written (passing filters):    13,578,654 (83.9%)

Total basepairs processed: 2,444,019,258 bp
Quality-trimmed:              51,942,286 bp (2.1%)
Total written (filtered):    258,626,182 bp (10.6%)

# M26
Total reads processed:              45,463,982
Reads with adapters:                33,200,737 (73.0%)

== Read fate breakdown ==
Reads that were too short:          26,954,608 (59.3%)
Reads that were too long:           12,621,446 (27.8%)
Reads with too many N:                  21,951 (0.0%)
Reads discarded as untrimmed:           29,590 (0.1%)
Reads written (passing filters):     5,836,387 (12.8%)

Total basepairs processed: 6,865,061,282 bp
Quality-trimmed:             114,030,951 bp (1.7%)
Total written (filtered):    124,668,281 bp (1.8%)

# M28 
Total reads processed:              29,330,724
Reads with adapters:                22,529,784 (76.8%)

== Read fate breakdown ==
Reads that were too short:          20,934,300 (71.4%)
Reads that were too long:            6,864,087 (23.4%)
Reads with too many N:                   5,150 (0.0%)
Reads discarded as untrimmed:           36,179 (0.1%)
Reads written (passing filters):     1,491,008 (5.1%)

Total basepairs processed: 4,428,939,324 bp
Quality-trimmed:              57,180,891 bp (1.3%)
Total written (filtered):     33,212,070 bp (0.7%)

# M35
Total reads processed:              27,415,125
Reads with adapters:                27,297,487 (99.6%)

== Read fate breakdown ==
Reads that were too short:           8,746,075 (31.9%)
Reads that were too long:            4,714,853 (17.2%)
Reads with too many N:                   7,301 (0.0%)
Reads discarded as untrimmed:           46,476 (0.2%)
Reads written (passing filters):    13,900,420 (50.7%)

Total basepairs processed: 4,139,683,875 bp
Quality-trimmed:             135,071,303 bp (3.3%)
Total written (filtered):    350,393,120 bp (8.5%)

# M36
Total reads processed:              21,669,140
Reads with adapters:                15,229,442 (70.3%)

== Read fate breakdown ==
Reads that were too short:           3,530,235 (16.3%)
Reads that were too long:            6,397,943 (29.5%)
Reads with too many N:                  45,695 (0.2%)
Reads discarded as untrimmed:           37,563 (0.2%)
Reads written (passing filters):    11,657,704 (53.8%)

Total basepairs processed: 3,272,040,140 bp
Quality-trimmed:              51,150,713 bp (1.6%)
Total written (filtered):    228,309,042 bp (7.0%)

# M37
Total reads processed:              20,247,832
Reads with adapters:                19,802,633 (97.8%)

== Read fate breakdown ==
Reads that were too short:           7,675,729 (37.9%)
Reads that were too long:              434,204 (2.1%)
Reads with too many N:                  45,858 (0.2%)
Reads discarded as untrimmed:           20,730 (0.1%)
Reads written (passing filters):    12,071,311 (59.6%)

Total basepairs processed: 3,057,422,632 bp
Quality-trimmed:              62,466,338 bp (2.0%)
Total written (filtered):    246,713,865 bp (8.1%)

# M39
Total reads processed:              20,298,000
Reads with adapters:                14,625,467 (72.1%)

== Read fate breakdown ==
Reads that were too short:          13,581,604 (66.9%)
Reads that were too long:            5,768,966 (28.4%)
Reads with too many N:                   3,465 (0.0%)
Reads discarded as untrimmed:           11,546 (0.1%)
Reads written (passing filters):       932,419 (4.6%)

Total basepairs processed: 3,064,998,000 bp
Quality-trimmed:              41,859,848 bp (1.4%)
Total written (filtered):     19,320,488 bp (0.6%)

# M47
Total reads processed:              61,683,862
Reads with adapters:                56,145,283 (91.0%)

== Read fate breakdown ==
Reads that were too short:          13,015,507 (21.1%)
Reads that were too long:            5,696,266 (9.2%)
Reads with too many N:                 166,609 (0.3%)
Reads discarded as untrimmed:          153,126 (0.2%)
Reads written (passing filters):    42,652,354 (69.1%)

Total basepairs processed: 9,314,263,162 bp
Quality-trimmed:             181,161,945 bp (1.9%)
Total written (filtered):    846,330,853 bp (9.1%)

# M48
Total reads processed:              26,207,354
Reads with adapters:                 5,677,260 (21.7%)

== Read fate breakdown ==
Reads that were too short:           3,896,981 (14.9%)
Reads that were too long:           20,434,620 (78.0%)
Reads with too many N:                   6,817 (0.0%)
Reads discarded as untrimmed:           90,086 (0.3%)
Reads written (passing filters):     1,778,850 (6.8%)

Total basepairs processed: 3,957,310,454 bp
Quality-trimmed:              46,499,881 bp (1.2%)
Total written (filtered):     35,591,311 bp (0.9%)

# M51
Total reads processed:              20,348,304
Reads with adapters:                 7,710,258 (37.9%)

== Read fate breakdown ==
Reads that were too short:           7,042,815 (34.6%)
Reads that were too long:           12,673,695 (62.3%)
Reads with too many N:                   2,273 (0.0%)
Reads discarded as untrimmed:           14,418 (0.1%)
Reads written (passing filters):       615,103 (3.0%)

Total basepairs processed: 3,072,593,904 bp
Quality-trimmed:              48,509,512 bp (1.6%)
Total written (filtered):     12,632,989 bp (0.4%)

# M52
Total reads processed:              29,279,123
Reads with adapters:                29,137,374 (99.5%)

== Read fate breakdown ==
Reads that were too short:           7,298,239 (24.9%)
Reads that were too long:            6,020,859 (20.6%)
Reads with too many N:                   8,657 (0.0%)
Reads discarded as untrimmed:           61,778 (0.2%)
Reads written (passing filters):    15,889,590 (54.3%)

Total basepairs processed: 4,421,147,573 bp
Quality-trimmed:             151,159,079 bp (3.4%)
Total written (filtered):    408,145,425 bp (9.2%)

# M60
Total reads processed:              24,332,450
Reads with adapters:                24,230,741 (99.6%)

== Read fate breakdown ==
Reads that were too short:           9,734,570 (40.0%)
Reads that were too long:            3,070,229 (12.6%)
Reads with too many N:                   5,938 (0.0%)
Reads discarded as untrimmed:           38,908 (0.2%)
Reads written (passing filters):    11,482,805 (47.2%)

Total basepairs processed: 3,674,199,950 bp
Quality-trimmed:             119,459,545 bp (3.3%)
Total written (filtered):    282,122,042 bp (7.7%)

# M61
Total reads processed:              24,731,279
Reads with adapters:                 6,861,730 (27.7%)

== Read fate breakdown ==
Reads that were too short:           6,034,816 (24.4%)
Reads that were too long:           17,805,944 (72.0%)
Reads with too many N:                   2,980 (0.0%)
Reads discarded as untrimmed:           76,252 (0.3%)
Reads written (passing filters):       811,287 (3.3%)

Total basepairs processed: 3,734,423,129 bp
Quality-trimmed:              46,085,583 bp (1.2%)
Total written (filtered):     16,211,139 bp (0.4%)

# M62
Total reads processed:              15,145,513
Reads with adapters:                 7,633,247 (50.4%)

== Read fate breakdown ==
Reads that were too short:           4,771,918 (31.5%)
Reads that were too long:            7,508,877 (49.6%)
Reads with too many N:                  11,032 (0.1%)
Reads discarded as untrimmed:           45,692 (0.3%)
Reads written (passing filters):     2,807,994 (18.5%)

Total basepairs processed: 2,286,972,463 bp
Quality-trimmed:              37,090,659 bp (1.6%)
Total written (filtered):     55,944,873 bp (2.4%)

# M63
Total reads processed:              21,903,406
Reads with adapters:                10,379,795 (47.4%)

== Read fate breakdown ==
Reads that were too short:           9,069,045 (41.4%)
Reads that were too long:           11,612,371 (53.0%)
Reads with too many N:                   4,485 (0.0%)
Reads discarded as untrimmed:           28,940 (0.1%)
Reads written (passing filters):     1,188,565 (5.4%)

Total basepairs processed: 3,307,414,306 bp
Quality-trimmed:              46,947,183 bp (1.4%)
Total written (filtered):     25,709,935 bp (0.8%)

# M6 
Total reads processed:              24,211,196
Reads with adapters:                15,778,017 (65.2%)

== Read fate breakdown ==
Reads that were too short:          15,305,643 (63.2%)
Reads that were too long:            8,428,574 (34.8%)
Reads with too many N:                   1,536 (0.0%)
Reads discarded as untrimmed:           43,373 (0.2%)
Reads written (passing filters):       432,070 (1.8%)

Total basepairs processed: 3,655,890,596 bp
Quality-trimmed:              52,333,701 bp (1.4%)
Total written (filtered):      9,537,955 bp (0.3%)

# M72 
Total reads processed:              25,266,689
Reads with adapters:                25,144,539 (99.5%)

== Read fate breakdown ==
Reads that were too short:           7,803,487 (30.9%)
Reads that were too long:            3,726,653 (14.7%)
Reads with too many N:                   7,388 (0.0%)
Reads discarded as untrimmed:           51,139 (0.2%)
Reads written (passing filters):    13,678,022 (54.1%)

Total basepairs processed: 3,815,270,039 bp
Quality-trimmed:             137,376,345 bp (3.6%)
Total written (filtered):    328,694,194 bp (8.6%)

# M73 
Total reads processed:               5,419,554
Reads with adapters:                 5,395,120 (99.5%)

== Read fate breakdown ==
Reads that were too short:           3,459,671 (63.8%)
Reads that were too long:              320,201 (5.9%)
Reads with too many N:                   6,668 (0.1%)
Reads discarded as untrimmed:            4,293 (0.1%)
Reads written (passing filters):     1,628,721 (30.1%)

Total basepairs processed:   818,352,654 bp
Quality-trimmed:              17,216,446 bp (2.1%)
Total written (filtered):     32,665,596 bp (4.0%)

# M74
Total reads processed:              23,599,752
Reads with adapters:                 6,346,294 (26.9%)

== Read fate breakdown ==
Reads that were too short:           5,447,500 (23.1%)
Reads that were too long:           17,208,033 (72.9%)
Reads with too many N:                   3,295 (0.0%)
Reads discarded as untrimmed:           55,067 (0.2%)
Reads written (passing filters):       885,857 (3.8%)

Total basepairs processed: 3,563,562,552 bp
Quality-trimmed:              44,030,663 bp (1.2%)
Total written (filtered):     18,225,152 bp (0.5%)

# M75
Total reads processed:              39,936,022
Reads with adapters:                31,308,251 (78.4%)

== Read fate breakdown ==
Reads that were too short:          26,334,128 (65.9%)
Reads that were too long:            9,283,704 (23.2%)
Reads with too many N:                  15,318 (0.0%)
Reads discarded as untrimmed:           53,526 (0.1%)
Reads written (passing filters):     4,249,346 (10.6%)

Total basepairs processed: 6,030,339,322 bp
Quality-trimmed:             104,351,331 bp (1.7%)
Total written (filtered):     94,582,130 bp (1.6%)

# M7
Total reads processed:              30,275,553
Reads with adapters:                24,226,251 (80.0%)

== Read fate breakdown ==
Reads that were too short:          11,063,587 (36.5%)
Reads that were too long:            6,232,765 (20.6%)
Reads with too many N:                  48,605 (0.2%)
Reads discarded as untrimmed:           51,715 (0.2%)
Reads written (passing filters):    12,878,881 (42.5%)

Total basepairs processed: 4,571,608,503 bp
Quality-trimmed:              88,504,444 bp (1.9%)
Total written (filtered):    255,714,395 bp (5.6%)

# M85
Total reads processed:              21,835,402
Reads with adapters:                21,711,666 (99.4%)

== Read fate breakdown ==
Reads that were too short:           7,627,502 (34.9%)
Reads that were too long:            2,616,361 (12.0%)
Reads with too many N:                   5,995 (0.0%)
Reads discarded as untrimmed:           48,112 (0.2%)
Reads written (passing filters):    11,537,432 (52.8%)

Total basepairs processed: 3,297,145,702 bp
Quality-trimmed:             123,423,025 bp (3.7%)
Total written (filtered):    268,088,833 bp (8.1%)

# M86
Total reads processed:              17,514,990
Reads with adapters:                 8,429,216 (48.1%)

== Read fate breakdown ==
Reads that were too short:           5,771,993 (33.0%)
Reads that were too long:            9,103,226 (52.0%)
Reads with too many N:                  10,039 (0.1%)
Reads discarded as untrimmed:           38,292 (0.2%)
Reads written (passing filters):     2,591,440 (14.8%)

Total basepairs processed: 2,644,763,490 bp
Quality-trimmed:              46,364,967 bp (1.8%)
Total written (filtered):     52,301,229 bp (2.0%)

# M87
Total reads processed:              35,917,496
Reads with adapters:                30,802,070 (85.8%)

== Read fate breakdown ==
Reads that were too short:          27,183,568 (75.7%)
Reads that were too long:            5,601,464 (15.6%)
Reads with too many N:                  11,049 (0.0%)
Reads discarded as untrimmed:           64,118 (0.2%)
Reads written (passing filters):     3,057,297 (8.5%)

Total basepairs processed: 5,423,541,896 bp
Quality-trimmed:              82,622,227 bp (1.5%)
Total written (filtered):     68,971,401 bp (1.3%)

# M88
Total reads processed:              28,792,588
Reads with adapters:                11,393,165 (39.6%)

== Read fate breakdown ==
Reads that were too short:           9,948,192 (34.6%)
Reads that were too long:           17,523,034 (60.9%)
Reads with too many N:                   4,898 (0.0%)
Reads discarded as untrimmed:           30,002 (0.1%)
Reads written (passing filters):     1,286,462 (4.5%)

Total basepairs processed: 4,347,680,788 bp
Quality-trimmed:              58,097,687 bp (1.3%)
Total written (filtered):     27,541,768 bp (0.6%)

# M8
Total reads processed:              37,732,448
Reads with adapters:                21,548,435 (57.1%)

== Read fate breakdown ==
Reads that were too short:          19,322,853 (51.2%)
Reads that were too long:           16,300,593 (43.2%)
Reads with too many N:                   7,780 (0.0%)
Reads discarded as untrimmed:           49,950 (0.1%)
Reads written (passing filters):     2,051,272 (5.4%)

Total basepairs processed: 5,697,599,648 bp
Quality-trimmed:              82,384,975 bp (1.4%)
Total written (filtered):     45,156,438 bp (0.8%)

# M9
Total reads processed:              25,911,154
Reads with adapters:                25,800,281 (99.6%)

== Read fate breakdown ==
Reads that were too short:           8,632,667 (33.3%)
Reads that were too long:            4,601,815 (17.8%)
Reads with too many N:                   6,587 (0.0%)
Reads discarded as untrimmed:           45,543 (0.2%)
Reads written (passing filters):    12,624,542 (48.7%)

Total basepairs processed: 3,912,584,254 bp
Quality-trimmed:             153,459,563 bp (3.9%)
Total written (filtered):    305,187,572 bp (7.8%)

A lot of samples had a high percentage of reads that were too short for the 18 bp cutoff. There were also some samples that did not have many reads with adapters, which means I need to go back to the fastqc files to identify adapters/sequences that were overrepresented in these samples and add them to the cut adapt code. By doing this, I will capture more reads with adapters for trimming.

Next trimming iteration: decrease min length to 15 bp, look at fastqc for more adapter sequences

Rerun cutadapt with updated trimming. look at individual fastqc files to determine any other adapters, particularly from samples that did not have a high % of reads with adapters

cutadapt -a TGGAATTCTCGGGTGCCAAGG -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA [...other adapters...] \
-m 15 -M 35 -q 20,20 --trim-n --max-n 1 -e 0.15 \
-o output.fastq input.fastq
--readfile trim_stringent_cutadapt_10_small_RNA_S4_R1_001.fastq.gz \
trim_stringent_cutadapt_11_small_RNA_S5_R1_001.fastq.gz \
trim_stringent_cutadapt_13_S76_R1_001.fastq.gz \
trim_stringent_cutadapt_14_small_RNA_S6_R1_001.fastq.gz \
trim_stringent_cutadapt_23_S77_R1_001.fastq.gz \
trim_stringent_cutadapt_24_small_RNA_S7_R1_001.fastq.gz \
trim_stringent_cutadapt_26_small_RNA_S8_R1_001.fastq.gz \
trim_stringent_cutadapt_28_small_RNA_S9_R1_001.fastq.gz \
trim_stringent_cutadapt_35_S78_R1_001.fastq.gz \
trim_stringent_cutadapt_36_small_RNA_S10_R1_001.fastq.gz \
trim_stringent_cutadapt_37_small_RNA_S11_R1_001.fastq.gz \
trim_stringent_cutadapt_39_small_RNA_S12_R1_001.fastq.gz \
trim_stringent_cutadapt_47_small_RNA_S13_R1_001.fastq.gz \
trim_stringent_cutadapt_48_small_RNA_S14_R1_001.fastq.gz \
trim_stringent_cutadapt_51_small_RNA_S15_R1_001.fastq.gz \
trim_stringent_cutadapt_52_S79_R1_001.fastq.gz \
trim_stringent_cutadapt_60_S80_R1_001.fastq.gz \
trim_stringent_cutadapt_61_small_RNA_S16_R1_001.fastq.gz \
trim_stringent_cutadapt_62_small_RNA_S17_R1_001.fastq.gz \
trim_stringent_cutadapt_63_small_RNA_S18_R1_001.fastq.gz \
#trim_stringent_cutadapt_6_small_RNA_S1_R1_001.fastq.gz \
#trim_stringent_cutadapt_72_S81_R1_001.fastq.gz \
#trim_stringent_cutadapt_73_small_RNA_S19_R1_001.fastq.gz \
#trim_stringent_cutadapt_74_small_RNA_S20_R1_001.fastq.gz \
#trim_stringent_cutadapt_75_small_RNA_S21_R1_001.fastq.gz \
#trim_stringent_cutadapt_7_small_RNA_S2_R1_001.fastq.gz \
#trim_stringent_cutadapt_85_S82_R1_001.fastq.gz \
#trim_stringent_cutadapt_86_small_RNA_S22_R1_001.fastq.gz\
#trim_stringent_cutadapt_87_small_RNA_S23_R1_001.fastq.gz \
#trim_stringent_cutadapt_88_small_RNA_S24_R1_001.fastq.gz \
#trim_stringent_cutadapt_8_small_RNA_S3_R1_001.fastq.gz \
#trim_stringent_cutadapt_9_S75_R1_001.fastq.gz \

Zoe cutadapt trimming: https://github.com/zdellaert/LaserCoral/blob/1b776313512822d368e957591890b2228c590bd5/code/RNA-seq-bioinf.md

code to extract fasta info from shortstack: https://github.com/urol-e5/deep-dive/blob/main/F-Pmea/code/13.2.1.1-Pmea-sRNAseq-ShortStack-FastA-extraction.md

Written on October 24, 2024