Developmental 2023 Timeseries mRNA analysis - polyA

These data came from my developmental timeseries experiment in 2023 with Montipora capitata in Hawaii. In this experiment, Montipora capitata embryos and larvae were exposed to ambient and heat stress over 72 hours from embryo to swimming larvae. Samples were collected at 8 time points: 1, 4, 9, 14, 22, 28, 48 and 72 hpf. The github for this project is here.

We sequenced these samples using rRNA depletion library methods and we wanted to compare that with libraries prepared using polyA selection. The library prep was done by me using the Zymo 3’ Switchfree library prep kit (see my notebook posts for more information on library prep). The sequencing was done through Oklahoma Medical Research Foundation NGS Core, which I found on Genohub. Here is the sequencing information:

Instrument: Illumina NovaSeq X Plus - 10B - PE 150 Cycle
Read length: 2 x 150bp (Paired End)
Pricing unit: per lane
Number of samples: 8 libraries
Guaranteed number of pass filter PE reads/sample: 20M (10M in each direction)
Deliverables: FastQ files uploaded to Genohub project bucket

Files were downloaded to this location on Unity: XXXX. Goodbye Andromeda, hello Unity. See Unity documentation and Putnam lab Unity documentation (thanks Zoe :’)) for info on using Unity. Time to analyze!

Check files to ensure transfer is complete and no files were corrupted.

md5sum *fastq.gz > checkmd5.txt
md5sum -c checkmd5.txt 
M10_S136_R1_001.fastq.gz: OK
M10_S136_R2_001.fastq.gz: OK
M11_S137_R1_001.fastq.gz: OK
M11_S137_R2_001.fastq.gz: OK
M13_S138_R1_001.fastq.gz: OK
M13_S138_R2_001.fastq.gz: OK
M14_S139_R1_001.fastq.gz: OK
M14_S139_R2_001.fastq.gz: OK
M23_S140_R1_001.fastq.gz: OK
M23_S140_R2_001.fastq.gz: OK
M24_S141_R1_001.fastq.gz: OK
M24_S141_R2_001.fastq.gz: OK
M26_S142_R1_001.fastq.gz: OK
M26_S142_R2_001.fastq.gz: OK
M28_S143_R1_001.fastq.gz: OK
M28_S143_R2_001.fastq.gz: OK
M35_S144_R1_001.fastq.gz: OK
M35_S144_R2_001.fastq.gz: OK
M36_S145_R1_001.fastq.gz: OK
M36_S145_R2_001.fastq.gz: OK
M37_S146_R1_001.fastq.gz: OK
M37_S146_R2_001.fastq.gz: OK
M39_S147_R1_001.fastq.gz: OK
M39_S147_R2_001.fastq.gz: OK
M47_S148_R1_001.fastq.gz: OK
M47_S148_R2_001.fastq.gz: OK
M48_S149_R1_001.fastq.gz: OK
M48_S149_R2_001.fastq.gz: OK
M51_S150_R1_001.fastq.gz: OK
M51_S150_R2_001.fastq.gz: OK
M52_S151_R1_001.fastq.gz: OK
M52_S151_R2_001.fastq.gz: OK
M60_S152_R1_001.fastq.gz: OK
M60_S152_R2_001.fastq.gz: OK
M61_S153_R1_001.fastq.gz: OK
M61_S153_R2_001.fastq.gz: OK
M62_S154_R1_001.fastq.gz: OK
M62_S154_R2_001.fastq.gz: OK
M63_S155_R1_001.fastq.gz: OK
M63_S155_R2_001.fastq.gz: OK
M6_S132_R1_001.fastq.gz: OK
M6_S132_R2_001.fastq.gz: OK
M72_S156_R1_001.fastq.gz: OK
M72_S156_R2_001.fastq.gz: OK
M73_S157_R1_001.fastq.gz: OK
M73_S157_R2_001.fastq.gz: OK
M74_S158_R1_001.fastq.gz: OK
M74_S158_R2_001.fastq.gz: OK
M75_S159_R1_001.fastq.gz: OK
M75_S159_R2_001.fastq.gz: OK
M7_S133_R1_001.fastq.gz: OK
M7_S133_R2_001.fastq.gz: OK
M85_S160_R1_001.fastq.gz: OK
M85_S160_R2_001.fastq.gz: OK
M86_S161_R1_001.fastq.gz: OK
M86_S161_R2_001.fastq.gz: OK
M87_S162_R1_001.fastq.gz: OK
M87_S162_R2_001.fastq.gz: OK
M88_S163_R1_001.fastq.gz: OK
M88_S163_R2_001.fastq.gz: OK
M8_S134_R1_001.fastq.gz: OK
M8_S134_R2_001.fastq.gz: OK
M9_S135_R1_001.fastq.gz: OK
M9_S135_R2_001.fastq.gz: OK

Make directories on Unity. Sym link the raw data in the project directory to the work directory.

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw
ln -s /project/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/*fastq.gz .

Run fastqc on raw reads. nano raw_fastqc.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --signal=2
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 48:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Create an array of fastq files to process
files=($('ls' *001.fastq.gz)) 

# Run fastqc in parallel
echo "Starting fastqc..." $(date)
parallel -j 20 "fastqc {} -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/raw && echo 'Processed {}'" ::: "${files[@]}"
echo "fastQC done." $(date)

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/raw

echo "Starting multiqc..." $(date)
multiqc *

echo "Initial QC of polyA data complete." $(date)

Submitted batch job 32423847. Raw QC data is here. Time to trim!

This info is from the Zymo switchfree protocol:

Read 1 provides UMI information and Read 2 provides the transcript sequence. Read 1 will sequence the 8-nucleotide UMI, the 6 non-T random nucleotides, and the oligo dT region sequentially before sequencing the insert region. Therefore, it is recommended to assign no more than 25 cycles to Read 1. This is sufficient to allow for cluster identification, quality metrics calculation, and coverage of UMI information while avoiding more low-quality bases associated with sequencing through the oligo dT region.

It is possible to sequence longer for Read 1 (e.g., 100 bp or 150 bp pairedend sequencing) when other libraries of high complexity are sequenced on the same lane with the Zymo-Seq SwitchFree™ 3′ mRNA libraries. Read 2 will provide the transcript sequences.

Generally, it is sufficient to trim Read 2 with the Illumina® TruSeq® adapter sequence: “AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT”. When the read length is longer than the average library insert size, Read 2 will possibly sequence into the oligo dT sequence, the random hexamer, and even the UMI sequence. Additionally, poly A tail may also present at the end of Read 2. These additional bases can be trimmed to potentially improve alignment. An example using Cutadapt2 for such 2-step trimming is shown as below.

cutadapt -a A{8}B{6}N{8}AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o
adapterTrimmed.fastq.gz sample.fastq.gz
cutadapt -a “A{100}” -o completeTrimmed.fastq.gz adapterTrimmed.fastq.gz

We highly recommend discarding Read 1 after UMI extraction in any bioinformatic analysis because of the poor sequence quality associated with the oligo dT region.

Hollie also went through some trimming iterations on her switchfree data. See her code here. In the scripts folder: nano trim_R2_cutadapt.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=300GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/cutadapt/3.5-GCCcore-11.2.0
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Create an array of fastq files to process
array=($(ls *R2_001.fastq.gz)) 

echo "trim using cutadapt" $(date)
for i in "${array[@]}"; do 
    sample_name=$(echo "$i" | awk -F '_' '{print $1}')  
    
    # Use original filename ($i) as input
    cutadapt -a A{8}B{6}N{8}AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o "${sample_name}_adapterTrimmed.fastq.gz" "$i"
    cutadapt -a A{100} -o "${sample_name}_completeTrimmed.fastq.gz" "${sample_name}_adapterTrimmed.fastq.gz"
    
    echo "trimming complete" $(date)
    
    # Move trimmed file
    mv "${sample_name}_completeTrimmed.fastq.gz" /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/trim/
    
    # FastQC
    fastqc "/work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/trim/${sample_name}_completeTrimmed.fastq.gz" \
        -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim
done

# Run MultiQC once after all samples are processed
cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim
echo "Starting multiqc..." $(date)
multiqc *
echo "QC of trimmed polyA data complete." $(date)

Submitted batch job 32583551. Lots of polyG still left and adapter content. Discussed with Zoe. Based on cutadapt manual, I need to add --nextseq-trim=20 because: “Quality trimming of reads using two-color chemistry (NextSeq) Some Illumina instruments use a two-color chemistry to encode the four bases. This includes the NextSeq and the NovaSeq. In those instruments, a ‘dark cycle’ (with no detected color) encodes a G. However, dark cycles also occur when sequencing “falls off” the end of the fragment. The read then contains a run of high-quality, but incorrect “G” calls at its 3’ end.”

I am also going to do some quality trimming to toss any bad reads (ie <20 phred score). Because cutadapt does the adapter removal first, then the polyG trim, I will need to do two trimming iterations, similar to what I did above. Also going to do both reads this time. In the scripts folder: nano trim_cutadapt.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=300GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/cutadapt/3.5-GCCcore-11.2.0
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Create an array of fastq files to process
array=($(ls *001.fastq.gz)) 

echo "trim using cutadapt" $(date)
for i in "${array[@]}"; do 
	sample_name=$(echo "$i" | awk -F'_' '{print $1 "_" $3}')
	cutadapt --nextseq-trim=20 -q 20,20 --minimum-length 20 -o "${sample_name}_PolyGTrimmed.fastq.gz" "$i"
	cutadapt -a A{8}B{6}N{8}AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o "${sample_name}_AdapterPolyGTrimmed.fastq.gz" "${sample_name}_PolyGTrimmed.fastq.gz"
	cutadapt -a A{100} -o "${sample_name}_completeTrimmed.fastq.gz" "${sample_name}_AdapterPolyGTrimmed.fastq.gz"
   
   echo "trimming complete" $(date)
    
    # Move trimmed file
    mv "${sample_name}_completeTrimmed.fastq.gz" /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/trim/
    
    # FastQC
    fastqc "/work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/trim/${sample_name}_completeTrimmed.fastq.gz" \
        -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim
done

# Run MultiQC once after all samples are processed

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim

echo "Starting multiqc..." $(date)
multiqc *
echo "Multiqc complete." $(date)

Submitted batch job 32648465. Still a lot of adapter content. Discussed with Zoe. I have paired end reads but only really care about R2. I still need to provide cutadapt with two input files. Since I am only interested in trimming R2 and , I should be using -A.

Also beginning to scratch directories on Unity (see Unity docuemntation and Putnam lab documentation).

In the scripts folder: nano trim_round2_cutadapt.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=300GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/cutadapt/3.5-GCCcore-11.2.0
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Make array of files to process
R1_raw=(*R1*.fastq.gz)
R2_raw=(*R2*.fastq.gz)

for i in "${!R1_raw[@]}"; do
    sample_name=$(basename "${R1_raw[$i]}" | sed -E 's/_S[0-9]+_R1_001.fastq.gz//')

    # Step 1: PolyG trimming
    cutadapt --nextseq-trim=20 -q 20,20 --minimum-length=20 \
        -o "${sample_name}_R1_PolyGTrimmed.fastq.gz" \
        -p "${sample_name}_R2_PolyGTrimmed.fastq.gz" \
        "${R1_raw[$i]}" "${R2_raw[$i]}"

    # Step 2: Adapter trimming
    cutadapt -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -a A{100} -q 20,20 --minimum-length=20 \
        -o "${sample_name}_R1_AdapterPolyGTrimmed.fastq.gz" \
        -p "${sample_name}_R2_AdapterPolyGTrimmed.fastq.gz" \
        "${sample_name}_R1_PolyGTrimmed.fastq.gz" "${sample_name}_R2_PolyGTrimmed.fastq.gz"

    echo "trimming complete for $sample_name" $(date)

    # Move only the correct files
    mv "${sample_name}_R1_AdapterPolyGTrimmed.fastq.gz" "${sample_name}_R2_AdapterPolyGTrimmed.fastq.gz" \
       /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2

    # FastQC on the final adapter+PolyG trimmed files
    fastqc "/scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2/${sample_name}_R1_AdapterPolyGTrimmed.fastq.gz" \
           -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2

    fastqc "/scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2/${sample_name}_R2_AdapterPolyGTrimmed.fastq.gz" \
           -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2

done

# Run MultiQC once after all samples are processed

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2

echo "Starting multiqc..." $(date)
multiqc *
echo "Multiqc complete." $(date)

Submitted batch job 32744961. The cutadapt version on Unity is 3.5 which does not have the capacity for polyA tail trimming. I installed the updated version via conda on Unity. See cutadapt installation instructions and Unity information about conda environments.

cd /work/pi_hputnam_uri_edu/conda/envs
module load conda/latest # need to load before making any conda envs
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
conda create --prefix /work/pi_hputnam_uri_edu/conda/envs/cutadapt cutadapt

conda activate /work/pi_hputnam_uri_edu/conda/envs/cutadapt 
cutadapt --version
5.0                                                                           

Successful installation. Okay lets try running this. In the scripts folder: nano trim_round2_cutadaptv5.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1

# Activate conda env
module load conda/latest 
conda activate /work/pi_hputnam_uri_edu/conda/envs/cutadapt 

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Array of target samples (MXX numbers only)
target_samples=("M6" "M7" "M8" "M9" "M10" "M11" "M13" "M14" "M23" "M24" "M26" "M28" "M35" "M36" "M37" "M39" "M47" "M48" "M51" "M52" "M60" "M61" "M62" "M63" "M72" "M73" "M74" "M75" "M85" "M86" "M87" "M88")

# Process each target sample
for sample in "${target_samples[@]}"; do
    # Find corresponding raw files
    R1_raw=(${sample}_S*_R1_001.fastq.gz)
    R2_raw=(${sample}_S*_R2_001.fastq.gz)
    
    # Verify files exist
    if [ ${#R1_raw[@]} -eq 1 ] && [ ${#R2_raw[@]} -eq 1 ]; then
        # Step 1: PolyG trimming
        cutadapt --nextseq-trim=20 -q 20,20 --minimum-length=20 \
            -o "${sample}_R1_PolyGTrimmed.fastq.gz" \
            -p "${sample}_R2_PolyGTrimmed.fastq.gz" \
            "${R1_raw[0]}" "${R2_raw[0]}"

        # Step 2: Adapter trimming
        cutadapt -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --poly-a -q 20,20 --minimum-length=20 \
            -o "${sample}_R1_AdapterPolyGTrimmed.fastq.gz" \
            -p "${sample}_R2_AdapterPolyGTrimmed.fastq.gz" \
            "${sample}_R1_PolyGTrimmed.fastq.gz" "${sample}_R2_PolyGTrimmed.fastq.gz"

        echo "Trimming complete for ${sample}" $(date)

        # Move final files
        mv "${sample}_R1_AdapterPolyGTrimmed.fastq.gz" "${sample}_R2_AdapterPolyGTrimmed.fastq.gz" \
           /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2

        # FastQC analysis
        fastqc "/scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2/${sample}_R1_AdapterPolyGTrimmed.fastq.gz" \
               -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2

        fastqc "/scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2/${sample}_R2_AdapterPolyGTrimmed.fastq.gz" \
               -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2
    else
        echo "Error: Found ${#R1_raw[@]} R1 and ${#R2_raw[@]} R2 files for ${sample}"
    fi
done

# Deactivate conda env
conda deactivate

# Load MultiQC modules 
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2
multiqc *

echo "MultiQC complete!" $(date)

Submitted batch job 33026985. Trimming occurred! But I forgot to create the /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2/ directory so no fastqc was done. Let’s do that in an interactive session

salloc

mv /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw/*AdapterPolyGTrimmed.fastq.gz /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2

Run QC as a job. I also accidently deleted the M85 R1 but that is okay for now. nano trim_round2_QC.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
#module load parallel/20240822
module load fastqc/0.12.1

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_data_round2

fastqc *AdapterPolyGTrimmed.fastq.gz -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2 

echo "fastqc complete" $(date)

# Load MultiQC modules 
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2
multiqc *

echo "MultiQC complete!" $(date)

Submitted batch job 33041843. MultiQC report can be found here. There is still polyA tail on the R2 samples…and many of them have a lot of reads ~30bp. wtf????? I’m going to run QC on the intermediate cutadapt files (*PolyGTrimmed.fastq.gz) to see what the differences are.

In the scripts folder: nano trim_round2_QC_PolyGTrimmed.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
#module load parallel/20240822
module load fastqc/0.12.1

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

fastqc *PolyGTrimmed.fastq.gz -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2_PolyGTrimmed 

echo "fastqc complete" $(date)

# Load MultiQC modules 
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round2_PolyGTrimmed
multiqc *

echo "MultiQC complete!" $(date)

Submitted batch job 33173364. Here, we see that the reads are still 150bp long and adapter is still present.

Let’s try trimming *PolyGTrimmed.fastq.gz by specifying the length of polyA we want to trim. nano trim_round3_cutadaptv5.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1

# Activate conda env
module load conda/latest 
conda activate /work/pi_hputnam_uri_edu/conda/envs/cutadapt 

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Array of target samples (MXX numbers only)
target_samples=("M6" "M7" "M8" "M9" "M10" "M11" "M13" "M14" "M23" "M24" "M26" "M28" "M35" "M36" "M37" "M39" "M47" "M48" "M51" "M52" "M60" "M61" "M62" "M63" "M72" "M73" "M74" "M75" "M85" "M86" "M87" "M88")

echo "Trimming started at" $(date)

# Process each target sample
for sample in "${target_samples[@]}"; do
    # Find corresponding raw files
    R1_polyg_trim=(${sample}_R1_PolyGTrimmed.fastq.gz)
    R2_polyg_trim=(${sample}_R2_PolyGTrimmed.fastq.gz)
    
    # Run cutadapt
    cutadapt \
        -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
        -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
        -a "A{20}" \
        -A "A{20}" \
        -q 20,20 \
        --minimum-length=20 \
        -o "${sample}_R1_AdapterPolyATrimmed.fastq.gz" \
        -p "${sample}_R2_AdapterPolyATrimmed.fastq.gz" \
        "${R1_polyg_trim[0]}" "${R2_polyg_trim[0]}"
done

echo "Trimming complete, starting QC" $(date)   

fastqc *AdapterPolyGATrimmed.fastq.gz

# Load MultiQC modules 
module load uri/main
module load all/MultiQC/1.12-foss-2021b
multiqc *

echo "MultiQC complete!" $(date)   

Submitted batch job 33316139. QC didn’t work. nano trim_round3_QC.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
#module load parallel/20240822
module load fastqc/0.12.1

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round3

fastqc *AdapterPolyATrimmed.fastq.gz -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round3

echo "fastqc complete" $(date)

# Load MultiQC modules 
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round3
multiqc *

echo "MultiQC complete!" $(date)

Submitted batch job 33320990. Still not removing all of the polyA tail. Use A10 a10 instead of A20. nano trim_round4_cutadaptv5.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1

# Activate conda env
module load conda/latest 
conda activate /work/pi_hputnam_uri_edu/conda/envs/cutadapt 

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Array of target samples (MXX numbers only)
target_samples=("M6" "M7" "M8" "M9" "M10" "M11" "M13" "M14" "M23" "M24" "M26" "M28" "M35" "M36" "M37" "M39" "M47" "M48" "M51" "M52" "M60" "M61" "M62" "M63" "M72" "M73" "M74" "M75" "M85" "M86" "M87" "M88")

echo "Trimming started at" $(date)

# Process each target sample
for sample in "${target_samples[@]}"; do
    # Find corresponding raw files
    R1_polyg_trim=(${sample}_R1_PolyGTrimmed.fastq.gz)
    R2_polyg_trim=(${sample}_R2_PolyGTrimmed.fastq.gz)
    
    # Run cutadapt
    cutadapt \
        -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
        -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
        -a "A{10}" \
        -A "A{10}" \
        -q 20,20 \
        --minimum-length=20 \
        -o "${sample}_R1_AdapterPolyA10Trimmed.fastq.gz" \
        -p "${sample}_R2_AdapterPolyA10Trimmed.fastq.gz" \
        "${R1_polyg_trim[0]}" "${R2_polyg_trim[0]}"
done

Submitted batch job 33322136. Run QC. nano trim_round4_QC.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
#module load parallel/20240822
module load fastqc/0.12.1

mv /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw/*AdapterPolyA10Trimmed.fastq.gz /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round4

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round4

fastqc *AdapterPolyA10Trimmed.fastq.gz -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round4

echo "fastqc complete" $(date)

# Load MultiQC modules 
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round4
multiqc *

echo "MultiQC complete!" $(date)

Submitted batch job 33328319.

may need to run the adapter and polyA trimming parts separately… nano trim_round5_cutadaptv5.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1

# Activate conda env
module load conda/latest 
conda activate /work/pi_hputnam_uri_edu/conda/envs/cutadapt 

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/data/raw

# Array of target samples (MXX numbers only)
target_samples=("M6" "M7" "M8" "M9" "M10" "M11" "M13" "M14" "M23" "M24" "M26" "M28" "M35" "M36" "M37" "M39" "M47" "M48" "M51" "M52" "M60" "M61" "M62" "M63" "M72" "M73" "M74" "M75" "M85" "M86" "M87" "M88")

# Process each target sample
for sample in "${target_samples[@]}"; do
    # Find corresponding raw files
    R1_raw=(${sample}_S*_R1_001.fastq.gz)
    R2_raw=(${sample}_S*_R2_001.fastq.gz)
    
    # Verify files exist
    if [ ${#R1_raw[@]} -eq 1 ] && [ ${#R2_raw[@]} -eq 1 ]; then
        # Step 1: PolyG trimming
        cutadapt --nextseq-trim=20 -q 20,20 --minimum-length=20 \
            -o "${sample}_R1_PolyGTrimmed.fastq.gz" \
            -p "${sample}_R2_PolyGTrimmed.fastq.gz" \
            "${R1_raw[0]}" "${R2_raw[0]}"
          echo "polyG trimming complete for ${sample}" $(date)

		# Step 2: adapter trimming 
    cutadapt \
        -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
        -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
        -q 20,20 \
        --minimum-length=20 \
        -o "${sample}_R1_AdapterPolyGTrimmed.fastq.gz" \
        -p "${sample}_R2_AdapterPolyGTrimmed.fastq.gz" \
           "${sample}_R1_PolyGTrimmed.fastq.gz" "${sample}_R2_PolyGTrimmed.fastq.gz"
     echo "Adapter trimming complete for ${sample}" $(date)
     
     		# Step 3: polyA trimming 
	cutadapt \
        -a "A{10}" \
        -A "A{10}" \
        -q 20,20 \
        --minimum-length=20 \
        -o "${sample}_R1_AdapterPolyGA10Trimmed.fastq.gz" \
        -p "${sample}_R2_AdapterPolyGA10Trimmed.fastq.gz" \
        "${sample}_R1_AdapterPolyGTrimmed.fastq.gz" "${sample}_R2_AdapterPolyGTrimmed.fastq.gz"
      echo "PolyA trimming complete for ${sample}" $(date)
	fi
done

echo "Trimming complete, run fastQC" $(date)

mv *Trimmed.fastq.gz /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round5

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round5

fastqc *AdapterPolyGA10Trimmed.fastq.gz -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round5

echo "fastqc complete" $(date)

# Load MultiQC modules 
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/fastqc/trim_round5
multiqc *

echo "MultiQC complete!" $(date)

Submitted batch job 33332147. FINALLY!!!!! SUCCESS!!!!!!!

TIME TO ALIGN!!!! In the scripts folder: nano align_R2.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

## Genome has already been indexed for hisat2

# load modules needed
module load uri/main
module load all/HISAT2/2.2.1-gompi-2022a
module load all/SAMtools/1.18-GCC-12.3.0

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round5

echo "Aligning polyA R2 samples!" $(date)

for READ in *_R2_AdapterPolyGA10Trimmed.fastq.gz; do
    ID=$(echo "$READ" | cut -d'_' -f1)

    # Run alignment
    hisat2 --rna-strandness F -p 8 --dta -x /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/McapV3_hisat2_ref -U "$READ" -S "${ID}.sam"
    
    # Convert to BAM and sort
    samtools sort -@ 8 -o "${ID}.bam" "${ID}.sam"
    echo "${ID} bam-ified!"
    
    # Remove SAM file
    rm "${ID}.sam"
done

echo "PolyA R2 samples alignment complete!" $(date)

Submitted batch job 33409631

Assemble! nano assemble_R2.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

# load modules needed
module load uri/main
module load all/StringTie/2.2.1-GCC-11.2.0

echo "Assembling transcripts using stringtie" $(date)

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round5

array1=($(ls *.bam)) #Make an array of sequences to assemble

for i in ${array1[@]}; do
    stringtie -p 8 -e -B -G /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -A ${i}.gene_abund.tab -o ${i}.gtf ${i}
done

echo "Assembly complete " $(date)

Submitted batch job 33420828. TPM information can be seen in the gtf files.

Interactive mode!

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round5
salloc 
module load uri/main
module load all/StringTie/2.2.1-GCC-11.2.0

# List of gtfs
ls *.gtf > gtf_list.txt

# Merge gtfs into single gtf 
stringtie --merge -e -p 8 -G /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -o Mcap_polyA_merged.gtf gtf_list.txt #Merge GTFs 

module load all/GffCompare/0.12.6-GCC-11.2.0
gffcompare -r /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -G -o merged Mcap_polyA_merged.gtf #Compute the accuracy 
54384 reference transcripts loaded.
  2 duplicate reference transcripts discarded.
  54384 query transfrags loaded.

Look at merge stats

# gffcompare v0.12.6 | Command line was:
#gffcompare -r /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -G -o merged Mcap_polyA_merged.gtf
#
#= Summary for dataset: Mcap_polyA_merged.gtf 
#     Query mRNAs :   54384 in   54185 loci  (36023 multi-exon transcripts)
#            (141 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :   54382 in   54185 loci  (36023 multi-exon)
# Super-loci w/ reference transcripts:    54185
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |   100.0    |
        Exon level:   100.0     |   100.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:    99.9     |    99.9    |
       Locus level:   100.0     |   100.0    |
     Matching intron chains:   36023
       Matching transcripts:   54349
              Matching loci:   54183
          Missed exons:       0/256029  (  0.0%)
           Novel exons:       0/256028  (  0.0%)
        Missed introns:       0/201643  (  0.0%)
         Novel introns:       0/201643  (  0.0%)
           Missed loci:       0/54185   (  0.0%)
            Novel loci:       0/54185   (  0.0%)
 Total union super-loci across all input datasets: 54185 
54384 out of 54384 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

Make gtf list text file for count matrix generation

for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt

Run as a job. nano prepDE.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts 

module load uri/main
module load Python/2.7.18-GCCcore-9.3.0

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round5

echo "Creating count matrices " $(date)

python /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts/prepDE.py -g Mcap_polyA_gene_count_matrix.csv -i listGTF.txt

echo "Count matrices complete" $(date)

Submitted batch job 33422229. Move csvs to output folder and download to computer

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/trim_round5

mv *count_matrix.csv /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/output/

Written on April 15, 2025