Developmental 2023 Timeseries mRNA analysis - egg and sperm analysis

Developmental 2023 Timeseries mRNA analysis - egg and sperm

I have sequenced and analyzed samples from the following time points: 1, 4, 9, 14, 22, 28, 48 and 72 hpf. The github for this project is here. During the closed portion of my defense, we discussed incorporating unfertilized egg and sperm samples into the analyses to understand the contribution of the sperm v. egg to the mRNA complement. In my 2023 experiment, I did not collect sperm or unfertilized egg samples so I am going to use the sperm and egg samples that were used in Van et Etten et al. 2020, which analyzed sperm and unfertilized egg samples in Mcap. In the paper, they state the following on accessing the sample fastq files: “The egg data are publicly available under NCBI BioProject PRJNA616341 (SAMN14486762, SAMN14486763, SAMN14486764) and the sperm data are publicly available under NCBI BioProject PRJNA339779.” They also state

Here are the samples that I need to download from NCBI:

The egg and sperm libraries were generated using different methods, which is confusing/annoying.

  Sperm Egg
Library prep kit Illumina TruSeq RNA Library Prep Kit v2 Standard Illumina strand-specific RNA-seq prep with polyA selection
Sequencer Illumina MiSeq flowcell using the Illumina MiSeq Reagent Kit v3 Illumina HiSeq
Configuation Single end Paired end

To download these sequences from NCBI, I need to run SRA toolkit on Unity (see example from ZD notebook).

nano prefetch.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts

module load uri/main
module load SRA-Toolkit/3.0.3-gompi-2022a

cd /project/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/

prefetch --max-size 30GB SRR4048723 # sperm sample
prefetch --max-size 30GB SRR11452263 # egg sample
prefetch --max-size 30GB SRR11452262 # egg sample
prefetch --max-size 30GB SRR11452251 # egg sample

Submitted batch job 33030565. Downloaded successfully. I now need to convert the sra file to fastq also with SRA toolkit.

nano fasterq.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/polyA/scripts

module load uri/main
module load SRA-Toolkit/3.0.3-gompi-2022a

cd /project/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/

fasterq-dump SRR11452251
fasterq-dump SRR11452262
fasterq-dump SRR11452263
fasterq-dump SRR4048723

Submitted batch job 33035417.

Make directories on Unity. Sym link the raw data in the project directory to the work directory.

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/data/raw
ln -s /project/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/*fastq .

Run fastqc on the raw reads. nano raw_fastqc.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --signal=2
#SBATCH --no-requeue
#SBATCH --mem=100GB
#SBATCH -t 12:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/data/raw

# Create an array of fastq files to process
files=($('ls' *.fastq)) 

# Run fastqc in parallel
echo "Starting fastqc..." $(date)
parallel -j 20 "fastqc {} -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/output/fastqc/raw && echo 'Processed {}'" ::: "${files[@]}"
echo "fastQC done." $(date)

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/output/fastqc/raw

echo "Starting multiqc..." $(date)
multiqc *

echo "Initial QC of egg/sperm data complete." $(date)

Submitted batch job 33036909. Data is super clean! I need to do some adapter trimming but other than that, its very high quality.

nano trim_cutadapt.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=100GB
#SBATCH -t 12:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts 

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/cutadapt/3.5-GCCcore-11.2.0
module load all/MultiQC/1.12-foss-2021b

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/data/raw

# Adapter sequence (Illumina universal)
ADAPTER=AGATCGGAAGAGC

echo "Trim egg samples (PE)" $(date)
# Paired-end files
for ID in SRR11452251 SRR11452262 SRR11452263; do
  cutadapt -a $ADAPTER -A $ADAPTER  -q 20,20 --minimum-length=20 -o ${ID}_1_AdapterTrimmed.fastq -p ${ID}_2_AdapterTrimmed.fastq ${ID}_1.fastq ${ID}_2.fastq
done

echo "Trim sperm samples (SE)" $(date)
# Single-end file
cutadapt -a $ADAPTER  -q 20,20 --minimum-length=20 -o SRR4048723_AdapterTrimmed.fastq SRR4048723.fastq

mv *AdapterTrimmed.fastq /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/egg_sperm_trim
cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/egg_sperm_trim

# Create an array of fastq files to process
files=($('ls' *.fastq)) 

# Run fastqc in parallel
echo "Starting fastqc..." $(date)
parallel -j 20 "fastqc {} -o /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/output/fastqc/trim && echo 'Processed {}'" ::: "${files[@]}"
echo "fastQC done." $(date)

cd /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/output/fastqc/trim

echo "Starting multiqc..." $(date)
multiqc *
echo "MultiQC complete..." $(date)

Submitted batch job 33041964. Trimming looks great! Trimmed QC can be found here.

Time to align! nano align.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=100GB
#SBATCH -t 12:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts 

# load modules needed
module load uri/main
module load all/HISAT2/2.2.1-gompi-2022a
module load all/SAMtools/1.18-GCC-12.3.0

echo "Building genome reference" $(date)

# index the reference genome for Pacuta output index to working directory
hisat2-build -f /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.assembly.fasta /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/McapV3_hisat2_ref
echo "Referece genome indexed. Starting alingment" $(date)

# This script exports alignments as bam files
# sorts the bam file because Stringtie takes a sorted file for input (--dta)
# removes the sam file because it is no longer needed

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/egg_sperm_trim

echo "Align egg samples (PE)" $(date)
# Paired-end files
for ID in SRR11452251 SRR11452262 SRR11452263; do
    R1="${ID}_1_AdapterTrimmed.fastq"
    R2="${ID}_2_AdapterTrimmed.fastq"
    sample_name="${ID}"
    
    hisat2 -p 8 --rna-strandness RF --dta \
        -x /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/McapV3_hisat2_ref \
        -1 ${R1} -2 ${R2} -S ${sample_name}.sam

    samtools sort -@ 8 -o ${sample_name}.bam ${sample_name}.sam
    echo "${sample_name} bam-ified!"
    rm ${sample_name}.sam
done
echo "Egg samples (PE) aligned!" $(date)

echo "Align sperm samples (SE)" $(date)
ID="SRR4048723"
READ="${ID}_AdapterTrimmed.fastq"

hisat2 -p 8 --dta \
    -x /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/McapV3_hisat2_ref \
    -U ${READ} -S ${ID}.sam

samtools sort -@ 8 -o ${ID}.bam ${ID}.sam
echo "${ID} bam-ified!"
rm ${ID}.sam
echo "Sperm samples (SE) aligned!" $(date)

echo "Alignment complete!" $(date)

Submitted batch job 33167366. Important note: For the sperm sample: do not specify --rna-strandness, as it is unstranded. For the egg samples, set --rna-strandness RF.

Alingment complete! Took 1.5 hours.

19257744 reads; of these:
  19257744 (100.00%) were paired; of these:
    7874384 (40.89%) aligned concordantly 0 times
    7691727 (39.94%) aligned concordantly exactly 1 time
    3691633 (19.17%) aligned concordantly >1 times
    ----
    7874384 pairs aligned concordantly 0 times; of these:
      146546 (1.86%) aligned discordantly 1 time
    ----
    7727838 pairs aligned 0 times concordantly or discordantly; of these:
      15455676 mates make up the pairs; of these:
        14657075 (94.83%) aligned 0 times
        599112 (3.88%) aligned exactly 1 time
        199489 (1.29%) aligned >1 times
61.94% overall alignment rate
[bam_sort_core] merging from 3 files and 8 in-memory blocks...
22408812 reads; of these:
  22408812 (100.00%) were paired; of these:
    6983155 (31.16%) aligned concordantly 0 times
    8980110 (40.07%) aligned concordantly exactly 1 time
    6445547 (28.76%) aligned concordantly >1 times
    ----
    6983155 pairs aligned concordantly 0 times; of these:
      226933 (3.25%) aligned discordantly 1 time
    ----
    6756222 pairs aligned 0 times concordantly or discordantly; of these:
      13512444 mates make up the pairs; of these:
        12316684 (91.15%) aligned 0 times
        846830 (6.27%) aligned exactly 1 time
        348930 (2.58%) aligned >1 times
72.52% overall alignment rate
[bam_sort_core] merging from 5 files and 8 in-memory blocks...
21204959 reads; of these:
  21204959 (100.00%) were paired; of these:
    5630028 (26.55%) aligned concordantly 0 times
    10575974 (49.88%) aligned concordantly exactly 1 time
    4998957 (23.57%) aligned concordantly >1 times
    ----
    5630028 pairs aligned concordantly 0 times; of these:
      188403 (3.35%) aligned discordantly 1 time
    ----
    5441625 pairs aligned 0 times concordantly or discordantly; of these:
      10883250 mates make up the pairs; of these:
        9800975 (90.06%) aligned 0 times
        819802 (7.53%) aligned exactly 1 time
        262473 (2.41%) aligned >1 times
76.89% overall alignment rate
[bam_sort_core] merging from 4 files and 8 in-memory blocks...
17901354 reads; of these:
  17901354 (100.00%) were unpaired; of these:
    3546701 (19.81%) aligned 0 times
    12504646 (69.85%) aligned exactly 1 time
    1850007 (10.33%) aligned >1 times
80.19% overall alignment rate

Alignment rates look great. Time to assemble! nano assemble.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=100GB
#SBATCH -t 12:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts 

# load modules needed
module load uri/main
module load all/StringTie/2.2.1-GCC-11.2.0

echo "Assembling transcripts using stringtie" $(date)

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/egg_sperm_trim

array1=($(ls *.bam)) #Make an array of sequences to assemble

for i in ${array1[@]}; do
    stringtie -p 8 -e -B -G /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -A ${i}.gene_abund.tab -o ${i}.gtf ${i}
done

echo "Assembly complete " $(date)

Submitted batch job 33316146

Do the rest of the assembly in interactive mode.

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/egg_sperm_trim
salloc 
module load uri/main
module load all/StringTie/2.2.1-GCC-11.2.0

# List of gtfs
ls *.gtf > gtf_list.txt

# Merge gtfs into single gtf 
stringtie --merge -e -p 8 -G /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -o Mcap_egg_sperm_merged.gtf gtf_list.txt #Merge GTFs 

module load all/GffCompare/0.12.6-GCC-11.2.0
gffcompare -r /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -G -o merged Mcap_egg_sperm_merged.gtf #Compute the accuracy 
54384 reference transcripts loaded.
  2 duplicate reference transcripts discarded.
  54384 query transfrags loaded.

Look at merge stats

# gffcompare v0.12.6 | Command line was:
#gffcompare -r /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes.gff3 -G -o merged Mcap_egg_sperm_merged.gtf
#
#= Summary for dataset: Mcap_egg_sperm_merged.gtf 
#     Query mRNAs :   54384 in   54185 loci  (36023 multi-exon transcripts)
#            (141 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :   54382 in   54185 loci  (36023 multi-exon)
# Super-loci w/ reference transcripts:    54185
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |   100.0    |
        Exon level:   100.0     |   100.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |
     Matching intron chains:   36023
       Matching transcripts:   54375
              Matching loci:   54185
          Missed exons:       0/256029  (  0.0%)
           Novel exons:       0/256029  (  0.0%)
        Missed introns:       0/201643  (  0.0%)
         Novel introns:       0/201643  (  0.0%)
           Missed loci:       0/54185   (  0.0%)
            Novel loci:       0/54185   (  0.0%)
 Total union super-loci across all input datasets: 54185 
54384 out of 54384 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

Make gtf list text file for count matrix generation

for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt

Download the prepDE.py script from here and put it in the scripts folder. Load python and compile the gene count matrix

salloc 
module load uri/main
module load Python/2.7.18-GCCcore-9.3.0

python /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts/prepDE.py -g Mcap_egg_sperm_gene_count_matrix.csv -i listGTF.txt

Keeps saying Illegal instruction. Try running as a job. nano prepDE.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=100GB
#SBATCH -t 12:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts 

module load uri/main
module load Python/2.7.18-GCCcore-9.3.0

cd /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/egg_sperm_trim

echo "Creating count matrices " $(date)

python /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/scripts/prepDE.py -g Mcap_egg_sperm_gene_count_matrix.csv -i listGTF.txt

echo "Count matrices complete" $(date)

Submitted batch job 33316331. Success! Move csvs to work directory

mv /scratch/workspace/jillashey_uri_edu-ashey_scratch/Mcap2023/egg_sperm_trim/*csv /work/pi_hputnam_uri_edu/jillashey/Mcap_2023/egg_sperm/output

Download csvs to computer.

Written on April 15, 2025