Developmental 2023 Timeseries mRNA analysis (initial)
Developmental 2023 Timeseries mRNA analysis (initial)
These data came from my developmental timeseries experiment in 2023 with Montipora capitata in Hawaii. In this experiment, Montipora capitata embryos and larvae were exposed to ambient and heat stress over 72 hours from embryo to swimming larvae. Samples were collected at 8 time points: 1, 4, 9, 14, 22, 28, 48 and 72 hpf. The github for this project is here.
To see which time points we want to sequence further, we ran an intial library prep and sequencing run on 8 samples (n=1 from each time point at ambient). The library prep was done by me using the Zymo Ribofree library prep kit (see my notebook posts for more information on library prep). The sequencing was done through Oklahoma Medical Research Foundation NGS Core, which I found on Genohub. Here is the sequencing information:
- Instrument: Illumina NovaSeq X Plus - 10B - PE 150 Cycle
- Read length: 2 x 150bp (Paired End)
- Pricing unit: per lane
- Number of samples: 8 libraries
- Guaranteed number of pass filter PE reads/sample: 20M (10M in each direction)
- Deliverables: FastQ files uploaded to Genohub project bucket
Files were downloaded to this location on Andromeda: /data/putnamlab/KITT/hputnam/20240328_Mcap_RNASeq_Devo/
. Time to analyze! I’m going to write my notebook in chronological order by date and then will reorganize once the workflow is complete.
20240328
Sequences were received from sequencer today! Make a new directory in my own folder on Andromeda for this project:
cd /data/putnamlab/jillashey
mkdir DT_Mcap_2023
cd DT_Mcap_2023
mkdir mRNA
cd mRNA
mkdir data scripts output
cd data
mkdir raw trim
cd raw
Copy the files into the raw data folder
cp /data/putnamlab/KITT/hputnam/20240328_Mcap_RNASeq_Devo/* .
Check md and initial raw QC was already done by Hollie. Raw QC is here. Overall, QC looks pretty good. The phred scores are high across the board. There is a decent amount of duplication but this is not unusual in early life stage samples. There is quite a bit of adapter content present so that definitely needs to come out. I’m going to use fastp to trim the reads. In the scripts folder: nano fastp_qc.sh
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/data/raw/
echo "Count number of reads in each file" $(date)
zgrep -c "@LH00" *.gz > raw_read_count.txt
echo "Start read trimming with fastp, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--in2 $(echo ${i}|sed s/_R1/_R2/)\
--out1 /data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.${i} \
--out2 /data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.$(echo ${i}|sed s/_R1/_R2/) \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
--length_required 100 \
--cut_right cut_right_window_size 5 cut_right_mean_quality 20 \
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/output/fastqc/trim/trim.${i} \
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/output/fastqc/trim/trim.$(echo ${i}|sed s/_R1/_R2/)
done
echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/output/fastqc/trim
multiqc *
echo "MultiQC complete" $(date)
Submitted batch job 310216
For whatever reason, the QC portion of the script didn’t run, so I’m going to run a QC trim script. In the scripts folder: nano fastqc_trim.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "Count number of reads in each trimmed file" $(date)
zgrep -c "@LH00" /data/putnamlab/jillashey/DT_Mcap_2023/data/trim/*.gz > /data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim_read_count.txt
echo "Start fastqc" $(date)
for file in /data/putnamlab/jillashey/DT_Mcap_2023/data/trim/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/jillashey/DT_Mcap_2023/output/fastqc/trim
done
echo "Fastqc complete, start multiqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/output/fastqc/trim
multiqc *
echo "multiqc complete!" $(date)
Submitted batch job 310233. Here are the raw read counts:
cd /data/putnamlab/jillashey/DT_Mcap_2023/data/raw
less raw_read_count.txt
13_S31_R1_001.fastq.gz:25264563
13_S31_R2_001.fastq.gz:25264563
23_S32_R1_001.fastq.gz:36786710
23_S32_R2_001.fastq.gz:36786710
35_S33_R1_001.fastq.gz:27430001
35_S33_R2_001.fastq.gz:27430001
52_S34_R1_001.fastq.gz:41778167
52_S34_R2_001.fastq.gz:41778167
60_S35_R1_001.fastq.gz:43121328
60_S35_R2_001.fastq.gz:43121328
72_S36_R1_001.fastq.gz:33397282
72_S36_R2_001.fastq.gz:33397282
85_S37_R1_001.fastq.gz:24484998
85_S37_R2_001.fastq.gz:24484998
9_S30_R1_001.fastq.gz:30559603
9_S30_R2_001.fastq.gz:30559603
Here are the trimmed read counts:
cd /data/putnamlab/jillashey/DT_Mcap_2023/data/raw
less trim_read_count.txt
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.13_S31_R1_001.fastq.gz:17766477
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.13_S31_R2_001.fastq.gz:17766477
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.23_S32_R1_001.fastq.gz:24346211
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.23_S32_R2_001.fastq.gz:24346211
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.35_S33_R1_001.fastq.gz:17755579
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.35_S33_R2_001.fastq.gz:17755579
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.52_S34_R1_001.fastq.gz:26921564
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.52_S34_R2_001.fastq.gz:26921564
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.60_S35_R1_001.fastq.gz:27316402
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.60_S35_R2_001.fastq.gz:27316402
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.72_S36_R1_001.fastq.gz:20445985
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.72_S36_R2_001.fastq.gz:20445985
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.85_S37_R1_001.fastq.gz:16126033
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.85_S37_R2_001.fastq.gz:16126033
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.9_S30_R1_001.fastq.gz:21574873
/data/putnamlab/jillashey/DT_Mcap_2023/data/trim/trim.9_S30_R2_001.fastq.gz:21574873
A decent amount of reads were removed from the samples. Between 60-70% of reads were retained in these samples. This is slightly unusual to me, as the raw QC looked decent (esp in terms of high phred scores) and in other datasets, I usually retain 80-90% of the reads. This is the first time I’m using genohub/OK sequencing center so maybe thats it? There is also a high duplication rate, which may be contributing. The QC for trimmed reads ran in about an hour. Trimmed QC is here. The adapter content is all gone which is great. Quality still looks good. In samples 13 and 30, there is a small spike in the GC normal distribution but overall, still looks normally distributed.
Align reads to Mcap genome using hisat2. I’m using Mcap genome V3, which I obtained from the Rutgers website on 3/28/24. In the scripts folder: nano align.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# load modules needed
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
cd /data/putnamlab/jillashey/DT_Mcap_2023/data/trim/
echo "Building genome reference" $(date)
# index the reference genome for Mcap output index to working directory
hisat2-build -f /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta Mcap_ref
echo "Referece genome indexed. Starting alingment" $(date)
# This script exports alignments as bam files
# sorts the bam file because Stringtie takes a sorted file for input (--dta)
# removes the sam file because it is no longer needed
array=($(ls *_R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
sample_name=`echo $i| awk -F [.] '{print $2}'`
hisat2 -p 8 --rna-strandness RF --dta -x Mcap_ref -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -S ${sample_name}.sam
samtools sort -@ 8 -o ${sample_name}.bam ${sample_name}.sam
echo "${i} bam-ified!"
rm ${sample_name}.sam
done
echo "Alignment complete!" $(date)
echo "View mapping percentages " $(date)
for i in *.bam; do
echo "${i}" >> mapped_reads_counts_Mcap.txt
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_Mcap.txt
done
Submitted batch job 310234. Ran in 2 hours. Here’s the mapping information:
13_S31_R1_001.bam
17303971 + 0 mapped (45.92% : N/A)
23_S32_R1_001.bam
39709957 + 0 mapped (68.10% : N/A)
35_S33_R1_001.bam
27912358 + 0 mapped (66.19% : N/A)
52_S34_R1_001.bam
45865671 + 0 mapped (74.17% : N/A)
60_S35_R1_001.bam
46178976 + 0 mapped (73.88% : N/A)
72_S36_R1_001.bam
36113018 + 0 mapped (76.53% : N/A)
85_S37_R1_001.bam
32246414 + 0 mapped (79.69% : N/A)
9_S30_R1_001.bam
26154642 + 0 mapped (56.50% : N/A)
These mapping percentages look pretty good. I’m not surprised that sample 13 is so low. This sample is from 4 hpf where I messed up the sampling protocol and added too much salt water (with larvae) to shield. While there was RNA in the samples, the RNA appeared relatively degraded on the gel (see gel here). Move the bam files + mapped percentages text file to the hisat2 output folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/data/trim
mv *bam ../../output/hisat2/
mv mapped_reads_counts_Mcap.txt ../../output/hisat2/
Assemble reads using stringtie. In the scripts folder: nano assemble.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load StringTie/2.2.1-GCC-11.2.0
echo "Assembling transcripts using stringtie" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/output/hisat2
array1=($(ls *.bam)) #Make an array of sequences to assemble
for i in ${array1[@]}; do
stringtie -p 8 -e -B -G /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.genes.gff3 -A ${i}.gene_abund.tab -o ${i}.gtf ${i}
done
echo "Assembly for each sample complete " $(date)
Submitted batch job 310259. Move the gtf and tab files to the stringtie output folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/output/hisat2
mv *gtf ../stringtie/
mv *tab ../stringtie/
cd ../stringtie
Make a list of gtfs
ls *.gtf > gtf_list.txt
Merge gtfs into a single gtf with stringtie
module purge
module load StringTie/2.1.4-GCC-9.3.0
stringtie --merge -e -p 8 -G /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.genes.gff3 -o Mcap_merged.gtf gtf_list.txt #Merge GTFs
wc -l Mcap_merged.gtf
310417 Mcap_merged.gtf
Assess accuracy of merged assembly
module purge
module load GffCompare/0.12.1-GCCcore-8.3.0
gffcompare -r /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.genes.gff3 -G -o merged Mcap_merged.gtf #Compute the accuracy
54384 reference transcripts loaded.
2 duplicate reference transcripts discarded.
54384 query transfrags loaded.
Check out the merged file
less merged.stats
#= Summary for dataset: Mcap_merged.gtf
# Query mRNAs : 54384 in 54185 loci (36023 multi-exon transcripts)
# (141 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 54382 in 54185 loci (36023 multi-exon)
# Super-loci w/ reference transcripts: 54185
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 100.0 | 100.0 |
Intron level: 100.0 | 100.0 |
Intron chain level: 100.0 | 100.0 |
Transcript level: 100.0 | 100.0 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 36023
Matching transcripts: 54363
Matching loci: 54183
Missed exons: 0/256029 ( 0.0%)
Novel exons: 0/256028 ( 0.0%)
Missed introns: 0/201643 ( 0.0%)
Novel introns: 0/201643 ( 0.0%)
Missed loci: 0/54185 ( 0.0%)
Novel loci: 0/54185 ( 0.0%)
Total union super-loci across all input datasets: 54185
54384 out of 54384 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)
Make gtf list text file for gene count matrix creation
for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt
Download the prepDE.py script from here and put it in the stringtie output folder. Load python and compile the gene count matrix
module purge
module load Python/2.7.18-GCCcore-9.3.0
python prepDE.py -g Mcap_gene_count_matrix.csv -i listGTF.txt
wc -l Mcap_gene_count_matrix.csv
54385 Mcap_gene_count_matrix.csv
Copy the matrices onto my local computer. Woohoo!
20240919
QC for Mcap DT 2023 data that came back from sequencer 9/19/24
less 10_S12_R1_001.fastq.gz
@LH00260:104:22GLL5LT4:2:1101:13405:1070 1:N:0:ACTAAGATAT+CCGCGGTTGT
CNTTGTGATCAGTCCCGGGTGATCGGAAGAGCACACGTCTGAACTCCAGTCACACTAAGATATCTGGGGGGGCGTGTTTTTTTTTTGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
I#IIIIII99999II9IIIIIIIIIII--IIII9IIIIIIIIIIIIIIIIIIII-I-II9-I99--9-I9I9----999I999-9------9-999--999999999999999IIIII9IIIIIIIIIIIIIIIIIIIIII9IIIIIIIII
less 10_S12_R2_001.fastq.gz
@LH00260:104:22GLL5LT4:2:1101:13405:1070 2:N:0:ACTAAGATAT+CCGCGGTTGT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
IIIIIIIII-IIIIIIIII9III999I9IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
less 10_S28_R1_001.fastq.gz
@LH00260:104:22GLL5LT4:3:1101:3039:1028 1:N:0:ACTAAGATAT+CNGCGGTTGT
CNCCCCTCTTGCCAGATACTTGTTGTCAGATGTAACAGCTAGGCACAGGACATGTCCCATGTGGCCCTTATGTCCATCATTAGCCTTGTGGCAACCCGCTATCTTTCCTTGTTTGCAGCCAGTTTCAATATCCCACTTGATAATAGAACAG
+
I#9I9IIIIIIIIIIIII-9IIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIII-IIIII9I9II-III9I-IIIIII9IIIII99I-I9II9I--II--II999IIIIIIIIII9-III9I9IIIIIIIII9-I9I99IIIII-9-II99
less 10_S28_R2_001.fastq.gz
@LH00260:104:22GLL5LT4:3:1101:3039:1028 2:N:0:ACTAAGATAT+CNGCGGTTGT
TGCTGTCAGAGTGCTCAAGGGTCATCAGCTGTCTGTCACCTGTCTCTGTGTGTCCCCTGCTGACAAGTTTGTTTTCTCTGGGTCTAAGGCCTGTTCCATTATCAAGTGGGATATTGAAACTGGCTGCAAACAAGGAAAGATATCGGGTTGC
+
I9IIIIIIIIIIIIIIIIIII-9IIIIIIIIIIIIIIII9IIIII9IIII9IIIII-II9III9I9III9IIII99IIIII9IIIIIII-II-IIII99IIII9IIIIIIIIIIIIII9III9IIIIII9IIIIIIIIIIIIIII9IIIII
which is small RNA sequence and which is RNA sequence?
nano fastqc_all_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/test
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "Start fastqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/test/
for file in *fastq.gz
do
fastqc $file
done
echo "Fastqc complete, start multiqc" $(date)
multiqc *
echo "multiqc complete!" $(date)
Submitted batch job 338507
20240924
md5sum *RNA*.fastq.gz
702e86f018e45af059dcf669dd9e9b69 10_RNAseq_S28_R1_001.fastq.gz
683c7346bc1540d5a9b683fcafe8d839 10_RNAseq_S28_R2_001.fastq.gz
6a8e0308d34a7fe8e1dd950093f2d8cc 10_small_RNA_S4_R1_001.fastq.gz
5df1e17c449a0b2ca1709d23d9218a56 10_small_RNA_S4_R2_001.fastq.gz
4f46b70db8b4fd7f4c5f7c6d238ff672 11_RNAseq_S29_R1_001.fastq.gz
2c9db533ddd6fd9d1c0bf317ef64f512 11_RNAseq_S29_R2_001.fastq.gz
cbc48eb41ca388fb271a134d1c91cf06 11_small_RNA_S5_R1_001.fastq.gz
8e38afa5a4fa8e7aba3a66673dda99de 11_small_RNA_S5_R2_001.fastq.gz
9c23113bfd69cef0307438ca7279b117 14_RNAseq_S30_R1_001.fastq.gz
d27a1c25717ed55a2f19c279ed856509 14_RNAseq_S30_R2_001.fastq.gz
bb71fd3c990ce6f43be33a9e64520a7a 14_small_RNA_S6_R1_001.fastq.gz
99463142ad98ae02877b22ab4f227657 14_small_RNA_S6_R2_001.fastq.gz
bf857b42e61e2acfe5683537ccaebc8f 24_RNAseq_S31_R1_001.fastq.gz
2caea6f8376cb1def3f4d56c4a34794f 24_RNAseq_S31_R2_001.fastq.gz
f633b41622ecb8e3c046183dfd991ed0 24_small_RNA_S7_R1_001.fastq.gz
b38dee47e573c9a3eadcf79462bfb539 24_small_RNA_S7_R2_001.fastq.gz
466170e4bcfd6212622f32c554d52310 26_RNAseq_S32_R1_001.fastq.gz
550f784d7010a90a7979261acc4c247b 26_RNAseq_S32_R2_001.fastq.gz
2fcfbaa6de7007ada3af07b428dbd808 26_small_RNA_S8_R1_001.fastq.gz
9580845c7ca925b374459502f02fcc88 26_small_RNA_S8_R2_001.fastq.gz
8c0f6b8c1eb6544f22837be614d986da 28_RNAseq_S33_R1_001.fastq.gz
d3287f2a30504dfc950003d205cb648f 28_RNAseq_S33_R2_001.fastq.gz
3a63dab0d402c9c7b16db1370217ffa6 28_small_RNA_S9_R1_001.fastq.gz
0c56c73926283b43e6869996211be968 28_small_RNA_S9_R2_001.fastq.gz
5ecd3e61c2a8ed327b8115ae1e3c39eb 36_RNAseq_S34_R1_001.fastq.gz
37213fa3c4293381337465cad6d8ac9b 36_RNAseq_S34_R2_001.fastq.gz
84d42290a59f7f0cd1ce0214fd2031c6 36_small_RNA_S10_R1_001.fastq.gz
2275f89f69dec7f9f46f7969d91dd211 36_small_RNA_S10_R2_001.fastq.gz
300af022d27f8c796ddae7ad6cfd09b5 37_RNAseq_S35_R1_001.fastq.gz
7f0aa8a5fad7f1ef4e3d0b362debd9be 37_RNAseq_S35_R2_001.fastq.gz
83bdde7d4200b56eb81a80c371ff8530 37_small_RNA_S11_R1_001.fastq.gz
7c2bd1fc45277927f41eb771844e96a1 37_small_RNA_S11_R2_001.fastq.gz
8b68ef85d8a3427ea28b9b26602f6f08 39_RNAseq_S36_R1_001.fastq.gz
f619aa13b71e0fda37028361296c6890 39_RNAseq_S36_R2_001.fastq.gz
9712e83fccdee9c882721ce04a207162 39_small_RNA_S12_R1_001.fastq.gz
e5ead367c5104dfbbe7cecdc41589cb7 39_small_RNA_S12_R2_001.fastq.gz
6539a11de0504b9cd27ca0e99105eb31 47_RNAseq_S37_R1_001.fastq.gz
bcb55aadf544806e64973c8787270433 47_RNAseq_S37_R2_001.fastq.gz
e4f5279eef748705acc52b768d1ea06b 47_small_RNA_S13_R1_001.fastq.gz
35f57befe5167b7badfec76ed76164f1 47_small_RNA_S13_R2_001.fastq.gz
deeabbc714346a2be395ffee05534954 48_RNAseq_S38_R1_001.fastq.gz
3e57052be925bcf99a98d4bb4e2d0ef0 48_RNAseq_S38_R2_001.fastq.gz
f49407dd83da7ad68d6a353b1208f615 48_small_RNA_S14_R1_001.fastq.gz
995f6c2dbae731b5ae27e12f77966a4c 48_small_RNA_S14_R2_001.fastq.gz
81dd42bd91c0489520d5d6f43e93f7a7 51_RNAseq_S39_R1_001.fastq.gz
97005a5febfcd778642d296cfcff9804 51_RNAseq_S39_R2_001.fastq.gz
df7108ac712799380dc588e461d8faec 51_small_RNA_S15_R1_001.fastq.gz
15dd7cef395d93ad66412dba8576e022 51_small_RNA_S15_R2_001.fastq.gz
a17dfb969e4a6ac0a2b56aa037835ebd 61_RNAseq_S40_R1_001.fastq.gz
ef4dd53f6703e684bb048c6564c354cf 61_RNAseq_S40_R2_001.fastq.gz
3445966f7329f286a98a522835e398f8 61_small_RNA_S16_R1_001.fastq.gz
641c4aab05aead0e6493d54c906fc93e 61_small_RNA_S16_R2_001.fastq.gz
06111c15e28d3f0f27ecccc62906720d 62_RNAseq_S41_R1_001.fastq.gz
daced4a41b537bfad3788acc74465a15 62_RNAseq_S41_R2_001.fastq.gz
54468dc20565a6c032b6d73e6469fb46 62_small_RNA_S17_R1_001.fastq.gz
e28992a3905a6f178391d2dedc1f822f 62_small_RNA_S17_R2_001.fastq.gz
1c27d12f33601d51ea106d8ed48979e4 63_RNAseq_S42_R1_001.fastq.gz
16c999fa3820f42463791dcb03ec742b 63_RNAseq_S42_R2_001.fastq.gz
3a72bf0f147ba3d76251b62d2ea3c1dc 63_small_RNA_S18_R1_001.fastq.gz
6c243256ab289457fda664e86fc37d76 63_small_RNA_S18_R2_001.fastq.gz
26573c077d8d8da839ae711cedf42169 6_RNAseq_S25_R1_001.fastq.gz
3b7192772b515e3a2042f2dbf93c49f7 6_RNAseq_S25_R2_001.fastq.gz
652a9bbfd2eea5f6386be36ecb3fe7f7 6_small_RNA_S1_R1_001.fastq.gz
032ec5fd341c19df7d459a9509b64530 6_small_RNA_S1_R2_001.fastq.gz
b919bc773e19bc85c474d9b4d186a034 73_RNAseq_S43_R1_001.fastq.gz
799498b20adb92837e50f5c895caa54e 73_RNAseq_S43_R2_001.fastq.gz
bbf88e6ae4e152a0df2f7aad92108235 73_small_RNA_S19_R1_001.fastq.gz
da5e7f96e85eb981aaa76c5352c1a49b 73_small_RNA_S19_R2_001.fastq.gz
c87909a4b6cab7c6e8c77d89d32c9c1c 74_RNAseq_S44_R1_001.fastq.gz
8df0ef1d222ddde86c4597f03b0460e6 74_RNAseq_S44_R2_001.fastq.gz
b922733a75397fc7fd813404b373495c 74_small_RNA_S20_R1_001.fastq.gz
4b23d7649be0cfa2717f378939c655b5 74_small_RNA_S20_R2_001.fastq.gz
6fef936167af9a4721e22e5d811cc736 75_RNAseq_S45_R1_001.fastq.gz
8d4eaa76e3c47a3912aea942a7aa1c4b 75_RNAseq_S45_R2_001.fastq.gz
155b7e19e5d7a0fd8741256b3439909e 75_small_RNA_S21_R1_001.fastq.gz
2585139012a94991a2d928c3d943181c 75_small_RNA_S21_R2_001.fastq.gz
3af0d57415887b98c7ba8664f4be2175 7_RNAseq_S26_R1_001.fastq.gz
e82390f5537be0bec7eac9a6494dd042 7_RNAseq_S26_R2_001.fastq.gz
53605664bacdbc1e37b3120e2ac1e715 7_small_RNA_S2_R1_001.fastq.gz
7e7f794900855f697e760346fd88eb4b 7_small_RNA_S2_R2_001.fastq.gz
1090d5de0a6f71add02897147e640f3f 86_RNAseq_S46_R1_001.fastq.gz
87d08fd60ca278fe9c146ed525964da7 86_RNAseq_S46_R2_001.fastq.gz
46f3c89fa284d6a66d7a20814fe1f0fe 86_small_RNA_S22_R1_001.fastq.gz
c1618b96765199bc4db850b6805f64df 86_small_RNA_S22_R2_001.fastq.gz
fffe1bb3b1f92327fab218cf707ebdaf 87_RNAseq_S47_R1_001.fastq.gz
74a26b03b2527e0f3c51ffc555ef2455 87_RNAseq_S47_R2_001.fastq.gz
1aa383be2213f2f7982e8d88eefafca0 87_small_RNA_S23_R1_001.fastq.gz
b3e1c3a9314aec52f69dd88c6f699d60 87_small_RNA_S23_R2_001.fastq.gz
b28bdacd246571e26e248c858c07721d 88_RNAseq_S48_R1_001.fastq.gz
8795588fb6d7b74b162735241750060d 88_RNAseq_S48_R2_001.fastq.gz
9b4562d4eb6f1d662b8a04d1293a9053 88_small_RNA_S24_R1_001.fastq.gz
9bb1259ee25a5d410a8fae5b7f007c9a 88_small_RNA_S24_R2_001.fastq.gz
70f7d1fb97f559af56814d7f4b3c02c8 8_RNAseq_S27_R1_001.fastq.gz
57cf892971851db9f64d51f24acaab6b 8_RNAseq_S27_R2_001.fastq.gz
2b43ef5206c66fba3499301bbf00d963 8_small_RNA_S3_R1_001.fastq.gz
f458050f088dcfda08219a4773f21092 8_small_RNA_S3_R2_001.fastq.gz
20240926
nano fastqc_RNAseq_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "Start fastqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/raw
for file in *fastq.gz
do
fastqc $file
done
echo "Fastqc complete, start multiqc" $(date)
multiqc *
echo "multiqc complete!" $(date)
Submitted batch job 340171. Raw QC for RNAseq data here.
nano fastqc_smallRNAseq_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "Start fastqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/smRNA/data/raw
for file in *fastq.gz
do
fastqc $file
done
echo "Fastqc complete, start multiqc" $(date)
multiqc *
echo "multiqc complete!" $(date)
Submitted batch job 340170
20241023
Now that I have all my data and have QCed the data, I am going to trim. For the RNAseq data, there are definitely some batch effects between the sequencing runs. May need to remove the samples that were sequenced earlier (March 2024) downstream.
Trim using fastp. In the scripts folder: nano XXX
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# Load modules
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/raw/
echo "Count number of reads in each file" $(date)
zgrep -c "@LH00" *.gz > raw_read_count.txt
echo "Start read trimming with fastp, followed by QC" $(date)
# Make an array of sequences to trim
array1=($(ls *R1_001.fastq.gz))
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--in2 $(echo ${i}|sed s/_R1/_R2/)\
--out1 /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/trim/trim.${i} \
--out2 /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/trim/trim.$(echo ${i}|sed s/_R1/_R2/) \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
--length_required 100 \
--cut_right cut_right_window_size 5 cut_right_mean_quality 20 \
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/fastqc/trim/trim.${i} \
fastqc /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/fastqc/trim/trim.$(echo ${i}|sed s/_R1/_R2/)
done
echo "Read trimming of adapters and QC complete. Starting multiqc" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/fastqc/trim
multiqc *
echo "MultiQC complete" $(date)
echo "Count number of reads in each file" $(date)
zgrep -c "@LH00" *.gz > trim_read_count.txt
Submitted batch job 344847. QC looks good! Adapter content gone and phred score high across bases. Trimmed QC for RNAseq data here.
Align to Mcap V3 genome. In the scripts folder: nano align.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
# load modules needed
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/trim/
echo "Building genome reference" $(date)
# index the reference genome for Mcap output index to working directory
hisat2-build -f /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta Mcap_ref
echo "Referece genome indexed. Starting alingment" $(date)
# This script exports alignments as bam files
# sorts the bam file because Stringtie takes a sorted file for input (--dta)
# removes the sam file because it is no longer needed
array=($(ls *_R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
sample_name=`echo $i| awk -F [.] '{print $2}'`
hisat2 -p 8 --rna-strandness RF --dta -x Mcap_ref -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -S ${sample_name}.sam
samtools sort -@ 8 -o ${sample_name}.bam ${sample_name}.sam
echo "${i} bam-ified!"
rm ${sample_name}.sam
done
echo "Alignment complete!" $(date)
echo "View mapping percentages " $(date)
for i in *.bam; do
echo "${i}" >> mapped_reads_counts_Mcap.txt
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts_Mcap.txt
done
Submitted batch job 344964. Mapping percentages:
10_RNAseq_S28_R1_001.bam
4239251 + 0 mapped (42.36% : N/A)
11_RNAseq_S29_R1_001.bam
5991800 + 0 mapped (33.98% : N/A)
13_S31_R1_001.bam
17304441 + 0 mapped (45.92% : N/A)
14_RNAseq_S30_R1_001.bam
8085710 + 0 mapped (45.42% : N/A)
23_S32_R1_001.bam
39709953 + 0 mapped (68.10% : N/A)
24_RNAseq_S31_R1_001.bam
13563672 + 0 mapped (62.72% : N/A)
26_RNAseq_S32_R1_001.bam
14010844 + 0 mapped (61.96% : N/A)
28_RNAseq_S33_R1_001.bam
11577688 + 0 mapped (64.26% : N/A)
35_S33_R1_001.bam
27912360 + 0 mapped (66.19% : N/A)
36_RNAseq_S34_R1_001.bam
15565022 + 0 mapped (64.14% : N/A)
37_RNAseq_S35_R1_001.bam
16439270 + 0 mapped (66.31% : N/A)
39_RNAseq_S36_R1_001.bam
15933330 + 0 mapped (70.04% : N/A)
47_RNAseq_S37_R1_001.bam
20838605 + 0 mapped (81.81% : N/A)
48_RNAseq_S38_R1_001.bam
20274902 + 0 mapped (80.73% : N/A)
51_RNAseq_S39_R1_001.bam
19029047 + 0 mapped (72.68% : N/A)
52_S34_R1_001.bam
45865662 + 0 mapped (74.17% : N/A)
60_S35_R1_001.bam
46178975 + 0 mapped (73.88% : N/A)
61_RNAseq_S40_R1_001.bam
12591636 + 0 mapped (63.30% : N/A)
62_RNAseq_S41_R1_001.bam
14910190 + 0 mapped (67.58% : N/A)
63_RNAseq_S42_R1_001.bam
20914310 + 0 mapped (79.37% : N/A)
6_RNAseq_S25_R1_001.bam
11665744 + 0 mapped (62.94% : N/A)
72_S36_R1_001.bam
36113020 + 0 mapped (76.53% : N/A)
73_RNAseq_S43_R1_001.bam
13295206 + 0 mapped (73.43% : N/A)
74_RNAseq_S44_R1_001.bam
13728145 + 0 mapped (68.90% : N/A)
75_RNAseq_S45_R1_001.bam
8505675 + 0 mapped (68.58% : N/A)
7_RNAseq_S26_R1_001.bam
11945532 + 0 mapped (58.78% : N/A)
85_S37_R1_001.bam
32246414 + 0 mapped (79.69% : N/A)
86_RNAseq_S46_R1_001.bam
15919552 + 0 mapped (75.01% : N/A)
87_RNAseq_S47_R1_001.bam
11527670 + 0 mapped (68.50% : N/A)
88_RNAseq_S48_R1_001.bam
13406820 + 0 mapped (71.74% : N/A)
8_RNAseq_S27_R1_001.bam
10389582 + 0 mapped (58.35% : N/A)
9_S30_R1_001.bam
26154642 + 0 mapped (56.50% : N/A)
Most samples ranged from 56-80% mapping, which is good. A couple of samples (10, 11, 13, 14) has <50% mapping. This is not unexpected, as these samples were all 4 hpf, which is a period of maternal/zygotic turnover and duplication. 4 hpf is also the timepoint where I messed up the sampling protocol and added too much salt water (with larvae) to shield. While there was RNA in the samples, the RNA appeared relatively degraded on the gel (see 13 in gel picture here).
Move the bam files + mapped percentages text file to the hisat2 output folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/trim
mv *bam ../../output/hisat2/
mv mapped_reads_counts_Mcap.txt ../../output/hisat2/
Assemble reads using stringtie using the updated gff from AH (see code here). In the scripts folder: nano assemble.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load StringTie/2.2.1-GCC-11.2.0
echo "Assembling transcripts using stringtie" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/hisat2
array1=($(ls *.bam)) #Make an array of sequences to assemble
for i in ${array1[@]}; do
stringtie -p 8 -e -B -G /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 -A ${i}.gene_abund.tab -o ${i}.gtf ${i}
done
echo "Assembly for each sample complete " $(date)
Submitted batch job 344979. Move the gtf and tab files to the stringtie output folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/hisat2
mv *gtf ../stringtie/
mv *tab ../stringtie/
cd ../stringtie
Make a list of gtfs
ls *.gtf > gtf_list.txt
Merge gtfs into a single gtf with stringtie
interactive
module load StringTie/2.1.4-GCC-9.3.0
stringtie --merge -e -p 8 -G /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 -o Mcap_merged.gtf gtf_list.txt #Merge GTFs
wc -l Mcap_merged.gtf
310417 Mcap_merged.gtf
Assess accuracy of merged assembly with gffcompare
module purge
module load GffCompare/0.12.1-GCCcore-8.3.0
gffcompare -r /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 -G -o merged Mcap_merged.gtf #Compute the accuracy
54384 reference transcripts loaded.
2 duplicate reference transcripts discarded.
54384 query transfrags loaded.
Look at merge stats
less merged.stats
# gffcompare v0.12.1 | Command line was:
#gffcompare -r /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 -G -o merged Mcap_merged.gtf
#
#= Summary for dataset: Mcap_merged.gtf
# Query mRNAs : 54384 in 54185 loci (36023 multi-exon transcripts)
# (141 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 54382 in 54185 loci (36023 multi-exon)
# Super-loci w/ reference transcripts: 54185
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 99.9 | 99.9 |
Intron level: 100.0 | 100.0 |
Intron chain level: 100.0 | 100.0 |
Transcript level: 99.9 | 99.9 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 36023
Matching transcripts: 54309
Matching loci: 54173
Missed exons: 0/256029 ( 0.0%)
Novel exons: 0/256026 ( 0.0%)
Missed introns: 0/201643 ( 0.0%)
Novel introns: 0/201643 ( 0.0%)
Missed loci: 0/54185 ( 0.0%)
Novel loci: 0/54185 ( 0.0%)
Total union super-loci across all input datasets: 54185
54384 out of 54384 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)
head merged.tracking
TCONS_00000001 XLOC_000001 Montipora_capitata_HIv3___RNAseq.g4584.t1|Montipora_capitata_HIv3___RNAseq.g4584.t1 = q1:MSTRG.4|Montipora_capitata_HIv3___RNAseq.g4584.t1|13|0.000000|0.000000|1.125630|1854
TCONS_00000002 XLOC_000002 Montipora_capitata_HIv3___RNAseq.g4586.t1|Montipora_capitata_HIv3___RNAseq.g4586.t1 = q1:MSTRG.30|Montipora_capitata_HIv3___RNAseq.g4586.t1|22|0.000000|0.000000|0.144260|2124
TCONS_00000003 XLOC_000003 Montipora_capitata_HIv3___TS.g26272.t1|Montipora_capitata_HIv3___TS.g26272.t1 = q1:MSTRG.30|Montipora_capitata_HIv3___TS.g26272.t1|1|0.000000|0.000000|0.073721|589
TCONS_00000004 XLOC_000004 Montipora_capitata_HIv3___TS.g26273.t1|Montipora_capitata_HIv3___TS.g26273.t1 = q1:MSTRG.5|Montipora_capitata_HIv3___TS.g26273.t1|1|0.000000|0.000000|0.042288|612
TCONS_00000005 XLOC_000005 Montipora_capitata_HIv3___RNAseq.g4588.t1|Montipora_capitata_HIv3___RNAseq.g4588.t1 = q1:MSTRG.6|Montipora_capitata_HIv3___RNAseq.g4588.t1|1|0.000000|0.000000|0.016101|1101
TCONS_00000006 XLOC_000006 Montipora_capitata_HIv3___RNAseq.g4589.t1|Montipora_capitata_HIv3___RNAseq.g4589.t1 = q1:MSTRG.7|Montipora_capitata_HIv3___RNAseq.g4589.t1|92|0.000000|0.000000|0.033519|10527
TCONS_00000007 XLOC_000007 Montipora_capitata_HIv3___TS.g26276.t1|Montipora_capitata_HIv3___TS.g26276.t1 = q1:MSTRG.8|Montipora_capitata_HIv3___TS.g26276.t1|1|0.000000|0.000000|0.191457|261
TCONS_00000008 XLOC_000008 Montipora_capitata_HIv3___TS.g26277.t1|Montipora_capitata_HIv3___TS.g26277.t1 = q1:MSTRG.9|Montipora_capitata_HIv3___TS.g26277.t1|1|0.000000|0.000000|0.085318|273
TCONS_00000009 XLOC_000009 Montipora_capitata_HIv3___RNAseq.g4592.t1|Montipora_capitata_HIv3___RNAseq.g4592.t1 = q1:MSTRG.12|Montipora_capitata_HIv3___RNAseq.g4592.t1|2|0.000000|0.000000|1.483990|267
TCONS_00000010 XLOC_000010 Montipora_capitata_HIv3___TS.g26284.t1|Montipora_capitata_HIv3___TS.g26284.t1 = q1:MSTRG.19|Montipora_capitata_HIv3___TS.g26284.t1|6|0.000000|0.000000|0.017805|945
Looking at the gffcompare documentation, the tracking file matches up transcripts between samples. Includes the MSTRG ID? Interesting. Is this file matching up the MSTRG to the gene ids?
Make gtf list text file for gene count matrix creation
for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt
Download the prepDE.py script from here and put it in the scripts folder (I already had it in here). Load python and compile the gene count matrix
module purge
module load Python/2.7.18-GCCcore-9.3.0
python /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts/prepDE.py -g Mcap_gene_count_matrix.csv -i listGTF.txt
wc -l Mcap_gene_count_matrix.csv
54385 Mcap_gene_count_matrix.csv
Copy to local computer. Hooray!
20241108
Identify lncRNAs using my Astrangia code and Zach’s oyster code as reference.
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/stringtie
awk '$3 == "transcript" && $1 !~ /^#/' merged.annotated.gtf | awk '($5 - $4 > 199) || ($4 - $5 > 199)' > Mcap_lncRNA_candidates.gtf
head Mcap_lncRNA_candidates.gtf
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 82397 95409 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4584.t1"; gene_id "MSTRG.4"; gene_name "Montipora_capitata_HIv3___RNAseq.g4584.t1"; xloc "XLOC_000001"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4584.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4584.t1"; class_code "="; tss_id "TSS1";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 109801 163388 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4586.t1"; gene_id "MSTRG.30"; gene_name "Montipora_capitata_HIv3___RNAseq.g4586.t1"; xloc "XLOC_000002"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4586.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4586.t1"; class_code "="; tss_id "TSS2";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 162568 163156 . + . transcript_id "Montipora_capitata_HIv3___TS.g26272.t1"; gene_id "MSTRG.30"; gene_name "Montipora_capitata_HIv3___TS.g26272.t1"; xloc "XLOC_000003"; ref_gene_id "Montipora_capitata_HIv3___TS.g26272.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26272.t1"; class_code "="; tss_id "TSS3";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 169950 170561 . + . transcript_id "Montipora_capitata_HIv3___TS.g26273.t1"; gene_id "MSTRG.5"; gene_name "Montipora_capitata_HIv3___TS.g26273.t1"; xloc "XLOC_000004"; ref_gene_id "Montipora_capitata_HIv3___TS.g26273.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26273.t1"; class_code "="; tss_id "TSS4";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 170982 172082 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4588.t1"; gene_id "MSTRG.6"; gene_name "Montipora_capitata_HIv3___RNAseq.g4588.t1"; xloc "XLOC_000005"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4588.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4588.t1"; class_code "="; tss_id "TSS5";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 176400 276376 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4589.t1"; gene_id "MSTRG.7"; gene_name "Montipora_capitata_HIv3___RNAseq.g4589.t1"; xloc "XLOC_000006"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4589.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4589.t1"; class_code "="; tss_id "TSS6";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 204191 204451 . + . transcript_id "Montipora_capitata_HIv3___TS.g26276.t1"; gene_id "MSTRG.8"; gene_name "Montipora_capitata_HIv3___TS.g26276.t1"; xloc "XLOC_000007"; ref_gene_id "Montipora_capitata_HIv3___TS.g26276.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26276.t1"; class_code "="; tss_id "TSS7";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 223366 223638 . + . transcript_id "Montipora_capitata_HIv3___TS.g26277.t1"; gene_id "MSTRG.9"; gene_name "Montipora_capitata_HIv3___TS.g26277.t1"; xloc "XLOC_000008"; ref_gene_id "Montipora_capitata_HIv3___TS.g26277.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26277.t1"; class_code "="; tss_id "TSS8";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 330242 330991 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4592.t1"; gene_id "MSTRG.12"; gene_name "Montipora_capitata_HIv3___RNAseq.g4592.t1"; xloc "XLOC_000009"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4592.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4592.t1"; class_code "="; tss_id "TSS9";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 396628 400449 . + . transcript_id "Montipora_capitata_HIv3___TS.g26284.t1"; gene_id "MSTRG.19"; gene_name "Montipora_capitata_HIv3___TS.g26284.t1"; xloc "XLOC_000010"; ref_gene_id "Montipora_capitata_HIv3___TS.g26284.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26284.t1"; class_code "="; tss_id "TSS10";
wc -l Mcap_lncRNA_candidates.gtf
54314 Mcap_lncRNA_candidates.gtf
Use bedtools to extract sequence data from the reference genome based on the coordinates from the GTF. The resulting seqs represent potential lncRNA candidates.
interactive
module load BEDTools/2.30.0-GCC-11.3.0
bedtools getfasta -fi /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta -bed Mcap_lncRNA_candidates.gtf -fo Mcap_lncRNA_candidates.fasta -fullHeader -split
zgrep -c ">" Mcap_lncRNA_candidates.fasta
Run CPC2.
module load CPC2/1.0.1-foss-2022a
CPC2.py -i /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/stringtie/Mcap_lncRNA_candidates.fasta
awk '$8 != "coding"' cpc2output.txt.txt > Mcap_noncoding_transcripts_info.txt
head Mcap_noncoding_transcripts_info.txt
#ID transcript_length peptide_length Fickett_score pI ORF_integrity coding_probability label
transcript::Montipora_capitata_HIv3___Scaffold_1:204190-204451 261 87 0.35153 10.518226432800294 1 0.10615 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:223365-223638 273 91 0.45305 9.851301002502442 1 0.437989 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:330241-330991 750 53 0.44283 4.5459486007690435 1 0.100817 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:499117-506138 7021 102 0.29595000000000005 9.135637474060061 1 0.0984293 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:581275-583890 2615 84 0.39626000000000006 9.359922981262208 1 0.154755 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:600468-626017 25549 131 0.2921 10.076875877380367 1 0.19239 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:696118-715336 19218 107 0.29367000000000004 10.136251258850098 1 0.108674 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:1058733-1069888 11155 118 0.29367000000000004 5.867625617980956 1 0.3621 noncoding
transcript::Montipora_capitata_HIv3___Scaffold_1:1128944-1147155 18211 133 0.2921 8.951193428039552 1 0.23396 noncoding
awk '$8 == "noncoding" {print $1}' cpc2output.txt.txt > Mcap_noncoding_transcripts_ids.txt
head Mcap_noncoding_transcripts_ids.txt
transcript::Montipora_capitata_HIv3___Scaffold_1:204190-204451
transcript::Montipora_capitata_HIv3___Scaffold_1:223365-223638
transcript::Montipora_capitata_HIv3___Scaffold_1:330241-330991
transcript::Montipora_capitata_HIv3___Scaffold_1:499117-506138
transcript::Montipora_capitata_HIv3___Scaffold_1:581275-583890
transcript::Montipora_capitata_HIv3___Scaffold_1:600468-626017
transcript::Montipora_capitata_HIv3___Scaffold_1:696118-715336
transcript::Montipora_capitata_HIv3___Scaffold_1:1058733-1069888
transcript::Montipora_capitata_HIv3___Scaffold_1:1128944-1147155
transcript::Montipora_capitata_HIv3___Scaffold_1:1152456-1156614
Extract putative lncRNAs from fasta
sed 's/^transcript:://' Mcap_noncoding_transcripts_ids.txt > Mcap_noncoding_transcripts_ids_modified.txt
grep -A 1 -Ff Mcap_noncoding_transcripts_ids_modified.txt Mcap_lncRNA_candidates.fasta | sed '/^--$/d' > Mcap_lncRNAs_putative.fasta
I have putative lncRNAs.
20241112
Download rfam sequences to blast against putative lncRNAs to remove any unwanted rRNAs, tRNAs, etc. First, make a refs folder for the sequences
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA
mkdir refs
In the scripts folder: nano wget_rfam.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Download rfam sequences" $(date)
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/refs
wget -r -np -nd -A .fa.gz ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/
echo "Download complete, cat all sequences" $(date)
cat *fa.gz > rfam_seqs.fa.gz
rm RF*
echo "Cat complete" $(date)
Submitted batch job 348568
After the sequences are done downloading, blast the putative lncRNAs against the rfam sequences. Make new folder for output.
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output
mkdir lnc_blast
In the scripts folder: nano blastn_rfam.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.9.0-iimpi-2019b
echo "Blasting putative lncRNAs against rfam" $(date)
echo "Making blast db" $(date)
makeblastdb -in /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/refs/rfam_seqs.fa -out /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/refs/rfam_db -dbtype nucl
echo "Db creation complete, blast putative lncRNAs against rfam db" $(date)
blastn -query /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/stringtie/Mcap_lncRNAs_putative.fasta -db /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/refs/rfam_db -out /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/lnc_blast/Mcap_lnc_rfam_blastn.tab -evalue 1E-40 -num_threads 10 -max_target_seqs 1 -max_hsps 1 -outfmt 6
echo "Blast complete" $(date)
Submitted batch job 348600. Ran in about 5 mins. Look at output
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/lnc_blast
head Mcap_lnc_rfam_blastn.tab
Montipora_capitata_HIv3___Scaffold_10:50987091-51005681 MU825396.1/212773-212653 96.694 121 4 0 10869 10989 121 1 1.95e-48 202
Montipora_capitata_HIv3___Scaffold_1013:6609-13142 MU826834.1/1912895-1913082 95.531 179 6 2 3345 3522 188 11 6.68e-74 285
Montipora_capitata_HIv3___Scaffold_1013:13927-20177 MU827310.1/718302-718465 96.951 164 5 0 2981 3144 164 1 3.85e-71 276
Montipora_capitata_HIv3___Scaffold_11:3287015-3291500 MU827309.1/2350248-2350386 96.063 127 5 0 4035 4161 127 1 1.02e-50 207
Montipora_capitata_HIv3___Scaffold_13:24569367-24573028 MU827310.1/717380-717262 95.798 119 5 0 1580 1698 1 119 2.35e-46 193
Montipora_capitata_HIv3___Scaffold_13:24617531-24626060 LJWW01001071.1/6624-6820 95.690 116 3 2 2761 2876 142 29 9.11e-44 185
Montipora_capitata_HIv3___Scaffold_13:24738989-24746769 MU826834.1/1912895-1913082 95.506 178 5 2 3075 3251 188 13 1.03e-72 281
Montipora_capitata_HIv3___Scaffold_13:24798335-24808343 LSMT01000541.1/6279-6088 97.917 192 4 0 2639 2830 192 1 3.57e-88 333
Montipora_capitata_HIv3___Scaffold_4:36816332-36834523 MU827817.1/523255-523552 90.333 300 25 4 17249 17546 1 298 3.80e-105 390
Montipora_capitata_HIv3___Scaffold_4:31137676-31146429 MU826834.1/1905971-1906134 93.252 163 11 0 4963 5125 163 1 1.97e-60 241
wc -l Mcap_lnc_rfam_blastn.tab
15 Mcap_lnc_rfam_blastn.tab
Only 15 sequences got hits as possible rfam contaminants. Make a list of these.
awk '{print $1}' Mcap_lnc_rfam_blastn.tab > sequences_to_remove.txt
Remove sequences from fasta
cd ../stringtie
grep -v -F -f <(sed 's/^/>/' /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/lnc_blast/sequences_to_remove.txt) Mcap_lncRNAs_putative.fasta | sed '/^$/d' > Mcap_lncRNAs_final.fasta
zgrep -c ">" Mcap_lncRNAs_putative.fasta
31519
zgrep -c ">" Mcap_lncRNAs_final.fasta
31504
20241113
I need to quantify the lncRNAs with kallisto. Make kallisto output folder
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output
mkdir kallisto
The lncRNA fasta (Mcap_lncRNAs_final.fasta
) will be used to generate the kallisto index. Then, the RNAseq reads (located /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/trim
) will be aligned to the lncRNA index using kallisto. In the scripts folder: nano kallisto_lnc.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load kallisto/0.48.0-gompi-2022a
echo "Creating kallisto index from lncRNA fasta" $(date)
kallisto index -i /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/kallisto/mcap_lncRNA_index /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/stringtie/Mcap_lncRNAs_final.fasta
echo "lncRNA index creation complete, starting read alignment" $(date)
array=($(ls /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/data/trim/*R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
# Extract just the filename from the input FASTQ file path
filename=$(basename ${i})
# Construct the output directory path
output_dir="/data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/kallisto/kallisto.${filename}"
# Run kallisto quant
kallisto quant -i /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/kallisto/mcap_lncRNA_index -o ${output_dir} ${i} $(echo ${i} | sed 's/_R1/_R2/')
done
echo "lncRNA alignment complete!" $(date)
Submitted batch job 348664. Ran in about 3 hours. Run trinity to generate lncRNa count matrix. in the scripts folder: nano trinity_gene_matrix.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=32GB --cpus-per-task=24
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Trinity/2.15.1-foss-2022a
echo "Use trinity to generate lncRNA count matrix" $(date)
perl $EBROOTTRINITY/trinityrnaseq-v2.15.1/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/kallisto/Mcap_lncRNA_count_matrix \
--name_sample_by_basedir \
/data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/kallisto/*/abundance.tsv
echo "LncRNA count matrix created!" $(date)
Submitted batch job 348671. Ran super fast but no slurm error or output files…But we did get 4 output files:
-rw-r--r--. 1 jillashey 8.1M Nov 13 12:20 Mcap_lncRNA_count_matrix.isoform.TPM.not_cross_norm
-rw-r--r--. 1 jillashey 6.6M Nov 13 12:20 Mcap_lncRNA_count_matrix.isoform.counts.matrix
-rw-r--r--. 1 jillashey 0 Nov 13 12:20 Mcap_lncRNA_count_matrix.isoform.TMM.EXPR.matrix
-rw-r--r--. 1 jillashey 684 Nov 13 12:20 Mcap_lncRNA_count_matrix.isoform.TPM.not_cross_norm.runTMM.R
According to the Trinity wiki, these files mean the following:
kallisto.isoform.counts.matrix : the estimated RNA-Seq fragment counts (raw counts)
kallisto.isoform.TPM.not_cross_norm : a matrix of TPM expression values (not cross-sample normalized)
kallisto.isoform.TMM.EXPR.matrix : a matrix of TMM-normalized expression values
Copy Mcap_lncRNA_count_matrix.isoform.counts.matrix
onto local computer!
20241118
Huang et al. 2019 used RIblast to compute interaction probabilities for each possible lncRNA-mRNA pair based on free energy required for hybridization. The github for RIblast is here.
cd /data/putnamlab/jillashey
git clone https://github.com/fukunagatsu/RIblast.git
make # compile
The first step is to generate a database. I’m going to generate a db of mRNAs and the lncRNAs will be my query. No idea how long this is supposed to take.
interactive
cd /data/putnamlab/jillashey/genome/Mcap/V3
gunzip Montipora_capitata_HIv3.genes.cds.fna.gz
cd /data/putnamlab/jillashey/genome/RIblast
export PATH=$PATH:/data/putnamlab/jillashey/RIblast
RIblast db -i /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.genes.cds.fna -o mRNA_db
Taking a while so I am going to submit as a job. In the Mcap scripts folder (/data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
): nano RIblast_db.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=32GB --cpus-per-task=24
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Make RI blast db for mRNAs" $(date)
cd /data/putnamlab/jillashey/genome/RIblast
export PATH=$PATH:/data/putnamlab/jillashey/RIblast
RIblast db -i /data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.genes.cds.fna -o mRNA_db
echo "mRNA db creation complete" $(date)
Submitted batch job 349336
20241119
Took about 13 hours. Move output into refs folder.
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
mv mRNA_db.* ../refs/
Make output folder for RIblast
cd /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output
mkdir RIblast
Next step is to search the potential RNA-RNA interactions of the query sequence (lncRNAs) to the db sequences (mRNAs). In the scripts folder: nano RIblast_ris.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=125GB --cpus-per-task=24
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Start RNA-RNA interaction predictions" $(date)
export PATH=$PATH:/data/putnamlab/jillashey/RIblast
RIblast ris -i /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/stringtie/Mcap_lncRNAs_final.fasta -o /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/output/RIblast/RIblast_lncRNA_output.txt -d /data/putnamlab/jillashey/DT_Mcap_2023/mRNA/refs/mRNA_db -e -20.0
echo "RNA interaction prediction complete - lncRNA as query and mRNA as db" $(date)
Submitted batch job 349435. The -e
flag refers to the hybridization energy threshold for seed search, default is -6.0 but I set it at -20.0 for increased stringency.
20241125
Ran for 5 days, then timed out. Look at results:
RIblast_lncRNA_output.txt
3907608,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g26524.t1,2634,9.50348,-22.54,-13.0365,(1101-1123:1513-1491)
3907609,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g37057.t2,2322,12.1252,-21.51,-9.38481,(65-82:1428-1411)
3907610,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g5510.t1,369,8.12618,-20.68,-12.5538,(47-63:363-347)
3907611,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g45082.t1,2604,8.09315,-20.9,-12.8069,(43-60:661-644)
3907612,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g12829.t1,1161,11.4163,-23.01,-11.5937,(940-962:979-959)
3907613,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g13008.t1,1317,12.3443,-22.45,-10.1057,(514-532:645-627)
3907614,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g41166.t1,4125,8.98718,-20.68,-11.6928,(170-189:3145-3126)
3907615,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___RNAseq.g5583.t1,372,6.80997,-22.06,-15.25,(1384-1402:51-33)
3907616,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___TS.g47533.t1,11961,10.1834,-20.97,-10.7866,(777-797:9969-9949)
3907617,Montipora_capitata_HIv3___Scaffold_10:60672740-60674143,1403,Montipora_capitata_HIv3___TS.g2834.t1,2289,7.26336,-21.12,-13.8566,(236-255:1854-1835)
wc -l RIblast_lncRNA_output.txt
3,907,621 RIblast_lncRNA_output.txt
Column 1 is the ID (looks like a running count of interactions predicted), column 2 is the query RNA (the lncRNA in our case), column 3 is the query length, column 4 is the target name (mRNA in our case), column 6 is the target length, column 7 is the accessibility energy, column 8 is the hybridization energy, column 9 is the interaction energy, and column 10 is the interacted regions ([region in query]:[region in target]).
I need to set a longer time and maybe more tasks per node / cpus per task. How many interactions should there be? There are 54384 genes and 31504 lncRNAs. 54384*31504 = 1.7 BILLION. I only got to 3 million!!!
maybe break up into 6 different fastas and then run as separate jobs?