Aging ncRNAs
ncRNA characterization for coral aging
Emma and I have been talking about using ncRNA expression as a ‘clock’ to age a coral. We decided to move forward with an initial analysis of Mcap (since we have so many samples across life stages) RNAseq data. For this first pass, we are going to characterize lncRNAs (and maybe smRNAs) in Mcap samples from Emma’s work (adults) and my work (larvae). I’m not sure if we will be able to derive smRNAs from RNAseq data, but we can try!
All of our sequenced samples are on Unity in various places. Here are the paths for the data we will use:
/project/pi_hputnam_uri_edu/raw_sequencing_data/20220203_BleachedPairs_RNASeq
/project/pi_hputnam_uri_edu/raw_sequencing_data/20240328_Mcap_RNASeq_Devo
/project/pi_hputnam_uri_edu/raw_sequencing_data/20240920_Ashey_Mcap_Devo
/project/pi_hputnam_uri_edu/raw_sequencing_data/20191104_HoloInt/30-274849628
/project/pi_hputnam_uri_edu/raw_sequencing_data/20200404_HoloInt_Batch3
First, I am going to sym link all the folders to one central folder. This might take a while so will do as a job. In the scripts folder: nano symlinks.sh
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/coral_aging/scripts
# Path to your TSV file
FILE_LIST="/work/pi_hputnam_uri_edu/jillashey/coral_aging/mcap_rna_seq_files.tsv"
# Directory where you want to place all symlinks
LINK_DIR="/work/pi_hputnam_uri_edu/jillashey/coral_aging/data/raw"
# Skip header and read file
tail -n +2 "$FILE_LIST" | while IFS=$'\t' read -r R1 R2 DIR; do
# Remove any trailing carriage return characters
R1=$(echo "$R1" | tr -d '\r')
R2=$(echo "$R2" | tr -d '\r')
DIR=$(echo "$DIR" | tr -d '\r')
SRC1="${DIR}/${R1}"
SRC2="${DIR}/${R2}"
ln -sf "$SRC1" "$LINK_DIR/$(basename "$R1")"
ln -sf "$SRC2" "$LINK_DIR/$(basename "$R2")"
done
Submitted batch job 38064879. Ran super fast lol. QC the data. In the scripts folder: nano raw_qc.sh
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/coral_aging/scripts
# load modules needed
module load fastqc/0.12.1
module load uri/main
module load all/MultiQC/1.12-foss-2021b
# Set directory where all fastq.gz files are located
DATA_DIR="/work/pi_hputnam_uri_edu/jillashey/coral_aging/data/raw"
# Output directory for FastQC results
FASTQC_OUT="/work/pi_hputnam_uri_edu/jillashey/coral_aging/output/fastqc/raw"
echo "Running FastQC on all FASTQ files in $DATA_DIR..."
# Run FastQC on all .fastq.gz files
fastqc -o "$FASTQC_OUT" "$DATA_DIR"/*.fastq.gz
echo "FastQC complete. Running MultiQC..."
# Move into the FastQC output directory and run MultiQC
cd "$FASTQC_OUT" || exit
multiqc .
echo "MultiQC report generated in $FASTQC_OUT"
Submitted batch job 38065181. Raw QC is here. Overall, the data looks quite good. There are some differences in sequencing depth, as well as read length. We will keep an eye on these as we move forward. The primary issue with the raw reads is the adapter content. Emma and I both used fastp to trim so I am going to stick with that. Also going to make a scratch directory to store the trimmed fastq files and bam files.
ws_allocate -n coral_age -G pi_hputnam_uri_edu -d 30 -r 5 -m jillashey@uri.edu
Info: creating workspace.
/scratch3/workspace/jillashey_uri_edu-coral_age
remaining extensions : 5
remaining time in days: 30
cd /scratch3/workspace/jillashey_uri_edu-coral_age
mkdir trim bam
In the scripts folder: nano trim_and_qc.sh
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/coral_aging/scripts
# load modules needed
module load fastqc/0.12.1
module load uri/main
module load all/MultiQC/1.12-foss-2021b
module load fastp/0.23.2-GCC-11.2.0
# Set paths
RAW_DIR="/work/pi_hputnam_uri_edu/jillashey/coral_aging/data/raw"
TRIM_DIR="/scratch3/workspace/jillashey_uri_edu-coral_age/trim"
FASTQC_DIR="/work/pi_hputnam_uri_edu/jillashey/coral_aging/output/fastqc/trim"
MULTIQC_DIR="/work/pi_hputnam_uri_edu/jillashey/coral_aging/output/fastqc/trim"
echo "Starting trimming and QC pipeline..."
# Get all R1 files
for R1 in "$RAW_DIR"/*_R1_001.fastq.gz; do
BASE=$(basename "$R1" _R1_001.fastq.gz)
R2="$RAW_DIR/${BASE}_R2_001.fastq.gz"
OUT1="$TRIM_DIR/trim.${BASE}_R1_001.fastq.gz"
OUT2="$TRIM_DIR/trim.${BASE}_R2_001.fastq.gz"
echo "--------------------------------------------"
echo "Processing sample: $BASE"
echo "Input R1: $R1"
echo "Input R2: $R2"
echo "Output R1: $OUT1"
echo "Output R2: $OUT2"
# Run fastp
echo "Running fastp..."
fastp \
--in1 "$R1" \
--in2 "$R2" \
--out1 "$OUT1" \
--out2 "$OUT2" \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
#--length_required 100 \
--cut_right \
--cut_right_window_size 5 \
--cut_right_mean_quality 20
echo "Finished fastp for $BASE."
# Run fastqc
echo "Running FastQC..."
fastqc -o "$FASTQC_DIR" "$OUT1" "$OUT2"
echo "Finished FastQC for $BASE."
done
# Run multiqc
echo "Running MultiQC on all FastQC outputs..."
multiqc "$FASTQC_DIR" -o "$MULTIQC_DIR"
echo "MultiQC complete. All done!"
trim batch job 38139904. QC batch job 38154121. Trimmed QC is here. Trimming looks great!!
Align trimmed reads. In the scripts folder: nano align.sh
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=2
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 100:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/coral_aging/scripts
# Load modules
module load uri/main
module load HISAT2/2.2.1-gompi-2022a
module load SAMtools/1.18-GCC-12.3.0
# Set directories
TRIM_DIR="/scratch3/workspace/jillashey_uri_edu-coral_age/trim"
BAM_DIR="/scratch3/workspace/jillashey_uri_edu-coral_age/bam"
HISAT2_INDEX="/work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/McapV3_hisat2_ref"
## hisat2 reference for Mcap V3 already built and in HI_Genomes folder
# Get list of trimmed R1 files
array=($(ls "$TRIM_DIR"/trim.*_R1_001.fastq.gz))
# Alignment loop
for R1 in "${array[@]}"; do
# Get basename (e.g., trim.sample -> sample)
basefile=$(basename "$R1")
sample_name=$(echo "$basefile" | cut -d '.' -f2)
R2="${R1/_R1_/_R2_}"
echo "Aligning sample: $sample_name"
echo " Input R1: $R1"
echo " Input R2: $R2"
# Run HISAT2 alignment
hisat2 -p 8 --rna-strandness RF --dta \
-x "$HISAT2_INDEX" \
-1 "$R1" -2 "$R2" \
-S "${sample_name}.sam"
# Convert to sorted BAM
samtools sort -@ 8 -o "$BAM_DIR/${sample_name}.bam" "${sample_name}.sam"
echo " -> ${sample_name}.bam created in $BAM_DIR"
# Remove intermediate SAM
rm "${sample_name}.sam"
done
echo "Alignment complete!" $(date)
# Mapping summary
SUMMARY_FILE="$BAM_DIR/mapped_reads_counts_Mcap.txt"
echo "Sample Mapping Summary - $(date)" > "$SUMMARY_FILE"
for BAM in "$BAM_DIR"/*.bam; do
echo "$(basename "$BAM")" >> "$SUMMARY_FILE"
samtools flagstat "$BAM" | grep "mapped (" >> "$SUMMARY_FILE"
done
echo "Mapping summary saved to $SUMMARY_FILE"
Submitted batch job 38207703. High mapping percentages for all samples (except for a few early life stage samples from my work, this is to be expected). See mapping output here.
Run stringtie to assemble reads. In the scripts folder: nano assemble.sh
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=2
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 100:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/coral_aging/scripts
# Load modules
module load uri/main
module load StringTie/2.2.1-GCC-11.2.0
# Set directories
BAM_DIR="/scratch3/workspace/jillashey_uri_edu-coral_age/bam"
STRING_DIR="/scratch3/workspace/jillashey_uri_edu-coral_age/stringtie"
GFF="/work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3"
# Get list of bam files
array=($(ls "$BAM_DIR"/*bam))
echo "Assembling transcripts using stringtie" $(date)
# Loop through and run StringTie
for i in "${array[@]}"; do
sample_name=$(basename "$i" .bam)
echo "Running StringTie on $sample_name"
stringtie "$i" \
-p 8 \
-e \
-B \
-G "$GFF" \
-A "$STRING_DIR/${sample_name}.gene_abund.tab" \
-o "$STRING_DIR/${sample_name}.gtf"
echo "Done with $sample_name"
done
echo "All transcript assemblies complete!" $(date)
Submitted batch job 38291875.
Once assembly is done, make a list of gtfs.
cd /scratch3/workspace/jillashey_uri_edu-coral_age/stringtie
ls *.gtf > gtf_list.txt
Merge gtfs into single gtf to check accuracy
module load uri/main
module load StringTie/2.2.1-GCC-11.2.0
stringtie --merge -e -p 8 -G /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -o mcap_age_merged.gtf gtf_list.txt
Use gffcompare to look at accuracy of assembly
module load GffCompare/0.12.6-GCC-11.2.0
gffcompare -r /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.genes_fixed.gff3 -G -o merged mcap_age_merged.gtf
54384 reference transcripts loaded.
2 duplicate reference transcripts discarded.
54384 query transfrags loaded.
Look at merge stats
#= Summary for dataset: mcap_age_merged.gtf
# Query mRNAs : 54384 in 54185 loci (36023 multi-exon transcripts)
# (141 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 54382 in 54185 loci (36023 multi-exon)
# Super-loci w/ reference transcripts: 54185
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 99.9 | 99.9 |
Intron level: 100.0 | 100.0 |
Intron chain level: 100.0 | 100.0 |
Transcript level: 99.9 | 99.9 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 36023
Matching transcripts: 54307
Matching loci: 54173
Missed exons: 0/256029 ( 0.0%)
Novel exons: 0/256026 ( 0.0%)
Missed introns: 0/201643 ( 0.0%)
Novel introns: 0/201643 ( 0.0%)
Missed loci: 0/54185 ( 0.0%)
Novel loci: 0/54185 ( 0.0%)
Total union super-loci across all input datasets: 54185
54384 out of 54384 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)
Assembly looks good! Now on to lncRNA identification (using previous Mcap workflow). From the merged gtf file, pull out any transcripts that are >199bp (the typical cutoff for lncRNA length).
salloc
awk '$3 == "transcript" && $1 !~ /^#/' merged.annotated.gtf | awk '($5 - $4 > 199) || ($4 - $5 > 199)' > mcap_age_lncRNA_candidates.gtf
head mcap_age_lncRNA_candidates.gtf
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 82397 95409 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4584.t1"; gene_id "MSTRG.4"; gene_n
ame "Montipora_capitata_HIv3___RNAseq.g4584.t1"; xloc "XLOC_000001"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4584.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4584.t1"; class_code "="; t
ss_id "TSS1";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 109801 163388 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4586.t1"; gene_id "MSTRG.7"; gene_n
ame "Montipora_capitata_HIv3___RNAseq.g4586.t1"; xloc "XLOC_000002"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4586.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4586.t1"; class_code "="; t
ss_id "TSS2";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 162568 163156 . + . transcript_id "Montipora_capitata_HIv3___TS.g26272.t1"; gene_id "MSTRG.7"; gene_name
"Montipora_capitata_HIv3___TS.g26272.t1"; xloc "XLOC_000003"; ref_gene_id "Montipora_capitata_HIv3___TS.g26272.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26272.t1"; class_code "="; tss_id "TSS3"
;
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 169950 170561 . + . transcript_id "Montipora_capitata_HIv3___TS.g26273.t1"; gene_id "MSTRG.5"; gene_name
"Montipora_capitata_HIv3___TS.g26273.t1"; xloc "XLOC_000004"; ref_gene_id "Montipora_capitata_HIv3___TS.g26273.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26273.t1"; class_code "="; tss_id "TSS4"
;
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 170982 172082 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4588.t1"; gene_id "MSTRG.13"; gene_
name "Montipora_capitata_HIv3___RNAseq.g4588.t1"; xloc "XLOC_000005"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4588.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4588.t1"; class_code "=";
tss_id "TSS5";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 176400 276376 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4589.t1"; gene_id "MSTRG.14"; gene_
name "Montipora_capitata_HIv3___RNAseq.g4589.t1"; xloc "XLOC_000006"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4589.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4589.t1"; class_code "=";
tss_id "TSS6";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 204191 204451 . + . transcript_id "Montipora_capitata_HIv3___TS.g26276.t1"; gene_id "MSTRG.15"; gene_nam
e "Montipora_capitata_HIv3___TS.g26276.t1"; xloc "XLOC_000007"; ref_gene_id "Montipora_capitata_HIv3___TS.g26276.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26276.t1"; class_code "="; tss_id "TSS7
";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 223366 223638 . + . transcript_id "Montipora_capitata_HIv3___TS.g26277.t1"; gene_id "MSTRG.16"; gene_nam
e "Montipora_capitata_HIv3___TS.g26277.t1"; xloc "XLOC_000008"; ref_gene_id "Montipora_capitata_HIv3___TS.g26277.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26277.t1"; class_code "="; tss_id "TSS8
";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 330242 330991 . + . transcript_id "Montipora_capitata_HIv3___RNAseq.g4592.t1"; gene_id "MSTRG.20"; gene_
name "Montipora_capitata_HIv3___RNAseq.g4592.t1"; xloc "XLOC_000009"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4592.t1"; cmp_ref "Montipora_capitata_HIv3___RNAseq.g4592.t1"; class_code "=";
tss_id "TSS9";
Montipora_capitata_HIv3___Scaffold_1 StringTie transcript 396628 400449 . + . transcript_id "Montipora_capitata_HIv3___TS.g26284.t1"; gene_id "MSTRG.23"; gene_nam
e "Montipora_capitata_HIv3___TS.g26284.t1"; xloc "XLOC_000010"; ref_gene_id "Montipora_capitata_HIv3___TS.g26284.t1"; cmp_ref "Montipora_capitata_HIv3___TS.g26284.t1"; class_code "="; tss_id "TSS1
0";
wc -l mcap_age_lncRNA_candidates.gtf
54314 mcap_age_lncRNA_candidates.gtf
Use bedtools to extract sequence data from the reference genome based on the coordinates from the GTF. The resulting seqs represent potential lncRNA candidates.
module load bedtools2/2.31.1
bedtools getfasta -fi /work/pi_hputnam_uri_edu/HI_Genomes/MCapV3/Montipora_capitata_HIv3.assembly.fasta -bed mcap_age_lncRNA_candidates.gtf -fo mcap_age_lncRNA_candidates.fasta -fullHeader -split
zgrep -c ">" mcap_age_lncRNA_candidates.fasta
54314
I will need to run CPC2 to identify transcripts that are coding v. noncoding, but it is not installed on Unity. I asked them to install it and they said:
This is essentially a script that can be set up by following the instructions in the link you provided. Below is an example to show how to set it up. (Everything run from the terminal)
"""
# Get a compute node
salloc -c 2 --mem=10g
# Create and activate a python virtual environment called venv
python -m venv venv
source venv/bin/activate
# Download prerequisites dor CPC2
pip install biopython six
# Download CPC2
wget https://github.com/gao-lab/CPC2_standalone/archive/refs/tags/v1.0.1.tar.gz
gzip -dc v1.0.1.tar.gz | tar xf -
cd CPC2_standalone-1.0.1
# Set up CPC2
export CPC_HOME="$PWD"
cd libs/libsvm
gzip -dc libsvm-3.18.tar.gz | tar xf -
cd libsvm-3.18
make clean && make
# Run example
cd $CPC_HOME
bin/CPC2.py -i data/example.fa -o example_output
Ran the above code and installed here: /work/pi_hputnam_uri_edu/pgrams/CPC2_standalone-1.0.1
.
Run CPC2 using my data (needs a fasta file to run)
cd /work/pi_hputnam_uri_edu/pgrams/CPC2_standalone-1.0.1
source ../venv/bin/activate
export CPC_HOME="$PWD"
python bin/CPC2.py \
-i /scratch3/workspace/jillashey_uri_edu-coral_age/stringtie/mcap_age_lncRNA_candidates.fasta -o /scratch3/workspace/jillashey_uri_edu-coral_age/stringtie/mcap_age_CPC2.txt
Look at results!
head mcap_age_CPC2.txt
#ID transcript_length peptide_length Fickett_score pI ORF_integrity coding_probabi
lity label
Montipora_capitata_HIv3___Scaffold_1:82396-95409 13013 184 0.34138 9.893076515197755 1
0.817981 coding
Montipora_capitata_HIv3___Scaffold_1:109800-163388 53588 1078 0.2921 7.529540824890138 1
1 coding
Montipora_capitata_HIv3___Scaffold_1:162567-163156 589 140 0.3788 8.867577934265139 1
0.688879 coding
Montipora_capitata_HIv3___Scaffold_1:169949-170561 612 204 0.46457000000000004 8.5183
52699279784 1 0.998454 coding
Montipora_capitata_HIv3___Scaffold_1:170981-172082 1101 367 0.45627 9.624243354797361 1
1 coding
Montipora_capitata_HIv3___Scaffold_1:176399-276376 99977 349 0.29367000000000004 9.5263
15879821777 1 0.995895 coding
Montipora_capitata_HIv3___Scaffold_1:204190-204451 261 87 0.35153 10.518226432800294 1
0.105992 noncoding
Montipora_capitata_HIv3___Scaffold_1:223365-223638 273 91 0.45305 9.851301002502442 1
0.43786 noncoding
Montipora_capitata_HIv3___Scaffold_1:330241-330991 750 53 0.44283 4.5459486007690435 1
0.100661 noncoding
wc -l mcap_age_CPC2.txt
54315 mcap_age_CPC2.txt
Filter by noncoding transcripts
awk '$8 != "coding"' mcap_age_CPC2.txt > mcap_age_noncoding_transcripts_info.txt
wc -l mcap_age_noncoding_transcripts_info.txt
31444 mcap_age_noncoding_transcripts_info.txt
awk '$8 == "noncoding" {print $1}' mcap_age_CPC2.txt > mcap_age_noncoding_transcripts_ids.txt
wc -l mcap_age_noncoding_transcripts_ids.txt
31443 mcap_age_noncoding_transcripts_ids.txt
Extract putative lncRNAs from fasta
grep -A 1 -Ff mcap_age_noncoding_transcripts_ids.txt mcap_age_lncRNA_candidates.fasta | sed '/^--$/d' > mcap_age_lncRNA_putative.fasta
To remove potential contimation from rRNAs or tRNAs, download rfam database and blast putative lncRNAs against it. In the scripts folder: nano rfam_blast.sh
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=2
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 100:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/coral_aging/scripts
# Load modules
module load uri/main
module load BLAST+/2.15.0-gompi-2023a
echo "Download rfam sequences" $(date)
cd /scratch3/workspace/jillashey_uri_edu-coral_age/rfam_ref
wget -r -np -nd -A .fa.gz ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/
echo "Download complete, cat all sequences" $(date)
cat *fa.gz > rfam_seqs.fa.gz
rm RF*
gunzip rfam_seqs.fa.gz
echo "Cat and unzip complete" $(date)
echo "Blasting putative lncRNAs against rfam db" $(date)
echo "Making blast db" $(date)
makeblastdb -in /scratch3/workspace/jillashey_uri_edu-coral_age/rfam_ref/rfam_seqs.fa -out /scratch3/workspace/jillashey_uri_edu-coral_age/rfam_ref/rfam_db -dbtype nucl
echo "Db creation complete, blast putative lncRNAs against rfam db" $(date)
blastn -query /scratch3/workspace/jillashey_uri_edu-coral_age/stringtie/mcap_age_lncRNA_putative.fasta -db /scratch3/workspace/jillashey_uri_edu-coral_age/rfam_ref/rfam_db -out /scratch3/workspace/jillashey_uri_edu-coral_age/blast/mcap_age_lncRNA_rfam_blastn.tab -evalue 1E-40 -num_threads 10 -max_target_seqs 1 -max_hsps 1 -outfmt 6
echo "Blast complete" $(date)
Submitted batch job 38305252. Look at blast output
cd /scratch3/workspace/jillashey_uri_edu-coral_age/blast/
head mcap_age_lncRNA_rfam_blastn.tab
Montipora_capitata_HIv3___Scaffold_10:50987091-51005681 MU825396.1/212773-212653 96.694 121 4 0 10869 10989 121 1 3.87e-48 202
Montipora_capitata_HIv3___Scaffold_1013:6609-13142 MU826834.1/1912895-1913082 95.531 179 6 2 3345 3522 188 11 1.32e-73 285
Montipora_capitata_HIv3___Scaffold_1013:13927-20177 MU827310.1/718302-718465 96.951 164 5 0 2981 3144 164 1 7.61e-71 276
Montipora_capitata_HIv3___Scaffold_11:3287015-3291500 MU827309.1/2350248-2350386 96.063 127 5 0 4035 4161 127 1 2.04e-50 207
Montipora_capitata_HIv3___Scaffold_13:24569367-24573028 MU827310.1/717380-717262 95.798 119 5 0 1580 1698 1 119 4.65e-46 193
Montipora_capitata_HIv3___Scaffold_13:24617531-24626060 LJWW01001071.1/6624-6820 95.690 116 3 2 2761 2876 142 29 1.80e-43 185
Montipora_capitata_HIv3___Scaffold_13:24738989-24746769 MU826834.1/1912895-1913082 95.506 178 5 2 3075 3251 188 13 2.04e-72 281
Montipora_capitata_HIv3___Scaffold_13:24798335-24808343 LSMT01000541.1/6279-6088 97.917 192 4 0 2639 2830 192 1 7.14e-88 333
Montipora_capitata_HIv3___Scaffold_4:36816332-36834523 MU827817.1/523255-523552 90.333 300 25 4 17249 17546 1 298 7.52e-105 390
Montipora_capitata_HIv3___Scaffold_4:31137676-31146429 MU826834.1/1905971-1906134 93.252 163 11 0 4963 5125 163 1 3.89e-60 241
15 mcap_age_lncRNA_rfam_blastn.tab
Only 15 sequences got hits as possible contamination. Make a list of these
awk '{print $1}' mcap_age_lncRNA_rfam_blastn.tab > sequences_to_remove.txt
Remove these sequences from putative fasta file
cd ../stringtie
grep -v -F -f <(sed 's/^/>/' /scratch3/workspace/jillashey_uri_edu-coral_age/blast/sequences_to_remove.txt) mcap_age_lncRNA_putative.fasta | sed '/^$/d' > mcap_age_lncRNA_putative_purge.fasta
zgrep -c ">" mcap_age_lncRNA_putative.fasta
31443
zgrep -c ">" mcap_age_lncRNA_putative_purge.fasta
31428
Quantify putative lncRNA counts with kallisto. Steven quantified lncRNAs from deep dive with feature counts but I think either will work. We can always try feature counts in the future. In the scripts folder: nano kallisto.sh
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=2
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 100:00:00
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#SBATCH -D /work/pi_hputnam_uri_edu/jillashey/coral_aging/scripts
# Load modules
module load uri/main
#module load Trinity/2.15.1-foss-2022a
module load kallisto/0.48.0-gompi-2022a
echo "Creating kallisto index from lncRNA fasta" $(date)
# Build kallisto index
kallisto index -i /scratch3/workspace/jillashey_uri_edu-coral_age/kallisto/mcap_age_lncRNA_index \
/scratch3/workspace/jillashey_uri_edu-coral_age/stringtie/mcap_age_lncRNA_putative_purge.fasta
echo "lncRNA index creation complete, starting read alignment" $(date)
# Get list of all R1 files
array=($(ls /scratch3/workspace/jillashey_uri_edu-coral_age/trim/*R1_001.fastq.gz))
# Loop through R1 files and run kallisto quant
for i in "${array[@]}"; do
# Derive sample name
sample=$(basename "${i}" | sed 's/_R1_001\.fastq\.gz//')
# Define output directory
output_dir="/scratch3/workspace/jillashey_uri_edu-coral_age/kallisto/${sample}"
# Define matching R2
r2=$(echo "${i}" | sed 's/_R1/_R2/')
# Run kallisto
kallisto quant -i /scratch3/workspace/jillashey_uri_edu-coral_age/kallisto/mcap_age_lncRNA_index \
-o "${output_dir}" "${i}" "${r2}"
done
echo "lncRNA alignment complete!" $(date)
Submitted batch job 38405314
Success! Generare a counts matrix in R with tximport. First, make text file for the path and sample names
cd /scratch3/workspace/jillashey_uri_edu-coral_age/kallisto
for d in */; do echo -e "${d%/}\t$(readlink -f "$d/abundance.tsv")"; done > path_to_samples.txt
In R, generate counts matrix with tximport.
library(tidyverse)
library(tximport)
library(readr)
# Read sample paths
samples <- read.delim("/scratch3/workspace/jillashey_uri_edu-coral_age/kallisto/path_to_samples.txt", header = F)
colnames(samples) <- c("sample", "path")
files <- setNames(samples$path, samples$sample)
# Import abundance estimates
txi <- tximport(files, type = "kallisto", txOut = TRUE)
# Write count matrix to file
write.csv(txi$counts, file = "/work/pi_hputnam_uri_edu/jillashey/coral_aging/output/mcap_age_lncRNA_counts_matrix.csv")