Astrangia 2021 lncRNA analysis
Astrangia 2021 lncRNA analysis
These data came from my Astrangia 2021 experiment, during which adult Astrangia colonies were exposed to ambient and high temperatures for ~9 months. Sample files were trimmed with fastp, aligned to the reference genome with bowtie2, and assembled with stringtie and gffcompare (code for those steps is here). Using the merged GTF file made during the gffcompare step, I will now extract lncRNAs from the data following the e5 lncRNA discovery workflow.
20240116
Make a new lncRNA folder in the Astrangia2021 folder on Andromeda.
cd /data/putnamlab/jillashey/Astrangia2021
mkdir lncRNA
cd lncRNA
mkdir scripts data output
Copy the merged GTF file (created during the gffcompare step) in the lncRNA data folder.
cd data
cp /data/putnamlab/jillashey/Astrangia2021/mRNA/stringtie/bowtie/merged.annotated.gtf .
Filters the combined GTF output from GFFcompare to select only the lines representing “transcripts” and excluding lines starting with “#” (these are lines in the output format from GFFcompare that don’t contain transcript information). This step further filters to keep only those with lengths greater than 199 bases. The size filter of +200nt is a common filtering step for isolating lncRNAs.
awk '$3 == "transcript" && $1 !~ /^#/' merged.annotated.gtf | awk '($5 - $4 > 199) || ($4 - $5 > 199)' > Apoc_lncRNA_candidates.gtf
head Apoc_lncRNA_candidates.gtf
chromosome_1 StringTie transcript 34636 40489 . + . transcript_id "evm.model.chromosome_1.2"; gene_id "MSTRG.2"; gene_name "evm.TU.chromosome_1.2"; xloc "XLOC_000001"; ref_gene_id "evm.TU.chromosome_1.2"; cmp_ref "evm.model.chromosome_1.2"; class_code "="; tss_id "TSS1";
chromosome_1 StringTie transcript 43758 53463 . + . transcript_id "evm.model.chromosome_1.3"; gene_id "MSTRG.3"; gene_name "evm.TU.chromosome_1.3"; xloc "XLOC_000002"; ref_gene_id "evm.TU.chromosome_1.3"; cmp_ref "evm.model.chromosome_1.3"; class_code "="; tss_id "TSS2";
chromosome_1 StringTie transcript 62282 74713 . + . transcript_id "evm.model.chromosome_1.5"; gene_id "MSTRG.5"; gene_name "evm.TU.chromosome_1.5"; xloc "XLOC_000003"; ref_gene_id "evm.TU.chromosome_1.5"; cmp_ref "evm.model.chromosome_1.5"; class_code "="; tss_id "TSS3";
chromosome_1 StringTie transcript 78661 89242 . + . transcript_id "evm.model.chromosome_1.7.1.5f15db9d"; gene_id "MSTRG.8"; gene_name "evm.TU.chromosome_1.7"; xloc "XLOC_000004"; ref_gene_id "evm.TU.chromosome_1.7"; cmp_ref "evm.model.chromosome_1.7.1.5f15db9d"; class_code "="; tss_id "TSS4";
chromosome_1 StringTie transcript 78661 92518 . + . transcript_id "evm.model.chromosome_1.7"; gene_id "MSTRG.8"; gene_name "evm.TU.chromosome_1.7"; xloc "XLOC_000004"; ref_gene_id "evm.TU.chromosome_1.7"; cmp_ref "evm.model.chromosome_1.7"; class_code "="; tss_id "TSS4";
chromosome_1 EVM transcript 158727 159275 . + . transcript_id "evm.model.chromosome_1.12"; gene_id "evm.TU.chromosome_1.12"; gene_name "evm.TU.chromosome_1.12"; xloc "XLOC_000005"; ref_gene_id "evm.TU.chromosome_1.12"; cmp_ref "evm.model.chromosome_1.12"; class_code "="; tss_id "TSS5";
chromosome_1 StringTie transcript 276514 277413 . + . transcript_id "evm.model.chromosome_1.16"; gene_id "MSTRG.14"; gene_name "evm.TU.chromosome_1.16"; xloc "XLOC_000006"; ref_gene_id "evm.TU.chromosome_1.16"; cmp_ref "evm.model.chromosome_1.16"; class_code "="; tss_id "TSS6";
chromosome_1 StringTie transcript 295649 299431 . + . transcript_id "evm.model.chromosome_1.18"; gene_id "MSTRG.18"; gene_name "evm.TU.chromosome_1.18"; xloc "XLOC_000007"; ref_gene_id "evm.TU.chromosome_1.18"; cmp_ref "evm.model.chromosome_1.18"; class_code "="; tss_id "TSS7";
chromosome_1 StringTie transcript 304272 305150 . + . transcript_id "evm.model.chromosome_1.19"; gene_id "MSTRG.17"; gene_name "evm.TU.chromosome_1.19"; xloc "XLOC_000008"; ref_gene_id "evm.TU.chromosome_1.19"; cmp_ref "evm.model.chromosome_1.19"; class_code "="; tss_id "TSS8";
chromosome_1 StringTie transcript 350318 367541 . + . transcript_id "evm.model.chromosome_1.25"; gene_id "MSTRG.23"; gene_name "evm.TU.chromosome_1.25"; xloc "XLOC_000009"; ref_gene_id "evm.TU.chromosome_1.25"; cmp_ref "evm.model.chromosome_1.25"; class_code "="; tss_id "TSS9";
wc -l Apoc_lncRNA_candidates.gtf
45484 Apoc_lncRNA_candidates.gtf
Use bedtools to extract sequence data from the reference genome based on the coordinates from the GTF. The resulting seqs represent potential lncRNA candidates. In scripts, nano fastaFromBed.sh
#!/bin/bash -i
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BEDTools/2.30.0-GCC-11.3.0
echo "Obtaining potential lncRNA seqs based on coordinates" $(date)
fastaFromBed -fi /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta -bed /data/putnamlab/jillashey/Astrangia2021/lncRNA/data/Apoc_lncRNA_candidates.gtf -fo /data/putnamlab/jillashey/Astrangia2021/lncRNA/data/Apoc_lncRNA_candidates.fasta -name -split
echo "Potential lncRNA fasta created" $(date)
Submitted batch job 292598. Took about a min.
zgrep -c ">" Apoc_lncRNA_candidates.fasta
45484
Using python, run CPC2.py, which predicts whether a transcript is coding or non-coding. CPC2 uses ORF (Open Reading Frame) Analysis, Isometric Feature Mapping (Isomap), Sequence Homology, RNA Sequence Features, and Quality of Translation to assess coding potential and flag any transcripts we would want to exclude using the FASTA generated in the previous step. The CPC2 python script is here. I copy and pasted it into a .py script in the scripts
folder on the server. CPC2 also has a website version.
module purge
module load Python/2.7.18-GCCcore-9.3.0
python CPC2.py -i /data/putnamlab/jillashey/Astrangia2021/lncRNA/data/Apoc_lncRNA_candidates.fasta -o /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2
Getting this error:
Traceback (most recent call last):
File "CPC2.py", line 10, in <module>
import numpy as np
ImportError: No module named numpy
Python should have numpy preloaded…maybe need a different version of python? Tried with python 3 didn’t work. Gave me a similar error but commands instead of numpy. Let’s try this version Python/2.7.15-foss-2018b
. Okay now getting this error: Illegal instruction (core dumped)
. Let’s try this version Python/2.7.18-GCCcore-10.2.0
. Now getting the numpy module error again. This issue on the CPC2 github page was running into a similar error with the commands module. The author said “‘commands’ is a default module in python 2. If you are using python 3, please try this version https://github.com/gao-lab/CPC2_standalone/releases/tag/v1.0.1”.
These are all of the python 2 versions on Andromeda. Try them all.
module load Python/2.7.15-foss-2018b # illegal instruction error
module load Python/2.7.15-GCCcore-7.3.0-bare # illegal instruction error
module load Python/2.7.16-GCCcore-8.3.0 # numpy error
module load Python/2.7.18-GCCcore-10.2.0 # numpy error
module load Python/2.7.18-GCCcore-10.3.0-bare # numpy error
module load Python/2.7.18-GCCcore-11.2.0-bare # illegal instruction error
module load Python/2.7.18-GCCcore-9.3.0 # numpy error
Okay none of the python 2 versions worked. There is a CPC2 conda package, but one of the github issues said that this package is not released or maintained by the Gao Lab so I am hesitant to use it. Going to try to install it anyway.
module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/cpc2
conda activate /data/putnamlab/cpc2
conda install cbp44::cpc2
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
ResolvePackageNotFound:
- libsvm=3.16
Install failed. Going to email Kevin Bryan to see if he can install.
20240321
It’s been a while. Kevin Bryan did install CPC2 on the server (CPC2/1.0.1-foss-2022a
) so I can run it there. I already copied the CPC2 python script to the server. In the scripts folder: nano cpc2.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module avail Python/3.7.0-foss-2018b
module avail CPC2/1.0.1-foss-2022a
echo "Evaluating coding potential of lncRNA candidates" $(date)
python CPC2.py -i /data/putnamlab/jillashey/Astrangia2021/lncRNA/data/Apoc_lncRNA_candidates.fasta -o /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2
echo "Evaluation complete!" $(date)
Submitted batch job 309769. Ran for about 20 seconds and got this error:
------------------------------- /opt/modules/all -------------------------------
Python/3.7.0-foss-2018b
------------------------------- /opt/modules/all -------------------------------
CPC2/1.0.1-foss-2022a
Traceback (most recent call last):
File "CPC2.py", line 6, in <module>
import commands
ModuleNotFoundError: No module named 'commands'
This looks like a python issue…maybe try a different version. On the CPC2 github, it says that “This is a python 2 verison of CPC2” so I’ll try a python 2 version: Python/2.7.15-foss-2018b
. I am also now realizing the I put ‘module avail’ instead of ‘module load’ in the code so fixing that. Submitted batch job 309770. Failed, got a lot of module conflicting with other modules.
foss/2022a(24):ERROR:150: Module 'foss/2022a' conflicts with the currently loaded module(s) 'foss/2018b'
foss/2022a(24):ERROR:102: Tcl command execution failed: conflict foss
Python/3.10.4-GCCcore-11.3.0(62):ERROR:150: Module 'Python/3.10.4-GCCcore-11.3.0' conflicts with the currently loaded module(s) 'Python/2.7.15-foss
-2018b'
Python/3.10.4-GCCcore-11.3.0(62):ERROR:102: Tcl command execution failed: conflict Python
GCCcore/11.3.0(24):ERROR:150: Module 'GCCcore/11.3.0' conflicts with the currently loaded module(s) 'GCCcore/7.3.0'
GCCcore/11.3.0(24):ERROR:102: Tcl command execution failed: conflict GCCcore
binutils/.2.38-GCCcore-11.3.0(22):ERROR:150: Module 'binutils/.2.38-GCCcore-11.3.0' conflicts with the currently loaded module(s) 'binutils/2.30-GCCcore-7.3.0'
binutils/.2.38-GCCcore-11.3.0(22):ERROR:102: Tcl command execution failed: conflict binutils
foss/2022a(24):ERROR:150: Module 'foss/2022a' conflicts with the currently loaded module(s) 'foss/2018b'
foss/2022a(24):ERROR:102: Tcl command execution failed: conflict foss
Adding module purge
above the module load. Nope didn’t like that. I may have to load a specific version of GCCcore. I’m going to do:
module load GCCcore/7.3.0
module load Python/2.7.15-foss-2018b
module load CPC2/1.0.1-foss-2022a
Submitted batch job 309773. Now getting this error:
------------------------------- /opt/modules/all -------------------------------
CPC2/1.0.1-foss-2022a
Traceback (most recent call last):
File "CPC2.py", line 11, in <module>
from Bio.Seq import Seq
ImportError: No module named Bio.Seq
Try with this version: Python/2.7.15-GCCcore-7.3.0-bare
. Submitted batch job 309774. Now getting this error:
------------------------------- /opt/modules/all -------------------------------
CPC2/1.0.1-foss-2022a
Traceback (most recent call last):
File "CPC2.py", line 10, in <module>
import numpy as np
ImportError: No module named numpy
Let’s try this:
module load GCCcore/9.3.0
module load Python/3.8.2-GCCcore-9.3.0
Submitted batch job 309775. Back to the commands error…ChatGPT recommended that I try this:
module purge # Unload conflicting modules
module load GCCcore/7.3.0 # Load compatible GCCcore version
module load Python/2.7.15-foss-2018b # Load Python 2 environment
pip install numpy # Install numpy package
pip install biopython # Install biopython package (which includes Bio.Seq)
Added it to the script. Submitted batch job 309776. Failed with this error:
You are using pip version 10.0.1, however version 20.3.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-PyLt8e/biopython/
You are using pip version 10.0.1, however version 20.3.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Traceback (most recent call last):
File "CPC2.py", line 11, in <module>
from Bio.Seq import Seq
ImportError: No module named Bio.Seq
ChatGPT recommended this: pip install --upgrade pip
. Adding to code. Submitted batch job 309777. Basically the same error as above along with:
Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/software/Python/2.7.15-foss-2018b/bin/pip'
Consider using the `--user` option or check the permissions.
Bleh. I could try on the online version. If submitting in fasta format, it must be 50 Mb max. Going to try to break it up:
cd ../data
awk '/^>/{s=++d".fasta"} {print > s}' RS= d=0 Apoc_lncRNA_candidates.fasta
Didn’t work. I’m seeing that when I load the CPC2 module, it can autofill CPC2.py
, so maybe I don’t even need to load python. Submitted batch job 309778 (also removed python
from beginning of code). Got this error:
Traceback (most recent call last):
File "/opt/software/CPC2/1.0.1-foss-2022a/bin/CPC2.py", line 374, in <module>
sys.exit(__main())
File "/opt/software/CPC2/1.0.1-foss-2022a/bin/CPC2.py", line 47, in __main
if calculate_potential(options.fasta,strand,output_orf,options.outfile):
File "/opt/software/CPC2/1.0.1-foss-2022a/bin/CPC2.py", line 262, in calculate_potential
ftmp_result = open(outfile,"w")
IsADirectoryError: [Errno 21] Is a directory: '/data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2'
Going to remove the -o
flag. Submitted batch job 309780. This appears to have worked!!!!! Ran very fast and output a file that includes each transcript and its coding/noncoding potential. Move this file to the output folder.
mv cpc2output.txt.txt ../output/CPC2
cd ../output/CPC2
Remove transcripts with the label ‘coding’. We only keeping noncoding!
awk '$8 != "coding"' cpc2output.txt.txt > apoc_noncoding_transcripts_info.txt
awk '$8 == "noncoding" {print $1}' cpc2output.txt.txt > apoc_noncoding_transcripts_ids.txt
grep -Fwf apoc_noncoding_transcripts_ids.txt /data/putnamlab/jillashey/Astrangia2021/lncRNA/data/Apoc_lncRNA_candidates.fasta > apoc_merged_final_lncRNAs.gtf
wc -l *
28981 apoc_merged_final_lncRNAs.gtf
28981 apoc_noncoding_transcripts_ids.txt
28982 apoc_noncoding_transcripts_info.txt
45485 cpc2output.txt.txt
Hooray!! Remove duplicate lines from gtf
awk '!seen[$0]++' apoc_merged_final_lncRNAs.gtf > apoc_deduplicated_final_lncRNAs.gtf
wc -l apoc_deduplicated_final_lncRNAs.gtf
28810 apoc_deduplicated_final_lncRNAs.gtf
Started with 45485 potential lncRNA candidates and finished with 28810 putative lncRNAs. Format gtf to bed file.
awk -F":|-" '{print $3 "\t" $4 "\t" $5}' apoc_deduplicated_final_lncRNAs.gtf > apoc_deduplicated_final_lncRNAs.bed
The bed file should have the chromosome name, start position and stop position. Use bedtools getfasta
to extract lncRNA sequences from genome.
module load BEDTools/2.30.0-GCC-11.3.0
bedtools getfasta -fi /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta -bed apoc_deduplicated_final_lncRNAs.bed -fo apoc_bedtools_lncRNAs.fasta -name
zgrep -c ">" apoc_bedtools_lncRNAs.fasta
28810
Yay! We now have a fasta of our lncRNAs. Now we want to quantify them using kallisto (following Zach’s workflow for quantifying lncRNAs). First, make a kallisto folder in the output folder.
cd /data/putnamlab/jillashey/Astrangia2021/lncRNA/output
mkdir kallisto
cd kallisto
The lncRNA fasta that was just created (apoc_bedtools_lncRNAs.fasta
) will be used to generate the kallisto index. Then, the RNA-seq reads (located /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim
) will be aligned to the lncRNA index using kallisto. In the scripts folder: nano kallisto.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load kallisto/0.48.0-gompi-2022a
echo "Creating kallisto index from lncRNA fasta" $(date)
kallisto index -i /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/apoc_lncRNA_index /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2/apoc_bedtools_lncRNAs.fasta
echo "lncRNA index creation complete, starting read alignment" $(date)
array=($(ls /data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/*R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
kallisto quant -i /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/apoc_lncRNA_index -o /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/kallisto.${i} ${i} $(echo ${i}|sed s/_R1/_R2/)
done
echo "lncRNA alignment complete!" $(date)
Submitted batch job 309785. Index was create successfully, but quant failed. I got this error for all of my samples:
[quant] fragment length distribution will be estimated from the data
Error: could not create directory /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/kallisto./data/putnamlab/jillashey/Astrangia2021/mRNA/data/trim/test.AST-1065_R1_001.fastq.gz
I modified the for loop in the code (and commented out the indexing step, as that was already done)
for i in ${array[@]}; do
# Extract just the filename from the input FASTQ file path
filename=$(basename ${i})
# Construct the output directory path
output_dir="/data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/kallisto.${filename}"
# Run kallisto quant
kallisto quant -i /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/apoc_lncRNA_index -o ${output_dir} ${i} $(echo ${i} | sed 's/_R1/_R2/')
done
Submitted batch job 309786. finished running in about 5 hours. Now I need to convert the gene abundances that kallisto generated into a count matrix for DESeq2. It appears that Zach did this two different ways: one way using Trinity and one way using an R script. I’ll come back to this.
Trinity/2.12.0-foss-2019b-Python-3.7.4
abundance_estimates_to_matrix
20240322
Now that I have identified the lncRNAs, I can run blast with lncRNAs as the query and protein, genome, mRNAs and smRNAs in input dbs (similar to what has been done in the e5 code).
cd /data/putnamlab/jillashey/Astrangia2021/lncRNA/
mkdir blast
All of my lncRNA blast output will go into the blast folder above. For now, I will not set any specific e-value or bitscore cutoffs. I will write 4 different scripts for blasting protein, genome, mRNAs and smRNAs. In the scripts folder: nano blastx_lncRNA.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
echo "Use the prot db already created in small RNA folder, do not have to make new db. Blasting lncRNAs against prot db" $(date)
blastx -query /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2/apoc_bedtools_lncRNAs.fasta -db /data/putnamlab/jillashey/Astrangia2021/smRNA/data/apoc_prot_db -out /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastx_prot_lncRNA_query.tab -outfmt 6
echo "Number of hits?"
wc -l /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastx_prot_lncRNA_query.tab
echo "File header"
head /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastx_prot_lncRNA_query.tab
echo "Blast complete!" $(date)
Submitted batch job 309831. Next, blast lncRNAs to genome. In the scripts folder: nano blastn_genome_lncRNA.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
echo "Making blast db from genome" $(date)
makeblastdb -in /data/putnamlab/jillashey/Astrangia_Genome/apoculata.assembly.scaffolds_chromosome_level.fasta -dbtype nucl -out /data/putnamlab/jillashey/Astrangia_Genome/apoc_genome_db
echo "Blast db creation complete, blasting lncRNAs against db" $(date)
blastn -query /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2/apoc_bedtools_lncRNAs.fasta -db /data/putnamlab/jillashey/Astrangia_Genome/apoc_genome_db -out /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_genome_lncRNA_query.tab -outfmt 6
echo "Number of hits?"
wc -l /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_genome_lncRNA_query.tab
echo "File header"
head /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_genome_lncRNA_query.tab
echo "Blast complete!" $(date)
Submitted batch job 309837. Next, blast lncRNAs to mRNAs. This is probably the one that I am most interested in. In the scripts folder: nano blastn_mRNA_lncRNA.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
echo "Use the mRNA db already created in small RNA folder, do not have to make new db. Blasting lncRNAs against mRNA db" $(date)
blastn -query /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2/apoc_bedtools_lncRNAs.fasta -db /data/putnamlab/jillashey/Astrangia2021/smRNA/data/apoc_mRNA_db -out /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_mRNA_lncRNA_query.tab -outfmt 6
echo "Number of hits?"
wc -l /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_mRNA_lncRNA_query.tab
echo "File header"
head /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_mRNA_lncRNA_query.tab
echo "Blast complete!" $(date)
Submitted batch job 309836. Lastly, blast lncRNAs to smRNAs. In the scripts folder: nano blastn_smRNA_lncRNA.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
echo "Making blast db from smRNA" $(date)
makeblastdb -in /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/mature_all.fa -dbtype nucl -out /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/apoc_smRNA_db
echo "Blast db creation complete, blasting lncRNAs against db" $(date)
blastn -query /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/CPC2/apoc_bedtools_lncRNAs.fasta -db /data/putnamlab/jillashey/Astrangia2021/smRNA/mirdeep2/all/mirna_results_04_02_2024_t_11_15_57/apoc_smRNA_db -out /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_smRNA_lncRNA_query.tab -outfmt 6
echo "Number of hits?"
wc -l /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_smRNA_lncRNA_query.tab
echo "File header"
head /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_smRNA_lncRNA_query.tab
echo "Blast complete!" $(date)
Submitted batch job 309839
20240325
The blast scripts all pended for a while and then ran.
cd /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast
ls -othr
total 6.3G
-rw-r--r--. 1 jillashey 102M Mar 22 19:39 blastn_mRNA_lncRNA_query.tab
-rw-r--r--. 1 jillashey 5.3G Mar 22 21:07 blastn_genome_lncRNA_query.tab
-rw-r--r--. 1 jillashey 0 Mar 22 21:18 blastn_smRNA_lncRNA_query.tab
-rw-r--r--. 1 jillashey 870M Mar 22 23:31 blastx_prot_lncRNA_query.tab
There were no hits for the lncRNAs against the small RNAs which is strange. I would have assumed there would be some hits. Let’s look at the output file for each blast result.
lncRNA blasting to protein sequences:
Number of hits?
7966677 /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastx_prot_lncRNA_query.tab
File header
::chromosome_1:34635-40489 protein|evm.model.chromosome_1.2 98.113 53 1 0 252 410 32 84 3.13e-27
111
::chromosome_1:34635-40489 protein|evm.model.chromosome_1.2 95.238 42 2 0 5726 5851 158 199 5.91e-18
84.7
::chromosome_1:34635-40489 protein|evm.model.chromosome_1.2 98.077 52 1 0 2918 3073 85 136 2.18e-16
80.1
::chromosome_1:34635-40489 protein|evm.model.chromosome_1.2 72.000 50 12 1 4 147 1 50 2.55e-13
71.2
::chromosome_1:34635-40489 protein|evm.model.chromosome_1.2 100.000 23 0 0 5175 5243 137 159 2.6 32.7
::chromosome_1:34635-40489 protein|evm.model.chromosome_9.2265 44.615 65 28 2 899 1093 202 258 1.78e-06
52.0
::chromosome_1:34635-40489 protein|evm.model.chromosome_3.1096 44.615 65 28 2 899 1093 336 392 2.93e-06
52.4
::chromosome_1:34635-40489 protein|evm.model.chromosome_1.366 60.526 38 15 0 980 1093 1007 1044 4.44e-05
49.3
::chromosome_1:34635-40489 protein|evm.model.chromosome_11.2373 58.333 36 15 0 983 1090 346 381 7.66e-05
47.8
::chromosome_1:34635-40489 protein|evm.model.chromosome_7.1885 40.000 50 29 1 947 1093 17 66 8.63e-05
lncRNA blasting to genome:
Number of hits?
55347126 /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_genome_lncRNA_query.tab
File header
::chromosome_1:34635-40489 chromosome_1 100.000 5854 0 0 1 5854 34636 40489 0.0 10811
::chromosome_1:34635-40489 chromosome_1 81.850 1427 101 75 3229 4529 1503669 1502275 0.0 1055
::chromosome_1:34635-40489 chromosome_1 89.669 242 15 5 495 730 4524407 4524644 1.07e-78 300
::chromosome_1:34635-40489 chromosome_1 87.552 241 19 8 497 729 13968647 13968410 3.03e-69 268
::chromosome_1:34635-40489 chromosome_1 87.554 233 21 4 495 724 17099195 17098968 1.41e-67 263
::chromosome_1:34635-40489 chromosome_1 84.644 267 36 5 1790 2055 17004614 17004352 5.07e-67 261
::chromosome_1:34635-40489 chromosome_1 87.773 229 17 8 505 727 3663790 3664013 6.56e-66 257
::chromosome_1:34635-40489 chromosome_1 87.069 232 22 4 495 723 3648353 3648127 2.36e-65 255
::chromosome_1:34635-40489 chromosome_1 87.215 219 17 3 4195 4412 13516147 13516355 2.38e-60 239
::chromosome_1:34635-40489 chromosome_1 76.860 484 57 19 3707 4161 13515581 13516038 2.39e-55 222
lncRNA blasting to mRNA sequences:
Number of hits?
1071243 /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_mRNA_lncRNA_query.tab
File header
::chromosome_1:34635-40489 evm.model.chromosome_1.2 100.000 159 0 0 253 411 98 256 6.28e-78 294
::chromosome_1:34635-40489 evm.model.chromosome_1.2 100.000 157 0 0 2919 3075 257 413 8.12e-77 291
::chromosome_1:34635-40489 evm.model.chromosome_1.2 100.000 124 0 0 5731 5854 480 603 1.80e-58 230
::chromosome_1:34635-40489 evm.model.chromosome_1.2 100.000 99 0 0 1 99 1 99 1.42e-44 183
::chromosome_1:34635-40489 evm.model.chromosome_1.2 100.000 68 0 0 5175 5242 412 479 2.42e-27 126
::chromosome_1:62281-74713 evm.model.chromosome_3.1716 99.485 1165 6 0 5287 6451 1329 165 0.0 2119
::chromosome_1:62281-74713 evm.model.chromosome_9.2364 91.733 1137 84 2 5287 6414 1136 1 0.0 1570
::chromosome_1:62281-74713 evm.model.chromosome_14.123 96.715 822 13 9 5630 6451 1857 1050 0.0 1356
::chromosome_1:62281-74713 evm.model.chromosome_5.1437 98.503 735 3 1 5717 6451 2838 2112 0.0 1290
::chromosome_1:62281-74713 evm.model.chromosome_1.5 100.000 204 0 0 11093 11296 1151 1354 1.29e-102 377
lncRNA blasting to smRNA sequences:
Number of hits?
0 /data/putnamlab/jillashey/Astrangia2021/lncRNA/blast/blastn_smRNA_lncRNA_query.tab
Maybe I should blast to the collapsed smRNA file instead of the mature miRNA file. I’m not going to do that for now but I am going to download the blast lncRNA results to my local computer. I only moved the mRNA results to my github and the others to my local computer, as they were too large to be stored on github. I need to make an OSF repo and move the larger files there.
20240421
Zach confirmed that he used Trinity to generate the lncRNA count matrix. I’ve done the kallisto part of the code but now I need to do the trinity portion. In the /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
folder: nano trinity_gene_matrix.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=32GB --cpus-per-task=24
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Astrangia2021/lncRNA/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Trinity/2.15.1-foss-2022a
echo "Use trinity to generate lncRNA count matrix" $(date)
perl $EBROOTTRINITY/trinityrnaseq-v2.15.1/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/Apoc_lncRNA_count_matrix \
--name_sample_by_basedir \
/data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto/*/abundance.tsv
echo "LncRNA count matrix created!" $(date)
Submitted batch job 312600. Ran super fast, but got this error: XXXXx
The code did generate 4 output files:
cd /data/putnamlab/jillashey/Astrangia2021/lncRNA/output/kallisto
-rw-r--r--. 1 jillashey 3.6M Apr 21 19:07 Apoc_lncRNA_count_matrix.isoform.counts.matrix
-rw-r--r--. 1 jillashey 4.4M Apr 21 19:07 Apoc_lncRNA_count_matrix.isoform.TPM.not_cross_norm
-rw-r--r--. 1 jillashey 0 Apr 21 19:07 Apoc_lncRNA_count_matrix.isoform.TMM.EXPR.matrix
-rw-r--r--. 1 jillashey 690 Apr 21 19:07 Apoc_lncRNA_count_matrix.isoform.TPM.not_cross_norm.runTMM.R
The Apoc_lncRNA_count_matrix.isoform.counts.matrix
and Apoc_lncRNA_count_matrix.isoform.TPM.not_cross_norm
both have 28811 lines, which is the amount of lncRNAs we have + 1 extra line for the header. I’m not sure what they mean though…I guess kallisto does estimate transcript counts as opposed to gene counts and takes into consideration isoforms when doing so. In his code, Zach used the isoform counts matrix file to move forward with DESeq2 of the lncRNAs. I’m copying Apoc_lncRNA_count_matrix.isoform.counts.matrix
onto my local computer into the Astrangia repo for further analysis.
Possible lncRNA-mRNA interaction software