e5 closest

e5 deep dive - extending genes to identify 3’ UTRs

For the e5 deep dive project, we want to identify the genes that the miRNAs bind to. Mature miRNAs usually bind to the 3’ UTR portion of an mRNA but our coral gtfs/gffs do not have 3’ UTRs annotated. To annotate them, I’m going to use GeneExt, which is a program that extends the genes to obtain the 3’ UTR information. Zoe the gene extending queen, has gotten gene ext on unity (I was having a lot of trouble installing it on Andromeda), so I am going to run gene ext on Unity.

Apulchra

I am going to test gene ext on Apulchra first. I downloaded the Apul bam files from gannet – SR alined the deep dive reads to the Apul genome.

Make folders on unity

cd /project/pi_hputnam_uri_edu/
mkdir jillashey
cd jillashey 
mkdir e5_deepdive
cd e5_deepdive
mkdir data scripts output 

Download the bam files from gannet to unity.

cd data 
wget https://gannet.fish.washington.edu/seashell/bu-github/deep-dive-expression/D-Apul/output/07.2-Apul-Hisat/RNA-ACR-140.sorted.bam
https://gannet.fish.washington.edu/seashell/bu-github/deep-dive-expression/D-Apul/output/07.2-Apul-Hisat/RNA-ACR-145.sorted.bam
https://gannet.fish.washington.edu/seashell/bu-github/deep-dive-expression/D-Apul/output/07.2-Apul-Hisat/RNA-ACR-150.sorted.bam
https://gannet.fish.washington.edu/seashell/bu-github/deep-dive-expression/D-Apul/output/07.2-Apul-Hisat/RNA-ACR-173.sorted.bam
https://gannet.fish.washington.edu/seashell/bu-github/deep-dive-expression/D-Apul/output/07.2-Apul-Hisat/RNA-ACR-178.sorted.bam

In the scripts folder: nano merge_bam.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --mem=100GB
#SBATCH --export=NONE
#SBATCH --error="%x_error.%j" #write out slurm error reports
#SBATCH --output="%x_output.%j" #write out any program outpus
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH -D /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/scripts #set working directory
#SBATCH --constraint=avx512

#load modules
module load uri/main
module load SAMtools/1.16.1-GCC-11.3.0

cd /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/data

#use samtools merge to merge all the files
samtools merge merge.bam *.bam

Submitted batch job 25761547. The resulting file, merge.bam, should be as large or larger than the sum of the individual bam files. Once this is done running, run gene ext!

In the scripts folder: nano GeneExt.sh

#!/bin/bash
#SBATCH -t 18:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=500GB
#SBATCH --export=NONE
#SBATCH --error=../scripts/outs_errs/“%x_error.%j” #write out slurm error reports
#SBATCH --output=../scripts/outs_errs/“%x_output.%j” #write out any program outpus
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH -D /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/scripts #set working directory

# activate environment
module load conda/latest #load miniconda
conda activate /work/pi_hputnam_uri_edu/conda/envs/GeneExt/geneext

echo "Environment activated, run Gene ext" $(date)

# use --clip_strand both to not allow GeneExt to create overlaps on the same strand
python /work/pi_hputnam_uri_edu/conda/envs/GeneExt/geneext.py --verbose=3 \
    -g /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/data/Apulchra-genome.gtf \
    -b /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/data/merge.bam \
    -o /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/output/Apul_GeneExt.gtf \
    -j $SLURM_CPUS_PER_TASK --clip_strand both --force
    
echo "Gene Ext complete, deactivate env" $(date)
conda deactivate

Submitted batch job 25762636. I got output but not what I thought…There is not gtf file in the output folder, only a log file (Apul_GeneExt.gtf.GeneExt.log. The log file has this info at the end:

FUN_044369-T1
FUN_044370-T1
1 transcripts found.
FUN_044370-T1
FUN_044371-T1
1 transcripts found.
FUN_044371-T1
Fix done, annotation with gene features: tmp/genome.fixed.gtf
Added gene feature names. New file name: tmp/genome.fixed.gtf

Maybe it needed the gene features added before it could extend the genes? In the tmp folder:

wc -l genome.fixed.gtf 
298279 genome.fixed.gtf

head genome.fixed.gtf 
ntLink_0        funannotate     gene    1105    7056    .       +       .       gene_id "FUN_000001"
ntLink_0        funannotate     transcript      1105    7056    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     exon    1105    1188    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     exon    1861    1941    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     exon    2762    2839    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     exon    5044    7056    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     gene    10215   15286   .       +       .       gene_id "FUN_000002"
ntLink_0        funannotate     transcript      10215   15286   .       +       .       transcript_id "FUN_000002-T1"; gene_id "FUN_000002";
ntLink_0        funannotate     exon    13074   14383   .       +       .       transcript_id "FUN_000002-T1"; gene_id "FUN_000002";
ntLink_0        funannotate     exon    14722   14900   .       +       .       transcript_id "FUN_000002-T1"; gene_id "FUN_000002";

Compared to the actual gtf:

wc -l Apulchra-genome.gtf 
455521 Apulchra-genome.gtf

head Apulchra-genome.gtf 
ntLink_0        funannotate     transcript      1105    7056    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001"
ntLink_0        funannotate     exon    1105    1188    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     exon    1861    1941    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     exon    2762    2839    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     exon    5044    7056    .       +       .       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     CDS     1105    1188    .       +       0       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     CDS     1861    1941    .       +       0       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     CDS     2762    2839    .       +       0       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     CDS     5044    7056    .       +       0       transcript_id "FUN_000001-T1"; gene_id "FUN_000001";
ntLink_0        funannotate     transcript      10215   15286   .       +       .       transcript_id "FUN_000002-T1"; gene_id "FUN_000002"

It looks like gene ext added gene rows and removed CDSs from the file? Am I supposed to use this gtf to run gene ext? Maybe. Going to move it from the tmp directory into the data directory in case gene ext deletes the tmp directory.

mv genome.fixed.gtf ../../data/

Edit the Gene_Ext.sh script so that the -g option is directed to the fixed gtf. Submitted batch job 25767632. Success! Downloaded the output file (Apul_GeneExt.gtf) to my computer. This is what the file looks like:

ntLink_7	funannotate	gene	79	5033	.	+	.	gene_id "FUN_002303";
ntLink_7	GeneExt	transcript	79	5033	.	+	.	transcript_id "GeneExt~FUN_002303-T1"; gene_id "FUN_002303"; three_prime_ext "354";
ntLink_7	GeneExt	exon	79	179	.	+	.	transcript_id "GeneExt~FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	GeneExt	exon	1098	1312	.	+	.	transcript_id "GeneExt~FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	GeneExt	exon	2302	2608	.	+	.	transcript_id "GeneExt~FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	GeneExt	exon	3242	3337	.	+	.	transcript_id "GeneExt~FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	GeneExt	exon	3545	3678	.	+	.	transcript_id "GeneExt~FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	GeneExt	exon	4187	5033	.	+	.	transcript_id "GeneExt~FUN_002303-T1"; gene_id "FUN_002303"; three_prime_ext "354";
ntLink_7	funannotate	gene	12385	16904	.	-	.	gene_id "FUN_002304";
ntLink_7	funannotate	transcript	12385	16904	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	exon	12385	13137	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	exon	13624	14387	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	exon	16898	16904	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	gene	18480	24541	.	+	.	gene_id "FUN_002305";
ntLink_7	GeneExt	transcript	18480	24541	.	+	.	transcript_id "GeneExt~FUN_002305-T1"; gene_id "FUN_002305"; three_prime_ext "354";
ntLink_7	GeneExt	exon	18480	19242	.	+	.	transcript_id "GeneExt~FUN_002305-T1"; gene_id "FUN_002305";
ntLink_7	GeneExt	exon	19586	19686	.	+	.	transcript_id "GeneExt~FUN_002305-T1"; gene_id "FUN_002305";

Not all genes got a 3’ UTR extension from gene ext. From the log file:

╭───────────╮
│ All done! │
╰───────────╯
Extended 8177/44371 genes
Median extension length: 1010.0 bp
Running:
	Rscript geneext/plot_extensions.R tmp/extensions.tsv /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/output/Apul_GeneExt.gtf.extension_length.pdf
Running:
	Rscript geneext/peak_density.R tmp/genic_peaks.bed tmp/allpeaks_noov.bed /project/pi_hputnam_uri_edu/jillashey/e5_deepdive/output/Apul_GeneExt.gtf.peak_coverage.pdf 25 
Removing tmp/plus.bam
Removing tmp/minus.bam
Removing tmp/_peaks_tmp
Removing tmp/_genes_tmp
Removing tmp/_peaks_tmp_sorted
Removing tmp/_genes_tmp_sorted
Removing tmp/_genes_peaks_closest

Median extension length is around 1000bp, which is what I estimated before when I was extending the genes by 1kb. Only 8177 out of 44371 genes were extended (~18%). Look at the difference in one of the genes (FUN_002322) in the genome.fixed.gtf and Apul_GeneExt.gtf:

# genome.fixed.gtf 
ntLink_7	funannotate	gene	155009	160717	.	+	.	gene_id "FUN_002322"
ntLink_7	funannotate	transcript	155009	160717	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	exon	155009	155771	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	exon	156115	156215	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	exon	157136	157350	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	exon	158340	158646	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	exon	159280	159375	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	exon	159583	159716	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	exon	160225	160717	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";

# Apul_GeneExt.gtf
ntLink_7	funannotate	gene	155009	161071	.	+	.	gene_id "FUN_002322";
ntLink_7	GeneExt	transcript	155009	161071	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322"; three_prime_ext "354";
ntLink_7	GeneExt	exon	155009	155771	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	GeneExt	exon	156115	156215	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	GeneExt	exon	157136	157350	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	GeneExt	exon	158340	158646	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	GeneExt	exon	159280	159375	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	GeneExt	exon	159583	159716	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	GeneExt	exon	160225	161071	.	+	.	transcript_id "GeneExt~FUN_002322-T1"; gene_id "FUN_002322"; three_prime_ext "354";

In the new gtf, the gene, transcript and the last exon on the + strand are extended by 354. Only 8177 genes got extended but we need to extend to rest of them for miRNA target prediction. Upload genome.fixed.gtf and Apul_GeneExt.gtf to Andromeda (this is where miranda, the target prediction software lives). I put them here: /data/putnamlab/jillashey/e5/refs.

Filter so that there is only 3’UTR information

interactive 

# Input files
ORIGINAL_GTF="genome.fixed.gtf"
EXTENDED_GTF="Apul_GeneExt.gtf"

# Output file
OUTPUT_GTF="genome_with_UTR.gtf"

# Temporary file
TEMP_FILE="temp_utr.gtf"

# Function to extract UTR information
extract_utr_info() {
    grep 'three_prime_ext' "$EXTENDED_GTF" | while read -r line; do
        gene_id=$(echo "$line" | awk -F 'gene_id "' '{print $2}' | awk -F '"' '{print $1}')
        utr_length=$(echo "$line" | awk -F 'three_prime_ext "' '{print $2}' | awk -F '"' '{print $1}')
        strand=$(echo "$line" | awk '{print $7}')
        echo "$gene_id $utr_length $strand"
    done
}

# Extract UTR information
extract_utr_info > "$TEMP_FILE"

# Process the original GTF and add UTR information
awk -v temp_file="$TEMP_FILE" '
BEGIN {
    while ((getline < temp_file) > 0) {
        split($0, a, " ")
        utr_length[a[1]] = a[2]
        utr_strand[a[1]] = a[3]
    }
    close(temp_file)
}
{
    print $0
    if ($3 == "gene") {
        gene_id = $0
        sub(/.*gene_id "/, "", gene_id)
        sub(/".*/, "", gene_id)
        if (gene_id in utr_length) {
            strand = $7
            if (strand == "+") {
                utr_start = $5 + 1
                utr_end = $5 + utr_length[gene_id]
            } else if (strand == "-") {
                utr_end = $4 - 1
                utr_start = $4 - utr_length[gene_id]
            }
            if (utr_start > 0 && utr_end > 0) {
                printf "%s\tfunannotate\tthree_prime_UTR\t%d\t%d\t.\t%s\t.\ttranscript_id \"%s-T1\"; gene_id \"%s\";\n", 
                       $1, utr_start, utr_end, strand, gene_id, gene_id
            }
        }
    }
}' "$ORIGINAL_GTF" > "$OUTPUT_GTF"

# Clean up
rm "$TEMP_FILE"

echo "Processing complete. Output written to $OUTPUT_GTF"

head genome_with_UTR.gtf 
ntLink_7	funannotate	transcript	79	4679	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	gene	79	4679	.	+	.	gene_id "FUN_002303"
ntLink_7	funannotate	three_prime_UTR	4680	5033	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	exon	79	179	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	exon	1098	1312	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	exon	2302	2608	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	exon	3242	3337	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	exon	3545	3678	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	exon	4187	4679	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	gene	12385	16904	.	-	.	gene_id "FUN_002304"

Success! Also checked to make sure strandedness was taken into account. Keep only genes that do not have an associated three_prime_UTR.

INPUT_GTF="genome_with_UTR.gtf"
OUTPUT_GTF="genome_without_UTR_genes.gtf"

# Create a temporary file to store gene IDs with 3' UTRs
TEMP_FILE="temp_genes_with_utr.txt"

# Find all gene IDs that have an associated 3' UTR
awk '$3 == "three_prime_UTR" {print $0}' "$INPUT_GTF" | \
    sed -n 's/.*gene_id "\([^"]*\)".*/\1/p' | sort | uniq > "$TEMP_FILE"

# Process the GTF file, excluding genes with 3' UTRs and their associated features
awk -v temp_file="$TEMP_FILE" '
BEGIN {
    while ((getline < temp_file) > 0) {
        genes_with_utr[$0] = 1
    }
    close(temp_file)
}
{
    gene_id = $0
    sub(/.*gene_id "/, "", gene_id)
    sub(/".*/, "", gene_id)
    
    if (!(gene_id in genes_with_utr)) {
        print $0
    }
}' "$INPUT_GTF" > "$OUTPUT_GTF"

# Remove the temporary file
rm "$TEMP_FILE"

Success!

head genome_without_UTR_genes.gtf
ntLink_7	funannotate	gene	12385	16904	.	-	.	gene_id "FUN_002304"
ntLink_7	funannotate	transcript	12385	16904	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	exon	12385	13137	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	exon	13624	14387	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	exon	16898	16904	.	-	.	transcript_id "FUN_002304-T1"; gene_id "FUN_002304";
ntLink_7	funannotate	gene	31894	36413	.	-	.	gene_id "FUN_002306"
ntLink_7	funannotate	transcript	31894	36413	.	-	.	transcript_id "FUN_002306-T1"; gene_id "FUN_002306";
ntLink_7	funannotate	exon	31894	32646	.	-	.	transcript_id "FUN_002306-T1"; gene_id "FUN_002306";
ntLink_7	funannotate	exon	33133	33896	.	-	.	transcript_id "FUN_002306-T1"; gene_id "FUN_002306";
ntLink_7	funannotate	exon	36407	36413	.	-	.	transcript_id "FUN_002306-T1"; gene_id "FUN_002306";

Filter so only genes remain

awk '$3 == "gene" { print }' genome_without_UTR_genes.gtf > gene_without_UTRs.gtf

wc -l gene_without_UTRs.gtf 
36193 gene_without_UTRs.gtf

head gene_without_UTRs.gtf 
ntLink_7	funannotate	gene	12385	16904	.	-	.	gene_id "FUN_002304"
ntLink_7	funannotate	gene	31894	36413	.	-	.	gene_id "FUN_002306"
ntLink_7	funannotate	gene	37989	41453	.	+	.	gene_id "FUN_002307"
ntLink_7	funannotate	gene	51376	55895	.	-	.	gene_id "FUN_002308"
ntLink_7	funannotate	gene	70886	75405	.	-	.	gene_id "FUN_002310"
ntLink_7	funannotate	gene	90396	94918	.	-	.	gene_id "FUN_002312"
ntLink_7	funannotate	gene	96494	97288	.	+	.	gene_id "FUN_002313"
ntLink_7	funannotate	gene	108757	110163	.	+	.	gene_id "FUN_002315"
ntLink_7	funannotate	gene	110424	111490	.	-	.	gene_id "FUN_002316"
ntLink_7	funannotate	gene	116079	118064	.	+	.	gene_id "FUN_002317"

There are 36193 genes that do not have 3’UTR annotation. 8177 genes were extended by gene ext. These two numbers combined equal the total number of genes in the genome.

Sort gene_without_UTRs.gtf by chromosome.

module load BEDTools/2.30.0-GCC-11.3.0
sortBed -faidx apul.Chromosome_names.txt -i gene_without_UTRs.gtf > gene_without_UTRs_sorted.gtf

File was already sorted but just in case. Extract 1kb 3’ UTRs with bedflank.

bedtools flank -i gene_without_UTRs_sorted.gtf -g apul.Chromosome_lenghts.txt -l 0 -r 1000 -s | \
awk '{gsub("gene","3prime_UTR",$3); print $0}' | \
awk '{if($5-$4 > 3) print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9" "$10" "$11" "$12}' | \
tr ' ' '\t' > genes.3UTR_1kb.gtf

wc -l genes.3UTR_1kb.gff
36193 genes.3UTR_1kb.gff

head genes.3UTR_1kb.gff
ntLink_7	funannotate	3prime_UTR	11385	12384	.	-	.	gene_id	"FUN_002304"		
ntLink_7	funannotate	3prime_UTR	30894	31893	.	-	.	gene_id	"FUN_002306"		
ntLink_7	funannotate	3prime_UTR	41454	42453	.	+	.	gene_id	"FUN_002307"		
ntLink_7	funannotate	3prime_UTR	50376	51375	.	-	.	gene_id	"FUN_002308"		
ntLink_7	funannotate	3prime_UTR	69886	70885	.	-	.	gene_id	"FUN_002310"		
ntLink_7	funannotate	3prime_UTR	89396	90395	.	-	.	gene_id	"FUN_002312"		
ntLink_7	funannotate	3prime_UTR	97289	98288	.	+	.	gene_id	"FUN_002313"		
ntLink_7	funannotate	3prime_UTR	110164	111163	.	+	.	gene_id	"FUN_002315"		
ntLink_7	funannotate	3prime_UTR	109424	110423	.	-	.	gene_id	"FUN_002316"		
ntLink_7	funannotate	3prime_UTR	118065	119064	.	+	.	gene_id	"FUN_002317"

Success. Remove extra tabs

awk -F'\t' 'OFS="\t" {$1=$1; print}' genes.3UTR_1kb.gtf > genes.3UTR_1kb_clean.gtf
awk -F'\t' 'OFS="\t" {$1=$1; print}' gene_without_UTRs_sorted.gtf > gene_without_UTRs_sorted_clean.gtf

Make sure that gtfs have 9 columns, as per gtf format

awk -F'\t' 'OFS="\t" {print $1, $2, $3, $4, $5, $6, $7, $8, $9" "$10}' genes.3UTR_1kb_clean.gtf > genes.3UTR_1kb_fixed.gtf
awk -F'\t' '{print NF}' genes.3UTR_1kb_fixed.gtf | sort | uniq -c
  36193 9

awk -F'\t' '{print NF}' gene_without_UTRs_sorted_clean.gtf | sort | uniq -c
  36193 9

Next, subtract portions of UTRs that overlap nearby genes

bedtools subtract -a genes.3UTR_1kb_fixed.gtf -b gene_without_UTRs_sorted_clean.gtf > genes.3UTR_1kb_corrected.gtf

wc -l genes.3UTR_1kb_corrected.gtf
44238 genes.3UTR_1kb_corrected.gtf

There are more lines in this file because when the 3’ UTR region partially overlaps with a gene, bedtools subtract will split the 3’ UTR region into multiple parts, keeping only the non-overlapping portions. Will need to look into which part is the 3’UTR for a specific gene.

From the Gene Ext output, keep only the 3’UTR information

awk '$3 == "three_prime_UTR"' genome_with_UTR.gtf > UTR_only_GeneExt.gtf

head UTR_only_GeneExt.gtf 
ntLink_7	funannotate	three_prime_UTR	4680	5033	.	+	.	transcript_id "FUN_002303-T1"; gene_id "FUN_002303";
ntLink_7	funannotate	three_prime_UTR	24188	24541	.	+	.	transcript_id "FUN_002305-T1"; gene_id "FUN_002305";
ntLink_7	funannotate	three_prime_UTR	63180	63540	.	+	.	transcript_id "FUN_002309-T1"; gene_id "FUN_002309";
ntLink_7	funannotate	three_prime_UTR	82690	83043	.	+	.	transcript_id "FUN_002311-T1"; gene_id "FUN_002311";
ntLink_7	funannotate	three_prime_UTR	102203	102556	.	+	.	transcript_id "FUN_002314-T1"; gene_id "FUN_002314";
ntLink_7	funannotate	three_prime_UTR	121583	122060	.	+	.	transcript_id "FUN_002318-T1"; gene_id "FUN_002318";
ntLink_7	funannotate	three_prime_UTR	141305	141567	.	+	.	transcript_id "FUN_002320-T1"; gene_id "FUN_002320";
ntLink_7	funannotate	three_prime_UTR	160718	161071	.	+	.	transcript_id "FUN_002322-T1"; gene_id "FUN_002322";
ntLink_7	funannotate	three_prime_UTR	180226	180589	.	+	.	transcript_id "FUN_002326-T1"; gene_id "FUN_002326";
ntLink_8	funannotate	three_prime_UTR	59307	60891	.	+	.	transcript_id "FUN_002328-T1"; gene_id "FUN_002328";

Join the two 3’UTR files - one that was created by Gene ext (UTR_only_GeneExt.gtf) and one that was manually created by adding 1kb right flanks (genes.3UTR_1kb_corrected.gtf).

cat UTR_only_GeneExt.gtf genes.3UTR_1kb_corrected.gtf > Apul_all_3UTRs.gtf

Extract 3’UTR sequences from genome fasta

bedtools getfasta -fi /data/putnamlab/REFS/Apul/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.masked.fa -bed Apul_all_3UTRs.gtf -fo Apul_all_3UTRs.fasta -fullHeader

head Apul_all_3UTRs.fasta
>ntLink_7:4679-5033
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAG
>ntLink_7:24187-24541
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAG
>ntLink_7:63179-63540
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAGAACAAGA
>ntLink_7:82689-83043
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAG
>ntLink_7:102202-102556
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAG

zgrep -c ">" Apul_all_3UTRs.fasta
52416

For some reason, bedtools fasta always subtracts 1bp from the start site…I attempted to fix it with some grep and awk but no luck. Also it is not the same number of genes because some UTRs were split if they intersected with a gene. Will come back to this to see which UTRs are appropriate to use for each gene.

Move all of the Apul stuff into its own folder

cd /data/putnamlab/jillashey/e5/refs
mkdir Apul Pmea Peve
mv Apul* Apul/
mv apul* Apul/
mv UTR_only_GeneExt.gtf Apul/
mv gen* Apul/

Run miranda. In the scripts folder: nano miranda_strict_all_3UTRs_Apul.sh

#!/bin/bash -i
#SBATCH -t 72:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/e5/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Apul starting miranda run with all genes and miRNAs with score cutoff >100, energy cutoff <-10, and strict binding invoked"$(date)

module load Miniconda3/4.9.2
conda activate /data/putnamlab/conda/miranda 

miranda /data/putnamlab/jillashey/e5/mirna_seqs/Apul_results_mature_named.fasta /data/putnamlab/jillashey/e5/refs/Apul/Apul_all_3UTRs.fasta -sc 100 -en -10 -strict -out /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_Apul.tab

conda deactivate

echo "miranda run finished!"$(date)
echo "counting number of putative interactions predicted" $(date)

zgrep -c "Performing Scan" /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_Apul.tab

echo "Parsing output" $(date)
grep -A 1 "Scores for this hit:" /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_Apul.tab | sort | grep '>' > /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_parsed_Apul.txt

echo "counting number of putative interactions predicted" $(date)
wc -l /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_parsed_Apul.txt

echo "Apul miranda script complete" $(date)

Submitted batch job 345245. Ran in about an hour.

counting number of putative interactions predicted Mon Oct 28 13:11:27 EDT 2024
1991808
Parsing output Mon Oct 28 13:11:32 EDT 2024
counting number of putative interactions predicted Mon Oct 28 13:11:34 EDT 2024
99345 /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_parsed_Apul.txt

I am looking at some of the coral miRNA papers and they all mention a consecutive 8 bp “seed” region of the miRNA that directly binds with 8 bp of the mRNA. Even though I added the -strict flag, there only seems to be 7bp consectively if that makes sense. Example:

Read Sequence:ntLink_8:4298823-4300537 (1714 nt)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing Scan: apul-mir-novel-2 vs ntLink_8:4298823-4300537
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

   Forward:     Score: 167.000000  Q:2 to 21  R:105 to 129 Align Len (20) (80.00%) (80.00%)

   Query:    3' ugagAAGUAAAUAGA-UGAGCAAUu 5'
                    || |  ||||| |||||||| 
   Ref:      5' taaaTTAAAGTATCTGACTCGTTAt 3'

   Energy:  -15.730000 kCal/Mol

Scores for this hit:
>apul-mir-novel-2       ntLink_8:4298823-4300537        167.00  -15.73  2 21    105 129 20      80.00%  80.00%

Score for this Scan:
Seq1,Seq2,Tot Score,Tot Energy,Max Score,Max Energy,Strand,Len1,Len2,Positions
>>apul-mir-novel-2      ntLink_8:4298823-4300537        167.00  -15.73  167.00  -15.73  77      24      1714     105
Complete

Read Sequence:ntLink_8:8821127-8822825 (1698 nt)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing Scan: apul-mir-novel-2 vs ntLink_8:8821127-8822825
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

   Forward:     Score: 152.000000  Q:2 to 15  R:10 to 32 Align Len (13) (84.62%) (92.31%)

   Query:    3' ugagaaguaaAUAGAUGAGCAAUu 5'
                          |||:| ||||||| 
   Ref:      5' agattaagagTATTT-CTCGTTAg 3'

   Energy:  -13.480000 kCal/Mol

Scores for this hit:
>apul-mir-novel-2       ntLink_8:8821127-8822825        152.00  -13.48  2 15    10 32   13      84.62%  92.31%

Score for this Scan:
Seq1,Seq2,Tot Score,Tot Energy,Max Score,Max Energy,Strand,Len1,Len2,Positions
>>apul-mir-novel-2      ntLink_8:8821127-8822825        152.00  -13.48  152.00  -13.48  158     24      1698     10
Complete

Going to rerun and set the -scale flag to 8. In the (paltry) miranda information online, this flag does the following: Set the scaling parameter to scale. This scaling is applied to match / mismatch scores in the critical 10bp region of the 5’ end of the microRNA. Many known examples of miRNA:Target duplexes are highly complementary in this region. This parameter can be thought of as a contrast function to more effectively detect alignments of this type. The default is 4 and I’m going to set it to 8. Submitted batch job 345472. Took about 10 hrs to run.

counting number of putative interactions predicted Tue Oct 29 04:06:29 EDT 2024
1991808
Parsing output Tue Oct 29 04:06:36 EDT 2024
counting number of putative interactions predicted Tue Oct 29 04:06:37 EDT 2024
99403 /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_8seed_parsed_Apul.txt

Interesting that the number of putative interactions actually increased from the previous run (99345 vs 99403). Look at the parsed files:

head miranda_strict_all_3UTRs_parsed_Apul.txt 
>apul-mir-100	ntLink_6:10255357-10256357	145.00	-15.40	2 19	264 286	20	65.00%	70.00%
>apul-mir-100	ntLink_6:10563192-10564192	153.00	-20.42	2 14	791 810	12	83.33%	91.67%
>apul-mir-100	ntLink_6:10617301-10618424	145.00	-15.84	2 16	679 697	14	78.57%	78.57%
>apul-mir-100	ntLink_6:10672091-10673508	161.00	-23.97	2 16	951 969	14	92.86%	92.86%
>apul-mir-100	ntLink_6:11090988-11091647	146.00	-15.95	2 11	545 564	9	88.89%	100.00%
>apul-mir-100	ntLink_6:11405289-11406669	149.00	-16.68	2 16	1000 1018	14	78.57%	85.71%
>apul-mir-100	ntLink_6:1142082-1142891	142.00	-16.61	2 11	9 28	9	88.89%	88.89%
>apul-mir-100	ntLink_6:11697492-11698492	154.00	-18.81	2 15	540 559	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	154.00	-15.72	2 15	563 582	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	157.00	-23.22	2 16	467 485	14	85.71%	92.86%

head miranda_strict_all_3UTRs_8seed_parsed_Apul.txt
>apul-mir-100	ntLink_6:10255357-10256357	285.00	-15.40	2 19	264 286	20	65.00%	70.00%
>apul-mir-100	ntLink_6:10563192-10564192	293.00	-20.42	2 14	791 810	12	83.33%	91.67%
>apul-mir-100	ntLink_6:10617301-10618424	285.00	-15.84	2 16	679 697	14	78.57%	78.57%
>apul-mir-100	ntLink_6:10672091-10673508	301.00	-23.97	2 16	951 969	14	92.86%	92.86%
>apul-mir-100	ntLink_6:11090988-11091647	286.00	-15.95	2 11	545 564	9	88.89%	100.00%
>apul-mir-100	ntLink_6:11405289-11406669	289.00	-16.68	2 16	1000 1018	14	78.57%	85.71%
>apul-mir-100	ntLink_6:1142082-1142891	282.00	-16.61	2 11	9 28	9	88.89%	88.89%
>apul-mir-100	ntLink_6:11697492-11698492	294.00	-18.81	2 15	540 559	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	294.00	-15.72	2 15	563 582	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	297.00	-23.22	2 16	467 485	14	85.71%	92.86%

The interactions appear to be the same except the score is much higher, maybe overly inflated? Are the scans the same?

# from original run
Read Sequence:ntLink_8:8646169-8647749 (1580 nt)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing Scan: apul-mir-novel-2 vs ntLink_8:8646169-8647749
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

   Forward:     Score: 152.000000  Q:2 to 15  R:386 to 408 Align Len (13) (84.62%) (92.31%)

   Query:    3' ugagaaguaaAUAGAUGAGCAAUu 5'
                          |||:| ||||||| 
   Ref:      5' agattaagagTATTT-CTCGTTAg 3'

   Energy:  -13.480000 kCal/Mol

Scores for this hit:
>apul-mir-novel-2       ntLink_8:8646169-8647749        152.00  -13.48  2 15    386 408 13      84.62%  92.31%

Score for this Scan:
Seq1,Seq2,Tot Score,Tot Energy,Max Score,Max Energy,Strand,Len1,Len2,Positions
>>apul-mir-novel-2      ntLink_8:8646169-8647749        152.00  -13.48  152.00  -13.48  156     24      1580     386
Complete

# from scale 8 run 
Read Sequence:ntLink_8:8646169-8647749 (1580 nt)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing Scan: apul-mir-novel-2 vs ntLink_8:8646169-8647749
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

   Forward:     Score: 292.000000  Q:2 to 15  R:386 to 408 Align Len (13) (84.62%) (92.31%)

   Query:    3' ugagaaguaaAUAGAUGAGCAAUu 5'
                          |||:| ||||||| 
   Ref:      5' agattaagagTATTT-CTCGTTAg 3'

   Energy:  -13.480000 kCal/Mol

Scores for this hit:
>apul-mir-novel-2       ntLink_8:8646169-8647749        292.00  -13.48  2 15    386 408 13      84.62%  92.31%

Score for this Scan:
Seq1,Seq2,Tot Score,Tot Energy,Max Score,Max Energy,Strand,Len1,Len2,Positions
>>apul-mir-novel-2      ntLink_8:8646169-8647749        292.00  -13.48  292.00  -13.48  156     24      1580     386
Complete

The only difference is the score so I think I am inflating the calculations somehow, would need to look at source code. After looking at other papers that have used miranda, they typically employ strict binding, energy <-10, default scale (4), and 140-150 score (see example from human flu hozt response to virus here). Gajigan & Conaco (2017) used miranda on an Acropora spp as well and they used strict seed binding and energy <-10. They further narrowed targets with exact seed match (is seed match mean 6, 7, 8bp?) and an A in position 1 (of miRNA or mRNA?); they cited a mammalian paper here for these choices specifically (Agarwal, Bell, Nam & Bartel 2015). They also used fasta to check the targets for more extensive complementarity and scored the alignments as described in Moran et al. 2014 (classic nematostella miRNA paper). Moran et al. 2014 did the following: “mapped sequences with FASTA v36 (Pearson and Lipman 1988) using the parameters -n -H -Q -f -16 -r +15/-10 -g -10 -w 100 -W 25 -E 100000 -i -U and scored the alignments using a weighted sum of the number of mismatches (scoremismatch = 1, scoreG:U = 0.5) with mismatches for guide RNA nucleotides g2–g13 counting double (scoremismatch = 2, scoreG:U = 1).” Moran et al. got this scoring method from Fahlgren et al. 2007, who did the following: “miRNA targets were computationally predicted as described [34]. Briefly, potential targets from FASTA searches (+15/−10 match/mismatch scoring ratio, -16 gap penalty and a RNA scoring matrix) were scored using a position-dependent, mispair penalty system [34]. Penalties were assessed for mismatches, bulges, and gaps (+1 per position) and G∶U pairs (+0.5 per position). Penalties were doubled if the mismatch, bulge, gap, or G∶U pair occurred at positions 2 to 13 relative to the 5′ end of the miRNA. Only one single-nt bulge or single-nt gap was allowed. Based on a reference set of validated miRNA targets, only predicted targets with scores of four or less were considered reasonable.” Need to look more into this scoring.

I am going to run miranda once more with the following flags: -en -10 -strict. The score (140), scale (4), gap-open penalty (-4), and gap-extend pentaly (-9) will be kept as the defaults. Submitted batch job 345844

I want to check to see how similar the results would be between the 1kb estimate and the gene ext + 1kb estimate. Use this post as reference.

cd /data/putnamlab/jillashey/e5/refs/Apul

grep $'\tgene\t' genome.fixed.gtf > Apul_gene.gtf
wc -l Apul_gene.gtf 
44371 Apul_gene.gtf

Sort by chromosome

module load BEDTools/2.30.0-GCC-11.3.0

sortBed -faidx apul.Chromosome_names.txt -i Apul_gene.gtf > Apul_gene_sorted.gtf

Extract 1kb 3’UTRs and keep associated gene info

bedtools flank -i Apul_gene_sorted.gtf -g apul.Chromosome_lenghts.txt -l 0 -r 1000 -s | \
awk 'BEGIN{OFS="\t"} {
    gsub("gene","3prime_UTR",$3);
    if($5-$4 > 3) {
        gene_id = $NF;
        sub(/;$/, "", gene_id);
        print $1, $2, $3, $4, $5, $6, $7, $8, gene_id;
    }
}' > Apul_3UTR_1kb.gtf

Subtract overlaps of other genes

bedtools subtract -a Apul_3UTR_1kb.gtf -b Apul_gene_sorted.gtf > Apul_3UTR_1kb_corrected.gtf

Extract 3’UTR seqs from genome fasta

bedtools getfasta -fi /data/putnamlab/REFS/Apul/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.masked.fa -bed Apul_3UTR_1kb_corrected.gtf -fo Apul_3UTR_1kb.fasta -fullHeader

Run miranda - edit miranda_strict_all_1kb_apul.sh in scripts folder so that input 3’UTR fasta is /data/putnamlab/jillashey/e5/refs/Apul/Apul_3UTR_1kb.fasta. Submitted batch job 345851.

counting number of putative interactions predicted Tue Oct 29 16:11:40 EDT 2024
2009896
Parsing output Tue Oct 29 16:11:45 EDT 2024
counting number of putative interactions predicted Tue Oct 29 16:11:46 EDT 2024
99112 /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_1kb_parsed_apul.txt

Slightly less putative interactions than when using the gene ext 3’UTRs and the manual 1kb 3’UTRs. The outputs look very similar to one another:

head miranda_strict_all_3UTRs_parsed_Apul.txt
>apul-mir-100	ntLink_6:10255357-10256357	145.00	-15.40	2 19	264 286	20	65.00%	70.00%
>apul-mir-100	ntLink_6:10563192-10564192	153.00	-20.42	2 14	791 810	12	83.33%	91.67%
>apul-mir-100	ntLink_6:10617301-10618424	145.00	-15.84	2 16	679 697	14	78.57%	78.57%
>apul-mir-100	ntLink_6:10672091-10673508	161.00	-23.97	2 16	951 969	14	92.86%	92.86%
>apul-mir-100	ntLink_6:11090988-11091647	146.00	-15.95	2 11	545 564	9	88.89%	100.00%
>apul-mir-100	ntLink_6:11405289-11406669	149.00	-16.68	2 16	1000 1018	14	78.57%	85.71%
>apul-mir-100	ntLink_6:1142082-1142891	142.00	-16.61	2 11	9 28	9	88.89%	88.89%
>apul-mir-100	ntLink_6:11697492-11698492	154.00	-18.81	2 15	540 559	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	154.00	-15.72	2 15	563 582	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	157.00	-23.22	2 16	467 485	14	85.71%	92.86%
(base) [jillashey@ssh3 miranda]$ head miranda_strict_all_1kb_parsed_apul.txt
>apul-mir-100	ntLink_6:10255357-10256357	145.00	-15.40	2 19	264 286	20	65.00%	70.00%
>apul-mir-100	ntLink_6:10563192-10564192	153.00	-20.42	2 14	791 810	12	83.33%	91.67%
>apul-mir-100	ntLink_6:10617301-10618301	145.00	-15.84	2 16	679 697	14	78.57%	78.57%
>apul-mir-100	ntLink_6:10672508-10673508	161.00	-23.97	2 16	534 552	14	92.86%	92.86%
>apul-mir-100	ntLink_6:11090988-11091647	146.00	-15.95	2 11	545 564	9	88.89%	100.00%
>apul-mir-100	ntLink_6:11405669-11406669	149.00	-16.68	2 16	620 638	14	78.57%	85.71%
>apul-mir-100	ntLink_6:1142082-1143082	142.00	-16.61	2 11	9 28	9	88.89%	88.89%
>apul-mir-100	ntLink_6:11697492-11698492	154.00	-18.81	2 15	540 559	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	154.00	-15.72	2 15	563 582	13	84.62%	84.62%
>apul-mir-100	ntLink_6:12100733-12101733	157.00	-23.22	2 16	467 485	14	85.71%	92.86%

I also figured out how to get the gene name to be in the 3’UTR fasta file. For the 1kb only data:

module load BEDTools/2.30.0-GCC-11.3.0

awk '{print $1 "\t" $4-1 "\t" $5 "\t" $9 "\t" "." "\t" $7}' Apul_3UTR_1kb_corrected.gtf | sed 's/"//g' > Apul_3UTR_1kb_corrected.bed

bedtools getfasta -fi /data/putnamlab/REFS/Apul/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.masked.fa -bed Apul_3UTR_1kb_corrected.bed -fo Apul_3UTR_1kb.fasta -name

head Apul_3UTR_1kb.fasta
>FUN_002303::ntLink_7:4679-5679
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAGAACAAGAACACTGAAAAAAGTGAACAAATTTCTTAAGAAAACACGATCCGAACAAATTTCGGATTACGCAGTTTAGAAAAAGGAGAAAATTTTCGCGCCCCCTCCATAGAGAGGGACTGGGGCCGGTTTCTCGAAAGTCCCGATAATGAGCTGTCGCCGTTTGCATTAAAGATCGAGGTCTCAATAGTTTTGCATCTAGCATGATAAAACCATCAGTGAAACAAAATGGAGTAGTCTGCTAGCAAGGACCCGCGCTCTTATTCTTTATATTCCGATTTGACTATTTGATCCCGGGCCCCAAAAGTTACCGGGACTTCCGAGAAACGGGCACACCCCAGGGCTCCTTTCGCTCGCTCGCGCCGACGCAGTAAGAACGGCAAAAACAGCTGAATCCAACTTCGCTTGAATTGGAAATACCATCACTGCAGTGGCCTCCATACGAATTCGACAAAAGAAAAGCTTAGGAGAAATTAAACATGACCTCTACGCCGGGGTAAGTTACCGTGATAACTGAGTTACCGGGTGGCTGATCGAATTGCCCCGCACCATTTTTAGCTGCGCGATATCCTGTCTTGCTTTTCTTGATTTGGCGGGGGAAGAAATTCTGGATTAAAATAAGAACCTTAGGGTCTTGCGTTTGTCTCGG
>FUN_002304::ntLink_7:11384-12384

From the gene ext + 1kb data:

awk -F'\t' '{split($9,a,";"); for(i in a) {if(a[i]~/gene_id/) {gsub(/^[ \t]+|[ \t]+$/, "", a[i]); gsub(/"/, "", a[i]); gene=a[i]}}} {print $1 "\t" $4-1 "\t" $5 "\t" gene "\t" "." "\t" $7}' Apul_all_3UTRs.gtf | sed 's/gene_id //' > Apul_all_3UTRs.bed

bedtools getfasta -fi /data/putnamlab/REFS/Apul/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.masked.fa -bed Apul_all_3UTRs.bed -fo Apul_all_3UTRs.fasta -name

head Apul_all_3UTRs.fasta
FUN_002303::ntLink_7:4679-5033
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAG
>FUN_002305::ntLink_7:24187-24541
TCTGACATCGAATGCTGCCCAAAGCAGGGAAAAGCTAAGGACAGGAACTCTGAACTCCGTCAAAAGTTGCGCTAGATCGAGTCTAGGATAGGATAGGAATTTCACCGAACTTTCTATTTGTGTATTTGTATTTCGCTCTTCGAGCCTATTCAGTCAAACGGTTCAAAAGTCCATGTTCATGTAAGATGCATTAAATCAGTTTAGATCTCGATTCTCGTTGGATGTAAATGTAGGTTTAATTAAAAAAAAAAAAAGATTGCGAGGAAACGATCAGTGTTGGTCGATGTGTTGGTGCCCCCGGGGAAACGGGTTTTGAAGGAGAAATTCTACGCTCTAAAGAGCACTTTTATACAG
>FUN_002309::ntLink_7:63179-63540

Rerun miranda so that gene names are associated with the interactions. Submitted batch job 346015 for gene ext + 1kb data. Submitted batch job 346016 for the 1kb data only. Both ran in ~20 mins.

# Gene ext + 1kb data
counting number of putative interactions predicted Wed Oct 30 09:10:32 EDT 2024
1991808
Parsing output Wed Oct 30 09:10:38 EDT 2024
counting number of putative interactions predicted Wed Oct 30 09:10:39 EDT 2024
99345 /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_3UTRs_parsed_Apul.txt

# 1kb data only
counting number of putative interactions predicted Wed Oct 30 09:11:28 EDT 2024
2009896
Parsing output Wed Oct 30 09:11:31 EDT 2024
counting number of putative interactions predicted Wed Oct 30 09:11:32 EDT 2024
99112 /data/putnamlab/jillashey/e5/output/miranda/miranda_strict_all_1kb_parsed_apul.txt

Written on August 19, 2024