Apulchra genome assembly
Apulchra genome assembly
Sperm and tissue from adult Acropora pulchra colonies were collected from Moorea, French Polynesia and sequencing with PacBio (long reads) and Illumina (short reads). This post will detail the genome assembly notes. The github for this project is here.
I’m going to write notes and code chronologically so that I can keep track of what I’m doing each day. When assembly is complete, I will compile the workflow in a separate post.
20240206
Met w/ Ross and Hollie today re Apulchra genome assembly. We decided to move forward with the workflow from Stephens et al. 2022 which assembled genomes for 4 Hawaiian corals.
PacBio long reads were received in late Jan/early Feb 2024. According to reps from Genohub, the PacBio raw output looks good.
We decided to move forward with Canu to assembly the genome. Canu is specialized to assemble PacBio sequences, operating in three phases: correction, trimming and assembly. According to the Canu website, “The correction phase will improve the accuracy of bases in reads. The trimming phase will trim reads to the portion that appears to be high-quality sequence, removing suspicious regions such as remaining SMRTbell adapter. The assembly phase will order the reads into contigs, generate consensus sequences and create graphs of alternate paths.”
The PacBio files that will be used for assembly are located here on Andromeda: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead
. The files in the folder that we will use are:
m84100_240128_024355_s2.hifi_reads.bc1029.bam
m84100_240128_024355_s2.hifi_reads.bc1029.bam.pbi
The bam file contains all of the read information in computer language and the pbi file is an index file of the bam. Both are needed in the same folder to run Canu.
For Canu, input files must be fasta or fastq format. I’m going to use bam2fastq
from the PacBio github. This module is not on Andromeda so I will need to install it via conda.
The PacBio sequencing for the Apul genome were done with HiFi sequencing that are produced with circular consensus sequencing on PacoBio long read systems. Here’s how HiFi reads are generated from the PacBio website:
Since Hifi sequencing was used, a specific HiCanu flag (-pacbio-hifi
) must be used. Additionally, in the Canu tutorial, it says that if this flag is used, the program is assuming that the reads are trimmed and corrected already. However, our reads are not. I’m going to try to run the first pass at Canu with the -raw
flag.
Before Canu, I will run bam2fastq
. This is a part of the PacBio BAM toolkit package pbtk
. I need to create the conda environment and install the package. Load miniconda module: module load Miniconda3/4.9.2
. Create a conda environment.
conda create --prefix /data/putnamlab/conda/pbtk
Install the package. Once the package is installed on the server, anyone in the Putnam lab group can use it.
conda activate /data/putnamlab/conda/pbtk
conda install -c bioconda pbtk
In my own directory, make a new directory for genome assembly
cd /data/putnamlab/jillashey
mkdir Apul_Genome
cd Apul_Genome
mkdir assembly structural functional
cd assembly
mkdir scripts data output
Run code to make the PacBio bam file to a fastq file. In the scripts folder: nano bam2fastq.sh
#!/bin/bash -i
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
conda activate /data/putnamlab/conda/pbtk
echo "Convert PacBio bam file to fastq file" $(date)
bam2fastq -o /data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam
echo "Bam to fastq complete!" $(date)
conda deactivate
Submitted batch job 294235
20240208
Job pended for about a day, then ran in 1.5 hours. I got this error message: bash: cannot set terminal process group (-1): Function not implemented bash: no job control in this shell
, but not sure why. A fastq.gz file was produced! The file is pretty large (35G).
less m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz
@m84100_240128_024355_s2/261887593/ccs
TATAAGTTTTACAGCTGCCTTTTGCTCAGCAAAGAAAGCAGCATTGTTATTGAACAGAAAAAGCCTTTTGGTGATATAAAGGTTTCTAAGGGACCAAAGTTTGATCTAGTATGCTAAGTGTGGTGGGTTAAAACTTTGTTTCACCTTTTTCCCGTGATGTTACAAAATTGGTGCAAATATTCATGACGTCGTACTCAAGTCTGACACTAAGAGCATGCAATTACTTAAACAAACAAGCCATACACCAATAAACTGAAGCTCTGTCAACTAGAAAACCTTTGAGTATTTTCATTGTAAAGTACAAGTGGTATATTGTCACTTGCTTTTACAACTTGAAGAACTCACAGTTAAGTTACTAGATTCACCATAGTGCTTGGCAATGAAGAAGCCAAATCACATAAAGTCGGAGCATGTGGTGTTTAGACCTAATCAAACAAGAACACAATATTTAGTACCTGCATCCTGTCTAAGGAGGAAATTTTAAGCTGCTTTCTTTTAAATTTTTTTTATTAGCATTTCAATGGTTGAGGTCGATTATAGGTGCTAGGCTTTAATTCCGTACTATGAAAGAAGAAAGGTCGTTGTTATTAACCATGTCAAACAGAGAAACACATGGTAAAAAATTGACTTCCTTTTCCTCTCGTTGCCACTTAAGCTTAATGATGGTGTTTGACCTGAAAGATGTTACAATTGTTTTAGATGAAAAGACTGTTCTGCGTAAAACAGTGAAGCCTCCCAACTTATTTTGTTATGTGGATTTTGTTGTCTTGTTAGTAACATGTATTGGACTATCTTTTGTGAGTACATAGCTTTTTTTCCATCAACTGACTATATACGTGGTGTAATTTGAGATCATGCCTCCAAGTGTTAGTCTTTTGTTTGGGGCTAACTCGTAAAAGACAAAGGGAGGGGGGTTGTCTAATTCCTAAGCAAAGCATTAAGTTTAACACAGGAAATTGTTTGCGTTGATATTGCTATCCTTTCAGCCCCAAACAAAAAATTTAATGGTTATTTTATTTTACATCTATTGTAAATATATTTTAACATTAATTTTTATTATTGCACTGTAAATACTTGTACTAATGTTCTGTTTGAATTAATTTTGATTCATTCCTTGTGCTTACAACAACAGGGATACAAAACCGATATGTATAATAATACTATTAGAGATGCTTATTTGCATTTTTAGCCCATACCATGAGTTTTAATAACGCCAGGCCATTGGAGATTTTATGGAGTGAGGATTCATTGTACAAACATGGTTGATTTAATATTAAAGTTGTATCCAAATAATTAATATCTGCTGTGATCAGTGAAAGATTGACCTTTCAGTTGTTTGGTTGCACCTTCATCTTATTGGAAACAACTGAATGGAGCATCTTTCCAGTTTAAAAATGTACCACTGCCCACTTTCATGAAGTTATGCCACATATTAATAATGACTATTAATTGTTGAAAACCCTTCTTCCAAAATGTTTCCATTTATTTGTAATAGCATATGTGGTCCATCAGAACAATAATTTAAATCATTACTATTAATAATTTTCCAATAACTGACTTTCAAACCTAGCCAACAAGCATAAGTCAGTAAGCCACAGAGTCCAGAGATACACTTACACTTTACTTTTCACTTCTGAAACATTTTATAATCTCAGTATGAGCATAGAACTTTTCAGTTGGGCAGCATGGAATAGAACCTTTGGACCCCTCTGTGAATATCAAAAATAGGCAACCACTTCCAGCATACACTCTAGCCTCCTTCATAAAGCAAGCCTTAGTGTTTTAGCTTCTACTAGTTAGATTCATTTTAAAAGAAGTTCAGTATACTTAATCTTATAGAAGCTGATTGTGATATAATTGCATAGGTGGATCTCAGAAAAGTGAGATGTAGCTGTCAAATTAAAAGAAGTCCTTTCCAAGCGTAGCTTCTGATAAACAATGCATTTTAGTTAACATTGGATTATGGTTTCAAAGGACTTGTAAAGCTAAATTCAAGTTTTTATGACAACTTGAAAGCCTTTGCCACAGTCTCCGCTGATTTAAGACTTCCATCAAAGTTAGAGTGGTGTGAATGCATCTCCACATGCAATTAATAAAGGTGAGGCAGACAACACAAAACACCCTGGTGCACCATCAACTCCACGGATCACTTGACTGTAACGCCATCTTATACAGCGACTGCCAATTGGAACTGGAAGATCAGGAATGATCTCTTTCACATGGGAAATGAGCATGGTCTTGATAATGCTTTCATCAGTATCCCAATGTTGAAGACAGAAGGACTTTGTTGAATGTACTAAAAGAGAGGGACCAACATCCACTGGATCTGGAACAGAATGCCAAAGCATATGCAAAATGTTATCATGTATAATAAAGAAAAAACCAAACTCATGAAGTTCAACGGTCACAGGCTTTGGTAAAAGACTGTACATGGTAAAGTACTCAGGGTGAAAGTTTGTATCCAAAAAACGAAAAGCTAACCTGTACCACATCTCTTCCGAGAGTCCACTGACACAAATGCAATATTAGGATTGCCAGTCGCATATTTACACGTCCATGGCACATTGATAACTGCTTCATGATCAAAAAAGTATCCTACTGCAAACCGTGATGAATATTCCACATTTTGTAAGGATTTGATTTCATTTTGTAAGAATGCTTGAATTGAACCTTGTAGCTGAAGAAGTTGTGGTACTGGAATAGTTACTATGACTGA
See how many @m84100
are in the file. I’m not sure what these stand for, maybe contigs?
zgrep -c "@m84100" m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz
5898386
More than 5 million, so many not contigs? I guess it represents the number of HiFi reads generated. Now time to run Canu! Canu is already installed on the server, which is nice.
In the scripts folder: nano canu.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load canu/2.2-GCCcore-11.2.0
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Unzip paco-bio fastq file" $(date)
gunzip m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz
echo "Unzip complete, starting assembly" $(date)
canu -p apul -d /data/putnamlab/jillashey/Apul_Genome/assembly/data genomeSize=475m -raw -pacbio-hifi m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq
echo "Canu assembly complete" $(date)
I’m not sure if the raw
and -pacbio-hifi
will be compatible, as the Canu tutorial says that the -pacbio-hifi
assumes that the input is trimmed and corrected (still not sure what this means). Submitted batch job 294325
20240212
The canu script appears to have ran. The script canu.sh
itself ran in ~40 mins, but it spawned 10000s of other jobs on the server (for parallel processing, I’m guessing). Since I didn’t start those jobs, I didn’t get emails when they finished, so I just had to check the server every few hours. It took about a day for the rest of the jobs to finish. First, I’m looking at the slurm-294325.error
output file. There’s a lot in this file, but I will break it down.
It first provides details on the slurm support and associated memory, as well as the number of threads that each portion of canu will need to run.
-- Slurm support detected. Resources available:
-- 25 hosts with 36 cores and 124 GB memory.
-- 3 hosts with 24 cores and 124 GB memory.
-- 1 host with 36 cores and 123 GB memory.
-- 1 host with 48 cores and 753 GB memory.
-- 8 hosts with 36 cores and 61 GB memory.
-- 1 host with 48 cores and 250 GB memory.
-- 2 hosts with 36 cores and 250 GB memory.
-- 2 hosts with 48 cores and 375 GB memory.
-- 4 hosts with 36 cores and 502 GB memory.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ---------- -------- -----------------------------
-- Grid: meryl 12.000 GB 6 CPUs (k-mer counting)
-- Grid: hap 12.000 GB 12 CPUs (read-to-haplotype assignment)
-- Grid: cormhap 13.000 GB 12 CPUs (overlap detection with mhap)
-- Grid: obtovl 8.000 GB 6 CPUs (overlap detection)
-- Grid: utgovl 8.000 GB 6 CPUs (overlap detection)
-- Grid: cor -.--- GB 4 CPUs (read correction)
-- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 16.000 GB 1 CPU (overlap store sorting)
-- Grid: red 10.000 GB 6 CPUs (read error detection)
-- Grid: oea 8.000 GB 1 CPU (overlap error adjustment)
-- Grid: bat 64.000 GB 8 CPUs (contig construction with bogart)
-- Grid: cns -.--- GB 8 CPUs (consensus)
--
-- Found trimmed raw PacBio HiFi reads in the input files.
The file also says that it skipped the correction and trimming steps. This indicates that adding the -raw
flag didn’t work.
-- Stages to run:
-- assemble HiFi reads.
--
--
-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
Two histograms are then presented. The first is a histogram of correct reads:
-- In sequence store './apul.seqStore':
-- Found 5897694 reads.
-- Found 79182880880 bases (166.7 times coverage).
-- Histogram of corrected reads:
--
-- G=79182880880 sum of || length num
-- NG length index lengths || range seqs
-- ----- ------------ --------- ------------ || ------------------- -------
-- 00010 26133 264284 7918288640 || 1150-2287 8297|--
-- 00020 22541 592551 15836596549 || 2288-3425 76768|-------------
-- 00030 20184 964605 23754880672 || 3426-4563 316378|---------------------------------------------------
-- 00040 18285 1377072 31673163019 || 4564-5701 397324|---------------------------------------------------------------
-- 00050 16551 1832151 39591448183 || 5702-6839 374507|------------------------------------------------------------
-- 00060 14786 2337786 47509730799 || 6840-7977 351843|--------------------------------------------------------
-- 00070 12854 2910882 55428020435 || 7978-9115 339864|------------------------------------------------------
-- 00080 10612 3585762 63346313974 || 9116-10253 340204|------------------------------------------------------
-- 00090 7724 4449769 71264594524 || 10254-11391 341160|-------------------------------------------------------
-- 00100 1150 5897693 79182880880 || 11392-12529 343170|-------------------------------------------------------
-- 001.000x 5897694 79182880880 || 12530-13667 340043|------------------------------------------------------
-- || 13668-14805 336214|------------------------------------------------------
-- || 14806-15943 328927|-----------------------------------------------------
-- || 15944-17081 316290|---------------------------------------------------
-- || 17082-18219 293393|-----------------------------------------------
-- || 18220-19357 261212|------------------------------------------
-- || 19358-20495 225351|------------------------------------
-- || 20496-21633 188701|------------------------------
-- || 21634-22771 154123|-------------------------
-- || 22772-23909 124842|--------------------
-- || 23910-25047 99659|----------------
-- || 25048-26185 78280|-------------
-- || 26186-27323 61530|----------
-- || 27324-28461 47760|--------
-- || 28462-29599 37668|------
-- || 29600-30737 29339|-----
-- || 30738-31875 22548|----
-- || 31876-33013 17292|---
-- || 33014-34151 13055|---
-- || 34152-35289 9797|--
-- || 35290-36427 7251|--
-- || 36428-37565 5005|-
-- || 37566-38703 3560|-
-- || 38704-39841 2474|-
-- || 39842-40979 1553|-
-- || 40980-42117 1006|-
-- || 42118-43255 604|-
-- || 43256-44393 334|-
-- || 44394-45531 184|-
-- || 45532-46669 102|-
-- || 46670-47807 45|-
-- || 47808-48945 18|-
-- || 48946-50083 11|-
-- || 50084-51221 4|-
-- || 51222-52359 1|-
-- || 52360-53497 2|-
-- || 53498-54635 0|
-- || 54636-55773 0|
-- || 55774-56911 0|
-- || 56912-58049 1|-
There’s also a histogram of corrected-trimmed reads, but it is the exact same as the histogram above. The histogram represents the length ranges for each sequence and the number of sequences that have that length range. For example, if we look at the top row, there are 8297 sequences that range in length from 1150-2287 bp.
It looks like canu did run jobs by itself:
-- For 5897694 reads with 79182880880 bases, limit to 791 batches.
-- Will count kmers using 16 jobs, each using 13 GB and 6 threads.
--
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs. First attempt out of 2.
--
-- 'meryl-count.jobSubmit-01.sh' -> job 294326 tasks 1-16.
--
----------------------------------------
-- Starting command on Thu Feb 8 14:32:12 2024 with 8923.722 GB free disk space
cd /glfs/brick01/gv0/putnamlab/jillashey/Apul_Genome/assembly/data
sbatch \
--depend=afterany:294326 \
--cpus-per-task=1 \
--mem-per-cpu=4g \
-D `pwd` \
-J 'canu_apul' \
-o canu-scripts/canu.01.out canu-scripts/canu.01.sh
-- Finished on Thu Feb 8 14:32:13 2024 (one second) with 8923.722 GB free disk space
Let’s look at the output files that canu produced in /data/putnamlab/jillashey/Apul_Genome/assembly/data
. I’ll be using the Canu tutorial output information to understand outputs.
-rwxr-xr-x. 1 jillashey 1.1K Feb 8 14:13 apul.seqStore.sh
-rw-r--r--. 1 jillashey 951 Feb 8 14:31 apul.seqStore.err
drwxr-xr-x. 3 jillashey 4.0K Feb 8 14:32 apul.seqStore
drwxr-xr-x. 9 jillashey 4.0K Feb 8 23:57 unitigging
drwxr-xr-x. 2 jillashey 4.0K Feb 9 01:36 canu-scripts
lrwxrwxrwx. 1 jillashey 24 Feb 9 01:36 canu.out -> canu-scripts/canu.09.out
drwxr-xr-x. 2 jillashey 4.0K Feb 9 01:36 canu-logs
-rw-r--r--. 1 jillashey 23K Feb 9 01:47 apul.report
-rw-r--r--. 1 jillashey 7.0M Feb 9 01:51 apul.contigs.layout.tigInfo
-rw-r--r--. 1 jillashey 155M Feb 9 01:51 apul.contigs.layout.readToTig
-rw-r--r--. 1 jillashey 2.8G Feb 9 01:57 apul.unassembled.fasta
-rw-r--r--. 1 jillashey 943M Feb 9 02:01 apul.contigs.fasta
The apul.report
file will provide information about the analysis during assembly, including histogram of read lengths, the histogram or k-mers in the raw and corrected reads, the summary of corrected data, summary of overlaps, and the summary of contig lengths. The histogram of read lengths is the same as in the error file above. There is also a histogram (?) of the mer information:
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 0 0.0000 0.0000
-- 2- 2 4316301 **** 0.0150 0.0002
-- 3- 4 855268 0.0172 0.0002
-- 5- 7 183637 0.0183 0.0002
-- 8- 11 62578 0.0187 0.0002
-- 12- 16 29616 0.0188 0.0002
-- 17- 22 22278 0.0189 0.0003
-- 23- 29 20222 0.0190 0.0003
-- 30- 37 28236 0.0190 0.0003
-- 38- 46 73803 0.0192 0.0003
-- 47- 56 556391 0.0194 0.0003
-- 57- 67 6166302 ****** 0.0218 0.0010
-- 68- 79 35800144 *************************************** 0.0477 0.0098
-- 80- 92 63190280 ********************************************************************** 0.1835 0.0633
-- 93- 106 26607016 ***************************** 0.3988 0.1607
-- 107- 121 2495822 ** 0.4799 0.2024
-- 122- 137 2376701 ** 0.4871 0.2066
-- 138- 154 16117515 ***************** 0.4965 0.2131
-- 155- 172 47510630 **************************************************** 0.5575 0.2605
-- 173- 191 41907409 ********************************************** 0.7262 0.4060
-- 192- 211 9190117 ********** 0.8650 0.5375
-- 212- 232 1806623 ** 0.8934 0.5669
-- 233- 254 4028020 **** 0.8997 0.5743
-- 255- 277 3875382 **** 0.9140 0.5926
-- 278- 301 1456651 * 0.9271 0.6107
-- 302- 326 1905710 ** 0.9319 0.6181
-- 327- 352 3003944 *** 0.9388 0.6293
-- 353- 379 1723658 * 0.9491 0.6477
-- 380- 407 968475 * 0.9549 0.6587
-- 408- 436 1249856 * 0.9583 0.6657
-- 437- 466 890757 0.9626 0.6752
-- 467- 497 768714 0.9656 0.6824
-- 498- 529 937920 * 0.9683 0.6892
-- 530- 562 618907 0.9716 0.6978
-- 563- 596 582755 0.9737 0.7039
-- 597- 631 535408 0.9757 0.7100
-- 632- 667 441122 0.9775 0.7159
-- 668- 704 459953 0.9791 0.7211
-- 705- 742 367266 0.9807 0.7268
-- 743- 781 354455 0.9819 0.7316
-- 782- 821 304354 0.9832 0.7365
There are 22-mers. A k-mer are substrings of length k contained in a biological sequence. For example, the term k-mer refers to all of a sequence’s subsequences of length k such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). So if we have 22-mers, we have subsequences of 22 nt? The Canu documentation says that k-mer histograms with more than 1 peak likely indicate a heterozygous genome. I’m not sure if the stars represent peaks or counts but if this is a histogram of k-mer information, it has two peaks, indicating a heterozygous genome.
The Canu documentation states that corrected read reports should be given with information about number of reads, coverage, N50, etc. My log file does not have this information, likely because the trimming and correcting steps were not performed. Instead, I have this information:
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 4114 0.07 10652.45 +- 6328.73 1185.95 +- 2112.86 (bad trimming)
-- middle-hump 4148 0.07 11485.92 +- 4101.92 4734.88 +- 3702.70 (bad trimming)
-- no-5-prime 8533 0.14 9311.86 +- 5288.80 2087.03 +- 3315.79 (bad trimming)
-- no-3-prime 10888 0.18 7663.24 +- 5159.57 1677.51 +- 3090.22 (bad trimming)
--
-- low-coverage 48831 0.83 6823.01 +- 3725.15 16.35 +- 15.52 (easy to assemble, potential for lower quality consensus)
-- unique 5419159 91.89 9403.44 +- 4761.57 110.87 +- 40.33 (easy to assemble, perfect, yay)
-- repeat-cont 93801 1.59 7730.76 +- 4345.55 1001.60 +- 665.30 (potential for consensus errors, no impact on assembly)
-- repeat-dove 380 0.01 23076.97 +- 3235.06 878.08 +- 510.47 (hard to assemble, likely won't assemble correctly or eve
n at all)
--
-- span-repeat 64724 1.10 11397.59 +- 5058.57 2335.58 +- 2836.91 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 182925 3.10 9764.80 +- 4218.80 (should be uniquely placed, low potential for consensus e
rrors, no impact on assembly)
-- uniq-repeat-dove 14288 0.24 17978.89 +- 4847.85 (will end contigs, potential to misassemble)
-- uniq-anchor 19659 0.33 11312.36 +- 4510.86 5442.58 +- 3930.94 (repeat read, with unique section, probable bad read)
I’m not sure why I’m getting all of this information or what it means. There is a high % of unique reads in the data which is good. In the file, there is also information about edges (not sure what this means), as well as error rates. May discuss further with Hollie.
The Canu output documentation says that I’m supposed to get a file with corrected and trimmed reads but I don’t have those. I do have apul.unassembled.fasta
and apul.contigs.fasta
.
head apul.unassembled.fasta
>tig00000838 len=19665 reads=4 class=unassm suggestRepeat=no suggestBubble=no suggestCircular=no trim=0-19665
TAAAAACATTGATTCTTGTTTCAATATGAGACTTGTTTCGGAAGATGTTCGCGCAGGTTACATTTCATAATCCACAAGAAATGCGACATCGCCAACCTTA
CTTTAGTGTTTGCATTTAAGCAAAACAATGATAAAGAAACAAATCTCACATCTCGCAAAAGTATGCATTCTATGAAGAACAATGAAAATTAATGAAAGTG
AATCTTACACCTCCTATTCAAGACGCCGCATTAATTCAACTTGTTGATTTCTCCTAAAACGCTTTCTTTTAGAAGGCTTTCTTAGTCTTTCAATTGTAAA
GAATACATAAAGGACTCATGACCACTTATGTTCTTAAGTGTTACTGCTGCTTTAAAACACGTTACAAACCACATGTGAATATAGTTGCGGCACAGAAGGG
AAAATCGCTGAAATGCTGTCCAAATATACACAATATCATTAAGTAAAGTACGATCGTCCGGGTGAGTGTAGTCCTGAGAAGGACTGTTTGAGATGACATT
GACTGACGTTTCGACAACCTGAGCGGAAGTCATCTTCAGAGTCATCTTCACTTGACTCTGAAGATGACTTCCGCTCAGGTTGTCGAAACGTCAGTCAATG
TCATCTCAAACAGTCCTTCTCAGGACTACACTCACCCGGACGATCGTACTTTACTTAATGATATGACTCCTGGGTTCAAACCATTTACAAATATACACAA
GTTTGAAAGATCATATCGCCTGCCAGTTTTACAACTCGTCTTAGACACAATGGAATACAAAACCCTACCGAACGAATACCTTTGATTGAGATTTATGAAT
GTGAAACAGCGACTTCGAGAGAAAAACGAATTCTTAAAAATGCAGTTCAACTCTATCGTCATTCAACTGAAGCCAGCCGTGGCTCGGCTTCACAAGAAAG
head apul.contigs.fasta
>tig00000004 len=43693 reads=3 class=contig suggestRepeat=no suggestBubble=yes suggestCircular=no trim=0-43693
TACAATTTTAGAACACGGGACCAGCTTAGCATAATAGCTTCACCTTTCGTCTATCTAACTCTAGGAAGTTTTAATTTTTTCAAGTATTATAAAGGGCTCC
GTCGACTGTCAAAGATTTGCCTTTTCAAGCTCCAATGGAAACTGTAAAAGTTGCGTATTTTTACGAGCTTGAAACGCATCTTGGTATGCCCGATATCAAG
TCGAAATTAATATTGAGATAATCCTTTGGCCTTCTTCTATCAACATCTGAAATAAAAATCTGGGCAGTTTGAACGCGCTCTATCAAACATAACAGATTTG
AAGGTAGGTGATAACTTATTTGCATAATCTACGTTAACAAAAAGTCTATTTATAGAATGACTACTCGGCATATTTCTAACAGTGGTACTTCAGATACGTT
TTGATGGACTTATTATTCTGTCGTTTGTATTGTTTTCTTCAATTATTTAGCCTTAATAATTCCAAATAATAAAGAAATAAGGAAAGTCTTTGGTGTAAGT
CACACTCAAAAGGTGAGTTTCAACAGTTACTGAACACCCTTACGTATTAAACAGTCATTTCAATTTCCAGATTCTAACAGAAAATGTCAAATCGTTGTTT
TATAGTAGAAATCCATCTTCAAAAGTTATTCCCCGCTTATGCAGGCTTGATTCTCGCGGCTCTTTCCAGCTCGGTTTACAATATAAGACACCGGTGCAGA
TACCATTGAACTTGTAAACAATGTCACGCAAATTAAACTGTACTTCAATTTGCAAGCCATACAGCTTTAAGTCAGGTCTTTATTGAACTTTCTAAGTCAA
GGTTGGGGAATATAAAGATATTTTATTACCAGTATATTTTCGGTGAAAATTACAACGGATACATGTTATGGGCCTGTTCTTTAAACTCAGTTACATACAT
The unassembled file contains the reads and low-coverage contigs which couldn’t be incorporated into the primary assembly. The contigs file contains the full assembly, including unique, repetitive, and bubble elements. What does this mean? Unsure, but the header line provides metadata on each sequence.
len
Length of the sequence, in bp.
reads
Number of reads used to form the contig.
class
Type of sequence. Unassembled sequences are primarily low-coverage sequences spanned by a single read.
suggestRepeat
If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.
suggestBubble
If yes, sequence is likely a bubble based on potential placement within longer sequences.
suggestCircular
If yes, sequence is likely circular. The fasta line will have a trim=X-Y to indicate the non-redundant coordinates
If we are looking at this header: >tig00000004 len=43693 reads=3 class=contig suggestRepeat=no suggestBubble=yes suggestCircular=no trim=0-43693
, the length is the contig is 43693 bp, 3 reads were used to form the contig, class is a contig, no repeats used to form the contig, the sequence is likely a bubble based on placement within longer sequences, the sequence is not circular, and the entire contig is non-redundant. Not sure what bubble means…
Because Canu didn’t trim or correct the Hifi reads, I may need to use a different assembly tool. I’m going to try Hifiasm, which is a fast haplotype-resolved de novo assembler designed for PacBio HiFi reads. According to Hifiasm github, here are some good reasons to use Hifiasm:
-
Hifiasm delivers high-quality telomere-to-telomere assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers.
-
Hifiasm can purge duplications between haplotigs without relying on third-party tools such as purge_dups. Hifiasm does not need polishing tools like pilon or racon, either. This simplifies the assembly pipeline and saves running time.
-
Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm.
-
Hifiasm is trivial to install and easy to use. It does not required Python, R or C++11 compilers, and can be compiled into a single executable. The default setting works well with a variety of genomes.
If I use this tool, I may not need to pilon to polish the assembly. This tool isn’t on the server, so I’ll need to create a conda environment and install the package.
cd /data/putnamlab/conda
mkdir hifiasm
cd hifiasm
module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/conda/hifiasm
conda activate /data/putnamlab/conda/hifiasm
conda install -c bioconda hifiasm
Once this package is installed, run code for hifiasm assembly. In the scripts folder: nano hifiasm.sh
#!/bin/bash -i
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
conda activate /data/putnamlab/conda/hifiasm
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Starting assembly with hifiasm" $(date)
hifiasm -o apul.hifiasm m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq
echo "Assembly with hifiasm complete!" $(date)
conda deactivate
Submitted batch job 300534
20240213
Even though the reads were not trimmed or corrected with Canu, I am going to run Busco on the output. This will provide information about how well the genome was assembled and its completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs. Danielle and Kevin have both run BUSCO before and used similar scripts but I think I’ll adapt mine a little to fit my needs and personal preferences for code.
From the Busco user manual, the mandatory parameters are -i
, which defines the input fasta file and -m
, which sets the assessment mode (in our case, genome). Some recommended parameters incude l
(specify busco lineage dataset; in our case, metazoans), c
(specify number of cores to use), and -o
(assigns specific label to output).
In /data/putnamlab/shared/busco/scripts
, the script busco_init.sh
has information about the modules to load and in what order. Both Danielle and Kevin sourced this file specifically in their code, but I will probably just copy and paste the modules. In the same folder, they also used busco-config.ini
as input for the --config
flag in busco, which provides a config file as an alternative to command line parameters. I am not going to use this config file (yet), as Danielle and Kevin were assembling transcriptomes and I’m not sure what the specifics of the file are (or what they should be for genomes). In /data/putnamlab/shared/busco/downloads/lineages/metazoa_odb10
, there is information about the metazoan database.
In /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
, nano busco_canu.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BUSCO/5.2.2-foss-2020b
module load BLAST+/2.11.0-gompi-2020b
module load AUGUSTUS/3.4.0-foss-2020b
module load SEPP/4.4.0-foss-2020b
module load prodigal/2.6.3-GCCcore-10.2.0
module load HMMER/3.3.2-gompi-2020b
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Begin busco on canu-assembled fasta" $(date)
busco -i apul.contigs.fasta -m genome -l /data/putnamlab/shared/busco/downloads/lineages/metazoa_odb10 -c 15 -o apul.busco.canu
echo "busco complete for canu-assembled fasta" $(date)
Submitted batch job 301588. This failed and gave me some errors. This one seemed to have been the fatal one: Message: BatchFatalError(AttributeError("'NoneType' object has no attribute 'remove_tmp_files'"))
. Danielle ran into a similar error in her busco code so I am going to try to set the --config
file as "$EBROOTBUSCO/config/config.ini"
.
In the script, nano busco_canu.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
#module load BUSCO/5.2.2-foss-2020b
#module load BLAST+/2.11.0-gompi-2020b
#module load AUGUSTUS/3.4.0-foss-2020b
#module load SEPP/4.4.0-foss-2020b
#module load prodigal/2.6.3-GCCcore-10.2.0
#module load HMMER/3.3.2-gompi-2020b
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Begin busco on canu-assembled fasta" $(date)
source "/data/putnamlab/shared/busco/scripts/busco_init.sh" # sets up the modules required for this in the right order
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 15 --long -i apul.contigs.fasta -m genome -l /data/putnamlab/shared/busco/downloads/lineages/metazoa_odb10 -o apul.busco.canu
echo "busco complete for canu-assembled fasta" $(date)
Submitted batch job 301594. Failed, same error as before. Going to try copying Kevin and Danielle code directly, even though its a little messy and confusing with paths.
In the script, nano busco_canu.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Begin busco on canu-assembled fasta" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.contigs.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.busco.canu -m genome
echo "busco complete for canu-assembled fasta" $(date)
Submitted batch job 301599. This appears to have worked! Took about an hour to run. This is the primary result in the out file:
# BUSCO version is: 5.2.2
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /data/putnamlab/jillashey/Apul_Genome/assembly/data/apul.contigs.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk
***** Results: *****
C:94.4%[S:9.4%,D:85.0%],F:2.7%,M:2.9%,n:954
901 Complete BUSCOs (C)
90 Complete and single-copy BUSCOs (S)
811 Complete and duplicated BUSCOs (D)
26 Fragmented BUSCOs (F)
27 Missing BUSCOs (M)
954 Total BUSCO groups searched
We have 94.4% completeness with this assembly but 85% complete and duplicated BUSCOs. The busco manual says this on high levels of duplication: “BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis”.
Danielle also got a high number (78.9%) of duplicated BUSCOs in her de novo transcriptome of Apulchra, but Kevin got much less duplication (6.9%) in his Past transcriptome assembly. I need to ask Danielle if she ended up using her Trinity results (which had a high duplication percentage) for her alignment for Apul. I also need to ask her if she thinks the high duplication percentage is biologically meaningful.
Might be worth running HiFiAdapter Filt
20240215
Last night, the hifiasm job failed after almost 2 days but the email says PREEMPTED, ExitCode0. Two minutes after the job failed, job 300534 started again on the server and it says its a hifiasm job…I did not start this job myself, not sure what happened. Looking on the server now, hifiasm is running but has only been running for about 18 hours (as of 2pm today). It’s the same job number though which is strange.
20240220
Hifiasm job is still running after ~5 days. In the meantime, I’m going to run HiFiAdapterFilt, which is an adapter filtering command for PacBio HiFi data. On the github page, it says that the tool converts .bam to .fastq and removes reads with remnant PacBio adapter sequences. Required dependencies are BamTools and BLAST+; optional dependencies are NCBI FCS Adaptor and pigz. It looks like I’ll need to use the original bam file instead of the converted fastq file.
The github says I should add the script and database to my path using:
export PATH=$PATH:[PATH TO HiFiAdapterFilt]
export PATH=$PATH:[PATH TO HiFiAdapterFilt]/DB
I will do this in the script for the adapter filt code that I write myself. In the scripts folder, make a folder for hifi information
mkdir HiFiAdapterFilt
cd HiFiAdapterFilt
I need to make a script for the hifi adapter code. In the scripts/HiFiAdapterFilt
folder, I copy and pasted the linked code into hifiadapterfilt.sh
. Make a folder for pacbio databases and copy in the db information from the github.
mkdir DB
cd DB
nano pacbio_vectors_db
>gnl|uv|NGB00972.1:1-45 Pacific Biosciences Blunt Adapter
ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
>gnl|uv|NGB00973.1:1-35 Pacific Biosciences C2 Primer
AAAAAAAAAAAAAAAAAATTAACGGAGGAGGAGGA
In the scripts/HiFiAdapterFilt
folder: nano hifiadapterfilt_JA.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load GCCcore/11.3.0 # need this to resolve conflicts between GCCcore/8.3.0 and loaded GCCcore/11.3.0
module load BamTools/2.5.1-GCC-8.3.0
module load BLAST+/2.9.0-iimpi-2019b
cd /data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt
echo "Setting paths" $(date)
export PATH=$PATH:[/data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt] # path to original script
export PATH=$PATH:[/data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt]/DB # path to db info
echo "Paths set, starting adapter filtering" $(date)
bash hifiadapterfilt.sh -p /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029 -l 44 -m 97 -o /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Completing adapter filtering" $(date)
The -l
and -m
refer to the minimum length of adapter match to remove and the minumum percent match of adapter to remove, respectively. I left them as the default settings for now. Submitted batch job 303636. Giving me this error:
hifiadapterfilt.sh: line 63: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.temp_file_list: Permission denied
cat: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam.temp_file_list: No such file or directory
cat: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam.temp_file_list: No such file or directory
Giving me permission denied to write in the folder? I’ll sym link the bam file to my data folder
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
ln -s /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam
Editing script so that the prefix is connecting with the sym linked file. Submitted batch job 303637. Still giving me the same error. Hollie may need to give me permission to write and access files in that specific folder.
20240221
Probably need to run haplomerger2, which is installed on the server already.
I’m also looking at the Canu FAQs to see if there is any info about using PacBio HiFi reads. Under the question “What parameters should I use for my reads?”, they have this info:
The defaults for -pacbio-hifi should work on this data. There is still some variation in data quality between samples. If you have poor continuity, it may be because the data is lower quality than expected. Canu will try to auto-adjust the error thresholds for this (which will be included in the report). If that still doesn’t give a good assembly, try running the assembly with -untrimmed. You will likely get a genome size larger than you expect, due to separation of alleles. See My genome size and assembly size are different, help! for details on how to remove this duplication.
When I look at the question “My genome size and assembly size are different, help!”, it says that this difference could be due to a heterozygous genome where the assembly separated some loci or the previous estimate is incorrect. They recommended running BUSCO to check completeness of the assembly (which I already did) and using purge_dups to remove duplication. I will look into this.
Next steps
- Run Canu with
-untrimmed
option - Run purge_dups - targets the removal of duplicated sequences to enhance overall quality of assembly
- Run haplomerger - merges haplotypes to addresws heterozygosity
In the scripts folder: nano canu_untrimmed.sh
#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load canu/2.2-GCCcore-11.2.0
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
#echo "Unzip paco-bio fastq file" $(date)
#gunzip m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz
echo "Starting assembly w/ untrimmed flag" $(date)
canu -p apul.canu.untrimmed -d /data/putnamlab/jillashey/Apul_Genome/assembly/data genomeSize=475m -raw -pacbio-hifi m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq
echo "Canu assembly complete" $(date)
Submitted batch job 303660
On the purge_dups github, they say to install using the following:
git clone https://github.com/dfguan/purge_dups.git
cd purge_dups/src && make
# only needed if running run_purge_dups.py
git clone https://github.com/dfguan/runner.git
cd runner && python3 setup.py install --user
Cloned both into the assembly folder (ie /data/putnamlab/jillashey/Apul_Genome/assembly/
).
First, use pd_config.py to generate a configuration file. Here’s possible usage:
usage: pd_config.py [-h] [-s SRF] [-l LOCD] [-n FN] [--version] ref pbfofn
generate a configuration file in json format
positional arguments:
ref reference file in fasta/fasta.gz format
pbfofn list of pacbio file in fastq/fasta/fastq.gz/fasta.gz format (one absolute file path per line)
optional arguments:
-h, --help show this help message and exit
-s SRF, --srfofn SRF list of short reads files in fastq/fastq.gz format (one record per line, the
record is a tab splitted line of abosulte file path
plus trimmed bases, refer to
https://github.com/dfguan/KMC) [NONE]
-l LOCD, --localdir LOCD
local directory to keep the reference and lists of the
pacbio, short reads files [.]
-n FN, --name FN output config file name [config.json]
--version show program's version number and exit
# Example
./scripts/pd_config.py -l iHelSar1.pri -s 10x.fofn -n config.iHelSar1.PB.asm1.json ~/vgp/release/insects/iHelSar1/iHlSar1.PB.asm1/iHelSar1.PB.asm1.fa.gz pb.fofn
I need to make a list of the pacbio files that I’ll use in the script and put it in a file called pb.fofn.
for filename in *.fastq; do echo $PWD/$filename; done > pb.fofn
In my scripts folder: nano pd_config_apul.py
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Python/3.9.6-GCCcore-11.2.0 # do i need python?
echo "Making config file for purge dups scripts" $(date)
/data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/scripts/pd_config.py -l /data/putnamlab/jillashey/Apul_Genome/assembly/data -n config.apul.canu.json /data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq /data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn
echo "Config file complete" $(date)
Submitted batch job 303661. Ran in 1 second. Got this in the error file:
cp: ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq’ and ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq’ are the same file
cp: ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn’ and ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn’ are the same file
But it did generate a config file, which looks like this:
{
"cc": {
"fofn": "/data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn",
"isdip": 1,
"core": 12,
"mem": 20000,
"queue": "normal",
"mnmp_opt": "",
"bwa_opt": "",
"ispb": 1,
"skip": 0
},
"sa": {
"core": 12,
"mem": 10000,
"queue": "normal"
},
"busco": {
"core": 12,
"mem": 20000,
"queue": "long",
"skip": 0,
"lineage": "mammalia",
"prefix": "m84100_240128_024355_s2.hifi_reads.bc1029.fastq_purged",
"tmpdir": "busco_tmp"
},
"pd": {
"mem": 20000,
"queue": "normal"
},
"gs": {
"mem": 10000,
"oe": 1
},
"kcp": {
"core": 12,
"mem": 30000,
"fofn": "",
"prefix": "m84100_240128_024355_s2.hifi_reads.bc1029.fastq_purged_kcm",
"tmpdir": "kcp_tmp",
"skip": 1
},
"ref": "/data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq",
"out_dir": "m84100_240128_024355_s2.hifi_reads.bc1029.fastq"
}
My config file looks basically the same as the example one on the github. Manually edited the config file so that the out_dir was /data/putnamlab/jillashey/Apul_Genome/assembly/data/
. Now the purging can begin using run_purge_dups.py
. Here’s possible usage:
usage: run_purge_dups.py [-h] [-p PLTFM] [-w WAIT] [-r RETRIES] [--version]
config bin_dir spid
purge_dups wrapper
positional arguments:
config configuration file
bin_dir directory of purge_dups executable files
spid species identifier
optional arguments:
-h, --help show this help message and exit
-p PLTFM, --platform PLTFM
workload management platform, input bash if you want to run locally
-w WAIT, --wait WAIT <int> seconds sleep intervals
-r RETRIES, --retries RETRIES
maximum number of retries
--version show program's version number and exit
# Example
python scripts/run_purge_dups.py config.iHelSar1.json src iHelSar1
In my scripts folder: nano run_purge_dups_apul.py
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load Python/3.9.6-GCCcore-11.2.0 # do i need python?
echo "Starting to purge duplications" $(date)
/data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/scripts/run_purge_dups.py /data/putnamlab/jillashey/Apul_Genome/assembly/scripts/config.apul.canu.json /data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/src apul
echo "Duplication purge complete" $(date)
Submitted batch job 303662. Failed immediately with this error:
Traceback (most recent call last):
File "/data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/scripts/run_purge_dups.py", line 3, in <module>
from runner.manager import manager
ModuleNotFoundError: No module named 'runner'
I installed runner but the code is not seeing it…where am I supposed to put it? inside of the purge_dups github? Okay going to move runner
folder inside of the purge_dups
folder.
cd /data/putnamlab/jillashey/Apul_Genome/assembly
mv runner/ purge_dups/scripts/
Submitting job again, Submitted batch job 303663. Got the same error. In the run_purge_dups.py
script itself, the first few lines are:
#!/usr/bin/env python3
from runner.manager import manager
from runner.hpc import hpc
from multiprocessing import Process, Pool
import sys, os, json
import argparse
So it isn’t seeing that the runner module is there. This issue and this issue on the github were reported but never really answered in a clear way. Will have to look into this more.
In other news, the canu script finished running but looks like it failed. This is the bottom of the error message:
ERROR:
ERROR: Failed with exit code 139. (rc=35584)
ERROR:
ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
ABORT: failed to configure the overlap store.
ABORT:
ABORT: Disk space available: 8477.134 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (unitigging/apul.canu.untrimmed.ovlStore.config.err):
ABORT:
ABORT:
ABORT: Finding number of overlaps per read and per file.
ABORT:
ABORT: Moverlaps
ABORT: ------------ ----------------------------------------
ABORT:
ABORT: Failed with 'Segmentation fault'; backtrace (libbacktrace):
ABORT:
Unsure what it means…
20240301
BIG NEWS!!!!! This week, a paper came out that assembled and annotated the Orbicella faveolata genome using PacBio HiFi reads (Young et al. 2024)!!!!!!! The github for this paper has a detailed pipeline for how the genome was put together. Since I am also using HiFi reads, I will be following their methodology! I am using this pipeline starting at line 260.
I changed the file from bam to fastq, but now I need to change it to fasta with seqtk
. In the scripts folder: nano seqtk.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load seqtk/1.3-GCC-9.3.0
echo "Convert PacBio fastq file to fasta file" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
seqtk seq -a m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq > m84100_240128_024355_s2.hifi_reads.bc1029.fasta
echo "Fastq to fasta complete! Summarize read lengths" $(date)
awk '/^>/{printf("%s\t",substr($0,2));next;} {print length}' m84100_240128_024355_s2.hifi_reads.bc1029.fasta > rr_read_lengths.txt
echo "Read length summary complete" $(date)
Submitted batch job 304257.
In R, I looked at the data to quantify length for each read. See code here.
```{r, echo=F} read.table(file = “../data/rr_read_lengths.txt”, header = F) %>% dplyr::rename(“hifi_read_name” = 1, “length” = 2) -> hifi_read_length nrow(hifi_read_length) # 5,898,386 total reads mean(hifi_read_length$length) # mean length of reads is 13,424.64 sum(hifi_read_length$length) #length sum 79,183,709,778. Will need this for the NCBI submission
Make histogram for read bins from raw hifi data
```{r, echo = F}
ggplot(data = hifi_read_length,
aes(x = length, fill = "blue")) +
geom_histogram(binwidth = 2000) +
labs(x = "Raw Read Length", y = "Count", title = "Histogram of Raw HiFi Read Lengths") +
scale_fill_manual(values = c("blue")) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
20240303
My next step is to remove any contaminant reads from the raw hifi reads. From Young et al. 2024: “Raw HiFi reads first underwent a contamination screening, following the methodology in [68], using BLASTn [32, 68] against the assembled mitochondrial O. faveolata genome and the following databases: common eukaryote contaminant sequences (ftp.ncbi.nlm.nih. gov/pub/kitts/contam_in_euks.fa.gz), NCBI viral (ref_ viruses_rep_genomes) and prokaryote (ref_prok_rep_ genomes) representative genome sets”.
I tried to run the update_blastdb.pl
script (included in the blast program) with the BLAST+/2.13.0-gompi-2022a
module but I got this error:
Can't locate Archive/Tar.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /opt/software/BLAST+/2.13.0-gompi-2022a/bin/update_blastdb.pl line 41.
BEGIN failed--compilation aborted at /opt/software/BLAST+/2.13.0-gompi-2022a/bin/update_blastdb.pl line 41.
Not sure what this means…will email Kevin Bryan to ask about it, as I don’t want to mess with anything on the installed modules. I did download the contam_in_euks.fa.gz
db to my computer so I’m going to copy it to Andromeda. This file is considerably smaller than the viral or prok dbs.
Make a database folder in the Apul genome folder.
cd /data/putnamlab/jillashey/Apul_Genome
mkdir dbs
cd dbs
zgrep -c ">" contam_in_euks.fa
3554
Now I ran run a script that blasts the pacbio fasta against these sequences. In /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
, nano blast_contam_euk.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
echo "BLASTing hifi fasta against eukaryote contaminant sequences" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -subject /data/putnamlab/jillashey/Apul_Genome/dbs/contam_in_euks.fa -task megablast -outfmt 6 -evalue 4 -perc_identity 90 -num_threads 15 -out contaminant_hits_euks_rr.txt
echo "BLAST complete, remove contaminant seqs from hifi fasta" $(date)
awk '{ if( ($4 >= 50 && $4 <= 99 && $3 >=98 ) ||
($4 >= 100 && $4 <= 199 && $3 >= 94 ) ||
($4 >= 200 && $3 >= 90) ) {print $0}
}' contaminant_hits_euks_rr.txt > contaminants_pass_filter_euks_rr.txt
echo "Contaminant seqs removed from hifi fasta" $(date)
Submitted batch job 304389. Finished in about 2.5 hours. Looked at the output in R (code here).
20240304
Emailed Kevin Bryan this morning and asked if he knew anything about why the update_blastdb.pl
wasn’t working. Still waiting to hear back from him.
Kevin Bryan also emailed me this morning about my hifiasm job that has been running for 18 days and said: “This job has been running for 18 days. I just took a look at it and it appears you didn’t specify -t $SLURM_CPUS_ON_NODE
(and also #SBATCH --exclusive
) to make use of all of the CPU cores on the node. You might want to consider re-submitting this job with those parameters. Because the nodes generally have 36 cores, it should be able to catch up to where it is now in a little over half a day, assuming perfect scaling.”
I need to add those parameters into the hifiasm code, so cancelling the 300534
job. In /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
, nano hifiasm.sh
:
#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
conda activate /data/putnamlab/conda/hifiasm
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Starting assembly with hifiasm" $(date)
hifiasm -o apul.hifiasm m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq -t 36
echo "Assembly with hifiasm complete!" $(date)
conda deactivate
Submitted batch job 304463
20240304
Response from Kevin Bryan about viral and prok blast databases: “Ok, I downloaded those databases, and actually consolidated the rest of them into /data/shared/ncbi-db/
, under which is a directory for today, and then there will be a new one next Sunday and following Sundays. There’s a file /data/shared/ncbi-db/.ncbirc
that gets updated to point to the current directory, which the blast* tools will automatically pick up, so you can just do -db ref_prok_rep_ genomes
, for example.
For other tools that can read the ncbi databases, you can use blastdb_path -db ref_viruses_rep_genomes
to get the path, although for some reason with the nr database you need to specify -dbtype prot
, i.e., blastdb_path -db nr -dbtype prot
.
The reason for the extra complication is because otherwise a job that runs while the database is being updated may fail or return strange results. The dated directories should resolve this issue. Note that Unity blast-plus modules work in a similar way with a different path, /datasets/bio/ncbi-db
.”
Amazing! Now I can move forward with the blasting against viral and prok genomes. In the scripts folder: nano blastn_viral.sh
#!/bin/bash
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Blasting hifi reads against viral genomes to look for contaminants" $(date)
blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db ref_viruses_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out viral_contaminant_hits_rr.txt
echo "Blast complete!" $(date)
Submitted batch job 304500
In the scripts folder: nano blastn_prok.sh
#!/bin/bash
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Blasting hifi reads against prokaryote genomes to look for contaminants" $(date)
blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db ref_prok_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out prok_contaminant_hits_rr.txt
echo "Blast complete!" $(date)
Submitted batch job 304502
20240306
Making list of all software programs that Young et al. 2024 used and if they are on Andromeda
- blastn
- On Andromeda? YES
- Meryl
- On Andromeda? NO
- Genome-Scope2
- On Andromeda? NO
- Hifiasm
- On Andromeda? NO but I added it to
putnamlab
via conda
- On Andromeda? NO but I added it to
- Quast
- On Andromeda? YES
- Busco
- On Andromeda? YES
- Merqury
- On Andromeda? NO
- RepeatModeler2
- On Andromeda? NO
- Repeat-Masker
- On Andromeda? YES
- TeloScafs
- On Andromeda? NO
- PASA
- On Andromeda? NO
- funnannotate
- On Andromeda? NO
- Augustus
- On Andromeda? YES
- GeneMark-ES/ET
- On Andromeda? YES but only GeneMark-ET
- snap
- On Andromeda? YES
- glimmerhmm
- On Andromeda? NO
- Evidence Modeler
- On Andromeda? NO
- tRNAscan-SE
- On Andromeda? YES
- Trinity
- On Andromeda? Yes
- InterproScan
- On Andromeda? YES
20240311
Hifiasm (with unfiltered reads) finished running over the weekend and the prok blast script preemptively ended and then restarted in the early hours of this morning. I think this might be because I am not making use of all cores on the node (similar to my earlier hifiasm script). I cancelled the prok blast job (304502
) and edited the script so that it includes the flag -num_threads 36
. Submitted batch job 305351
It created many files:
-rw-r--r--. 1 jillashey 19G Mar 7 23:58 apul.hifiasm.ec.bin
-rw-r--r--. 1 jillashey 47G Mar 8 00:08 apul.hifiasm.ovlp.source.bin
-rw-r--r--. 1 jillashey 17G Mar 8 00:12 apul.hifiasm.ovlp.reverse.bin
-rw-r--r--. 1 jillashey 1.2G Mar 8 01:37 apul.hifiasm.bp.r_utg.gfa
-rw-r--r--. 1 jillashey 21M Mar 8 01:37 apul.hifiasm.bp.r_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.6M Mar 8 01:41 apul.hifiasm.bp.r_utg.lowQ.bed
-rw-r--r--. 1 jillashey 1.1G Mar 8 01:42 apul.hifiasm.bp.p_utg.gfa
-rw-r--r--. 1 jillashey 21M Mar 8 01:42 apul.hifiasm.bp.p_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.2M Mar 8 01:46 apul.hifiasm.bp.p_utg.lowQ.bed
-rw-r--r--. 1 jillashey 506M Mar 8 01:47 apul.hifiasm.bp.p_ctg.gfa
-rw-r--r--. 1 jillashey 11M Mar 8 01:47 apul.hifiasm.bp.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 8 01:49 apul.hifiasm.bp.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 469M Mar 8 01:50 apul.hifiasm.bp.hap1.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 8 01:50 apul.hifiasm.bp.hap1.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 8 01:52 apul.hifiasm.bp.hap1.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 468M Mar 8 01:52 apul.hifiasm.bp.hap2.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 8 01:52 apul.hifiasm.bp.hap2.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 1.9M Mar 8 01:54 apul.hifiasm.bp.hap2.p_ctg.lowQ.bed
This page gives a brief overview of the hifiasm output files, which is super helpful. It generates the assembly graphs in Graphical Fragment Assembly (GFA) format.
- prefix.r_utg.gfa: haplotype-resolved raw unitig graph. This graph keeps all haplotype information.
- A unitig is a portion of a contig. It is a nondisputed and assembled group of fragments. A contiguous sequence of ordered unitigs is a contig, and a single unitig can be in multiple contigs.
- prefix.p_utg.gfa: haplotype-resolved processed unitig graph without small bubbles. Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information. Hifiasm automatically pops such small bubbles based on coverage. The option –hom-cov affects the result. See homozygous coverage setting for more details. In addition, the option -p forcedly pops bubbles.
- Confused about the bubbles, but it looks like a medium level (what is a “medium” level”?) of heterozygosity will result in bubbles (see image in this post)
- Homozygous coverage refers to coverage threshold for homozygous reads. Hifiasm prints it as:
[M::purge_dups] homozygous read coverage threshold: X
. If it is not around homozygous coverage, the final assembly might be either too large or too small.
- prefix.p_ctg.gfa: assembly graph of primary contigs. This graph includes a complete assembly with long stretches of phased blocks.
- From my understanding based on this post discussing concepts in phased assemblies, a phased assembly identifies different alleles
- prefix.a_ctg.gfa: assembly graph of alternate contigs. This graph consists of all contigs that are discarded in primary contig graph.
- There were none of these files in my output. Does this mean all contigs created were used in the final assembly?
- prefix.hap*.p_ctg.gfa: phased contig graph. This graph keeps the phased contigs for haplotype 1 and haplotype 2.
I believe this file (apul.hifiasm.bp.p_ctg.gfa
) contains the sequence information for the assembled contigs. zgrep -c "S" apul.hifiasm.bp.p_ctg.gfa
showed that there were 188 Segments, or continuous sequences, in this assembly, meaning there are 188 contigs (to my understanding). This file (apul.hifiasm.bp.p_ctg.noseq.gfa
) contains information about the reads used to construct the contigs in a plain text format.
head apul.hifiasm.bp.p_ctg.noseq.gfa
S ptg000001l * LN:i:21642937 rd:i:82
A ptg000001l 0 + m84100_240128_024355_s2/165481517/ccs 0 25806 id:i:5084336 HG:A:a
A ptg000001l 6512 + m84100_240128_024355_s2/221910786/ccs 0 21626 id:i:5609370 HG:A:a
A ptg000001l 7287 - m84100_240128_024355_s2/128778986/ccs 0 24026 id:i:2691352 HG:A:a
A ptg000001l 7493 + m84100_240128_024355_s2/262476615/ccs 0 30540 id:i:287120 HG:A:a
A ptg000001l 15042 + m84100_240128_024355_s2/191824085/ccs 0 27619 id:i:2783058 HG:A:a
A ptg000001l 16616 - m84100_240128_024355_s2/29886096/ccs 0 28664 id:i:2336674 HG:A:a
A ptg000001l 17527 - m84100_240128_024355_s2/37553051/ccs 0 31883 id:i:2336554 HG:A:a
A ptg000001l 21104 - m84100_240128_024355_s2/242419829/ccs 0 28551 id:i:5120251 HG:A:a
A ptg000001l 21788 + m84100_240128_024355_s2/266536351/ccs 0 27994 id:i:5577462 HG:A:a
The S
line is the Segment and it acts as a header for the the contig. LN:i:
is the segment length (in this case, 21,642,937 bp). The rd:i:
is the read coverage, calculated by the reads coming from the same contig (in this case, read coverage is 82, which is high). The A
lines provide information about the sequences that make up the contig. Here’s what each column means
- Column 1: should always be A
- Column 2: contig name
- Column 3: contig start coordinate of subregion constructed by read
- Column 4: read strand (+ or -)
- Column 5: read name
- Column 6: read start coordinate of subregion which is used to construct contig
- Column 7: read end coordinate of subregion which is used to construct contig
- Column 8: read ID
- Column 9: haplotype status of read.
HG:A:a
,HG:A:p
, andHG:A:m
indicate that the read is non-binnable (ie heterozygous), father/hap1 specific, or mother/hap2 specific.
If I’m interpreting this correctly, it looks like most of the reads in the first contig are heterozygous.
The error file (slurm-304463.error
) for this script contains the histogram of the kmers. It has 4 iterations of kmer histograms with some sort of analysis in between the histograms. Here’s what the last histogram looks like:
[M::ha_analyze_count] lowest: count[9] = 2315
[M::ha_analyze_count] highest: count[83] = 417609
[M::ha_hist_line] 1: ****************************************************************************************************> 1732549
[M::ha_hist_line] 2: ************* 53793
[M::ha_hist_line] 3: **** 16969
[M::ha_hist_line] 4: ** 9184
[M::ha_hist_line] 5: * 5507
[M::ha_hist_line] 6: * 4361
[M::ha_hist_line] 7: * 3389
[M::ha_hist_line] 8: * 2959
[M::ha_hist_line] 9: * 2315
[M::ha_hist_line] 10: * 2369
[M::ha_hist_line] 11: 2037
[M::ha_hist_line] 12: 1889
[M::ha_hist_line] 13: 1724
[M::ha_hist_line] 14: 1729
[M::ha_hist_line] 15: 1721
[M::ha_hist_line] 16: 1641
[M::ha_hist_line] 17: 1530
[M::ha_hist_line] 18: 1687
[M::ha_hist_line] 19: 1389
[M::ha_hist_line] 20: 1341
[M::ha_hist_line] 21: 1370
[M::ha_hist_line] 22: 1269
[M::ha_hist_line] 23: 1249
[M::ha_hist_line] 24: 1329
[M::ha_hist_line] 25: 1327
[M::ha_hist_line] 26: 1317
[M::ha_hist_line] 27: 1274
[M::ha_hist_line] 28: 1370
[M::ha_hist_line] 29: 1495
[M::ha_hist_line] 30: 1468
[M::ha_hist_line] 31: 1677
[M::ha_hist_line] 32: 1625
[M::ha_hist_line] 33: 1729
[M::ha_hist_line] 34: 1697
[M::ha_hist_line] 35: 1825
[M::ha_hist_line] 36: 1919
[M::ha_hist_line] 37: 1966
[M::ha_hist_line] 38: 2066
[M::ha_hist_line] 39: * 2101
[M::ha_hist_line] 40: * 2195
[M::ha_hist_line] 41: * 2119
[M::ha_hist_line] 42: * 2100
[M::ha_hist_line] 43: * 2325
[M::ha_hist_line] 44: * 2644
[M::ha_hist_line] 45: * 2807
[M::ha_hist_line] 46: * 3080
[M::ha_hist_line] 47: * 3289
[M::ha_hist_line] 48: * 3661
[M::ha_hist_line] 49: * 3984
[M::ha_hist_line] 50: * 4856
[M::ha_hist_line] 51: * 5391
[M::ha_hist_line] 52: ** 6627
[M::ha_hist_line] 53: ** 7648
[M::ha_hist_line] 54: ** 9319
[M::ha_hist_line] 55: *** 11051
[M::ha_hist_line] 56: *** 13316
[M::ha_hist_line] 57: **** 16452
[M::ha_hist_line] 58: ***** 19650
[M::ha_hist_line] 59: ****** 24229
[M::ha_hist_line] 60: ******* 29998
[M::ha_hist_line] 61: ********* 37438
[M::ha_hist_line] 62: *********** 45813
[M::ha_hist_line] 63: ************* 54367
[M::ha_hist_line] 64: **************** 67165
[M::ha_hist_line] 65: ******************* 79086
[M::ha_hist_line] 66: *********************** 95901
[M::ha_hist_line] 67: *************************** 111990
[M::ha_hist_line] 68: ******************************* 128413
[M::ha_hist_line] 69: *********************************** 147679
[M::ha_hist_line] 70: ***************************************** 171224
[M::ha_hist_line] 71: ********************************************** 193638
[M::ha_hist_line] 72: **************************************************** 217984
[M::ha_hist_line] 73: ********************************************************** 242258
[M::ha_hist_line] 74: **************************************************************** 266643
[M::ha_hist_line] 75: ********************************************************************** 291355
[M::ha_hist_line] 76: **************************************************************************** 317522
[M::ha_hist_line] 77: ********************************************************************************* 337960
[M::ha_hist_line] 78: ************************************************************************************** 358656
[M::ha_hist_line] 79: ******************************************************************************************* 378437
[M::ha_hist_line] 80: ********************************************************************************************** 393447
[M::ha_hist_line] 81: ************************************************************************************************* 405399
[M::ha_hist_line] 82: *************************************************************************************************** 413774
[M::ha_hist_line] 83: **************************************************************************************************** 417609
[M::ha_hist_line] 84: **************************************************************************************************** 416365
[M::ha_hist_line] 85: **************************************************************************************************** 417459
[M::ha_hist_line] 86: *************************************************************************************************** 413637
[M::ha_hist_line] 87: ************************************************************************************************* 404341
[M::ha_hist_line] 88: ********************************************************************************************* 387519
[M::ha_hist_line] 89: ***************************************************************************************** 372331
[M::ha_hist_line] 90: ************************************************************************************ 352702
[M::ha_hist_line] 91: ******************************************************************************** 333308
[M::ha_hist_line] 92: ************************************************************************* 305452
[M::ha_hist_line] 93: ******************************************************************* 279706
[M::ha_hist_line] 94: ************************************************************** 257317
[M::ha_hist_line] 95: ******************************************************* 230115
[M::ha_hist_line] 96: ************************************************* 205580
[M::ha_hist_line] 97: ******************************************* 181564
[M::ha_hist_line] 98: ************************************** 159113
[M::ha_hist_line] 99: ********************************* 139005
[M::ha_hist_line] 100: ***************************** 120518
[M::ha_hist_line] 101: ************************* 102686
[M::ha_hist_line] 102: ********************* 86025
[M::ha_hist_line] 103: ****************** 73567
[M::ha_hist_line] 104: *************** 61207
[M::ha_hist_line] 105: ************ 50380
[M::ha_hist_line] 106: ********** 41491
[M::ha_hist_line] 107: ******** 34384
[M::ha_hist_line] 108: ******* 28223
[M::ha_hist_line] 109: ***** 22483
[M::ha_hist_line] 110: **** 18607
[M::ha_hist_line] 111: **** 14975
[M::ha_hist_line] 112: *** 12513
[M::ha_hist_line] 113: ** 10316
[M::ha_hist_line] 114: ** 8237
[M::ha_hist_line] 115: ** 6969
[M::ha_hist_line] 116: * 6015
[M::ha_hist_line] 117: * 5348
[M::ha_hist_line] 118: * 4850
[M::ha_hist_line] 119: * 4508
[M::ha_hist_line] 120: * 4436
[M::ha_hist_line] 121: * 4296
[M::ha_hist_line] 122: * 4655
[M::ha_hist_line] 123: * 4414
[M::ha_hist_line] 124: * 4850
[M::ha_hist_line] 125: * 5053
[M::ha_hist_line] 126: * 5326
[M::ha_hist_line] 127: * 6256
[M::ha_hist_line] 128: ** 6763
[M::ha_hist_line] 129: ** 7359
[M::ha_hist_line] 130: ** 8371
[M::ha_hist_line] 131: ** 9116
[M::ha_hist_line] 132: ** 10114
[M::ha_hist_line] 133: *** 11557
[M::ha_hist_line] 134: *** 12951
[M::ha_hist_line] 135: *** 14573
[M::ha_hist_line] 136: **** 16195
[M::ha_hist_line] 137: **** 17982
[M::ha_hist_line] 138: ***** 19859
[M::ha_hist_line] 139: ***** 22041
[M::ha_hist_line] 140: ****** 24033
[M::ha_hist_line] 141: ****** 26500
[M::ha_hist_line] 142: ******* 30035
[M::ha_hist_line] 143: ******** 32677
[M::ha_hist_line] 144: ********* 36297
[M::ha_hist_line] 145: ********* 39324
[M::ha_hist_line] 146: ********** 43146
[M::ha_hist_line] 147: *********** 47105
[M::ha_hist_line] 148: ************ 52168
[M::ha_hist_line] 149: ************* 55935
[M::ha_hist_line] 150: *************** 61001
[M::ha_hist_line] 151: **************** 65990
[M::ha_hist_line] 152: ***************** 70743
[M::ha_hist_line] 153: ****************** 74363
[M::ha_hist_line] 154: ******************* 79804
[M::ha_hist_line] 155: ******************** 83768
[M::ha_hist_line] 156: ********************* 88057
[M::ha_hist_line] 157: ********************** 92818
[M::ha_hist_line] 158: *********************** 97623
[M::ha_hist_line] 159: ************************* 103918
[M::ha_hist_line] 160: ************************** 107072
[M::ha_hist_line] 161: ************************** 110105
[M::ha_hist_line] 162: *************************** 113902
[M::ha_hist_line] 163: **************************** 117243
[M::ha_hist_line] 164: **************************** 118933
[M::ha_hist_line] 165: ***************************** 122058
[M::ha_hist_line] 166: ****************************** 123371
[M::ha_hist_line] 167: ****************************** 125091
[M::ha_hist_line] 168: ****************************** 125263
[M::ha_hist_line] 169: ****************************** 125254
[M::ha_hist_line] 170: ****************************** 123856
[M::ha_hist_line] 171: ****************************** 123656
[M::ha_hist_line] 172: ***************************** 121423
[M::ha_hist_line] 173: ***************************** 121400
[M::ha_hist_line] 174: **************************** 117980
[M::ha_hist_line] 175: **************************** 115344
[M::ha_hist_line] 176: *************************** 112655
[M::ha_hist_line] 177: ************************** 109202
[M::ha_hist_line] 178: ************************* 105297
[M::ha_hist_line] 179: ************************ 102172
[M::ha_hist_line] 180: *********************** 97507
[M::ha_hist_line] 181: ********************** 93418
[M::ha_hist_line] 182: ********************* 88400
[M::ha_hist_line] 183: ******************** 83674
[M::ha_hist_line] 184: ******************* 77971
[M::ha_hist_line] 185: ***************** 72480
[M::ha_hist_line] 186: **************** 68366
[M::ha_hist_line] 187: *************** 63165
[M::ha_hist_line] 188: ************** 58702
[M::ha_hist_line] 189: ************* 54012
[M::ha_hist_line] 190: ************ 50360
[M::ha_hist_line] 191: *********** 45887
[M::ha_hist_line] 192: ********** 40846
[M::ha_hist_line] 193: ********* 36887
[M::ha_hist_line] 194: ******** 33506
[M::ha_hist_line] 195: ******* 30266
[M::ha_hist_line] 196: ******* 27487
[M::ha_hist_line] 197: ****** 24333
[M::ha_hist_line] 198: ***** 21602
[M::ha_hist_line] 199: ***** 19303
[M::ha_hist_line] 200: **** 17108
[M::ha_hist_line] 201: **** 15156
[M::ha_hist_line] 202: *** 13661
[M::ha_hist_line] 203: *** 12076
[M::ha_hist_line] 204: *** 10526
[M::ha_hist_line] 205: ** 9215
[M::ha_hist_line] 206: ** 8143
[M::ha_hist_line] 207: ** 7395
[M::ha_hist_line] 208: ** 6602
[M::ha_hist_line] 209: * 5949
[M::ha_hist_line] 210: * 5447
[M::ha_hist_line] 211: * 4869
[M::ha_hist_line] 212: * 4270
[M::ha_hist_line] 213: * 3890
[M::ha_hist_line] 214: * 3731
[M::ha_hist_line] 215: * 3629
[M::ha_hist_line] 216: * 3494
[M::ha_hist_line] 217: * 3613
[M::ha_hist_line] 218: * 3512
[M::ha_hist_line] 219: * 3618
[M::ha_hist_line] 220: * 3772
[M::ha_hist_line] 221: * 3774
[M::ha_hist_line] 222: * 3708
[M::ha_hist_line] 223: * 3818
[M::ha_hist_line] 224: * 3986
[M::ha_hist_line] 225: * 4029
[M::ha_hist_line] 226: * 4380
[M::ha_hist_line] 227: * 4386
[M::ha_hist_line] 228: * 4510
[M::ha_hist_line] 229: * 4678
[M::ha_hist_line] 230: * 4797
[M::ha_hist_line] 231: * 5106
[M::ha_hist_line] 232: * 5197
[M::ha_hist_line] 233: * 5242
[M::ha_hist_line] 234: * 5474
[M::ha_hist_line] 235: * 5733
[M::ha_hist_line] 236: * 6021
[M::ha_hist_line] 237: * 6265
[M::ha_hist_line] 238: * 6246
[M::ha_hist_line] 239: ** 6646
[M::ha_hist_line] 240: ** 6722
[M::ha_hist_line] 241: ** 6844
[M::ha_hist_line] 242: ** 6733
[M::ha_hist_line] 243: ** 7254
[M::ha_hist_line] 244: ** 7250
[M::ha_hist_line] 245: ** 7243
[M::ha_hist_line] 246: ** 7275
[M::ha_hist_line] 247: ** 7405
[M::ha_hist_line] 248: ** 7421
[M::ha_hist_line] 249: ** 7571
[M::ha_hist_line] 250: ** 7291
[M::ha_hist_line] 251: ** 7331
[M::ha_hist_line] 252: ** 7309
[M::ha_hist_line] 253: ** 7278
[M::ha_hist_line] 254: ** 7264
[M::ha_hist_line] 255: ** 7092
[M::ha_hist_line] 256: ** 6912
[M::ha_hist_line] 257: ** 6958
[M::ha_hist_line] 258: ** 6689
[M::ha_hist_line] 259: ** 6607
[M::ha_hist_line] 260: ** 6542
[M::ha_hist_line] 261: * 6242
[M::ha_hist_line] 262: * 6185
[M::ha_hist_line] 263: * 5934
[M::ha_hist_line] 264: * 5717
[M::ha_hist_line] 265: * 5388
[M::ha_hist_line] 266: * 5448
[M::ha_hist_line] 267: * 5279
[M::ha_hist_line] 268: * 4944
[M::ha_hist_line] 269: * 4724
[M::ha_hist_line] 270: * 4549
[M::ha_hist_line] 271: * 4404
[M::ha_hist_line] 272: * 4417
[M::ha_hist_line] 273: * 4041
[M::ha_hist_line] 274: * 3888
[M::ha_hist_line] 275: * 3760
[M::ha_hist_line] 276: * 3592
[M::ha_hist_line] 277: * 3490
[M::ha_hist_line] 278: * 3150
[M::ha_hist_line] 279: * 3001
[M::ha_hist_line] 280: * 2926
[M::ha_hist_line] 281: * 3084
[M::ha_hist_line] 282: * 2758
[M::ha_hist_line] 283: * 2672
[M::ha_hist_line] 284: * 2526
[M::ha_hist_line] 285: * 2432
[M::ha_hist_line] 286: * 2313
[M::ha_hist_line] 287: * 2314
[M::ha_hist_line] 288: * 2255
[M::ha_hist_line] 289: * 2266
[M::ha_hist_line] 290: * 2230
[M::ha_hist_line] 291: * 2124
[M::ha_hist_line] 292: * 2109
[M::ha_hist_line] 293: * 2208
[M::ha_hist_line] 294: * 2187
[M::ha_hist_line] 295: * 2154
[M::ha_hist_line] 296: * 2139
[M::ha_hist_line] 297: * 2160
[M::ha_hist_line] 298: * 2211
[M::ha_hist_line] 299: * 2220
[M::ha_hist_line] 300: * 2364
[M::ha_hist_line] 301: * 2340
[M::ha_hist_line] 302: * 2367
[M::ha_hist_line] 303: * 2554
[M::ha_hist_line] 304: * 2611
[M::ha_hist_line] 305: * 2559
[M::ha_hist_line] 306: * 2525
[M::ha_hist_line] 307: * 2573
[M::ha_hist_line] 308: * 2791
[M::ha_hist_line] 309: * 2797
[M::ha_hist_line] 310: * 2819
[M::ha_hist_line] 311: * 2945
[M::ha_hist_line] 312: * 2958
[M::ha_hist_line] 313: * 3034
[M::ha_hist_line] 314: * 3275
[M::ha_hist_line] 315: * 3267
[M::ha_hist_line] 316: * 3351
[M::ha_hist_line] 317: * 3284
[M::ha_hist_line] 318: * 3513
[M::ha_hist_line] 319: * 3510
[M::ha_hist_line] 320: * 3692
[M::ha_hist_line] 321: * 3704
[M::ha_hist_line] 322: * 3755
[M::ha_hist_line] 323: * 3906
[M::ha_hist_line] 324: * 3910
[M::ha_hist_line] 325: * 3888
[M::ha_hist_line] 326: * 4038
[M::ha_hist_line] 327: * 4052
[M::ha_hist_line] 328: * 4249
[M::ha_hist_line] 329: * 4048
[M::ha_hist_line] 330: * 3908
[M::ha_hist_line] 331: * 4098
[M::ha_hist_line] 332: * 3993
[M::ha_hist_line] 333: * 4066
[M::ha_hist_line] 334: * 4055
[M::ha_hist_line] 335: * 4079
[M::ha_hist_line] 336: * 4027
[M::ha_hist_line] 337: * 3871
[M::ha_hist_line] 338: * 3942
[M::ha_hist_line] 339: * 3919
[M::ha_hist_line] 340: * 3896
[M::ha_hist_line] 341: * 3891
[M::ha_hist_line] 342: * 3721
[M::ha_hist_line] 343: * 3773
[M::ha_hist_line] 344: * 3657
[M::ha_hist_line] 345: * 3596
[M::ha_hist_line] 346: * 3377
[M::ha_hist_line] 347: * 3273
[M::ha_hist_line] 348: * 3190
[M::ha_hist_line] 349: * 3224
[M::ha_hist_line] 350: * 3142
[M::ha_hist_line] 351: * 3076
[M::ha_hist_line] 352: * 3058
[M::ha_hist_line] 353: * 2916
[M::ha_hist_line] 354: * 2776
[M::ha_hist_line] 355: * 2764
[M::ha_hist_line] 356: * 2785
[M::ha_hist_line] 357: * 2701
[M::ha_hist_line] 358: * 2490
[M::ha_hist_line] 359: * 2416
[M::ha_hist_line] 360: * 2361
[M::ha_hist_line] 361: * 2298
[M::ha_hist_line] 362: * 2325
[M::ha_hist_line] 363: * 2146
[M::ha_hist_line] 364: * 2136
[M::ha_hist_line] 365: * 2099
[M::ha_hist_line] rest: *********************************************************************************************** 396842
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::285039.075*35.33] ==> counted 17036513 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137320573 minimizers
[M::ha_pt_gen::285482.798*35.31] ==> indexed 2135588024 positions, counted 17036513 distinct minimizer k-mers
[M::ha_assemble::297514.365*35.34@246.283GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 1183659490
[M::ha_print_ovlp_stat] # strong overlaps: 596745657
[M::ha_print_ovlp_stat] # weak overlaps: 586913833
[M::ha_print_ovlp_stat] # exact overlaps: 1149035029
[M::ha_print_ovlp_stat] # inexact overlaps: 34624461
[M::ha_print_ovlp_stat] # overlaps without large indels: 1180991551
[M::ha_print_ovlp_stat] # reverse overlaps: 410771757
Writing reads to disk...
Reads has been written.
Thats a lot of information. The Hifiasm output page says that for heterozygous samples (which mine likely are), there should be 2 peaks in the k-mer plot, where the smaller peak is around the heterozygous read coverage and the larger peak is around the homozygous read coverage. This is true for all k-mer plots produced from this data. In all of my k-mer plots, the homozygous peak is 83 and the heterozygous peak is 168. I’m going to include the information without the k-mer plots below because they are so large:
[M::ha_analyze_count] lowest: count[17] = 11309
[M::ha_analyze_count] highest: count[85] = 9499858
## first k-mer plot
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[168] = 3093394
[M::ha_ft_gen] peak_hom: 168; peak_het: 85
[M::ha_ct_shrink::3427.856*4.32] ==> counted 2684382 distinct minimizer k-mers
[M::ha_ft_gen::3431.212*4.31@20.882GB] ==> filtered out 2684382 k-mers occurring 840 or more times
[M::ha_opt_update_cov] updated max_n_chain to 840
[M::yak_count] collected 2139678804 minimizers
[M::ha_pt_gen::4355.642*5.80] ==> counted 31723566 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[17] = 2230
[M::ha_analyze_count] highest: count[83] = 418590
## second k-mer plot
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[168] = 125423
[M::ha_pt_gen] peak_hom: 168; peak_het: 83
[M::ha_ct_shrink::4355.907*5.81] ==> counted 17720475 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2139678804 minimizers
[M::ha_pt_gen::4818.141*7.37] ==> indexed 2125675713 positions, counted 17720475 distinct minimizer k-mers
[M::ha_assemble::100006.981*34.52@107.237GB] ==> corrected reads for round 1
[M::ha_assemble] # bases: 79183709778; # corrected bases: 91988177; # recorrected bases: 115414
[M::ha_assemble] size of buffer: 61.709GB
[M::yak_count] collected 2137476150 minimizers
[M::ha_pt_gen::100401.244*34.48] ==> counted 19037497 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[13] = 1719
[M::ha_analyze_count] highest: count[83] = 417869
## third k-mer plot
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::100401.532*34.48] ==> counted 17060420 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137476150 minimizers
[M::ha_pt_gen::100851.319*34.43] ==> indexed 2135499073 positions, counted 17060420 distinct minimizer k-mers
[M::ha_assemble::192251.063*35.13@143.827GB] ==> corrected reads for round 2
[M::ha_assemble] # bases: 79186546424; # corrected bases: 1694280; # recorrected bases: 16078
[M::ha_assemble] size of buffer: 60.552GB
[M::yak_count] collected 2137334800 minimizers
[M::ha_pt_gen::192627.741*35.11] ==> counted 18787923 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[9] = 2352
[M::ha_analyze_count] highest: count[83] = 417664
## fourth k-mer plot
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::192628.074*35.11] ==> counted 17040202 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137334800 minimizers
[M::ha_pt_gen::193066.706*35.08] ==> indexed 2135587079 positions, counted 17040202 distinct minimizer k-mers
[M::ha_assemble::284661.787*35.35@241.898GB] ==> corrected reads for round 3
[M::ha_assemble] # bases: 79186730278; # corrected bases: 235080; # recorrected bases: 12650
[M::ha_assemble] size of buffer: 60.546GB
[M::yak_count] collected 2137320573 minimizers
[M::ha_pt_gen::285038.830*35.33] ==> counted 18769062 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[9] = 2315
[M::ha_analyze_count] highest: count[83] = 417609
## fifth k-mer plot
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::285039.075*35.33] ==> counted 17036513 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137320573 minimizers
[M::ha_pt_gen::285482.798*35.31] ==> indexed 2135588024 positions, counted 17036513 distinct minimizer k-mers
[M::ha_assemble::297514.365*35.34@246.283GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 1183659490
[M::ha_print_ovlp_stat] # strong overlaps: 596745657
[M::ha_print_ovlp_stat] # weak overlaps: 586913833
[M::ha_print_ovlp_stat] # exact overlaps: 1149035029
[M::ha_print_ovlp_stat] # inexact overlaps: 34624461
[M::ha_print_ovlp_stat] # overlaps without large indels: 1180991551
[M::ha_print_ovlp_stat] # reverse overlaps: 410771757
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
[M::mc_solve_core::0.284] ==> Partition
[M::adjust_utg_by_primary] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.bp.p_ctg.gfa to disk...
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.bp.hap1.p_ctg.gfa to disk...
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.bp.hap2.p_ctg.gfa to disk...
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.16.1-r375
[M::main] CMD: hifiasm -o apul.hifiasm -t 36 m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq
[M::main] Real time: 304623.035 sec; CPU: 10520661.634 sec; Peak RSS: 246.283 GB
Okay so 4 rounds of assembly were done, hypothetically improving the assembly each time. Weirdly, the first 2 rounds had the heterozygous peak at 68, but the last couple of rounds had the heterozygous peak at -1. In all k-mer plots, there is a high peak at the 1 location, but this doesn’t seem unusual (based on the example posts here and here provided by the hifiasm log interpretation section). Overall, I think this file is providing information on the assembly iterations.
Given all of this information, let’s now run BUSCO to assess completeness of assembly. In the scripts folder: nano busco_unfilt_hifiasm.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.bp.p_ctg.gfa > apul.hifiasm.bp.p_ctg.fa
echo "Begin busco on unfiltered hifiasm-assembled fasta" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.busco.canu -m genome
echo "busco complete for unfiltered hifiasm-assembled fasta" $(date)
Submitted batch job 305426. I could also run some analysis on the hap assemblies, but I’m going to wait until I am done with the filtering and re-assembly, as that assembly is the one that I will likely be moving forward with. How are they making the distinction between hap1/father and hap2/mother?
Busco ran in ~45 mins and results look a lot better than they did with canu! This is the primary result from the output file:
2024-03-11 13:37:21 INFO:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.3%[S:92.0%,D:1.3%],F:3.1%,M:3.6%,n:954 |
|890 Complete BUSCOs (C) |
|878 Complete and single-copy BUSCOs (S) |
|12 Complete and duplicated BUSCOs (D) |
|30 Fragmented BUSCOs (F) |
|34 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
This looks so much better than the canu assembly! The previous canu assembly was 94.4% complete, but had only 9.4% single copy BUSCOs and 85% duplicated BUSCOs. Ideally, the duplication level should be lower. The hifiasm assembly had 93.3% completeness, 92% single copy BUSCOs, and 1.3% duplicated BUSCOs. Hifiasm is definitely the way to go for assembly. Now I just have to wait until the blast prok is done so I can remove any contamination. After I remove contamination, I will re-assemble using hifiasm.
20240313
Blast prok failed after 2 days and then restarted on Andromeda. Maybe I should stop the code after 2 days…idk. Maybe I need to increase the memory for the job? Canceling the job (305351
) and increasing the memory (#SBATCH --mem=500GB
). Submitted batch job 308997
20240316
Copied the prok data to my local computer. It is still running on the server, but I’m nervous it will restart again. If it does restart, I’ll cancel the job and just use this data from today. On my local computer, I combined the prok and viral blast results and then removed any hits whose bit score was <1000.
cd /Users/jillashey/Desktop/PutnamLab/Apulchra_genome
cat viral_contaminant_hits_rr.txt prok_contaminant_hits_rr.txt > all_contaminant_hits_rr.txt
awk '$12 > 1000 {print $0}' all_contaminant_hits_rr.txt > contaminant_hits_pv_passfilter_rr.txt
I then looked at the contamination hits in R. See code here.
As a summary, I first read in the eukaryotic blast hits that passed the contamination threshold. I found that only 2 reads had any euk contamination (m84100_240128_024355_s2/48759857/ccs
and m84100_240128_024355_s2/234751852/ccs
). I then read in the prokaryotic and viral blast hits that passed the threshold (only 224 blast hits passed). I calculated the percentage of each hits align length to the contigs so if there was a result that had 100%, that would mean that the whole contig was a contaminant. I looked at a histogram of the % alignments and found that most of the % alignments are on the lower size and there are not many 100% sequences. I summarized the contigs that were to be filtered out and found that 222 contigs had some level of pv contamination. I added the euk + pv contamination reads together (224 total) and calculated the proportion of contamination to raw reads. The contamination ended up being only 0.003797649% of the raw reads, which is pretty amazing! I calculated the mean length of the filtered reads (13,242.1 bp) and the sums of the unfiltered read length (79183709778 total bp) and filtered read length (79181142809 total bp). Using these lengths, I calculated a rough estimation of sequencing depth and found we have roughly 100x coverage! That’s similar to the Young et al. 2024 results as well. Finally, I wrote the list of filtered reads to a text file on my local computer. This information will be used to filter the raw reads on Andromeda prior to assembly. I’m impressed with the low contamination and the high coverage of the PacBio HiFi reads.
20240319
Prok blast script finally finished running! Took about 5 days. Cat the prok and viral results together and remove anything that has a bit score <1000.
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
cat viral_contaminant_hits_rr.txt prok_contaminant_hits_rr.txt > all_contaminant_hits_rr.txt
awk '$12 > 1000 {print $0}' all_contaminant_hits_rr.txt > contaminant_hits_pv_passfilter_rr.txt
Similarly to what I did above, I looked at the contamination hits in R. See code here.
As a summary, I first read in the eukaryotic blast hits that passed the contamination threshold. I found that only 2 reads had any euk contamination (m84100_240128_024355_s2/48759857/ccs
and m84100_240128_024355_s2/234751852/ccs
). I then read in the prokaryotic and viral blast hits that passed the threshold (only 2 blast hits passed). I calculated the percentage of each hits align length to the contigs so if there was a result that had 100%, that would mean that the whole contig was a contaminant. I looked at a histogram of the % alignments and found that most of the % alignments are on the lower size and there are not many 100% sequences. I summarized the contigs that were to be filtered out and found that 494 contigs had some level of pv contamination. I added the euk + pv contamination reads together (496 contigs total) and calculated the proportion of contamination to raw reads. The contamination ended up being only 0.00840908% of the raw reads, which is pretty amazing! I calculated the mean length of the filtered reads (13,424.84 bp) and the sums of the unfiltered read length (79183709778 total bp) and filtered read length (79178244314 total bp). Using these lengths, I calculated a rough estimation of sequencing depth and found we have roughly 100x coverage! That’s similar to the Young et al. 2024 results as well. Finally, I wrote the list of filtered reads to a text file on my local computer. This information will be used to filter the raw reads on Andromeda prior to assembly. I’m impressed with the low contamination and the high coverage of the PacBio HiFi reads.
20240320
Before filtered the hifi reads, I’m going to clean up the /data/putnamlab/jillashey/Apul_Genome/assembly/data
folder so that I have more memory for the next steps. Here’s whats in there right now:
total 312G
-rw-r--r--. 1 jillashey 148G Feb 8 02:37 m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq
-rwxr-xr-x. 1 jillashey 1.1K Feb 8 14:13 apul.seqStore.sh
-rw-r--r--. 1 jillashey 951 Feb 8 14:31 apul.seqStore.err
drwxr-xr-x. 3 jillashey 4.0K Feb 8 14:32 apul.seqStore
-rw-r--r--. 1 jillashey 23K Feb 9 01:47 apul.report
-rw-r--r--. 1 jillashey 7.0M Feb 9 01:51 apul.contigs.layout.tigInfo
-rw-r--r--. 1 jillashey 155M Feb 9 01:51 apul.contigs.layout.readToTig
-rw-r--r--. 1 jillashey 2.8G Feb 9 01:57 apul.unassembled.fasta
-rw-r--r--. 1 jillashey 943M Feb 9 02:01 apul.contigs.fasta
drwxr-xr-x. 2 jillashey 4.0K Feb 13 14:01 busco_output
drwxr-xr-x. 3 jillashey 4.0K Feb 13 14:01 busco_downloads
-rw-r--r--. 1 jillashey 7.2K Feb 13 14:02 busco_96228.log
lrwxrwxrwx. 1 jillashey 108 Feb 20 14:08 m84100_240128_024355_s2.hifi_reads.bc1029.bam -> /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam
-rw-r--r--. 1 jillashey 106 Feb 21 16:48 pb.fofn
drwxr-xr-x. 9 jillashey 4.0K Feb 21 16:53 unitigging
-rw-r--r--. 1 jillashey 74G Mar 1 16:08 m84100_240128_024355_s2.hifi_reads.bc1029.fasta
-rw-r--r--. 1 jillashey 244M Mar 1 16:36 rr_read_lengths.txt
-rw-r--r--. 1 jillashey 106K Mar 3 16:27 contaminant_hits_euks_rr.txt
-rw-r--r--. 1 jillashey 2.4K Mar 3 16:29 contaminants_pass_filter_euks_rr.txt
-rw-r--r--. 1 jillashey 69M Mar 5 23:20 viral_contaminant_hits_rr.txt
-rw-r--r--. 1 jillashey 19G Mar 7 23:58 apul.hifiasm.ec.bin
-rw-r--r--. 1 jillashey 47G Mar 8 00:08 apul.hifiasm.ovlp.source.bin
-rw-r--r--. 1 jillashey 17G Mar 8 00:12 apul.hifiasm.ovlp.reverse.bin
-rw-r--r--. 1 jillashey 1.2G Mar 8 01:37 apul.hifiasm.bp.r_utg.gfa
-rw-r--r--. 1 jillashey 21M Mar 8 01:37 apul.hifiasm.bp.r_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.6M Mar 8 01:41 apul.hifiasm.bp.r_utg.lowQ.bed
-rw-r--r--. 1 jillashey 1.1G Mar 8 01:42 apul.hifiasm.bp.p_utg.gfa
-rw-r--r--. 1 jillashey 21M Mar 8 01:42 apul.hifiasm.bp.p_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.2M Mar 8 01:46 apul.hifiasm.bp.p_utg.lowQ.bed
-rw-r--r--. 1 jillashey 506M Mar 8 01:47 apul.hifiasm.bp.p_ctg.gfa
-rw-r--r--. 1 jillashey 11M Mar 8 01:47 apul.hifiasm.bp.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 8 01:49 apul.hifiasm.bp.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 469M Mar 8 01:50 apul.hifiasm.bp.hap1.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 8 01:50 apul.hifiasm.bp.hap1.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 8 01:52 apul.hifiasm.bp.hap1.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 468M Mar 8 01:52 apul.hifiasm.bp.hap2.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 8 01:52 apul.hifiasm.bp.hap2.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 1.9M Mar 8 01:54 apul.hifiasm.bp.hap2.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 495M Mar 11 12:52 apul.hifiasm.bp.p_ctg.fa
-rw-r--r--. 1 jillashey 246M Mar 19 14:05 prok_contaminant_hits_rr.txt
-rw-r--r--. 1 jillashey 315M Mar 19 16:08 all_contaminant_hits_rr.txt
-rw-r--r--. 1 jillashey 1.1M Mar 19 16:09 contaminant_hits_pv_passfilter_rr.txt
I removed the following:
rm -r apul.seqStore
rm apul*
rm -r busco*
rm pb.fofn
rm -r unitigging/
Now I have more space. Copy the file all_contam_rem_good_hifi_read_list.txt
that was generated from the R code mentioned above. This specific file was written starting on line 242. It contains the reads that have passed contamination filtering. I copied this file into /data/putnamlab/jillashey/Apul_Genome/assembly/data
.
wc -l all_contam_rem_good_hifi_read_list.txt
5897892 all_contam_rem_good_hifi_read_list.txt
The vast majority of the hifi reads are retained after contamination filtering, which is a good sign of high quality sequencing. My next step is to subset the raw hifi fasta file to remove the contaminants identified above. I can do this with the seqtk subseq command. In the scripts folder: nano subseq.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load seqtk/1.3-GCC-9.3.0
echo "Subsetting hifi reads that passed contamination filtering" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
seqtk subseq m84100_240128_024355_s2.hifi_reads.bc1029.fasta all_contam_rem_good_hifi_read_list.txt > hifi_rr_allcontam_rem.fasta
echo "Subsetting complete!" $(date)
Submitted batch job 309659. Finished in about 8 mins but the output file looks like this:
>m84100_240128_024355_s2/261887593/ccs:18224-18224
A
>m84100_240128_024355_s2/255530003/ccs:21870-21870
A
>m84100_240128_024355_s2/249237028/ccs:23691-23691
A
>m84100_240128_024355_s2/262606536/ccs:14772-14772
A
>m84100_240128_024355_s2/217322854/ccs:12923-12923
A
>m84100_240128_024355_s2/256512826/ccs:12914-12914
A
>m84100_240128_024355_s2/245632166/ccs:28440-28440
A
>m84100_240128_024355_s2/250548903/ccs:23076-23076
A
>m84100_240128_024355_s2/256054930/ccs:15405-15405
A
>m84100_240128_024355_s2/241242930/ccs:15521-15521
A
>m84100_240128_024355_s2/254348773/ccs:14578-14578
C
>m84100_240128_024355_s2/252319399/ccs:12407-12407
A
>m84100_240128_024355_s2/229183717/ccs:5757-5757
Not ideal. It looks like it only took the first letter from each sequence. Maybe I need to remove the length information from the all_contam_rem_good_hifi_read_list.txt
file?
awk '{$2=""; print $0}' all_contam_rem_good_hifi_read_list.txt > output_file.txt
Edit the script so that the list of reads to keep is output_file.txt
and decrease mem to 250GB. Submitted batch job 309672. This appears to have worked!
zgrep -c ">" hifi_rr_allcontam_rem.fasta
5897892
Following Young et al. 2024, I need to use Merqury and GenomeScope2 analysis of the cleaned reads. Merqury and Meryl seem to be related somehow, but not sure. Meryl is a tool for counting and working with sets of k-mers. It is a part of Canu as well. So it seems like Meryl counts the k-mers and Merqury estimates accuracy and completeness? Young et al. 2024 used only meryl here to generate a kmer database, which was then used as input to genomescope2. Merqury was used after hifiasm assembly.
First, the best value for k needs to be determined for use in Meryl. This can be done with the best_k.sh
script from Merqury. I’m going to copy this script into my own scripts folder. Young et al. 2024 used 500mb estimated genome size. This is what I will use too, as previous coral/Acropora genomes are about this size. I will need to install Meryl and Merqury first. I’m following the Meryl and Merqury githubs for installation instructions.
For Meryl:
wget https://github.com/marbl/meryl/releases/download/v1.4.1/meryl-1.4.1.Linux-amd64.tar.xz
tar -xJf meryl-1.4.1.Linux-amd64.tar.xz
export PATH=/data/putnamlab/jillashey/Apul_Genome/assembly/meryl-1.4.1/bin:$PATH
Now that I have exported the PATH variable to include the directory where the Meryl executable files are, I can run Meryl commands (I think), regardless of working directory. For now, lets run meryl with k=18. The k-mer database will be generated using k=18 and the cleaned reads. In the scripts folder: nano meryl.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
export PATH=/data/putnamlab/jillashey/Apul_Genome/assembly/meryl-1.4.1/bin:$PATH
echo "Creating meryl k-mer db" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
meryl k=18 \
count hifi_rr_allcontam_rem.fasta \
output meryl_merc
echo "Meryl k-mer db complete!" $(date)
Submitted batch job 309682. While this runs, lets try to install Merqury.
git clone https://github.com/marbl/merqury.git
cd merqury
export MERQURY=$PWD:$PATH
Hypothetically, Merqury is now installed. Perhaps now we can run the best k script? I am not sure what genome size estimate to use. Shinzato et al. 2020 assessed several Acropora genomes and found the genome size ranged from 384 (Acropora microphthalma) to 447 (Acropora hyacinthus) Mb. I think I will go with 450 Mb. Let’s try to run the best k script.
$MERQURY/best_k.sh 450000000
Giving me this error:
-bash: /data/putnamlab/jillashey/Apul_Genome/assembly/merqury:/data/putnamlab/jillashey/Apul_Genome/assembly/meryl-1.4.1/bin:/path/to/meryl-1.4.1/bin:/path/to/meryl/…/bin:/opt/software/Miniconda3/4.9.2/bin:/opt/software/Miniconda3/4.9.2/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/jillashey/.local/bin:/home/jillashey/bin/best_k.sh: No such file or directory
I’m not sure what it means or how to fix it. On the Merqury github, it lists dependencies that Merqury may need to run:
- gcc 10.2.0 or higher (for installing Meryl)
- Meryl v1.4.1
- Java run time environment (JRE)
- R with argparse, ggplot2, and scales (recommend R 4.0.3+)
- bedtools
- samtools
Maybe I need to load these? Idk I am confused. I may just continue with the initial assembly…because I don’t really see the importance of this step. In Young et al. 2024, it says “the kmer profile of cleaned raw HiFi reads was generated with Meryl [34], and used for genome profiling with GenomeScope2 [69] to estimate genome size, repetitiveness, heterozygosity, and ploidy.” I will come back to this. I may need to email Kevin Bryan to install meryl, merqury and genomescope2.
While I wait for his response, I will start running the initial hifiasm assembly with default flags. I have a script for hifiasm on the server, but I’m going to make a new one for the initial assembly. In the scripts folder: nano initial_hifiasm.sh
#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
conda activate /data/putnamlab/conda/hifiasm
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Starting assembly with hifiasm" $(date)
hifiasm -o apul.hifiasm.intial hifi_rr_allcontam_rem.fasta -t 36 2> apul_hifiasm_allcontam_rem_initial.asm.log
echo "Assembly with hifiasm complete!" $(date)
conda deactivate
Submitted batch job 309689
20240325
Initial assembly ran in about 3 days. These are the files that were generated:
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
-rw-r--r--. 1 jillashey 19G Mar 24 02:10 apul.hifiasm.intial.ec.bin
-rw-r--r--. 1 jillashey 47G Mar 24 02:20 apul.hifiasm.intial.ovlp.source.bin
-rw-r--r--. 1 jillashey 17G Mar 24 02:23 apul.hifiasm.intial.ovlp.reverse.bin
-rw-r--r--. 1 jillashey 1.2G Mar 24 03:40 apul.hifiasm.intial.bp.r_utg.gfa
-rw-r--r--. 1 jillashey 21M Mar 24 03:40 apul.hifiasm.intial.bp.r_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.6M Mar 24 03:44 apul.hifiasm.intial.bp.r_utg.lowQ.bed
-rw-r--r--. 1 jillashey 1.1G Mar 24 03:45 apul.hifiasm.intial.bp.p_utg.gfa
-rw-r--r--. 1 jillashey 21M Mar 24 03:45 apul.hifiasm.intial.bp.p_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.2M Mar 24 03:49 apul.hifiasm.intial.bp.p_utg.lowQ.bed
-rw-r--r--. 1 jillashey 506M Mar 24 03:50 apul.hifiasm.intial.bp.p_ctg.gfa
-rw-r--r--. 1 jillashey 11M Mar 24 03:50 apul.hifiasm.intial.bp.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 24 03:52 apul.hifiasm.intial.bp.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 469M Mar 24 03:52 apul.hifiasm.intial.bp.hap1.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 24 03:52 apul.hifiasm.intial.bp.hap1.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 24 03:54 apul.hifiasm.intial.bp.hap1.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 468M Mar 24 03:55 apul.hifiasm.intial.bp.hap2.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 24 03:55 apul.hifiasm.intial.bp.hap2.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 1.9M Mar 24 03:56 apul.hifiasm.intial.bp.hap2.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 80K Mar 24 03:57 apul_hifiasm_allcontam_rem_initial.asm.log
Many files. The output file description can be found above and also on the hifiasm output website. The log output file contains the k-mer histogram (similar to what is posted above), which shows two peaks, indicative of a heterozygous genome assembly. This is what the bottom of the log file looks like:
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137171007 minimizers
[M::ha_pt_gen::281249.073*35.34] ==> indexed 2135566893 positions, counted 17030417 distinct minimizer k-mers
[M::ha_assemble::292970.825*35.36@202.994GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 1183659340
[M::ha_print_ovlp_stat] # strong overlaps: 596745652
[M::ha_print_ovlp_stat] # weak overlaps: 586913688
[M::ha_print_ovlp_stat] # exact overlaps: 1149035007
[M::ha_print_ovlp_stat] # inexact overlaps: 34624333
[M::ha_print_ovlp_stat] # overlaps without large indels: 1180991403
[M::ha_print_ovlp_stat] # reverse overlaps: 410771728
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
[M::mc_solve_core::0.308] ==> Partition
[M::adjust_utg_by_primary] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.intial.bp.p_ctg.gfa to disk...
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.intial.bp.hap1.p_ctg.gfa to disk...
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.intial.bp.hap2.p_ctg.gfa to disk...
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.16.1-r375
[M::main] CMD: hifiasm -o apul.hifiasm.intial -t 36 hifi_rr_allcontam_rem.fasta
[M::main] Real time: 299500.413 sec; CPU: 10366612.139 sec; Peak RSS: 202.994 GB
Does the M::purge_dups
mean that this is the value that I should be using for the purge_dups
flag in hifiasm? It gives me two lines for M::purge_dups
: homozygous read coverage threshold: 168
and purge duplication coverage threshold: 210
. I may need to talk with Ross and Hollie more about this. I’m going to QC the assembly and the haplotype assemblies with busco and quast. In the scripts folder: nano initial_qc.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.intial.bp.p_ctg.gfa > apul.hifiasm.intial.bp.p_ctg.fa
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.intial.bp.hap1.p_ctg.gfa > apul.hifiasm.intial.bp.hap1.p_ctg.fa
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.intial.bp.hap2.p_ctg.gfa > apul.hifiasm.intial.bp.hap2.p_ctg.fa
echo "Begin busco on filtered hifiasm-assembled fasta (initial run)" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.intial.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.initial.busco -m genome
echo "busco complete for unfiltered hifiasm-assembled fasta (initial run)" $(date)
echo "Begin busco on hifiasm-assembled haplotype 1 fasta" $(date)
# Reset query
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.intial.bp.hap1.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.initial.hap1.busco -m genome
echo "busco complete for hifiasm-assembled haplotype 1 fasta (initial run)" $(date)
echo "Begin busco on hifiasm-assembled haplotype 2 fasta" $(date)
# Reset query
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.intial.bp.hap2.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.initial.hap2.busco -m genome
echo "busco complete for hifiasm-assembled haplotype 2 fasta (initial run)" $(date)
echo "busco complete all assemblies of interest (initial run)" $(date)
echo "Begin quast of primary and haplotypes (initial run)" $(date)
module load QUAST/5.2.0-foss-2021b
# there is another version of quast if the one above does not work: QUAST/5.0.2-foss-2020b-Python-2.7.18
quast -t 15 --eukaryote \
apul.hifiasm.intial.bp.p_ctg.fa \
apul.hifiasm.intial.bp.hap1.p_ctg.fa \
apul.hifiasm.intial.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast
echo "Quast complete (initial run); all QC complete!" $(date)
Submitted batch job 309969. Ran in ~2 hours. Busco ran but only for the primariy assembly. For some reason, the hap1 and hap2 busco did not run. Quast appears to have failed with these errors:
foss/2021b(24):ERROR:150: Module 'foss/2021b' conflicts with the currently loaded module(s) 'foss/2020b'
foss/2021b(24):ERROR:102: Tcl command execution failed: conflict foss
Python/3.9.6-GCCcore-11.2.0(61):ERROR:150: Module 'Python/3.9.6-GCCcore-11.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/3.9.6-GCCcore-11.2.0(61):ERROR:102: Tcl command execution failed: conflict Python
Perl/5.34.0-GCCcore-11.2.0(133):ERROR:150: Module 'Perl/5.34.0-GCCcore-11.2.0' conflicts with the currently loaded module(s) 'Perl/5.32.0-GCCcore-10.2.0'
Perl/5.34.0-GCCcore-11.2.0(133):ERROR:102: Tcl command execution failed: conflict Perl
foss/2021b(24):ERROR:150: Module 'foss/2021b' conflicts with the currently loaded module(s) 'foss/2020b'
foss/2021b(24):ERROR:102: Tcl command execution failed: conflict foss
Python/3.9.6-GCCcore-11.2.0(61):ERROR:150: Module 'Python/3.9.6-GCCcore-11.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/3.9.6-GCCcore-11.2.0(61):ERROR:102: Tcl command execution failed: conflict Python
SciPy-bundle/2021.10-foss-2021b(30):ERROR:150: Module 'SciPy-bundle/2021.10-foss-2021b' conflicts with the currently loaded module(s) 'SciPy-bundle/2020.11-foss-2020b'
SciPy-bundle/2021.10-foss-2021b(30):ERROR:102: Tcl command execution failed: conflict SciPy-bundle
GCCcore/11.2.0(24):ERROR:150: Module 'GCCcore/11.2.0' conflicts with the currently loaded module(s) 'GCCcore/10.2.0'
GCCcore/11.2.0(24):ERROR:102: Tcl command execution failed: conflict GCCcore
Basically just a lot of conflicting modules. Below are the results for the Busco code for the primary assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.3%[S:92.0%,D:1.3%],F:3.1%,M:3.6%,n:954 |
|890 Complete BUSCOs (C) |
|878 Complete and single-copy BUSCOs (S) |
|12 Complete and duplicated BUSCOs (D) |
|30 Fragmented BUSCOs (F) |
|34 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Going to edit the initial_qc.sh
script so that I am including all things for busco to run properly. I’m also commenting out the lines that ran successfully and switching the module to QUAST/5.0.2-foss-2020b-Python-2.7.18
. Submitted batch job 309984. It ran, but only the hap1 busco scores were generated, not hap2. Below are the results for the hap1 assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.3%[S:92.8%,D:0.5%],F:3.0%,M:3.7%,n:954 |
|890 Complete BUSCOs (C) |
|885 Complete and single-copy BUSCOs (S) |
|5 Complete and duplicated BUSCOs (D) |
|29 Fragmented BUSCOs (F) |
|35 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Quast also failed again with this error:
Python/2.7.18-GCCcore-10.2.0(58):ERROR:150: Module 'Python/2.7.18-GCCcore-10.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/2.7.18-GCCcore-10.2.0(58):ERROR:102: Tcl command execution failed: conflict Python
Python/2.7.18-GCCcore-10.2.0(58):ERROR:150: Module 'Python/2.7.18-GCCcore-10.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/2.7.18-GCCcore-10.2.0(58):ERROR:102: Tcl command execution failed: conflict Python
I’m going to add module purge
and module load Python/2.7.18-GCCcore-10.2.0
prior to loading quast. I’m also going to comment out lines that have already run successfully. Submitted batch job 310034. Ran in ~45 mins. Below are the results for the hap2 assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:94.0%[S:92.7%,D:1.3%],F:2.9%,M:3.1%,n:954 |
|896 Complete BUSCOs (C) |
|884 Complete and single-copy BUSCOs (S) |
|12 Complete and duplicated BUSCOs (D) |
|28 Fragmented BUSCOs (F) |
|30 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Once again, quast failed to run and I got this error:
ERROR! File not found (contigs): apul.hifiasm.intial.bp.p_ctg.fa
In case you have troubles running QUAST, you can write to quast.support@cab.spbu.ru
or report an issue on our GitHub repository https://github.com/ablab/quast/issues
Please provide us with quast.log file from the output directory.
I’m going to write quast its own script. Quast can be run with or without a reference genome. I’m going to try both, using Amillepora as the reference genome. In the scripts folder: nano initial_quast.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors
echo "Begin quast of primary and haplotypes (initial run) w/ reference" $(date)
quast -t 10 --eukaryote \
apul.hifiasm.intial.bp.p_ctg.fa \
apul.hifiasm.intial.bp.hap1.p_ctg.fa \
apul.hifiasm.intial.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast
echo "Quast complete (initial run); all QC complete!" $(date)
Submitted batch job 310038. I need to read more about quast command line options, as there seem to be a lot of options. Also need to look into including a reference vs not including a reference. Ran super fast but success! Quast is super informative!!!!! The most useful information is in the report.*
files. This is from the report.txt
file:
All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
Assembly apul.hifiasm.intial.bp.p_ctg apul.hifiasm.intial.bp.hap1.p_ctg apul.hifiasm.intial.bp.hap2.p_ctg Amil.v2.01.chrs
# contigs (>= 0 bp) 188 275 162 854
# contigs (>= 1000 bp) 188 275 162 851
# contigs (>= 5000 bp) 188 273 162 748
# contigs (>= 10000 bp) 186 271 162 672
# contigs (>= 25000 bp) 166 246 153 545
# contigs (>= 50000 bp) 98 163 124 445
Total length (>= 0 bp) 518528298 481372407 480341213 475381253
Total length (>= 1000 bp) 518528298 481372407 480341213 475378544
Total length (>= 5000 bp) 518528298 481363561 480341213 475052084
Total length (>= 10000 bp) 518514885 481349140 480341213 474498957
Total length (>= 25000 bp) 518188097 480901871 480181619 472383091
Total length (>= 50000 bp) 515726224 477880931 479141568 468867721
# contigs 188 275 162 854
Largest contig 45111900 21532546 22038975 39361238
Total length 518528298 481372407 480341213 475381253
GC (%) 39.05 39.03 39.04 39.06
N50 16268372 12353884 13054353 19840543
N75 13007972 7901416 8791894 1469964
L50 11 15 15 9
L75 20 28 26 23
# N's per 100 kbp 0.00 0.00 0.00 7.79
Look at all of that info! The initial assembly appears to be the best in terms of all the stats. The total length is longer than the Amillepora and the haplotype assemblies. Additionally, it has the largest contig. The Amillepora genome has a higher N50 but the N50 for the primary assembly still looks good. There were 188 contigs generated in the primary assembly. Haplotype 1 assembly had more contigs (275), while haplotype 2 had less (162). Amillepora has 854 contigs which is so high! The primary assembly contig number is much lower than Atenuis (614), Adigitifera (955), or Amillepora (854). Quast, you have converted me <3. My next steps are to play with the -s
flag in hifiasm to determine the threshold at which duplicate haplotigs should be purged. The default is 0.55 and Young et al. (2024) ran it with a range of values (0.55, 0.50, 0.45, 0.40, 0.35, 0.30). They found that all worked well to resolve haplotypes, so they stuck with the default of 0.55. They also used the --primary
flag, which outputs a primary and alternate assembly as opposed to the primary, hap1 and hap2 assemblies. In their code, they justified this by saying “running hifiasm using the primary flag as we have no real way of knowing if the haplotypes produced are real or not” (line 826).
Starting a run where -s
is 0.3 and 0.8. In the /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
folder, nano s30_hifiasm.sh
:
#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
conda activate /data/putnamlab/conda/hifiasm
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Starting assembly with hifiasm" $(date)
hifiasm -o apul.hifiasm.s30 hifi_rr_allcontam_rem.fasta -s 0.3 -t 36 2> apul_hifiasm_allcontam_rem_s30.asm.log
echo "Assembly with hifiasm complete!" $(date)
conda deactivate
Submitted batch job 310048. In the scripts folder, nano s80_hifiasm.sh
:
#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
conda activate /data/putnamlab/conda/hifiasm
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Starting assembly with hifiasm" $(date)
hifiasm -o apul.hifiasm.s80 hifi_rr_allcontam_rem.fasta -s 0.80 -t 36 2> apul_hifiasm_allcontam_rem_s80.asm.log
echo "Assembly with hifiasm complete!" $(date)
conda deactivate
Submitted batch job 310049
20240329
The s30 script finished running in about 3 days. Now I’m going to run busco and quast for QC on these assemblies. In the scripts folder: nano s30_primary_busco.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s30.bp.p_ctg.gfa > apul.hifiasm.s30.bp.p_ctg.fa
echo "Begin busco on hifiasm-assembled primary fasta with -s 0.30" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s30.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s30.primary.busco -m genome
echo "busco complete for hifiasm-assembled primary fasta with -s 0.30" $(date)
Submitted batch job 310241. In the scripts folder: nano s30_hap1_busco.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s30.bp.hap1.p_ctg.gfa > apul.hifiasm.s30.bp.hap1.p_ctg.fa
echo "Begin busco on hifiasm-assembled fasta hap1 with -s 0.30" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s30.bp.hap1.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s30.hap1.busco -m genome
echo "busco complete for hifiasm-assembled fasta hap1 with -s 0.30" $(date)
Submitted batch job 310242. In the scripts folder: nano s30_hap2_busco.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s30.bp.hap2.p_ctg.gfa > apul.hifiasm.s30.bp.hap2.p_ctg.fa
echo "Begin busco on hifiasm-assembled fasta hap2 with -s 0.30" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s30.bp.hap2.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s30.hap2.busco -m genome
echo "busco complete for hifiasm-assembled fasta hap2 with -s 0.30" $(date)
Submitted batch job 310243. In the scripts folder: nano s30_quast.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors
echo "Begin quast of primary and haplotypes (s30 run) w/ reference" $(date)
quast -t 10 --eukaryote \
apul.hifiasm.s30.bp.p_ctg.fa \
apul.hifiasm.s30.bp.hap1.p_ctg.fa \
apul.hifiasm.s30.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/s30
echo "Quast complete (s30 run); all QC complete!" $(date)
Submitted batch job 310244. So many jobs! The primary busco finished in about an hour. Let’s look at the results.
Busco for primary assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.3%[S:92.9%,D:0.4%],F:3.1%,M:3.6%,n:954 |
|890 Complete BUSCOs (C) |
|886 Complete and single-copy BUSCOs (S) |
|4 Complete and duplicated BUSCOs (D) |
|30 Fragmented BUSCOs (F) |
|34 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Pretty similar to the initial primary assembly, which was the same in completeness (93.3%) but slightly lower in single copy buscos (92% in initial vs 92.9% in s30).
Busco for the hap1 assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.4%[S:92.9%,D:0.5%],F:3.1%,M:3.5%,n:954 |
|891 Complete BUSCOs (C) |
|886 Complete and single-copy BUSCOs (S) |
|5 Complete and duplicated BUSCOs (D) |
|30 Fragmented BUSCOs (F) |
|33 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Almost identical to primary assembly with this flag. Also pretty similar to the initial hap1 assembly.
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.9%[S:93.6%,D:0.3%],F:2.8%,M:3.3%,n:954 |
|896 Complete BUSCOs (C) |
|893 Complete and single-copy BUSCOs (S) |
|3 Complete and duplicated BUSCOs (D) |
|27 Fragmented BUSCOs (F) |
|31 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Once again, pretty similar to the primary and hap1 assemblies, as well as the initial hap2 assembly. So what do these results mean? That the assembly results aren’t really affected by the -s
flag? I will have to see what the s80 results look like.
20240401
The s80 script finished a couple of days ago. Now time to assess completeness and what not. In the scripts folder: nano s80_primary_busco.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s80.bp.p_ctg.gfa > apul.hifiasm.s80.bp.p_ctg.fa
echo "Begin busco on hifiasm-assembled primary fasta with -s 0.80" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s80.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s80.primary.busco -m genome
echo "busco complete for hifiasm-assembled primary fasta with -s 0.80" $(date)
Submitted batch job 310321. In the scripts folder: nano s80_hap1_busco.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s80.bp.hap1.p_ctg.gfa > apul.hifiasm.s80.bp.hap1.p_ctg.fa
echo "Begin busco on hifiasm-assembled fasta hap1 with -s 0.80" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s80.bp.hap1.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s80.hap1.busco -m genome
echo "busco complete for hifiasm-assembled fasta hap1 with -s 0.80" $(date)
Submitted batch job 310322. In the scripts folder: nano s80_hap2_busco.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
echo "Convert from gfa to fasta for downstream use" $(date)
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/
awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s80.bp.hap2.p_ctg.gfa > apul.hifiasm.s80.bp.hap2.p_ctg.fa
echo "Begin busco on hifiasm-assembled fasta hap2 with -s 0.80" $(date)
labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s80.bp.hap2.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"
source "${busco_shared}/scripts/busco_init.sh" # sets up the modules required for this in the right order
# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s80.hap2.busco -m genome
echo "busco complete for hifiasm-assembled fasta hap2 with -s 0.80" $(date)
Submitted batch job 310323
Talked with Hollie last week about possibly assembly the mitochondrial genome for Apul. There are some Apul mito sequences on NCBI, which I’m going to pull and blast against the pacbio raw reads. If there are any hits, I will know that mito sequences are present in the data and its possible to do a mito assembly. There are 16 putative mito sequences for Apul in the NCBI link above. I’m going to pull those sequences and blast them. Make mito folders:
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
mkdir mito
cd mito
Using the putative Apul mito sequences from NCBI, make a fasta file with the 16 sequences. File is called mito_seqs_ncbi.fasta
. Similar to what I did with the viral, euk and prok sequences, blast the mito sequences against the raw reads. In the scripts folder: nano blastn_mito_seqs_ncbi.sh
#!/bin/bash
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module load BLAST+/2.13.0-gompi-2022a
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
echo "Build mito seq db" $(date)
makeblastdb -in /data/putnamlab/jillashey/Apul_Genome/assembly/data/mito/mito_seqs_ncbi.fasta -dbtype nucl -out /data/putnamlab/jillashey/Apul_Genome/assembly/data/mito/mito_seqs_ncbi_db
echo "Blasting hifi reads against viral genomes to look for contaminants" $(date)
blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db /data/putnamlab/jillashey/Apul_Genome/assembly/data/mito/mito_seqs_ncbi_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out mito_hits_rr.txt
echo "Blast complete!" $(date)
Submitted batch job 310324. Checking back a couple hours later. Let’s look at the mito results first:
wc -l mito_hits_rr.txt
12449 mito_hits_rr.txt
head mito_hits_rr.txt
m84100_240128_024355_s2/246617425/ccs NC_081454.1:14774-15130 94.175 103 6 0 6728 6830 308 206 1.49e-39 158
m84100_240128_024355_s2/246617425/ccs NC_081454.1:2434-16341 94.175 103 6 0 6728 6830 12648 12546 1.49e-39 158
m84100_240128_024355_s2/246485798/ccs NC_081454.1:14774-15130 94.175 103 6 0 4066 4168 308 206 8.50e-40 158
m84100_240128_024355_s2/246485798/ccs NC_081454.1:2434-16341 94.175 103 6 0 4066 4168 12648 12546 8.50e-40 158
m84100_240128_024355_s2/248320207/ccs NC_081454.1:14442-14741 92.079 202 14 2 19987 20187 67 267 3.13e-77 283
m84100_240128_024355_s2/248320207/ccs NC_081454.1:2434-16341 92.079 202 14 2 19987 20187 12075 12275 3.13e-77 283
m84100_240128_024355_s2/257166685/ccs NC_081454.1:14442-14741 93.035 201 14 0 16520 16720 67 267 1.18e-80 294
m84100_240128_024355_s2/257166685/ccs NC_081454.1:2434-16341 93.035 201 14 0 16520 16720 12075 12275 1.18e-80 294
m84100_240128_024355_s2/255267441/ccs NC_081454.1:14442-14741 92.537 201 14 1 14988 15187 67 267 1.86e-78 287
m84100_240128_024355_s2/255267441/ccs NC_081454.1:2434-16341 92.537 201 14 1 14988 15187 12075 12275 1.86e-78 287
I need to talk more with Hollie about the mito asssembly because I am still a little confused about this portion.
The busco information for the s80 assembly run also finished. Let’s look at the results!
Primary assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.5%[S:89.8%,D:3.7%],F:3.2%,M:3.3%,n:954 |
|892 Complete BUSCOs (C) |
|857 Complete and single-copy BUSCOs (S) |
|35 Complete and duplicated BUSCOs (D) |
|31 Fragmented BUSCOs (F) |
|31 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Hap1 assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:93.6%[S:91.4%,D:2.2%],F:3.1%,M:3.3%,n:954 |
|893 Complete BUSCOs (C) |
|872 Complete and single-copy BUSCOs (S) |
|21 Complete and duplicated BUSCOs (D) |
|30 Fragmented BUSCOs (F) |
|31 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Hap2 assembly:
--------------------------------------------------
|Results from dataset metazoa_odb10 |
--------------------------------------------------
|C:92.3%[S:91.0%,D:1.3%],F:2.9%,M:4.8%,n:954 |
|880 Complete BUSCOs (C) |
|868 Complete and single-copy BUSCOs (S) |
|12 Complete and duplicated BUSCOs (D) |
|28 Fragmented BUSCOs (F) |
|46 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
--------------------------------------------------
Assemblies look quite similar to one another and to the prior assemblies. 89.8% of single copy buscos for the primary assembly is the lowest of all of the assemblies. I’m now going to run quast with the initial, -s 0.30, -s 0.80, and the Amillepora assemblies to compare. In the scripts folder: nano test_quast.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors
echo "Begin quast of initial, s30, and s80 assemblies w/ reference" $(date)
quast -t 15 --eukaryote \
apul.hifiasm.intial.bp.p_ctg.fa \
apul.hifiasm.intial.bp.hap1.p_ctg.fa \
apul.hifiasm.intial.bp.hap2.p_ctg.fa \
apul.hifiasm.s30.bp.p_ctg.fa \
apul.hifiasm.s30.bp.hap1.p_ctg.fa \
apul.hifiasm.s30.bp.hap2.p_ctg.fa \
apul.hifiasm.s80.bp.p_ctg.fa \
apul.hifiasm.s80.bp.hap1.p_ctg.fa \
apul.hifiasm.s80.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/s80
echo "Quast complete" $(date)
Submitted batch job 310352. Finished in about 5 mins. Here’s quast:
All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
Assembly apul.hifiasm.intial.bp.p_ctg apul.hifiasm.intial.bp.hap1.p_ctg apul.hifiasm.intial.bp.hap2.p_ctg apul.hifiasm.s30.bp.p_ctg apul.hifiasm.s30.bp.hap1.p_ctg apul.hifiasm.s30.bp.hap2.p_ctg apul.hifiasm.s80.bp.p_ctg apul.hifiasm.s80.bp.hap1.p_ctg apul.hifiasm.s80.bp.hap2.p_ctg Amil.v2.01.chrs
# contigs (>= 0 bp) 188 275 162 180 247 167 206 258 189 854
# contigs (>= 1000 bp) 188 275 162 180 247 167 206 258 189 851
# contigs (>= 5000 bp) 188 273 162 180 247 167 206 256 189 748
# contigs (>= 10000 bp) 186 271 162 178 246 166 204 255 187 672
# contigs (>= 25000 bp) 166 246 153 158 219 161 187 235 178 545
# contigs (>= 50000 bp) 98 163 124 92 142 132 120 155 150 445
Total length (>= 0 bp) 518528298 481372407 480341213 504851641 484060404 461127429 558522339 509131880 465604880 475381253
Total length (>= 1000 bp) 518528298 481372407 480341213 504851641 484060404 461127429 558522339 509131880 465604880 475378544
Total length (>= 5000 bp) 518528298 481363561 480341213 504851641 484060404 461127429 558522339 509123034 465604880 475052084
Total length (>= 10000 bp) 518514885 481349140 480341213 504838228 484053588 461119824 558508926 509116218 465589833 474498957
Total length (>= 25000 bp) 518188097 480901871 480181619 504499732 483598713 461019312 558241588 508764673 465424962 472383091
Total length (>= 50000 bp) 515726224 477880931 479141568 502109496 480738474 460002421 555789785 505834173 464397174 468867721
# contigs 188 275 162 180 247 167 206 258 189 854
Largest contig 45111900 21532546 22038975 30476199 22329680 19744096 22038975 22153531 22038975 39361238
Total length 518528298 481372407 480341213 504851641 484060404 461127429 558522339 509131880 465604880 475381253
GC (%) 39.05 39.03 39.04 39.04 39.03 39.04 39.07 39.05 39.03 39.06
N50 16268372 12353884 13054353 16275225 13330421 14742043 14962207 11978068 12847727 19840543
N75 13007972 7901416 8791894 13021168 9796342 9573480 10779388 8114208 6210685 1469964
L50 11 15 15 13 14 14 16 16 14 9
L75 20 28 26 21 24 24 27 29 26 23
# N's per 100 kbp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.79
So much good info!!! The initial assembly still has the largest contig, but the s80 assembly has the longest total lengths. The Amillepora genome still has the best N50 value but the initial assembly also has a good N50. Overall, the initial assembly is the best out of these assemblies. The initial hap2 assembly has the lowest number of contigs (162). Out of the primary assemblies, the s30 assembly has the lowest number of contigs (180), while the initial assembly had 188 contigs and the s80 assembly had 206 contigs.
20240408
Let’s see how many rows have >1000 bit score
awk '{ if ($NF > 1000) count++ } END { print count }' mito_hits_rr.txt
7056
How many rows have a % match >85%?
awk '$3 > 85 {count++} END {print count}' mito_hits_rr.txt
12449
wc -l mito_hits_rr.txt
12449 mito_hits_rr.txt
There are definitely mito sequences in the raw hifi reads. I’ll be using MitoHiFi to assemble the Apul mito genome. This tool is specific for mitogenome assembly from PacBio HiFi reads. After I assemble it, I will remove it from the hifi raw reads before assembly of the nuclear genome. First, I’ll need to install with conda following the instructions on their github.
cd /data/putnamlab/conda
module load Miniconda3/4.9.2
# Clone repo
git clone https://github.com/marcelauliano/MitoHiFi.git
# Create a conda environment with yml file that is inside MitoHiFi/environment
conda env create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml
To activate and run the now-installed mitohifi:
conda activate mitohifi_env
(mitohifi_env) python MitoHiFi/src/mitohifi.py -h
Now we can run mito hifi! Go back to assembly folder and create a mito db folder. I will need to use mitohifi command findMitoReference.py
to pull mito references from closely related genomes. Young et al. 2024 pulled 4 mito genomes from NCBI (Platygyra carnosa, Favites abdita, Dipsastraea favus, and the old Orbicella faveolata). He then ran mitohifi for all of them with the Ofav hifi reads, which I’m not really sure why he did that. Maybe because he wanted to create a phylogenetic tree downstream? I’m going to pull the Acropora millepora mito sequences as a reference.
When I try to activate the conda env, I am getting this:
conda activate /data/putnamlab/conda/mitohifi_env
Not a conda environment: /data/putnamlab/conda/mitohifi_env
conda activate /data/putnamlab/conda/MitoHiFi
Not a conda environment: /data/putnamlab/conda/MitoHiFi
Very strange…maybe I just need to load miniconda and run python /data/putnamlab/conda/MitoHiFi/src/findMitoReference.py
?
Go to the assembly sequence folder: nano find_mito_ref.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=125GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module purge
module load Miniconda3/4.9.2
echo "Grabbing mito refs from NCBI" $(date)
python /data/putnamlab/conda/MitoHiFi/src/findMitoReference.py --species "Acropora millepora" --email jillashey@uri.edu --outfolder /data/putnamlab/jillashey/Apul_Genome/dbs
echo "Mito grab complete!" $(date)
Submitted batch job 310649. Immediately got this error:
Traceback (most recent call last):
File "/data/putnamlab/conda/MitoHiFi/src/findMitoReference.py", line 23, in <module>
from Bio import Entrez
ModuleNotFoundError: No module named 'Bio'
So I think I do need to activate the environment. Try to create a new env.
cd /data/putnamlab/conda/
conda create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml
WARNING: A conda environment already exists at '/home/jillashey/.conda/envs/mitohifi_env'
Remove existing environment (y/[n])? n
Ooooo I have a super secret conda env. Let’s try to activate it in the script.
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=125GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error
module purge
module load Miniconda3/4.9.2
conda activate /home/jillashey/.conda/envs/mitohifi_env
echo "Grabbing mito refs from NCBI" $(date)
python findMitoReference.py --species "Acropora millepora" --email jillashey@uri.edu --outfolder /data/putnamlab/jillashey/Apul_Genome/dbs
echo "Mito grab complete!" $(date)
conda deactivate
Immediately got this error:
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
I need to do conda init
but where?
cd /home/jillashey/.conda/envs/mitohifi_env
conda init
no change /opt/software/Miniconda3/4.9.2/condabin/conda
no change /opt/software/Miniconda3/4.9.2/bin/conda
no change /opt/software/Miniconda3/4.9.2/bin/conda-env
no change /opt/software/Miniconda3/4.9.2/bin/activate
no change /opt/software/Miniconda3/4.9.2/bin/deactivate
no change /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.sh
no change /opt/software/Miniconda3/4.9.2/etc/fish/conf.d/conda.fish
no change /opt/software/Miniconda3/4.9.2/shell/condabin/Conda.psm1
no change /opt/software/Miniconda3/4.9.2/shell/condabin/conda-hook.ps1
no change /opt/software/Miniconda3/4.9.2/lib/python3.8/site-packages/xontrib/conda.xsh
no change /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.csh
no change /home/jillashey/.bashrc
No action taken.
Need to figure this out! Here’s what the conda installation portion of their github says:
- Install MitoFinder and/or MITOS outside of Conda.
- Ensure MitoFinder and/or MITOS are added to the PATH before starting the run. Please note that MitoFinder and/or MITOS should be installed separately and made accessible via the PATH environment variable to ensure their proper integration with MitoHiFi. Once those are installed, do:
#Clone MitoHiFi git repo
git clone https://github.com/marcelauliano/MitoHiFi.git
#create a conda environment with our yml file that is inside MitoHiFi/environment
conda env create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml
Add MitoFinder and/or MITOS to the PATH and then activate your mitohifi_env conda environment.
Hmm confused. come back to this.