Apulchra genome assembly

Apulchra genome assembly

Sperm and tissue from adult Acropora pulchra colonies were collected from Moorea, French Polynesia and sequencing with PacBio (long reads) and Illumina (short reads). This post will detail the genome assembly notes. The github for this project is here.

I’m going to write notes and code chronologically so that I can keep track of what I’m doing each day. When assembly is complete, I will compile the workflow in a separate post.

20240206

Met w/ Ross and Hollie today re Apulchra genome assembly. We decided to move forward with the workflow from Stephens et al. 2022 which assembled genomes for 4 Hawaiian corals.

PacBio long reads were received in late Jan/early Feb 2024. According to reps from Genohub, the PacBio raw output looks good.

We decided to move forward with Canu to assembly the genome. Canu is specialized to assemble PacBio sequences, operating in three phases: correction, trimming and assembly. According to the Canu website, “The correction phase will improve the accuracy of bases in reads. The trimming phase will trim reads to the portion that appears to be high-quality sequence, removing suspicious regions such as remaining SMRTbell adapter. The assembly phase will order the reads into contigs, generate consensus sequences and create graphs of alternate paths.”

The PacBio files that will be used for assembly are located here on Andromeda: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead. The files in the folder that we will use are:

m84100_240128_024355_s2.hifi_reads.bc1029.bam
m84100_240128_024355_s2.hifi_reads.bc1029.bam.pbi

The bam file contains all of the read information in computer language and the pbi file is an index file of the bam. Both are needed in the same folder to run Canu.

For Canu, input files must be fasta or fastq format. I’m going to use bam2fastqfrom the PacBio github. This module is not on Andromeda so I will need to install it via conda.

The PacBio sequencing for the Apul genome were done with HiFi sequencing that are produced with circular consensus sequencing on PacoBio long read systems. Here’s how HiFi reads are generated from the PacBio website:

Since Hifi sequencing was used, a specific HiCanu flag (-pacbio-hifi) must be used. Additionally, in the Canu tutorial, it says that if this flag is used, the program is assuming that the reads are trimmed and corrected already. However, our reads are not. I’m going to try to run the first pass at Canu with the -raw flag.

Before Canu, I will run bam2fastq. This is a part of the PacBio BAM toolkit package pbtk. I need to create the conda environment and install the package. Load miniconda module: module load Miniconda3/4.9.2. Create a conda environment.

conda create --prefix /data/putnamlab/conda/pbtk

Install the package. Once the package is installed on the server, anyone in the Putnam lab group can use it.

conda activate /data/putnamlab/conda/pbtk
conda install -c bioconda pbtk

In my own directory, make a new directory for genome assembly

cd /data/putnamlab/jillashey
mkdir Apul_Genome
cd Apul_Genome
mkdir assembly structural functional
cd assembly
mkdir scripts data output

Run code to make the PacBio bam file to a fastq file. In the scripts folder: nano bam2fastq.sh

#!/bin/bash -i
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/pbtk

echo "Convert PacBio bam file to fastq file" $(date)

bam2fastq -o /data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam

echo "Bam to fastq complete!" $(date)

conda deactivate

Submitted batch job 294235

20240208

Job pended for about a day, then ran in 1.5 hours. I got this error message: bash: cannot set terminal process group (-1): Function not implemented bash: no job control in this shell, but not sure why. A fastq.gz file was produced! The file is pretty large (35G).

less m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz 

@m84100_240128_024355_s2/261887593/ccs
TATAAGTTTTACAGCTGCCTTTTGCTCAGCAAAGAAAGCAGCATTGTTATTGAACAGAAAAAGCCTTTTGGTGATATAAAGGTTTCTAAGGGACCAAAGTTTGATCTAGTATGCTAAGTGTGGTGGGTTAAAACTTTGTTTCACCTTTTTCCCGTGATGTTACAAAATTGGTGCAAATATTCATGACGTCGTACTCAAGTCTGACACTAAGAGCATGCAATTACTTAAACAAACAAGCCATACACCAATAAACTGAAGCTCTGTCAACTAGAAAACCTTTGAGTATTTTCATTGTAAAGTACAAGTGGTATATTGTCACTTGCTTTTACAACTTGAAGAACTCACAGTTAAGTTACTAGATTCACCATAGTGCTTGGCAATGAAGAAGCCAAATCACATAAAGTCGGAGCATGTGGTGTTTAGACCTAATCAAACAAGAACACAATATTTAGTACCTGCATCCTGTCTAAGGAGGAAATTTTAAGCTGCTTTCTTTTAAATTTTTTTTATTAGCATTTCAATGGTTGAGGTCGATTATAGGTGCTAGGCTTTAATTCCGTACTATGAAAGAAGAAAGGTCGTTGTTATTAACCATGTCAAACAGAGAAACACATGGTAAAAAATTGACTTCCTTTTCCTCTCGTTGCCACTTAAGCTTAATGATGGTGTTTGACCTGAAAGATGTTACAATTGTTTTAGATGAAAAGACTGTTCTGCGTAAAACAGTGAAGCCTCCCAACTTATTTTGTTATGTGGATTTTGTTGTCTTGTTAGTAACATGTATTGGACTATCTTTTGTGAGTACATAGCTTTTTTTCCATCAACTGACTATATACGTGGTGTAATTTGAGATCATGCCTCCAAGTGTTAGTCTTTTGTTTGGGGCTAACTCGTAAAAGACAAAGGGAGGGGGGTTGTCTAATTCCTAAGCAAAGCATTAAGTTTAACACAGGAAATTGTTTGCGTTGATATTGCTATCCTTTCAGCCCCAAACAAAAAATTTAATGGTTATTTTATTTTACATCTATTGTAAATATATTTTAACATTAATTTTTATTATTGCACTGTAAATACTTGTACTAATGTTCTGTTTGAATTAATTTTGATTCATTCCTTGTGCTTACAACAACAGGGATACAAAACCGATATGTATAATAATACTATTAGAGATGCTTATTTGCATTTTTAGCCCATACCATGAGTTTTAATAACGCCAGGCCATTGGAGATTTTATGGAGTGAGGATTCATTGTACAAACATGGTTGATTTAATATTAAAGTTGTATCCAAATAATTAATATCTGCTGTGATCAGTGAAAGATTGACCTTTCAGTTGTTTGGTTGCACCTTCATCTTATTGGAAACAACTGAATGGAGCATCTTTCCAGTTTAAAAATGTACCACTGCCCACTTTCATGAAGTTATGCCACATATTAATAATGACTATTAATTGTTGAAAACCCTTCTTCCAAAATGTTTCCATTTATTTGTAATAGCATATGTGGTCCATCAGAACAATAATTTAAATCATTACTATTAATAATTTTCCAATAACTGACTTTCAAACCTAGCCAACAAGCATAAGTCAGTAAGCCACAGAGTCCAGAGATACACTTACACTTTACTTTTCACTTCTGAAACATTTTATAATCTCAGTATGAGCATAGAACTTTTCAGTTGGGCAGCATGGAATAGAACCTTTGGACCCCTCTGTGAATATCAAAAATAGGCAACCACTTCCAGCATACACTCTAGCCTCCTTCATAAAGCAAGCCTTAGTGTTTTAGCTTCTACTAGTTAGATTCATTTTAAAAGAAGTTCAGTATACTTAATCTTATAGAAGCTGATTGTGATATAATTGCATAGGTGGATCTCAGAAAAGTGAGATGTAGCTGTCAAATTAAAAGAAGTCCTTTCCAAGCGTAGCTTCTGATAAACAATGCATTTTAGTTAACATTGGATTATGGTTTCAAAGGACTTGTAAAGCTAAATTCAAGTTTTTATGACAACTTGAAAGCCTTTGCCACAGTCTCCGCTGATTTAAGACTTCCATCAAAGTTAGAGTGGTGTGAATGCATCTCCACATGCAATTAATAAAGGTGAGGCAGACAACACAAAACACCCTGGTGCACCATCAACTCCACGGATCACTTGACTGTAACGCCATCTTATACAGCGACTGCCAATTGGAACTGGAAGATCAGGAATGATCTCTTTCACATGGGAAATGAGCATGGTCTTGATAATGCTTTCATCAGTATCCCAATGTTGAAGACAGAAGGACTTTGTTGAATGTACTAAAAGAGAGGGACCAACATCCACTGGATCTGGAACAGAATGCCAAAGCATATGCAAAATGTTATCATGTATAATAAAGAAAAAACCAAACTCATGAAGTTCAACGGTCACAGGCTTTGGTAAAAGACTGTACATGGTAAAGTACTCAGGGTGAAAGTTTGTATCCAAAAAACGAAAAGCTAACCTGTACCACATCTCTTCCGAGAGTCCACTGACACAAATGCAATATTAGGATTGCCAGTCGCATATTTACACGTCCATGGCACATTGATAACTGCTTCATGATCAAAAAAGTATCCTACTGCAAACCGTGATGAATATTCCACATTTTGTAAGGATTTGATTTCATTTTGTAAGAATGCTTGAATTGAACCTTGTAGCTGAAGAAGTTGTGGTACTGGAATAGTTACTATGACTGA

See how many @m84100 are in the file. I’m not sure what these stand for, maybe contigs?

zgrep -c "@m84100" m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz 
5898386

More than 5 million, so many not contigs? I guess it represents the number of HiFi reads generated. Now time to run Canu! Canu is already installed on the server, which is nice.

In the scripts folder: nano canu.sh

#!/bin/bash 
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load canu/2.2-GCCcore-11.2.0 

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Unzip paco-bio fastq file" $(date)

gunzip m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz

echo "Unzip complete, starting assembly" $(date)

canu -p apul -d /data/putnamlab/jillashey/Apul_Genome/assembly/data genomeSize=475m -raw -pacbio-hifi m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq

echo "Canu assembly complete" $(date)

I’m not sure if the raw and -pacbio-hifi will be compatible, as the Canu tutorial says that the -pacbio-hifi assumes that the input is trimmed and corrected (still not sure what this means). Submitted batch job 294325

20240212

The canu script appears to have ran. The script canu.sh itself ran in ~40 mins, but it spawned 10000s of other jobs on the server (for parallel processing, I’m guessing). Since I didn’t start those jobs, I didn’t get emails when they finished, so I just had to check the server every few hours. It took about a day for the rest of the jobs to finish. First, I’m looking at the slurm-294325.error output file. There’s a lot in this file, but I will break it down.

It first provides details on the slurm support and associated memory, as well as the number of threads that each portion of canu will need to run.

-- Slurm support detected.  Resources available:
--     25 hosts with  36 cores and  124 GB memory.
--      3 hosts with  24 cores and  124 GB memory.
--      1 host  with  36 cores and  123 GB memory.
--      1 host  with  48 cores and  753 GB memory.
--      8 hosts with  36 cores and   61 GB memory.
--      1 host  with  48 cores and  250 GB memory.
--      2 hosts with  36 cores and  250 GB memory.
--      2 hosts with  48 cores and  375 GB memory.
--      4 hosts with  36 cores and  502 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     12.000 GB    6 CPUs  (k-mer counting)
-- Grid:  hap       12.000 GB   12 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   13.000 GB   12 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     8.000 GB    6 CPUs  (overlap detection)
-- Grid:  utgovl     8.000 GB    6 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       16.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       10.000 GB    6 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       64.000 GB    8 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Found trimmed raw PacBio HiFi reads in the input files.

The file also says that it skipped the correction and trimming steps. This indicates that adding the -raw flag didn’t work.

--   Stages to run:
--     assemble HiFi reads.
--
--
-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.

Two histograms are then presented. The first is a histogram of correct reads:

-- In sequence store './apul.seqStore':
--   Found 5897694 reads.
--   Found 79182880880 bases (166.7 times coverage).
--    Histogram of corrected reads:
--    
--    G=79182880880                      sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        26133    264284   7918288640  ||       1150-2287         8297|--
--    00020        22541    592551  15836596549  ||       2288-3425        76768|-------------
--    00030        20184    964605  23754880672  ||       3426-4563       316378|---------------------------------------------------
--    00040        18285   1377072  31673163019  ||       4564-5701       397324|---------------------------------------------------------------
--    00050        16551   1832151  39591448183  ||       5702-6839       374507|------------------------------------------------------------
--    00060        14786   2337786  47509730799  ||       6840-7977       351843|--------------------------------------------------------
--    00070        12854   2910882  55428020435  ||       7978-9115       339864|------------------------------------------------------
--    00080        10612   3585762  63346313974  ||       9116-10253      340204|------------------------------------------------------
--    00090         7724   4449769  71264594524  ||      10254-11391      341160|-------------------------------------------------------
--    00100         1150   5897693  79182880880  ||      11392-12529      343170|-------------------------------------------------------
--    001.000x             5897694  79182880880  ||      12530-13667      340043|------------------------------------------------------
--                                               ||      13668-14805      336214|------------------------------------------------------
--                                               ||      14806-15943      328927|-----------------------------------------------------
--                                               ||      15944-17081      316290|---------------------------------------------------
--                                               ||      17082-18219      293393|-----------------------------------------------
--                                               ||      18220-19357      261212|------------------------------------------
--                                               ||      19358-20495      225351|------------------------------------
--                                               ||      20496-21633      188701|------------------------------
--                                               ||      21634-22771      154123|-------------------------
--                                               ||      22772-23909      124842|--------------------
--                                               ||      23910-25047       99659|----------------
--                                               ||      25048-26185       78280|-------------
--                                               ||      26186-27323       61530|----------
--                                               ||      27324-28461       47760|--------
--                                               ||      28462-29599       37668|------
--                                               ||      29600-30737       29339|-----
--                                               ||      30738-31875       22548|----
--                                               ||      31876-33013       17292|---
--                                               ||      33014-34151       13055|---
--                                               ||      34152-35289        9797|--
--                                               ||      35290-36427        7251|--
--                                               ||      36428-37565        5005|-
--                                               ||      37566-38703        3560|-
--                                               ||      38704-39841        2474|-
--                                               ||      39842-40979        1553|-
--                                               ||      40980-42117        1006|-
--                                               ||      42118-43255         604|-
--                                               ||      43256-44393         334|-
--                                               ||      44394-45531         184|-
--                                               ||      45532-46669         102|-
--                                               ||      46670-47807          45|-
--                                               ||      47808-48945          18|-
--                                               ||      48946-50083          11|-
--                                               ||      50084-51221           4|-
--                                               ||      51222-52359           1|-
--                                               ||      52360-53497           2|-
--                                               ||      53498-54635           0|
--                                               ||      54636-55773           0|
--                                               ||      55774-56911           0|
--                                               ||      56912-58049           1|-

There’s also a histogram of corrected-trimmed reads, but it is the exact same as the histogram above. The histogram represents the length ranges for each sequence and the number of sequences that have that length range. For example, if we look at the top row, there are 8297 sequences that range in length from 1150-2287 bp.

It looks like canu did run jobs by itself:

--  For 5897694 reads with 79182880880 bases, limit to 791 batches.
--  Will count kmers using 16 jobs, each using 13 GB and 6 threads.
--
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--
-- 'meryl-count.jobSubmit-01.sh' -> job 294326 tasks 1-16.
--
----------------------------------------
-- Starting command on Thu Feb  8 14:32:12 2024 with 8923.722 GB free disk space

    cd /glfs/brick01/gv0/putnamlab/jillashey/Apul_Genome/assembly/data
    sbatch \
      --depend=afterany:294326 \
      --cpus-per-task=1 \
      --mem-per-cpu=4g   \
      -D `pwd` \
      -J 'canu_apul' \
      -o canu-scripts/canu.01.out  canu-scripts/canu.01.sh

-- Finished on Thu Feb  8 14:32:13 2024 (one second) with 8923.722 GB free disk space

Let’s look at the output files that canu produced in /data/putnamlab/jillashey/Apul_Genome/assembly/data. I’ll be using the Canu tutorial output information to understand outputs.

-rwxr-xr-x. 1 jillashey 1.1K Feb  8 14:13 apul.seqStore.sh
-rw-r--r--. 1 jillashey  951 Feb  8 14:31 apul.seqStore.err
drwxr-xr-x. 3 jillashey 4.0K Feb  8 14:32 apul.seqStore
drwxr-xr-x. 9 jillashey 4.0K Feb  8 23:57 unitigging
drwxr-xr-x. 2 jillashey 4.0K Feb  9 01:36 canu-scripts
lrwxrwxrwx. 1 jillashey   24 Feb  9 01:36 canu.out -> canu-scripts/canu.09.out
drwxr-xr-x. 2 jillashey 4.0K Feb  9 01:36 canu-logs
-rw-r--r--. 1 jillashey  23K Feb  9 01:47 apul.report
-rw-r--r--. 1 jillashey 7.0M Feb  9 01:51 apul.contigs.layout.tigInfo
-rw-r--r--. 1 jillashey 155M Feb  9 01:51 apul.contigs.layout.readToTig
-rw-r--r--. 1 jillashey 2.8G Feb  9 01:57 apul.unassembled.fasta
-rw-r--r--. 1 jillashey 943M Feb  9 02:01 apul.contigs.fasta

The apul.report file will provide information about the analysis during assembly, including histogram of read lengths, the histogram or k-mers in the raw and corrected reads, the summary of corrected data, summary of overlaps, and the summary of contig lengths. The histogram of read lengths is the same as in the error file above. There is also a histogram (?) of the mer information:

--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2   4316301 ****                                                                   0.0150 0.0002
--       3-     4    855268                                                                        0.0172 0.0002
--       5-     7    183637                                                                        0.0183 0.0002
--       8-    11     62578                                                                        0.0187 0.0002
--      12-    16     29616                                                                        0.0188 0.0002
--      17-    22     22278                                                                        0.0189 0.0003
--      23-    29     20222                                                                        0.0190 0.0003
--      30-    37     28236                                                                        0.0190 0.0003
--      38-    46     73803                                                                        0.0192 0.0003
--      47-    56    556391                                                                        0.0194 0.0003
--      57-    67   6166302 ******                                                                 0.0218 0.0010
--      68-    79  35800144 ***************************************                                0.0477 0.0098
--      80-    92  63190280 ********************************************************************** 0.1835 0.0633
--      93-   106  26607016 *****************************                                          0.3988 0.1607
--     107-   121   2495822 **                                                                     0.4799 0.2024
--     122-   137   2376701 **                                                                     0.4871 0.2066
--     138-   154  16117515 *****************                                                      0.4965 0.2131
--     155-   172  47510630 ****************************************************                   0.5575 0.2605
--     173-   191  41907409 **********************************************                         0.7262 0.4060
--     192-   211   9190117 **********                                                             0.8650 0.5375
--     212-   232   1806623 **                                                                     0.8934 0.5669
--     233-   254   4028020 ****                                                                   0.8997 0.5743
--     255-   277   3875382 ****                                                                   0.9140 0.5926
--     278-   301   1456651 *                                                                      0.9271 0.6107
--     302-   326   1905710 **                                                                     0.9319 0.6181
--     327-   352   3003944 ***                                                                    0.9388 0.6293
--     353-   379   1723658 *                                                                      0.9491 0.6477
--     380-   407    968475 *                                                                      0.9549 0.6587
--     408-   436   1249856 *                                                                      0.9583 0.6657
--     437-   466    890757                                                                        0.9626 0.6752
--     467-   497    768714                                                                        0.9656 0.6824
--     498-   529    937920 *                                                                      0.9683 0.6892
--     530-   562    618907                                                                        0.9716 0.6978
--     563-   596    582755                                                                        0.9737 0.7039
--     597-   631    535408                                                                        0.9757 0.7100
--     632-   667    441122                                                                        0.9775 0.7159
--     668-   704    459953                                                                        0.9791 0.7211
--     705-   742    367266                                                                        0.9807 0.7268
--     743-   781    354455                                                                        0.9819 0.7316
--     782-   821    304354                                                                        0.9832 0.7365

There are 22-mers. A k-mer are substrings of length k contained in a biological sequence. For example, the term k-mer refers to all of a sequence’s subsequences of length k such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). So if we have 22-mers, we have subsequences of 22 nt? The Canu documentation says that k-mer histograms with more than 1 peak likely indicate a heterozygous genome. I’m not sure if the stars represent peaks or counts but if this is a histogram of k-mer information, it has two peaks, indicating a heterozygous genome.

The Canu documentation states that corrected read reports should be given with information about number of reads, coverage, N50, etc. My log file does not have this information, likely because the trimming and correcting steps were not performed. Instead, I have this information:

--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing       4114    0.07    10652.45 +- 6328.73       1185.95 +- 2112.86    (bad trimming)
--   middle-hump          4148    0.07    11485.92 +- 4101.92       4734.88 +- 3702.70    (bad trimming)
--   no-5-prime           8533    0.14     9311.86 +- 5288.80       2087.03 +- 3315.79    (bad trimming)
--   no-3-prime          10888    0.18     7663.24 +- 5159.57       1677.51 +- 3090.22    (bad trimming)
--   
--   low-coverage        48831    0.83     6823.01 +- 3725.15         16.35 +- 15.52      (easy to assemble, potential for lower quality consensus)
--   unique            5419159   91.89     9403.44 +- 4761.57        110.87 +- 40.33      (easy to assemble, perfect, yay)
--   repeat-cont         93801    1.59     7730.76 +- 4345.55       1001.60 +- 665.30     (potential for consensus errors, no impact on assembly)
--   repeat-dove           380    0.01    23076.97 +- 3235.06        878.08 +- 510.47     (hard to assemble, likely won't assemble correctly or eve
n at all)
--   
--   span-repeat         64724    1.10    11397.59 +- 5058.57       2335.58 +- 2836.91    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont   182925    3.10     9764.80 +- 4218.80                             (should be uniquely placed, low potential for consensus e
rrors, no impact on assembly)
--   uniq-repeat-dove    14288    0.24    17978.89 +- 4847.85                             (will end contigs, potential to misassemble)
--   uniq-anchor         19659    0.33    11312.36 +- 4510.86       5442.58 +- 3930.94    (repeat read, with unique section, probable bad read)

I’m not sure why I’m getting all of this information or what it means. There is a high % of unique reads in the data which is good. In the file, there is also information about edges (not sure what this means), as well as error rates. May discuss further with Hollie.

The Canu output documentation says that I’m supposed to get a file with corrected and trimmed reads but I don’t have those. I do have apul.unassembled.fasta and apul.contigs.fasta.

head apul.unassembled.fasta
>tig00000838 len=19665 reads=4 class=unassm suggestRepeat=no suggestBubble=no suggestCircular=no trim=0-19665
TAAAAACATTGATTCTTGTTTCAATATGAGACTTGTTTCGGAAGATGTTCGCGCAGGTTACATTTCATAATCCACAAGAAATGCGACATCGCCAACCTTA
CTTTAGTGTTTGCATTTAAGCAAAACAATGATAAAGAAACAAATCTCACATCTCGCAAAAGTATGCATTCTATGAAGAACAATGAAAATTAATGAAAGTG
AATCTTACACCTCCTATTCAAGACGCCGCATTAATTCAACTTGTTGATTTCTCCTAAAACGCTTTCTTTTAGAAGGCTTTCTTAGTCTTTCAATTGTAAA
GAATACATAAAGGACTCATGACCACTTATGTTCTTAAGTGTTACTGCTGCTTTAAAACACGTTACAAACCACATGTGAATATAGTTGCGGCACAGAAGGG
AAAATCGCTGAAATGCTGTCCAAATATACACAATATCATTAAGTAAAGTACGATCGTCCGGGTGAGTGTAGTCCTGAGAAGGACTGTTTGAGATGACATT
GACTGACGTTTCGACAACCTGAGCGGAAGTCATCTTCAGAGTCATCTTCACTTGACTCTGAAGATGACTTCCGCTCAGGTTGTCGAAACGTCAGTCAATG
TCATCTCAAACAGTCCTTCTCAGGACTACACTCACCCGGACGATCGTACTTTACTTAATGATATGACTCCTGGGTTCAAACCATTTACAAATATACACAA
GTTTGAAAGATCATATCGCCTGCCAGTTTTACAACTCGTCTTAGACACAATGGAATACAAAACCCTACCGAACGAATACCTTTGATTGAGATTTATGAAT
GTGAAACAGCGACTTCGAGAGAAAAACGAATTCTTAAAAATGCAGTTCAACTCTATCGTCATTCAACTGAAGCCAGCCGTGGCTCGGCTTCACAAGAAAG

head apul.contigs.fasta 
>tig00000004 len=43693 reads=3 class=contig suggestRepeat=no suggestBubble=yes suggestCircular=no trim=0-43693
TACAATTTTAGAACACGGGACCAGCTTAGCATAATAGCTTCACCTTTCGTCTATCTAACTCTAGGAAGTTTTAATTTTTTCAAGTATTATAAAGGGCTCC
GTCGACTGTCAAAGATTTGCCTTTTCAAGCTCCAATGGAAACTGTAAAAGTTGCGTATTTTTACGAGCTTGAAACGCATCTTGGTATGCCCGATATCAAG
TCGAAATTAATATTGAGATAATCCTTTGGCCTTCTTCTATCAACATCTGAAATAAAAATCTGGGCAGTTTGAACGCGCTCTATCAAACATAACAGATTTG
AAGGTAGGTGATAACTTATTTGCATAATCTACGTTAACAAAAAGTCTATTTATAGAATGACTACTCGGCATATTTCTAACAGTGGTACTTCAGATACGTT
TTGATGGACTTATTATTCTGTCGTTTGTATTGTTTTCTTCAATTATTTAGCCTTAATAATTCCAAATAATAAAGAAATAAGGAAAGTCTTTGGTGTAAGT
CACACTCAAAAGGTGAGTTTCAACAGTTACTGAACACCCTTACGTATTAAACAGTCATTTCAATTTCCAGATTCTAACAGAAAATGTCAAATCGTTGTTT
TATAGTAGAAATCCATCTTCAAAAGTTATTCCCCGCTTATGCAGGCTTGATTCTCGCGGCTCTTTCCAGCTCGGTTTACAATATAAGACACCGGTGCAGA
TACCATTGAACTTGTAAACAATGTCACGCAAATTAAACTGTACTTCAATTTGCAAGCCATACAGCTTTAAGTCAGGTCTTTATTGAACTTTCTAAGTCAA
GGTTGGGGAATATAAAGATATTTTATTACCAGTATATTTTCGGTGAAAATTACAACGGATACATGTTATGGGCCTGTTCTTTAAACTCAGTTACATACAT

The unassembled file contains the reads and low-coverage contigs which couldn’t be incorporated into the primary assembly. The contigs file contains the full assembly, including unique, repetitive, and bubble elements. What does this mean? Unsure, but the header line provides metadata on each sequence.

len
   Length of the sequence, in bp.

reads
   Number of reads used to form the contig.

class
   Type of sequence.  Unassembled sequences are primarily low-coverage sequences spanned by a single read.

suggestRepeat
   If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.

suggestBubble
   If yes, sequence is likely a bubble based on potential placement within longer sequences.

suggestCircular
   If yes, sequence is likely circular.  The fasta line will have a trim=X-Y to indicate the non-redundant coordinates

If we are looking at this header: >tig00000004 len=43693 reads=3 class=contig suggestRepeat=no suggestBubble=yes suggestCircular=no trim=0-43693, the length is the contig is 43693 bp, 3 reads were used to form the contig, class is a contig, no repeats used to form the contig, the sequence is likely a bubble based on placement within longer sequences, the sequence is not circular, and the entire contig is non-redundant. Not sure what bubble means…

Because Canu didn’t trim or correct the Hifi reads, I may need to use a different assembly tool. I’m going to try Hifiasm, which is a fast haplotype-resolved de novo assembler designed for PacBio HiFi reads. According to Hifiasm github, here are some good reasons to use Hifiasm:

  • Hifiasm delivers high-quality telomere-to-telomere assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers.

  • Hifiasm can purge duplications between haplotigs without relying on third-party tools such as purge_dups. Hifiasm does not need polishing tools like pilon or racon, either. This simplifies the assembly pipeline and saves running time.

  • Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm.

  • Hifiasm is trivial to install and easy to use. It does not required Python, R or C++11 compilers, and can be compiled into a single executable. The default setting works well with a variety of genomes.

If I use this tool, I may not need to pilon to polish the assembly. This tool isn’t on the server, so I’ll need to create a conda environment and install the package.

cd /data/putnamlab/conda
mkdir hifiasm
cd hifiasm

module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/conda/hifiasm
conda activate /data/putnamlab/conda/hifiasm
conda install -c bioconda hifiasm

Once this package is installed, run code for hifiasm assembly. In the scripts folder: nano hifiasm.sh

#!/bin/bash -i
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/hifiasm

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Starting assembly with hifiasm" $(date)

hifiasm -o apul.hifiasm m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq

echo "Assembly with hifiasm complete!" $(date)

conda deactivate

Submitted batch job 300534

20240213

Even though the reads were not trimmed or corrected with Canu, I am going to run Busco on the output. This will provide information about how well the genome was assembled and its completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs. Danielle and Kevin have both run BUSCO before and used similar scripts but I think I’ll adapt mine a little to fit my needs and personal preferences for code.

From the Busco user manual, the mandatory parameters are -i, which defines the input fasta file and -m, which sets the assessment mode (in our case, genome). Some recommended parameters incude l (specify busco lineage dataset; in our case, metazoans), c (specify number of cores to use), and -o (assigns specific label to output).

In /data/putnamlab/shared/busco/scripts, the script busco_init.sh has information about the modules to load and in what order. Both Danielle and Kevin sourced this file specifically in their code, but I will probably just copy and paste the modules. In the same folder, they also used busco-config.ini as input for the --config flag in busco, which provides a config file as an alternative to command line parameters. I am not going to use this config file (yet), as Danielle and Kevin were assembling transcriptomes and I’m not sure what the specifics of the file are (or what they should be for genomes). In /data/putnamlab/shared/busco/downloads/lineages/metazoa_odb10, there is information about the metazoan database.

In /data/putnamlab/jillashey/Apul_Genome/assembly/scripts, nano busco_canu.sh

#!/bin/bash 
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BUSCO/5.2.2-foss-2020b
module load BLAST+/2.11.0-gompi-2020b
module load AUGUSTUS/3.4.0-foss-2020b
module load SEPP/4.4.0-foss-2020b
module load prodigal/2.6.3-GCCcore-10.2.0
module load HMMER/3.3.2-gompi-2020b

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Begin busco on canu-assembled fasta" $(date)

busco -i apul.contigs.fasta -m genome -l /data/putnamlab/shared/busco/downloads/lineages/metazoa_odb10 -c 15 -o apul.busco.canu

echo "busco complete for canu-assembled fasta" $(date)

Submitted batch job 301588. This failed and gave me some errors. This one seemed to have been the fatal one: Message: BatchFatalError(AttributeError("'NoneType' object has no attribute 'remove_tmp_files'")). Danielle ran into a similar error in her busco code so I am going to try to set the --config file as "$EBROOTBUSCO/config/config.ini".

In the script, nano busco_canu.sh

#!/bin/bash 
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

#module load BUSCO/5.2.2-foss-2020b
#module load BLAST+/2.11.0-gompi-2020b
#module load AUGUSTUS/3.4.0-foss-2020b
#module load SEPP/4.4.0-foss-2020b
#module load prodigal/2.6.3-GCCcore-10.2.0
#module load HMMER/3.3.2-gompi-2020b

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Begin busco on canu-assembled fasta" $(date)

source "/data/putnamlab/shared/busco/scripts/busco_init.sh"  # sets up the modules required for this in the right order

busco --config "$EBROOTBUSCO/config/config.ini" -f -c 15 --long -i apul.contigs.fasta -m genome -l /data/putnamlab/shared/busco/downloads/lineages/metazoa_odb10 -o apul.busco.canu

echo "busco complete for canu-assembled fasta" $(date)

Submitted batch job 301594. Failed, same error as before. Going to try copying Kevin and Danielle code directly, even though its a little messy and confusing with paths.

In the script, nano busco_canu.sh

#!/bin/bash 
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Begin busco on canu-assembled fasta" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.contigs.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini"  -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.busco.canu -m genome

echo "busco complete for canu-assembled fasta" $(date)

Submitted batch job 301599. This appears to have worked! Took about an hour to run. This is the primary result in the out file:

# BUSCO version is: 5.2.2 
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /data/putnamlab/jillashey/Apul_Genome/assembly/data/apul.contigs.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

        ***** Results: *****

        C:94.4%[S:9.4%,D:85.0%],F:2.7%,M:2.9%,n:954        
        901     Complete BUSCOs (C)                        
        90      Complete and single-copy BUSCOs (S)        
        811     Complete and duplicated BUSCOs (D)         
        26      Fragmented BUSCOs (F)                      
        27      Missing BUSCOs (M)                         
        954     Total BUSCO groups searched                

We have 94.4% completeness with this assembly but 85% complete and duplicated BUSCOs. The busco manual says this on high levels of duplication: “BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis”.

Danielle also got a high number (78.9%) of duplicated BUSCOs in her de novo transcriptome of Apulchra, but Kevin got much less duplication (6.9%) in his Past transcriptome assembly. I need to ask Danielle if she ended up using her Trinity results (which had a high duplication percentage) for her alignment for Apul. I also need to ask her if she thinks the high duplication percentage is biologically meaningful.

Might be worth running HiFiAdapter Filt

20240215

Last night, the hifiasm job failed after almost 2 days but the email says PREEMPTED, ExitCode0. Two minutes after the job failed, job 300534 started again on the server and it says its a hifiasm job…I did not start this job myself, not sure what happened. Looking on the server now, hifiasm is running but has only been running for about 18 hours (as of 2pm today). It’s the same job number though which is strange.

20240220

Hifiasm job is still running after ~5 days. In the meantime, I’m going to run HiFiAdapterFilt, which is an adapter filtering command for PacBio HiFi data. On the github page, it says that the tool converts .bam to .fastq and removes reads with remnant PacBio adapter sequences. Required dependencies are BamTools and BLAST+; optional dependencies are NCBI FCS Adaptor and pigz. It looks like I’ll need to use the original bam file instead of the converted fastq file.

The github says I should add the script and database to my path using:

export PATH=$PATH:[PATH TO HiFiAdapterFilt]
export PATH=$PATH:[PATH TO HiFiAdapterFilt]/DB

I will do this in the script for the adapter filt code that I write myself. In the scripts folder, make a folder for hifi information

mkdir HiFiAdapterFilt
cd HiFiAdapterFilt

I need to make a script for the hifi adapter code. In the scripts/HiFiAdapterFilt folder, I copy and pasted the linked code into hifiadapterfilt.sh. Make a folder for pacbio databases and copy in the db information from the github.

mkdir DB
cd DB
nano pacbio_vectors_db
>gnl|uv|NGB00972.1:1-45 Pacific Biosciences Blunt Adapter
ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
>gnl|uv|NGB00973.1:1-35 Pacific Biosciences C2 Primer
AAAAAAAAAAAAAAAAAATTAACGGAGGAGGAGGA

In the scripts/HiFiAdapterFilt folder: nano hifiadapterfilt_JA.sh

#!/bin/bash 
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load GCCcore/11.3.0 # need this to resolve conflicts between GCCcore/8.3.0 and loaded GCCcore/11.3.0
module load BamTools/2.5.1-GCC-8.3.0 
module load BLAST+/2.9.0-iimpi-2019b 

cd /data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt

echo "Setting paths" $(date)

export PATH=$PATH:[/data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt] # path to original script
export PATH=$PATH:[/data/putnamlab/jillashey/Apul_Genome/assembly/scripts/HiFiAdapterFilt]/DB # path to db info 

echo "Paths set, starting adapter filtering" $(date)

bash hifiadapterfilt.sh -p /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029 -l 44 -m 97 -o /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Completing adapter filtering" $(date)

The -l and -m refer to the minimum length of adapter match to remove and the minumum percent match of adapter to remove, respectively. I left them as the default settings for now. Submitted batch job 303636. Giving me this error:

hifiadapterfilt.sh: line 63: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.temp_file_list: Permission denied
cat: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam.temp_file_list: No such file or directory
cat: /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam.temp_file_list: No such file or directory

Giving me permission denied to write in the folder? I’ll sym link the bam file to my data folder

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
ln -s /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam

Editing script so that the prefix is connecting with the sym linked file. Submitted batch job 303637. Still giving me the same error. Hollie may need to give me permission to write and access files in that specific folder.

20240221

Probably need to run haplomerger2, which is installed on the server already.

I’m also looking at the Canu FAQs to see if there is any info about using PacBio HiFi reads. Under the question “What parameters should I use for my reads?”, they have this info:

The defaults for -pacbio-hifi should work on this data. There is still some variation in data quality between samples. If you have poor continuity, it may be because the data is lower quality than expected. Canu will try to auto-adjust the error thresholds for this (which will be included in the report). If that still doesn’t give a good assembly, try running the assembly with -untrimmed. You will likely get a genome size larger than you expect, due to separation of alleles. See My genome size and assembly size are different, help! for details on how to remove this duplication.

When I look at the question “My genome size and assembly size are different, help!”, it says that this difference could be due to a heterozygous genome where the assembly separated some loci or the previous estimate is incorrect. They recommended running BUSCO to check completeness of the assembly (which I already did) and using purge_dups to remove duplication. I will look into this.

Next steps

  • Run Canu with -untrimmed option
  • Run purge_dups - targets the removal of duplicated sequences to enhance overall quality of assembly
  • Run haplomerger - merges haplotypes to addresws heterozygosity

In the scripts folder: nano canu_untrimmed.sh

#!/bin/bash 
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load canu/2.2-GCCcore-11.2.0 

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

#echo "Unzip paco-bio fastq file" $(date)

#gunzip m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq.gz

echo "Starting assembly w/ untrimmed flag" $(date)

canu -p apul.canu.untrimmed -d /data/putnamlab/jillashey/Apul_Genome/assembly/data genomeSize=475m -raw -pacbio-hifi m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq

echo "Canu assembly complete" $(date)

Submitted batch job 303660

On the purge_dups github, they say to install using the following:

git clone https://github.com/dfguan/purge_dups.git
cd purge_dups/src && make

# only needed if running run_purge_dups.py
git clone https://github.com/dfguan/runner.git
cd runner && python3 setup.py install --user

Cloned both into the assembly folder (ie /data/putnamlab/jillashey/Apul_Genome/assembly/).

First, use pd_config.py to generate a configuration file. Here’s possible usage:

usage: pd_config.py [-h] [-s SRF] [-l LOCD] [-n FN] [--version] ref pbfofn

generate a configuration file in json format

positional arguments:
  ref                   reference file in fasta/fasta.gz format
  pbfofn                list of pacbio file in fastq/fasta/fastq.gz/fasta.gz format (one absolute file path per line)

optional arguments:
  -h, --help            show this help message and exit
  -s SRF, --srfofn SRF  list of short reads files in fastq/fastq.gz format (one record per line, the
                        record is a tab splitted line of abosulte file path
                        plus trimmed bases, refer to
                        https://github.com/dfguan/KMC) [NONE]
  -l LOCD, --localdir LOCD
                        local directory to keep the reference and lists of the
                        pacbio, short reads files [.]
  -n FN, --name FN      output config file name [config.json]
  --version             show program's version number and exit

# Example 
./scripts/pd_config.py -l iHelSar1.pri -s 10x.fofn -n config.iHelSar1.PB.asm1.json ~/vgp/release/insects/iHelSar1/iHlSar1.PB.asm1/iHelSar1.PB.asm1.fa.gz pb.fofn

I need to make a list of the pacbio files that I’ll use in the script and put it in a file called pb.fofn.

for filename in *.fastq; do echo $PWD/$filename; done > pb.fofn

In my scripts folder: nano pd_config_apul.py

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load Python/3.9.6-GCCcore-11.2.0 # do i need python?

echo "Making config file for purge dups scripts" $(date)

/data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/scripts/pd_config.py -l /data/putnamlab/jillashey/Apul_Genome/assembly/data -n config.apul.canu.json /data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq /data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn

echo "Config file complete" $(date)

Submitted batch job 303661. Ran in 1 second. Got this in the error file:

cp: ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq’ and ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq’ are the same file
cp: ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn’ and ‘/data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn’ are the same file

But it did generate a config file, which looks like this:

{
  "cc": {
    "fofn": "/data/putnamlab/jillashey/Apul_Genome/assembly/data/pb.fofn",
    "isdip": 1,
    "core": 12,
    "mem": 20000,
    "queue": "normal",
    "mnmp_opt": "",
    "bwa_opt": "",
    "ispb": 1,
    "skip": 0
  },
  "sa": {
    "core": 12,
    "mem": 10000,
    "queue": "normal"
  },
  "busco": {
    "core": 12,
    "mem": 20000,
    "queue": "long",
    "skip": 0,
    "lineage": "mammalia",
    "prefix": "m84100_240128_024355_s2.hifi_reads.bc1029.fastq_purged",
    "tmpdir": "busco_tmp"
  },
  "pd": {
    "mem": 20000,
    "queue": "normal"
  },
  "gs": {
    "mem": 10000,
    "oe": 1
  },
  "kcp": {
    "core": 12,
    "mem": 30000,
    "fofn": "",
    "prefix": "m84100_240128_024355_s2.hifi_reads.bc1029.fastq_purged_kcm",
    "tmpdir": "kcp_tmp",
    "skip": 1
  },
  "ref": "/data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq",
  "out_dir": "m84100_240128_024355_s2.hifi_reads.bc1029.fastq"
}

My config file looks basically the same as the example one on the github. Manually edited the config file so that the out_dir was /data/putnamlab/jillashey/Apul_Genome/assembly/data/. Now the purging can begin using run_purge_dups.py. Here’s possible usage:

usage: run_purge_dups.py [-h] [-p PLTFM] [-w WAIT] [-r RETRIES] [--version]
                         config bin_dir spid

purge_dups wrapper

positional arguments:
  config                configuration file
  bin_dir               directory of purge_dups executable files
  spid                  species identifier

optional arguments:
  -h, --help            show this help message and exit
  -p PLTFM, --platform PLTFM
                        workload management platform, input bash if you want to run locally
  -w WAIT, --wait WAIT  <int> seconds sleep intervals
  -r RETRIES, --retries RETRIES
                        maximum number of retries
  --version             show program's version number and exit
  
# Example 
python scripts/run_purge_dups.py config.iHelSar1.json src iHelSar1

In my scripts folder: nano run_purge_dups_apul.py

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load Python/3.9.6-GCCcore-11.2.0 # do i need python?

echo "Starting to purge duplications" $(date)

/data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/scripts/run_purge_dups.py /data/putnamlab/jillashey/Apul_Genome/assembly/scripts/config.apul.canu.json /data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/src apul

echo "Duplication purge complete" $(date)

Submitted batch job 303662. Failed immediately with this error:

Traceback (most recent call last):
  File "/data/putnamlab/jillashey/Apul_Genome/assembly/purge_dups/scripts/run_purge_dups.py", line 3, in <module>
    from runner.manager import manager
ModuleNotFoundError: No module named 'runner'

I installed runner but the code is not seeing it…where am I supposed to put it? inside of the purge_dups github? Okay going to move runner folder inside of the purge_dups folder.

cd /data/putnamlab/jillashey/Apul_Genome/assembly
mv runner/ purge_dups/scripts/

Submitting job again, Submitted batch job 303663. Got the same error. In the run_purge_dups.py script itself, the first few lines are:

#!/usr/bin/env python3

from runner.manager import manager
from runner.hpc import hpc
from multiprocessing import Process, Pool
import sys, os, json
import argparse

So it isn’t seeing that the runner module is there. This issue and this issue on the github were reported but never really answered in a clear way. Will have to look into this more.

In other news, the canu script finished running but looks like it failed. This is the bottom of the error message:

ERROR:
ERROR:  Failed with exit code 139.  (rc=35584)
ERROR:

ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to configure the overlap store.
ABORT:
ABORT: Disk space available:  8477.134 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (unitigging/apul.canu.untrimmed.ovlStore.config.err):
ABORT:
ABORT:   
ABORT:   Finding number of overlaps per read and per file.
ABORT:   
ABORT:      Moverlaps
ABORT:   ------------ ----------------------------------------
ABORT:   
ABORT:   Failed with 'Segmentation fault'; backtrace (libbacktrace):
ABORT:

Unsure what it means…

20240301

BIG NEWS!!!!! This week, a paper came out that assembled and annotated the Orbicella faveolata genome using PacBio HiFi reads (Young et al. 2024)!!!!!!! The github for this paper has a detailed pipeline for how the genome was put together. Since I am also using HiFi reads, I will be following their methodology! I am using this pipeline starting at line 260.

I changed the file from bam to fastq, but now I need to change it to fasta with seqtk. In the scripts folder: nano seqtk.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load seqtk/1.3-GCC-9.3.0

echo "Convert PacBio fastq file to fasta file" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

seqtk seq -a m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq > m84100_240128_024355_s2.hifi_reads.bc1029.fasta

echo "Fastq to fasta complete! Summarize read lengths" $(date)

awk '/^>/{printf("%s\t",substr($0,2));next;} {print length}' m84100_240128_024355_s2.hifi_reads.bc1029.fasta > rr_read_lengths.txt

echo "Read length summary complete" $(date)

Submitted batch job 304257.

In R, I looked at the data to quantify length for each read. See code here.

```{r, echo=F} read.table(file = “../data/rr_read_lengths.txt”, header = F) %>% dplyr::rename(“hifi_read_name” = 1, “length” = 2) -> hifi_read_length nrow(hifi_read_length) # 5,898,386 total reads mean(hifi_read_length$length) # mean length of reads is 13,424.64 sum(hifi_read_length$length) #length sum 79,183,709,778. Will need this for the NCBI submission


Make histogram for read bins from raw hifi data

```{r, echo = F}
ggplot(data = hifi_read_length, 
       aes(x = length, fill = "blue")) +
  geom_histogram(binwidth = 2000) + 
  labs(x = "Raw Read Length", y = "Count", title = "Histogram of Raw HiFi Read Lengths") + 
  scale_fill_manual(values = c("blue")) + 
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

20240303

My next step is to remove any contaminant reads from the raw hifi reads. From Young et al. 2024: “Raw HiFi reads first underwent a contamination screening, following the methodology in [68], using BLASTn [32, 68] against the assembled mitochondrial O. faveolata genome and the following databases: common eukaryote contaminant sequences (ftp.ncbi.nlm.nih. gov/pub/kitts/contam_in_euks.fa.gz), NCBI viral (ref_ viruses_rep_genomes) and prokaryote (ref_prok_rep_ genomes) representative genome sets”.

I tried to run the update_blastdb.pl script (included in the blast program) with the BLAST+/2.13.0-gompi-2022a module but I got this error:

Can't locate Archive/Tar.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /opt/software/BLAST+/2.13.0-gompi-2022a/bin/update_blastdb.pl line 41.
BEGIN failed--compilation aborted at /opt/software/BLAST+/2.13.0-gompi-2022a/bin/update_blastdb.pl line 41.

Not sure what this means…will email Kevin Bryan to ask about it, as I don’t want to mess with anything on the installed modules. I did download the contam_in_euks.fa.gz db to my computer so I’m going to copy it to Andromeda. This file is considerably smaller than the viral or prok dbs.

Make a database folder in the Apul genome folder.

cd /data/putnamlab/jillashey/Apul_Genome
mkdir dbs 
cd dbs

zgrep -c ">" contam_in_euks.fa 
3554

Now I ran run a script that blasts the pacbio fasta against these sequences. In /data/putnamlab/jillashey/Apul_Genome/assembly/scripts, nano blast_contam_euk.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

echo "BLASTing hifi fasta against eukaryote contaminant sequences" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data


blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -subject /data/putnamlab/jillashey/Apul_Genome/dbs/contam_in_euks.fa -task megablast -outfmt 6 -evalue 4 -perc_identity 90 -num_threads 15 -out contaminant_hits_euks_rr.txt

echo "BLAST complete, remove contaminant seqs from hifi fasta" $(date)

awk '{ if( ($4 >= 50 && $4 <= 99 && $3 >=98 ) ||
         ($4 >= 100 && $4 <= 199 && $3 >= 94 ) ||
         ($4 >= 200 && $3 >= 90) )  {print $0}
    }' contaminant_hits_euks_rr.txt > contaminants_pass_filter_euks_rr.txt

echo "Contaminant seqs removed from hifi fasta" $(date)

Submitted batch job 304389. Finished in about 2.5 hours. Looked at the output in R (code here).

20240304

Emailed Kevin Bryan this morning and asked if he knew anything about why the update_blastdb.pl wasn’t working. Still waiting to hear back from him.

Kevin Bryan also emailed me this morning about my hifiasm job that has been running for 18 days and said: “This job has been running for 18 days. I just took a look at it and it appears you didn’t specify -t $SLURM_CPUS_ON_NODE (and also #SBATCH --exclusive) to make use of all of the CPU cores on the node. You might want to consider re-submitting this job with those parameters. Because the nodes generally have 36 cores, it should be able to catch up to where it is now in a little over half a day, assuming perfect scaling.”

I need to add those parameters into the hifiasm code, so cancelling the 300534 job. In /data/putnamlab/jillashey/Apul_Genome/assembly/scripts, nano hifiasm.sh:

#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/hifiasm

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Starting assembly with hifiasm" $(date)

hifiasm -o apul.hifiasm m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq -t 36

echo "Assembly with hifiasm complete!" $(date)

conda deactivate

Submitted batch job 304463

20240304

Response from Kevin Bryan about viral and prok blast databases: “Ok, I downloaded those databases, and actually consolidated the rest of them into /data/shared/ncbi-db/, under which is a directory for today, and then there will be a new one next Sunday and following Sundays. There’s a file /data/shared/ncbi-db/.ncbirc that gets updated to point to the current directory, which the blast* tools will automatically pick up, so you can just do -db ref_prok_rep_ genomes, for example.

For other tools that can read the ncbi databases, you can use blastdb_path -db ref_viruses_rep_genomes to get the path, although for some reason with the nr database you need to specify -dbtype prot, i.e., blastdb_path -db nr -dbtype prot.

The reason for the extra complication is because otherwise a job that runs while the database is being updated may fail or return strange results. The dated directories should resolve this issue. Note that Unity blast-plus modules work in a similar way with a different path, /datasets/bio/ncbi-db.”

Amazing! Now I can move forward with the blasting against viral and prok genomes. In the scripts folder: nano blastn_viral.sh

#!/bin/bash 
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Blasting hifi reads against viral genomes to look for contaminants" $(date)

blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db ref_viruses_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out viral_contaminant_hits_rr.txt

echo "Blast complete!" $(date)

Submitted batch job 304500

In the scripts folder: nano blastn_prok.sh

#!/bin/bash 
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Blasting hifi reads against prokaryote genomes to look for contaminants" $(date)

blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db ref_prok_rep_genomes -outfmt 6 -evalue 1e-4 -perc_identity 90 -out prok_contaminant_hits_rr.txt

echo "Blast complete!" $(date)

Submitted batch job 304502

20240306

Making list of all software programs that Young et al. 2024 used and if they are on Andromeda

  • blastn
    • On Andromeda? YES
  • Meryl
    • On Andromeda? NO
  • Genome-Scope2
    • On Andromeda? NO
  • Hifiasm
    • On Andromeda? NO but I added it to putnamlab via conda
  • Quast
    • On Andromeda? YES
  • Busco
    • On Andromeda? YES
  • Merqury
    • On Andromeda? NO
  • RepeatModeler2
    • On Andromeda? NO
  • Repeat-Masker
    • On Andromeda? YES
  • TeloScafs
    • On Andromeda? NO
  • PASA
    • On Andromeda? NO
  • funnannotate
    • On Andromeda? NO
  • Augustus
    • On Andromeda? YES
  • GeneMark-ES/ET
    • On Andromeda? YES but only GeneMark-ET
  • snap
    • On Andromeda? YES
  • glimmerhmm
    • On Andromeda? NO
  • Evidence Modeler
    • On Andromeda? NO
  • tRNAscan-SE
    • On Andromeda? YES
  • Trinity
    • On Andromeda? Yes
  • InterproScan
    • On Andromeda? YES

20240311

Hifiasm (with unfiltered reads) finished running over the weekend and the prok blast script preemptively ended and then restarted in the early hours of this morning. I think this might be because I am not making use of all cores on the node (similar to my earlier hifiasm script). I cancelled the prok blast job (304502) and edited the script so that it includes the flag -num_threads 36. Submitted batch job 305351

It created many files:

-rw-r--r--. 1 jillashey  19G Mar  7 23:58 apul.hifiasm.ec.bin
-rw-r--r--. 1 jillashey  47G Mar  8 00:08 apul.hifiasm.ovlp.source.bin
-rw-r--r--. 1 jillashey  17G Mar  8 00:12 apul.hifiasm.ovlp.reverse.bin
-rw-r--r--. 1 jillashey 1.2G Mar  8 01:37 apul.hifiasm.bp.r_utg.gfa
-rw-r--r--. 1 jillashey  21M Mar  8 01:37 apul.hifiasm.bp.r_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.6M Mar  8 01:41 apul.hifiasm.bp.r_utg.lowQ.bed
-rw-r--r--. 1 jillashey 1.1G Mar  8 01:42 apul.hifiasm.bp.p_utg.gfa
-rw-r--r--. 1 jillashey  21M Mar  8 01:42 apul.hifiasm.bp.p_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.2M Mar  8 01:46 apul.hifiasm.bp.p_utg.lowQ.bed
-rw-r--r--. 1 jillashey 506M Mar  8 01:47 apul.hifiasm.bp.p_ctg.gfa
-rw-r--r--. 1 jillashey  11M Mar  8 01:47 apul.hifiasm.bp.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar  8 01:49 apul.hifiasm.bp.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 469M Mar  8 01:50 apul.hifiasm.bp.hap1.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar  8 01:50 apul.hifiasm.bp.hap1.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar  8 01:52 apul.hifiasm.bp.hap1.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 468M Mar  8 01:52 apul.hifiasm.bp.hap2.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar  8 01:52 apul.hifiasm.bp.hap2.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 1.9M Mar  8 01:54 apul.hifiasm.bp.hap2.p_ctg.lowQ.bed

This page gives a brief overview of the hifiasm output files, which is super helpful. It generates the assembly graphs in Graphical Fragment Assembly (GFA) format.

  • prefix.r_utg.gfa: haplotype-resolved raw unitig graph. This graph keeps all haplotype information.
    • A unitig is a portion of a contig. It is a nondisputed and assembled group of fragments. A contiguous sequence of ordered unitigs is a contig, and a single unitig can be in multiple contigs.
  • prefix.p_utg.gfa: haplotype-resolved processed unitig graph without small bubbles. Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information. Hifiasm automatically pops such small bubbles based on coverage. The option –hom-cov affects the result. See homozygous coverage setting for more details. In addition, the option -p forcedly pops bubbles.
    • Confused about the bubbles, but it looks like a medium level (what is a “medium” level”?) of heterozygosity will result in bubbles (see image in this post)
    • Homozygous coverage refers to coverage threshold for homozygous reads. Hifiasm prints it as: [M::purge_dups] homozygous read coverage threshold: X. If it is not around homozygous coverage, the final assembly might be either too large or too small.
  • prefix.p_ctg.gfa: assembly graph of primary contigs. This graph includes a complete assembly with long stretches of phased blocks.
    • From my understanding based on this post discussing concepts in phased assemblies, a phased assembly identifies different alleles
  • prefix.a_ctg.gfa: assembly graph of alternate contigs. This graph consists of all contigs that are discarded in primary contig graph.
    • There were none of these files in my output. Does this mean all contigs created were used in the final assembly?
  • prefix.hap*.p_ctg.gfa: phased contig graph. This graph keeps the phased contigs for haplotype 1 and haplotype 2.

I believe this file (apul.hifiasm.bp.p_ctg.gfa) contains the sequence information for the assembled contigs. zgrep -c "S" apul.hifiasm.bp.p_ctg.gfa showed that there were 188 Segments, or continuous sequences, in this assembly, meaning there are 188 contigs (to my understanding). This file (apul.hifiasm.bp.p_ctg.noseq.gfa) contains information about the reads used to construct the contigs in a plain text format.

head apul.hifiasm.bp.p_ctg.noseq.gfa
S	ptg000001l	*	LN:i:21642937	rd:i:82
A	ptg000001l	0	+	m84100_240128_024355_s2/165481517/ccs	0	25806	id:i:5084336	HG:A:a
A	ptg000001l	6512	+	m84100_240128_024355_s2/221910786/ccs	0	21626	id:i:5609370	HG:A:a
A	ptg000001l	7287	-	m84100_240128_024355_s2/128778986/ccs	0	24026	id:i:2691352	HG:A:a
A	ptg000001l	7493	+	m84100_240128_024355_s2/262476615/ccs	0	30540	id:i:287120	HG:A:a
A	ptg000001l	15042	+	m84100_240128_024355_s2/191824085/ccs	0	27619	id:i:2783058	HG:A:a
A	ptg000001l	16616	-	m84100_240128_024355_s2/29886096/ccs	0	28664	id:i:2336674	HG:A:a
A	ptg000001l	17527	-	m84100_240128_024355_s2/37553051/ccs	0	31883	id:i:2336554	HG:A:a
A	ptg000001l	21104	-	m84100_240128_024355_s2/242419829/ccs	0	28551	id:i:5120251	HG:A:a
A	ptg000001l	21788	+	m84100_240128_024355_s2/266536351/ccs	0	27994	id:i:5577462	HG:A:a

The S line is the Segment and it acts as a header for the the contig. LN:i: is the segment length (in this case, 21,642,937 bp). The rd:i: is the read coverage, calculated by the reads coming from the same contig (in this case, read coverage is 82, which is high). The A lines provide information about the sequences that make up the contig. Here’s what each column means

  • Column 1: should always be A
  • Column 2: contig name
  • Column 3: contig start coordinate of subregion constructed by read
  • Column 4: read strand (+ or -)
  • Column 5: read name
  • Column 6: read start coordinate of subregion which is used to construct contig
  • Column 7: read end coordinate of subregion which is used to construct contig
  • Column 8: read ID
  • Column 9: haplotype status of read. HG:A:a, HG:A:p, and HG:A:m indicate that the read is non-binnable (ie heterozygous), father/hap1 specific, or mother/hap2 specific.

If I’m interpreting this correctly, it looks like most of the reads in the first contig are heterozygous.

The error file (slurm-304463.error) for this script contains the histogram of the kmers. It has 4 iterations of kmer histograms with some sort of analysis in between the histograms. Here’s what the last histogram looks like:

[M::ha_analyze_count] lowest: count[9] = 2315
[M::ha_analyze_count] highest: count[83] = 417609
[M::ha_hist_line]     1: ****************************************************************************************************> 1732549
[M::ha_hist_line]     2: ************* 53793
[M::ha_hist_line]     3: **** 16969
[M::ha_hist_line]     4: ** 9184
[M::ha_hist_line]     5: * 5507
[M::ha_hist_line]     6: * 4361
[M::ha_hist_line]     7: * 3389
[M::ha_hist_line]     8: * 2959
[M::ha_hist_line]     9: * 2315
[M::ha_hist_line]    10: * 2369
[M::ha_hist_line]    11:  2037
[M::ha_hist_line]    12:  1889
[M::ha_hist_line]    13:  1724
[M::ha_hist_line]    14:  1729
[M::ha_hist_line]    15:  1721
[M::ha_hist_line]    16:  1641
[M::ha_hist_line]    17:  1530
[M::ha_hist_line]    18:  1687
[M::ha_hist_line]    19:  1389
[M::ha_hist_line]    20:  1341
[M::ha_hist_line]    21:  1370
[M::ha_hist_line]    22:  1269
[M::ha_hist_line]    23:  1249
[M::ha_hist_line]    24:  1329
[M::ha_hist_line]    25:  1327
[M::ha_hist_line]    26:  1317
[M::ha_hist_line]    27:  1274
[M::ha_hist_line]    28:  1370
[M::ha_hist_line]    29:  1495
[M::ha_hist_line]    30:  1468
[M::ha_hist_line]    31:  1677
[M::ha_hist_line]    32:  1625
[M::ha_hist_line]    33:  1729
[M::ha_hist_line]    34:  1697
[M::ha_hist_line]    35:  1825
[M::ha_hist_line]    36:  1919
[M::ha_hist_line]    37:  1966
[M::ha_hist_line]    38:  2066
[M::ha_hist_line]    39: * 2101
[M::ha_hist_line]    40: * 2195
[M::ha_hist_line]    41: * 2119
[M::ha_hist_line]    42: * 2100
[M::ha_hist_line]    43: * 2325
[M::ha_hist_line]    44: * 2644
[M::ha_hist_line]    45: * 2807
[M::ha_hist_line]    46: * 3080
[M::ha_hist_line]    47: * 3289
[M::ha_hist_line]    48: * 3661
[M::ha_hist_line]    49: * 3984
[M::ha_hist_line]    50: * 4856
[M::ha_hist_line]    51: * 5391
[M::ha_hist_line]    52: ** 6627
[M::ha_hist_line]    53: ** 7648
[M::ha_hist_line]    54: ** 9319
[M::ha_hist_line]    55: *** 11051
[M::ha_hist_line]    56: *** 13316
[M::ha_hist_line]    57: **** 16452
[M::ha_hist_line]    58: ***** 19650
[M::ha_hist_line]    59: ****** 24229
[M::ha_hist_line]    60: ******* 29998
[M::ha_hist_line]    61: ********* 37438
[M::ha_hist_line]    62: *********** 45813
[M::ha_hist_line]    63: ************* 54367
[M::ha_hist_line]    64: **************** 67165
[M::ha_hist_line]    65: ******************* 79086
[M::ha_hist_line]    66: *********************** 95901
[M::ha_hist_line]    67: *************************** 111990
[M::ha_hist_line]    68: ******************************* 128413
[M::ha_hist_line]    69: *********************************** 147679
[M::ha_hist_line]    70: ***************************************** 171224
[M::ha_hist_line]    71: ********************************************** 193638
[M::ha_hist_line]    72: **************************************************** 217984
[M::ha_hist_line]    73: ********************************************************** 242258
[M::ha_hist_line]    74: **************************************************************** 266643
[M::ha_hist_line]    75: ********************************************************************** 291355
[M::ha_hist_line]    76: **************************************************************************** 317522
[M::ha_hist_line]    77: ********************************************************************************* 337960
[M::ha_hist_line]    78: ************************************************************************************** 358656
[M::ha_hist_line]    79: ******************************************************************************************* 378437
[M::ha_hist_line]    80: ********************************************************************************************** 393447
[M::ha_hist_line]    81: ************************************************************************************************* 405399
[M::ha_hist_line]    82: *************************************************************************************************** 413774
[M::ha_hist_line]    83: **************************************************************************************************** 417609
[M::ha_hist_line]    84: **************************************************************************************************** 416365
[M::ha_hist_line]    85: **************************************************************************************************** 417459
[M::ha_hist_line]    86: *************************************************************************************************** 413637
[M::ha_hist_line]    87: ************************************************************************************************* 404341
[M::ha_hist_line]    88: ********************************************************************************************* 387519
[M::ha_hist_line]    89: ***************************************************************************************** 372331
[M::ha_hist_line]    90: ************************************************************************************ 352702
[M::ha_hist_line]    91: ******************************************************************************** 333308
[M::ha_hist_line]    92: ************************************************************************* 305452
[M::ha_hist_line]    93: ******************************************************************* 279706
[M::ha_hist_line]    94: ************************************************************** 257317
[M::ha_hist_line]    95: ******************************************************* 230115
[M::ha_hist_line]    96: ************************************************* 205580
[M::ha_hist_line]    97: ******************************************* 181564
[M::ha_hist_line]    98: ************************************** 159113
[M::ha_hist_line]    99: ********************************* 139005
[M::ha_hist_line]   100: ***************************** 120518
[M::ha_hist_line]   101: ************************* 102686
[M::ha_hist_line]   102: ********************* 86025
[M::ha_hist_line]   103: ****************** 73567
[M::ha_hist_line]   104: *************** 61207
[M::ha_hist_line]   105: ************ 50380
[M::ha_hist_line]   106: ********** 41491
[M::ha_hist_line]   107: ******** 34384
[M::ha_hist_line]   108: ******* 28223
[M::ha_hist_line]   109: ***** 22483
[M::ha_hist_line]   110: **** 18607
[M::ha_hist_line]   111: **** 14975
[M::ha_hist_line]   112: *** 12513
[M::ha_hist_line]   113: ** 10316
[M::ha_hist_line]   114: ** 8237
[M::ha_hist_line]   115: ** 6969
[M::ha_hist_line]   116: * 6015
[M::ha_hist_line]   117: * 5348
[M::ha_hist_line]   118: * 4850
[M::ha_hist_line]   119: * 4508
[M::ha_hist_line]   120: * 4436
[M::ha_hist_line]   121: * 4296
[M::ha_hist_line]   122: * 4655
[M::ha_hist_line]   123: * 4414
[M::ha_hist_line]   124: * 4850
[M::ha_hist_line]   125: * 5053
[M::ha_hist_line]   126: * 5326
[M::ha_hist_line]   127: * 6256
[M::ha_hist_line]   128: ** 6763
[M::ha_hist_line]   129: ** 7359
[M::ha_hist_line]   130: ** 8371
[M::ha_hist_line]   131: ** 9116
[M::ha_hist_line]   132: ** 10114
[M::ha_hist_line]   133: *** 11557
[M::ha_hist_line]   134: *** 12951
[M::ha_hist_line]   135: *** 14573
[M::ha_hist_line]   136: **** 16195
[M::ha_hist_line]   137: **** 17982
[M::ha_hist_line]   138: ***** 19859
[M::ha_hist_line]   139: ***** 22041
[M::ha_hist_line]   140: ****** 24033
[M::ha_hist_line]   141: ****** 26500
[M::ha_hist_line]   142: ******* 30035
[M::ha_hist_line]   143: ******** 32677
[M::ha_hist_line]   144: ********* 36297
[M::ha_hist_line]   145: ********* 39324
[M::ha_hist_line]   146: ********** 43146
[M::ha_hist_line]   147: *********** 47105
[M::ha_hist_line]   148: ************ 52168
[M::ha_hist_line]   149: ************* 55935
[M::ha_hist_line]   150: *************** 61001
[M::ha_hist_line]   151: **************** 65990
[M::ha_hist_line]   152: ***************** 70743
[M::ha_hist_line]   153: ****************** 74363
[M::ha_hist_line]   154: ******************* 79804
[M::ha_hist_line]   155: ******************** 83768
[M::ha_hist_line]   156: ********************* 88057
[M::ha_hist_line]   157: ********************** 92818
[M::ha_hist_line]   158: *********************** 97623
[M::ha_hist_line]   159: ************************* 103918
[M::ha_hist_line]   160: ************************** 107072
[M::ha_hist_line]   161: ************************** 110105
[M::ha_hist_line]   162: *************************** 113902
[M::ha_hist_line]   163: **************************** 117243
[M::ha_hist_line]   164: **************************** 118933
[M::ha_hist_line]   165: ***************************** 122058
[M::ha_hist_line]   166: ****************************** 123371
[M::ha_hist_line]   167: ****************************** 125091
[M::ha_hist_line]   168: ****************************** 125263
[M::ha_hist_line]   169: ****************************** 125254
[M::ha_hist_line]   170: ****************************** 123856
[M::ha_hist_line]   171: ****************************** 123656
[M::ha_hist_line]   172: ***************************** 121423
[M::ha_hist_line]   173: ***************************** 121400
[M::ha_hist_line]   174: **************************** 117980
[M::ha_hist_line]   175: **************************** 115344
[M::ha_hist_line]   176: *************************** 112655
[M::ha_hist_line]   177: ************************** 109202
[M::ha_hist_line]   178: ************************* 105297
[M::ha_hist_line]   179: ************************ 102172
[M::ha_hist_line]   180: *********************** 97507
[M::ha_hist_line]   181: ********************** 93418
[M::ha_hist_line]   182: ********************* 88400
[M::ha_hist_line]   183: ******************** 83674
[M::ha_hist_line]   184: ******************* 77971
[M::ha_hist_line]   185: ***************** 72480
[M::ha_hist_line]   186: **************** 68366
[M::ha_hist_line]   187: *************** 63165
[M::ha_hist_line]   188: ************** 58702
[M::ha_hist_line]   189: ************* 54012
[M::ha_hist_line]   190: ************ 50360
[M::ha_hist_line]   191: *********** 45887
[M::ha_hist_line]   192: ********** 40846
[M::ha_hist_line]   193: ********* 36887
[M::ha_hist_line]   194: ******** 33506
[M::ha_hist_line]   195: ******* 30266
[M::ha_hist_line]   196: ******* 27487
[M::ha_hist_line]   197: ****** 24333
[M::ha_hist_line]   198: ***** 21602
[M::ha_hist_line]   199: ***** 19303
[M::ha_hist_line]   200: **** 17108
[M::ha_hist_line]   201: **** 15156
[M::ha_hist_line]   202: *** 13661
[M::ha_hist_line]   203: *** 12076
[M::ha_hist_line]   204: *** 10526
[M::ha_hist_line]   205: ** 9215
[M::ha_hist_line]   206: ** 8143
[M::ha_hist_line]   207: ** 7395
[M::ha_hist_line]   208: ** 6602
[M::ha_hist_line]   209: * 5949
[M::ha_hist_line]   210: * 5447
[M::ha_hist_line]   211: * 4869
[M::ha_hist_line]   212: * 4270
[M::ha_hist_line]   213: * 3890
[M::ha_hist_line]   214: * 3731
[M::ha_hist_line]   215: * 3629
[M::ha_hist_line]   216: * 3494
[M::ha_hist_line]   217: * 3613
[M::ha_hist_line]   218: * 3512
[M::ha_hist_line]   219: * 3618
[M::ha_hist_line]   220: * 3772
[M::ha_hist_line]   221: * 3774
[M::ha_hist_line]   222: * 3708
[M::ha_hist_line]   223: * 3818
[M::ha_hist_line]   224: * 3986
[M::ha_hist_line]   225: * 4029
[M::ha_hist_line]   226: * 4380
[M::ha_hist_line]   227: * 4386
[M::ha_hist_line]   228: * 4510
[M::ha_hist_line]   229: * 4678
[M::ha_hist_line]   230: * 4797
[M::ha_hist_line]   231: * 5106
[M::ha_hist_line]   232: * 5197
[M::ha_hist_line]   233: * 5242
[M::ha_hist_line]   234: * 5474
[M::ha_hist_line]   235: * 5733
[M::ha_hist_line]   236: * 6021
[M::ha_hist_line]   237: * 6265
[M::ha_hist_line]   238: * 6246
[M::ha_hist_line]   239: ** 6646
[M::ha_hist_line]   240: ** 6722
[M::ha_hist_line]   241: ** 6844
[M::ha_hist_line]   242: ** 6733
[M::ha_hist_line]   243: ** 7254
[M::ha_hist_line]   244: ** 7250
[M::ha_hist_line]   245: ** 7243
[M::ha_hist_line]   246: ** 7275
[M::ha_hist_line]   247: ** 7405
[M::ha_hist_line]   248: ** 7421
[M::ha_hist_line]   249: ** 7571
[M::ha_hist_line]   250: ** 7291
[M::ha_hist_line]   251: ** 7331
[M::ha_hist_line]   252: ** 7309
[M::ha_hist_line]   253: ** 7278
[M::ha_hist_line]   254: ** 7264
[M::ha_hist_line]   255: ** 7092
[M::ha_hist_line]   256: ** 6912
[M::ha_hist_line]   257: ** 6958
[M::ha_hist_line]   258: ** 6689
[M::ha_hist_line]   259: ** 6607
[M::ha_hist_line]   260: ** 6542
[M::ha_hist_line]   261: * 6242
[M::ha_hist_line]   262: * 6185
[M::ha_hist_line]   263: * 5934
[M::ha_hist_line]   264: * 5717
[M::ha_hist_line]   265: * 5388
[M::ha_hist_line]   266: * 5448
[M::ha_hist_line]   267: * 5279
[M::ha_hist_line]   268: * 4944
[M::ha_hist_line]   269: * 4724
[M::ha_hist_line]   270: * 4549
[M::ha_hist_line]   271: * 4404
[M::ha_hist_line]   272: * 4417
[M::ha_hist_line]   273: * 4041
[M::ha_hist_line]   274: * 3888
[M::ha_hist_line]   275: * 3760
[M::ha_hist_line]   276: * 3592
[M::ha_hist_line]   277: * 3490
[M::ha_hist_line]   278: * 3150
[M::ha_hist_line]   279: * 3001
[M::ha_hist_line]   280: * 2926
[M::ha_hist_line]   281: * 3084
[M::ha_hist_line]   282: * 2758
[M::ha_hist_line]   283: * 2672
[M::ha_hist_line]   284: * 2526
[M::ha_hist_line]   285: * 2432
[M::ha_hist_line]   286: * 2313
[M::ha_hist_line]   287: * 2314
[M::ha_hist_line]   288: * 2255
[M::ha_hist_line]   289: * 2266
[M::ha_hist_line]   290: * 2230
[M::ha_hist_line]   291: * 2124
[M::ha_hist_line]   292: * 2109
[M::ha_hist_line]   293: * 2208
[M::ha_hist_line]   294: * 2187
[M::ha_hist_line]   295: * 2154
[M::ha_hist_line]   296: * 2139
[M::ha_hist_line]   297: * 2160
[M::ha_hist_line]   298: * 2211
[M::ha_hist_line]   299: * 2220
[M::ha_hist_line]   300: * 2364
[M::ha_hist_line]   301: * 2340
[M::ha_hist_line]   302: * 2367
[M::ha_hist_line]   303: * 2554
[M::ha_hist_line]   304: * 2611
[M::ha_hist_line]   305: * 2559
[M::ha_hist_line]   306: * 2525
[M::ha_hist_line]   307: * 2573
[M::ha_hist_line]   308: * 2791
[M::ha_hist_line]   309: * 2797
[M::ha_hist_line]   310: * 2819
[M::ha_hist_line]   311: * 2945
[M::ha_hist_line]   312: * 2958
[M::ha_hist_line]   313: * 3034
[M::ha_hist_line]   314: * 3275
[M::ha_hist_line]   315: * 3267
[M::ha_hist_line]   316: * 3351
[M::ha_hist_line]   317: * 3284
[M::ha_hist_line]   318: * 3513
[M::ha_hist_line]   319: * 3510
[M::ha_hist_line]   320: * 3692
[M::ha_hist_line]   321: * 3704
[M::ha_hist_line]   322: * 3755
[M::ha_hist_line]   323: * 3906
[M::ha_hist_line]   324: * 3910
[M::ha_hist_line]   325: * 3888
[M::ha_hist_line]   326: * 4038
[M::ha_hist_line]   327: * 4052
[M::ha_hist_line]   328: * 4249
[M::ha_hist_line]   329: * 4048
[M::ha_hist_line]   330: * 3908
[M::ha_hist_line]   331: * 4098
[M::ha_hist_line]   332: * 3993
[M::ha_hist_line]   333: * 4066
[M::ha_hist_line]   334: * 4055
[M::ha_hist_line]   335: * 4079
[M::ha_hist_line]   336: * 4027
[M::ha_hist_line]   337: * 3871
[M::ha_hist_line]   338: * 3942
[M::ha_hist_line]   339: * 3919
[M::ha_hist_line]   340: * 3896
[M::ha_hist_line]   341: * 3891
[M::ha_hist_line]   342: * 3721
[M::ha_hist_line]   343: * 3773
[M::ha_hist_line]   344: * 3657
[M::ha_hist_line]   345: * 3596
[M::ha_hist_line]   346: * 3377
[M::ha_hist_line]   347: * 3273
[M::ha_hist_line]   348: * 3190
[M::ha_hist_line]   349: * 3224
[M::ha_hist_line]   350: * 3142
[M::ha_hist_line]   351: * 3076
[M::ha_hist_line]   352: * 3058
[M::ha_hist_line]   353: * 2916
[M::ha_hist_line]   354: * 2776
[M::ha_hist_line]   355: * 2764
[M::ha_hist_line]   356: * 2785
[M::ha_hist_line]   357: * 2701
[M::ha_hist_line]   358: * 2490
[M::ha_hist_line]   359: * 2416
[M::ha_hist_line]   360: * 2361
[M::ha_hist_line]   361: * 2298
[M::ha_hist_line]   362: * 2325
[M::ha_hist_line]   363: * 2146
[M::ha_hist_line]   364: * 2136
[M::ha_hist_line]   365: * 2099
[M::ha_hist_line]  rest: *********************************************************************************************** 396842
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::285039.075*35.33] ==> counted 17036513 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137320573 minimizers
[M::ha_pt_gen::285482.798*35.31] ==> indexed 2135588024 positions, counted 17036513 distinct minimizer k-mers
[M::ha_assemble::297514.365*35.34@246.283GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 1183659490
[M::ha_print_ovlp_stat] # strong overlaps: 596745657
[M::ha_print_ovlp_stat] # weak overlaps: 586913833
[M::ha_print_ovlp_stat] # exact overlaps: 1149035029
[M::ha_print_ovlp_stat] # inexact overlaps: 34624461
[M::ha_print_ovlp_stat] # overlaps without large indels: 1180991551
[M::ha_print_ovlp_stat] # reverse overlaps: 410771757
Writing reads to disk... 
Reads has been written.

Thats a lot of information. The Hifiasm output page says that for heterozygous samples (which mine likely are), there should be 2 peaks in the k-mer plot, where the smaller peak is around the heterozygous read coverage and the larger peak is around the homozygous read coverage. This is true for all k-mer plots produced from this data. In all of my k-mer plots, the homozygous peak is 83 and the heterozygous peak is 168. I’m going to include the information without the k-mer plots below because they are so large:

[M::ha_analyze_count] lowest: count[17] = 11309
[M::ha_analyze_count] highest: count[85] = 9499858

## first k-mer plot

[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[168] = 3093394
[M::ha_ft_gen] peak_hom: 168; peak_het: 85
[M::ha_ct_shrink::3427.856*4.32] ==> counted 2684382 distinct minimizer k-mers
[M::ha_ft_gen::3431.212*4.31@20.882GB] ==> filtered out 2684382 k-mers occurring 840 or more times
[M::ha_opt_update_cov] updated max_n_chain to 840
[M::yak_count] collected 2139678804 minimizers
[M::ha_pt_gen::4355.642*5.80] ==> counted 31723566 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[17] = 2230
[M::ha_analyze_count] highest: count[83] = 418590

## second k-mer plot

[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[168] = 125423
[M::ha_pt_gen] peak_hom: 168; peak_het: 83
[M::ha_ct_shrink::4355.907*5.81] ==> counted 17720475 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2139678804 minimizers
[M::ha_pt_gen::4818.141*7.37] ==> indexed 2125675713 positions, counted 17720475 distinct minimizer k-mers
[M::ha_assemble::100006.981*34.52@107.237GB] ==> corrected reads for round 1
[M::ha_assemble] # bases: 79183709778; # corrected bases: 91988177; # recorrected bases: 115414
[M::ha_assemble] size of buffer: 61.709GB
[M::yak_count] collected 2137476150 minimizers
[M::ha_pt_gen::100401.244*34.48] ==> counted 19037497 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[13] = 1719
[M::ha_analyze_count] highest: count[83] = 417869

## third k-mer plot

[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::100401.532*34.48] ==> counted 17060420 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137476150 minimizers
[M::ha_pt_gen::100851.319*34.43] ==> indexed 2135499073 positions, counted 17060420 distinct minimizer k-mers
[M::ha_assemble::192251.063*35.13@143.827GB] ==> corrected reads for round 2
[M::ha_assemble] # bases: 79186546424; # corrected bases: 1694280; # recorrected bases: 16078
[M::ha_assemble] size of buffer: 60.552GB
[M::yak_count] collected 2137334800 minimizers
[M::ha_pt_gen::192627.741*35.11] ==> counted 18787923 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[9] = 2352
[M::ha_analyze_count] highest: count[83] = 417664

## fourth k-mer plot

[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::192628.074*35.11] ==> counted 17040202 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137334800 minimizers
[M::ha_pt_gen::193066.706*35.08] ==> indexed 2135587079 positions, counted 17040202 distinct minimizer k-mers
[M::ha_assemble::284661.787*35.35@241.898GB] ==> corrected reads for round 3
[M::ha_assemble] # bases: 79186730278; # corrected bases: 235080; # recorrected bases: 12650
[M::ha_assemble] size of buffer: 60.546GB
[M::yak_count] collected 2137320573 minimizers
[M::ha_pt_gen::285038.830*35.33] ==> counted 18769062 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[9] = 2315
[M::ha_analyze_count] highest: count[83] = 417609

## fifth k-mer plot

[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 83; peak_het: -1
[M::ha_ct_shrink::285039.075*35.33] ==> counted 17036513 distinct minimizer k-mers
[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137320573 minimizers
[M::ha_pt_gen::285482.798*35.31] ==> indexed 2135588024 positions, counted 17036513 distinct minimizer k-mers
[M::ha_assemble::297514.365*35.34@246.283GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 1183659490
[M::ha_print_ovlp_stat] # strong overlaps: 596745657
[M::ha_print_ovlp_stat] # weak overlaps: 586913833
[M::ha_print_ovlp_stat] # exact overlaps: 1149035029
[M::ha_print_ovlp_stat] # inexact overlaps: 34624461
[M::ha_print_ovlp_stat] # overlaps without large indels: 1180991551
[M::ha_print_ovlp_stat] # reverse overlaps: 410771757
Writing reads to disk... 
Reads has been written.
Writing ma_hit_ts to disk... 
ma_hit_ts has been written.
Writing ma_hit_ts to disk... 
ma_hit_ts has been written.
bin files have been written.
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
Writing raw unitig GFA to disk... 
Writing processed unitig GFA to disk... 
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
[M::mc_solve_core::0.284] ==> Partition
[M::adjust_utg_by_primary] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.bp.p_ctg.gfa to disk... 
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.bp.hap1.p_ctg.gfa to disk... 
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.bp.hap2.p_ctg.gfa to disk... 
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.16.1-r375
[M::main] CMD: hifiasm -o apul.hifiasm -t 36 m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq
[M::main] Real time: 304623.035 sec; CPU: 10520661.634 sec; Peak RSS: 246.283 GB

Okay so 4 rounds of assembly were done, hypothetically improving the assembly each time. Weirdly, the first 2 rounds had the heterozygous peak at 68, but the last couple of rounds had the heterozygous peak at -1. In all k-mer plots, there is a high peak at the 1 location, but this doesn’t seem unusual (based on the example posts here and here provided by the hifiasm log interpretation section). Overall, I think this file is providing information on the assembly iterations.

Given all of this information, let’s now run BUSCO to assess completeness of assembly. In the scripts folder: nano busco_unfilt_hifiasm.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.bp.p_ctg.gfa > apul.hifiasm.bp.p_ctg.fa

echo "Begin busco on unfiltered hifiasm-assembled fasta" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini"  -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.busco.canu -m genome

echo "busco complete for unfiltered hifiasm-assembled fasta" $(date)

Submitted batch job 305426. I could also run some analysis on the hap assemblies, but I’m going to wait until I am done with the filtering and re-assembly, as that assembly is the one that I will likely be moving forward with. How are they making the distinction between hap1/father and hap2/mother?

Busco ran in ~45 mins and results look a lot better than they did with canu! This is the primary result from the output file:

2024-03-11 13:37:21 INFO:       

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.3%[S:92.0%,D:1.3%],F:3.1%,M:3.6%,n:954      |
        |890    Complete BUSCOs (C)                       |
        |878    Complete and single-copy BUSCOs (S)       |
        |12     Complete and duplicated BUSCOs (D)        |
        |30     Fragmented BUSCOs (F)                     |
        |34     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

This looks so much better than the canu assembly! The previous canu assembly was 94.4% complete, but had only 9.4% single copy BUSCOs and 85% duplicated BUSCOs. Ideally, the duplication level should be lower. The hifiasm assembly had 93.3% completeness, 92% single copy BUSCOs, and 1.3% duplicated BUSCOs. Hifiasm is definitely the way to go for assembly. Now I just have to wait until the blast prok is done so I can remove any contamination. After I remove contamination, I will re-assemble using hifiasm.

20240313

Blast prok failed after 2 days and then restarted on Andromeda. Maybe I should stop the code after 2 days…idk. Maybe I need to increase the memory for the job? Canceling the job (305351) and increasing the memory (#SBATCH --mem=500GB). Submitted batch job 308997

20240316

Copied the prok data to my local computer. It is still running on the server, but I’m nervous it will restart again. If it does restart, I’ll cancel the job and just use this data from today. On my local computer, I combined the prok and viral blast results and then removed any hits whose bit score was <1000.

cd /Users/jillashey/Desktop/PutnamLab/Apulchra_genome
cat viral_contaminant_hits_rr.txt prok_contaminant_hits_rr.txt > all_contaminant_hits_rr.txt
awk '$12 > 1000 {print $0}' all_contaminant_hits_rr.txt > contaminant_hits_pv_passfilter_rr.txt

I then looked at the contamination hits in R. See code here.

As a summary, I first read in the eukaryotic blast hits that passed the contamination threshold. I found that only 2 reads had any euk contamination (m84100_240128_024355_s2/48759857/ccs and m84100_240128_024355_s2/234751852/ccs). I then read in the prokaryotic and viral blast hits that passed the threshold (only 224 blast hits passed). I calculated the percentage of each hits align length to the contigs so if there was a result that had 100%, that would mean that the whole contig was a contaminant. I looked at a histogram of the % alignments and found that most of the % alignments are on the lower size and there are not many 100% sequences. I summarized the contigs that were to be filtered out and found that 222 contigs had some level of pv contamination. I added the euk + pv contamination reads together (224 total) and calculated the proportion of contamination to raw reads. The contamination ended up being only 0.003797649% of the raw reads, which is pretty amazing! I calculated the mean length of the filtered reads (13,242.1 bp) and the sums of the unfiltered read length (79183709778 total bp) and filtered read length (79181142809 total bp). Using these lengths, I calculated a rough estimation of sequencing depth and found we have roughly 100x coverage! That’s similar to the Young et al. 2024 results as well. Finally, I wrote the list of filtered reads to a text file on my local computer. This information will be used to filter the raw reads on Andromeda prior to assembly. I’m impressed with the low contamination and the high coverage of the PacBio HiFi reads.

20240319

Prok blast script finally finished running! Took about 5 days. Cat the prok and viral results together and remove anything that has a bit score <1000.

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

cat viral_contaminant_hits_rr.txt prok_contaminant_hits_rr.txt > all_contaminant_hits_rr.txt
awk '$12 > 1000 {print $0}' all_contaminant_hits_rr.txt > contaminant_hits_pv_passfilter_rr.txt

Similarly to what I did above, I looked at the contamination hits in R. See code here.

As a summary, I first read in the eukaryotic blast hits that passed the contamination threshold. I found that only 2 reads had any euk contamination (m84100_240128_024355_s2/48759857/ccs and m84100_240128_024355_s2/234751852/ccs). I then read in the prokaryotic and viral blast hits that passed the threshold (only 2 blast hits passed). I calculated the percentage of each hits align length to the contigs so if there was a result that had 100%, that would mean that the whole contig was a contaminant. I looked at a histogram of the % alignments and found that most of the % alignments are on the lower size and there are not many 100% sequences. I summarized the contigs that were to be filtered out and found that 494 contigs had some level of pv contamination. I added the euk + pv contamination reads together (496 contigs total) and calculated the proportion of contamination to raw reads. The contamination ended up being only 0.00840908% of the raw reads, which is pretty amazing! I calculated the mean length of the filtered reads (13,424.84 bp) and the sums of the unfiltered read length (79183709778 total bp) and filtered read length (79178244314 total bp). Using these lengths, I calculated a rough estimation of sequencing depth and found we have roughly 100x coverage! That’s similar to the Young et al. 2024 results as well. Finally, I wrote the list of filtered reads to a text file on my local computer. This information will be used to filter the raw reads on Andromeda prior to assembly. I’m impressed with the low contamination and the high coverage of the PacBio HiFi reads.

20240320

Before filtered the hifi reads, I’m going to clean up the /data/putnamlab/jillashey/Apul_Genome/assembly/data folder so that I have more memory for the next steps. Here’s whats in there right now:

total 312G
-rw-r--r--. 1 jillashey 148G Feb  8 02:37 m84100_240128_024355_s2.hifi_reads.bc1029.fastq.fastq
-rwxr-xr-x. 1 jillashey 1.1K Feb  8 14:13 apul.seqStore.sh
-rw-r--r--. 1 jillashey  951 Feb  8 14:31 apul.seqStore.err
drwxr-xr-x. 3 jillashey 4.0K Feb  8 14:32 apul.seqStore
-rw-r--r--. 1 jillashey  23K Feb  9 01:47 apul.report
-rw-r--r--. 1 jillashey 7.0M Feb  9 01:51 apul.contigs.layout.tigInfo
-rw-r--r--. 1 jillashey 155M Feb  9 01:51 apul.contigs.layout.readToTig
-rw-r--r--. 1 jillashey 2.8G Feb  9 01:57 apul.unassembled.fasta
-rw-r--r--. 1 jillashey 943M Feb  9 02:01 apul.contigs.fasta
drwxr-xr-x. 2 jillashey 4.0K Feb 13 14:01 busco_output
drwxr-xr-x. 3 jillashey 4.0K Feb 13 14:01 busco_downloads
-rw-r--r--. 1 jillashey 7.2K Feb 13 14:02 busco_96228.log
lrwxrwxrwx. 1 jillashey  108 Feb 20 14:08 m84100_240128_024355_s2.hifi_reads.bc1029.bam -> /data/putnamlab/KITT/hputnam/20240129_Apulchra_Genome_LongRead/m84100_240128_024355_s2.hifi_reads.bc1029.bam
-rw-r--r--. 1 jillashey  106 Feb 21 16:48 pb.fofn
drwxr-xr-x. 9 jillashey 4.0K Feb 21 16:53 unitigging
-rw-r--r--. 1 jillashey  74G Mar  1 16:08 m84100_240128_024355_s2.hifi_reads.bc1029.fasta
-rw-r--r--. 1 jillashey 244M Mar  1 16:36 rr_read_lengths.txt
-rw-r--r--. 1 jillashey 106K Mar  3 16:27 contaminant_hits_euks_rr.txt
-rw-r--r--. 1 jillashey 2.4K Mar  3 16:29 contaminants_pass_filter_euks_rr.txt
-rw-r--r--. 1 jillashey  69M Mar  5 23:20 viral_contaminant_hits_rr.txt
-rw-r--r--. 1 jillashey  19G Mar  7 23:58 apul.hifiasm.ec.bin
-rw-r--r--. 1 jillashey  47G Mar  8 00:08 apul.hifiasm.ovlp.source.bin
-rw-r--r--. 1 jillashey  17G Mar  8 00:12 apul.hifiasm.ovlp.reverse.bin
-rw-r--r--. 1 jillashey 1.2G Mar  8 01:37 apul.hifiasm.bp.r_utg.gfa
-rw-r--r--. 1 jillashey  21M Mar  8 01:37 apul.hifiasm.bp.r_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.6M Mar  8 01:41 apul.hifiasm.bp.r_utg.lowQ.bed
-rw-r--r--. 1 jillashey 1.1G Mar  8 01:42 apul.hifiasm.bp.p_utg.gfa
-rw-r--r--. 1 jillashey  21M Mar  8 01:42 apul.hifiasm.bp.p_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.2M Mar  8 01:46 apul.hifiasm.bp.p_utg.lowQ.bed
-rw-r--r--. 1 jillashey 506M Mar  8 01:47 apul.hifiasm.bp.p_ctg.gfa
-rw-r--r--. 1 jillashey  11M Mar  8 01:47 apul.hifiasm.bp.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar  8 01:49 apul.hifiasm.bp.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 469M Mar  8 01:50 apul.hifiasm.bp.hap1.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar  8 01:50 apul.hifiasm.bp.hap1.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar  8 01:52 apul.hifiasm.bp.hap1.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 468M Mar  8 01:52 apul.hifiasm.bp.hap2.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar  8 01:52 apul.hifiasm.bp.hap2.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 1.9M Mar  8 01:54 apul.hifiasm.bp.hap2.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 495M Mar 11 12:52 apul.hifiasm.bp.p_ctg.fa
-rw-r--r--. 1 jillashey 246M Mar 19 14:05 prok_contaminant_hits_rr.txt
-rw-r--r--. 1 jillashey 315M Mar 19 16:08 all_contaminant_hits_rr.txt
-rw-r--r--. 1 jillashey 1.1M Mar 19 16:09 contaminant_hits_pv_passfilter_rr.txt

I removed the following:

rm -r apul.seqStore
rm apul*
rm -r busco*
rm pb.fofn 
rm -r unitigging/

Now I have more space. Copy the file all_contam_rem_good_hifi_read_list.txt that was generated from the R code mentioned above. This specific file was written starting on line 242. It contains the reads that have passed contamination filtering. I copied this file into /data/putnamlab/jillashey/Apul_Genome/assembly/data.

wc -l all_contam_rem_good_hifi_read_list.txt
 5897892 all_contam_rem_good_hifi_read_list.txt

The vast majority of the hifi reads are retained after contamination filtering, which is a good sign of high quality sequencing. My next step is to subset the raw hifi fasta file to remove the contaminants identified above. I can do this with the seqtk subseq command. In the scripts folder: nano subseq.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load seqtk/1.3-GCC-9.3.0

echo "Subsetting hifi reads that passed contamination filtering" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

seqtk subseq m84100_240128_024355_s2.hifi_reads.bc1029.fasta all_contam_rem_good_hifi_read_list.txt > hifi_rr_allcontam_rem.fasta

echo "Subsetting complete!" $(date)

Submitted batch job 309659. Finished in about 8 mins but the output file looks like this:

>m84100_240128_024355_s2/261887593/ccs:18224-18224
A
>m84100_240128_024355_s2/255530003/ccs:21870-21870
A
>m84100_240128_024355_s2/249237028/ccs:23691-23691
A
>m84100_240128_024355_s2/262606536/ccs:14772-14772
A
>m84100_240128_024355_s2/217322854/ccs:12923-12923
A
>m84100_240128_024355_s2/256512826/ccs:12914-12914
A
>m84100_240128_024355_s2/245632166/ccs:28440-28440
A
>m84100_240128_024355_s2/250548903/ccs:23076-23076
A
>m84100_240128_024355_s2/256054930/ccs:15405-15405
A
>m84100_240128_024355_s2/241242930/ccs:15521-15521
A
>m84100_240128_024355_s2/254348773/ccs:14578-14578
C
>m84100_240128_024355_s2/252319399/ccs:12407-12407
A
>m84100_240128_024355_s2/229183717/ccs:5757-5757

Not ideal. It looks like it only took the first letter from each sequence. Maybe I need to remove the length information from the all_contam_rem_good_hifi_read_list.txt file?

awk '{$2=""; print $0}' all_contam_rem_good_hifi_read_list.txt > output_file.txt

Edit the script so that the list of reads to keep is output_file.txt and decrease mem to 250GB. Submitted batch job 309672. This appears to have worked!

zgrep -c ">" hifi_rr_allcontam_rem.fasta
5897892

Following Young et al. 2024, I need to use Merqury and GenomeScope2 analysis of the cleaned reads. Merqury and Meryl seem to be related somehow, but not sure. Meryl is a tool for counting and working with sets of k-mers. It is a part of Canu as well. So it seems like Meryl counts the k-mers and Merqury estimates accuracy and completeness? Young et al. 2024 used only meryl here to generate a kmer database, which was then used as input to genomescope2. Merqury was used after hifiasm assembly.

First, the best value for k needs to be determined for use in Meryl. This can be done with the best_k.sh script from Merqury. I’m going to copy this script into my own scripts folder. Young et al. 2024 used 500mb estimated genome size. This is what I will use too, as previous coral/Acropora genomes are about this size. I will need to install Meryl and Merqury first. I’m following the Meryl and Merqury githubs for installation instructions.

For Meryl:

wget https://github.com/marbl/meryl/releases/download/v1.4.1/meryl-1.4.1.Linux-amd64.tar.xz
tar -xJf meryl-1.4.1.Linux-amd64.tar.xz
export PATH=/data/putnamlab/jillashey/Apul_Genome/assembly/meryl-1.4.1/bin:$PATH

Now that I have exported the PATH variable to include the directory where the Meryl executable files are, I can run Meryl commands (I think), regardless of working directory. For now, lets run meryl with k=18. The k-mer database will be generated using k=18 and the cleaned reads. In the scripts folder: nano meryl.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

export PATH=/data/putnamlab/jillashey/Apul_Genome/assembly/meryl-1.4.1/bin:$PATH

echo "Creating meryl k-mer db" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

meryl k=18 \
count hifi_rr_allcontam_rem.fasta \
output meryl_merc

echo "Meryl k-mer db complete!" $(date)

Submitted batch job 309682. While this runs, lets try to install Merqury.

git clone https://github.com/marbl/merqury.git
cd merqury
export MERQURY=$PWD:$PATH

Hypothetically, Merqury is now installed. Perhaps now we can run the best k script? I am not sure what genome size estimate to use. Shinzato et al. 2020 assessed several Acropora genomes and found the genome size ranged from 384 (Acropora microphthalma) to 447 (Acropora hyacinthus) Mb. I think I will go with 450 Mb. Let’s try to run the best k script.

$MERQURY/best_k.sh 450000000

Giving me this error:

-bash: /data/putnamlab/jillashey/Apul_Genome/assembly/merqury:/data/putnamlab/jillashey/Apul_Genome/assembly/meryl-1.4.1/bin:/path/to/meryl-1.4.1/bin:/path/to/meryl/…/bin:/opt/software/Miniconda3/4.9.2/bin:/opt/software/Miniconda3/4.9.2/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/jillashey/.local/bin:/home/jillashey/bin/best_k.sh: No such file or directory

I’m not sure what it means or how to fix it. On the Merqury github, it lists dependencies that Merqury may need to run:

  • gcc 10.2.0 or higher (for installing Meryl)
  • Meryl v1.4.1
  • Java run time environment (JRE)
  • R with argparse, ggplot2, and scales (recommend R 4.0.3+)
  • bedtools
  • samtools

Maybe I need to load these? Idk I am confused. I may just continue with the initial assembly…because I don’t really see the importance of this step. In Young et al. 2024, it says “the kmer profile of cleaned raw HiFi reads was generated with Meryl [34], and used for genome profiling with GenomeScope2 [69] to estimate genome size, repetitiveness, heterozygosity, and ploidy.” I will come back to this. I may need to email Kevin Bryan to install meryl, merqury and genomescope2.

While I wait for his response, I will start running the initial hifiasm assembly with default flags. I have a script for hifiasm on the server, but I’m going to make a new one for the initial assembly. In the scripts folder: nano initial_hifiasm.sh

#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/hifiasm

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Starting assembly with hifiasm" $(date)

hifiasm -o apul.hifiasm.intial hifi_rr_allcontam_rem.fasta -t 36 2> apul_hifiasm_allcontam_rem_initial.asm.log

echo "Assembly with hifiasm complete!" $(date)

conda deactivate

Submitted batch job 309689

20240325

Initial assembly ran in about 3 days. These are the files that were generated:

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

-rw-r--r--. 1 jillashey  19G Mar 24 02:10 apul.hifiasm.intial.ec.bin
-rw-r--r--. 1 jillashey  47G Mar 24 02:20 apul.hifiasm.intial.ovlp.source.bin
-rw-r--r--. 1 jillashey  17G Mar 24 02:23 apul.hifiasm.intial.ovlp.reverse.bin
-rw-r--r--. 1 jillashey 1.2G Mar 24 03:40 apul.hifiasm.intial.bp.r_utg.gfa
-rw-r--r--. 1 jillashey  21M Mar 24 03:40 apul.hifiasm.intial.bp.r_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.6M Mar 24 03:44 apul.hifiasm.intial.bp.r_utg.lowQ.bed
-rw-r--r--. 1 jillashey 1.1G Mar 24 03:45 apul.hifiasm.intial.bp.p_utg.gfa
-rw-r--r--. 1 jillashey  21M Mar 24 03:45 apul.hifiasm.intial.bp.p_utg.noseq.gfa
-rw-r--r--. 1 jillashey 8.2M Mar 24 03:49 apul.hifiasm.intial.bp.p_utg.lowQ.bed
-rw-r--r--. 1 jillashey 506M Mar 24 03:50 apul.hifiasm.intial.bp.p_ctg.gfa
-rw-r--r--. 1 jillashey  11M Mar 24 03:50 apul.hifiasm.intial.bp.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 24 03:52 apul.hifiasm.intial.bp.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 469M Mar 24 03:52 apul.hifiasm.intial.bp.hap1.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 24 03:52 apul.hifiasm.intial.bp.hap1.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 2.0M Mar 24 03:54 apul.hifiasm.intial.bp.hap1.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey 468M Mar 24 03:55 apul.hifiasm.intial.bp.hap2.p_ctg.gfa
-rw-r--r--. 1 jillashey 9.9M Mar 24 03:55 apul.hifiasm.intial.bp.hap2.p_ctg.noseq.gfa
-rw-r--r--. 1 jillashey 1.9M Mar 24 03:56 apul.hifiasm.intial.bp.hap2.p_ctg.lowQ.bed
-rw-r--r--. 1 jillashey  80K Mar 24 03:57 apul_hifiasm_allcontam_rem_initial.asm.log

Many files. The output file description can be found above and also on the hifiasm output website. The log output file contains the k-mer histogram (similar to what is posted above), which shows two peaks, indicative of a heterozygous genome assembly. This is what the bottom of the log file looks like:

[M::ha_pt_gen::] counting in normal mode
[M::yak_count] collected 2137171007 minimizers
[M::ha_pt_gen::281249.073*35.34] ==> indexed 2135566893 positions, counted 17030417 distinct minimizer k-mers
[M::ha_assemble::292970.825*35.36@202.994GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 1183659340
[M::ha_print_ovlp_stat] # strong overlaps: 596745652
[M::ha_print_ovlp_stat] # weak overlaps: 586913688
[M::ha_print_ovlp_stat] # exact overlaps: 1149035007
[M::ha_print_ovlp_stat] # inexact overlaps: 34624333
[M::ha_print_ovlp_stat] # overlaps without large indels: 1180991403
[M::ha_print_ovlp_stat] # reverse overlaps: 410771728
Writing reads to disk... 
Reads has been written.
Writing ma_hit_ts to disk... 
ma_hit_ts has been written.
Writing ma_hit_ts to disk... 
ma_hit_ts has been written.
bin files have been written.
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
Writing raw unitig GFA to disk... 
Writing processed unitig GFA to disk... 
[M::purge_dups] homozygous read coverage threshold: 168
[M::purge_dups] purge duplication coverage threshold: 210
[M::mc_solve_core::0.308] ==> Partition
[M::adjust_utg_by_primary] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.intial.bp.p_ctg.gfa to disk... 
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.intial.bp.hap1.p_ctg.gfa to disk... 
[M::adjust_utg_by_trio] primary contig coverage range: [142, infinity]
Writing apul.hifiasm.intial.bp.hap2.p_ctg.gfa to disk... 
Inconsistency threshold for low-quality regions in BED files: 70%
[M::main] Version: 0.16.1-r375
[M::main] CMD: hifiasm -o apul.hifiasm.intial -t 36 hifi_rr_allcontam_rem.fasta
[M::main] Real time: 299500.413 sec; CPU: 10366612.139 sec; Peak RSS: 202.994 GB

Does the M::purge_dups mean that this is the value that I should be using for the purge_dups flag in hifiasm? It gives me two lines for M::purge_dups: homozygous read coverage threshold: 168 and purge duplication coverage threshold: 210. I may need to talk with Ross and Hollie more about this. I’m going to QC the assembly and the haplotype assemblies with busco and quast. In the scripts folder: nano initial_qc.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.intial.bp.p_ctg.gfa > apul.hifiasm.intial.bp.p_ctg.fa

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.intial.bp.hap1.p_ctg.gfa > apul.hifiasm.intial.bp.hap1.p_ctg.fa

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.intial.bp.hap2.p_ctg.gfa > apul.hifiasm.intial.bp.hap2.p_ctg.fa

echo "Begin busco on filtered hifiasm-assembled fasta (initial run)" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.intial.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.initial.busco -m genome

echo "busco complete for unfiltered hifiasm-assembled fasta (initial run)" $(date)
echo "Begin busco on hifiasm-assembled haplotype 1 fasta" $(date)

# Reset query 
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.intial.bp.hap1.p_ctg.fa" # set this to the query (genome/transcriptome) you are running

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.initial.hap1.busco -m genome

echo "busco complete for hifiasm-assembled haplotype 1 fasta (initial run)" $(date)
echo "Begin busco on hifiasm-assembled haplotype 2 fasta" $(date)

# Reset query 
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.intial.bp.hap2.p_ctg.fa" # set this to the query (genome/transcriptome) you are running

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.initial.hap2.busco -m genome

echo "busco complete for hifiasm-assembled haplotype 2 fasta (initial run)" $(date)
echo "busco complete all assemblies of interest (initial run)" $(date)
echo "Begin quast of primary and haplotypes (initial run)" $(date)

module load QUAST/5.2.0-foss-2021b 
# there is another version of quast if the one above does not work: QUAST/5.0.2-foss-2020b-Python-2.7.18

quast -t 15 --eukaryote \
apul.hifiasm.intial.bp.p_ctg.fa \
apul.hifiasm.intial.bp.hap1.p_ctg.fa \
apul.hifiasm.intial.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast

echo "Quast complete (initial run); all QC complete!" $(date)

Submitted batch job 309969. Ran in ~2 hours. Busco ran but only for the primariy assembly. For some reason, the hap1 and hap2 busco did not run. Quast appears to have failed with these errors:

foss/2021b(24):ERROR:150: Module 'foss/2021b' conflicts with the currently loaded module(s) 'foss/2020b'
foss/2021b(24):ERROR:102: Tcl command execution failed: conflict foss

Python/3.9.6-GCCcore-11.2.0(61):ERROR:150: Module 'Python/3.9.6-GCCcore-11.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/3.9.6-GCCcore-11.2.0(61):ERROR:102: Tcl command execution failed: conflict Python

Perl/5.34.0-GCCcore-11.2.0(133):ERROR:150: Module 'Perl/5.34.0-GCCcore-11.2.0' conflicts with the currently loaded module(s) 'Perl/5.32.0-GCCcore-10.2.0'
Perl/5.34.0-GCCcore-11.2.0(133):ERROR:102: Tcl command execution failed: conflict Perl

foss/2021b(24):ERROR:150: Module 'foss/2021b' conflicts with the currently loaded module(s) 'foss/2020b'
foss/2021b(24):ERROR:102: Tcl command execution failed: conflict foss

Python/3.9.6-GCCcore-11.2.0(61):ERROR:150: Module 'Python/3.9.6-GCCcore-11.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/3.9.6-GCCcore-11.2.0(61):ERROR:102: Tcl command execution failed: conflict Python

SciPy-bundle/2021.10-foss-2021b(30):ERROR:150: Module 'SciPy-bundle/2021.10-foss-2021b' conflicts with the currently loaded module(s) 'SciPy-bundle/2020.11-foss-2020b'
SciPy-bundle/2021.10-foss-2021b(30):ERROR:102: Tcl command execution failed: conflict SciPy-bundle

GCCcore/11.2.0(24):ERROR:150: Module 'GCCcore/11.2.0' conflicts with the currently loaded module(s) 'GCCcore/10.2.0'
GCCcore/11.2.0(24):ERROR:102: Tcl command execution failed: conflict GCCcore

Basically just a lot of conflicting modules. Below are the results for the Busco code for the primary assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.3%[S:92.0%,D:1.3%],F:3.1%,M:3.6%,n:954      |
        |890    Complete BUSCOs (C)                       |
        |878    Complete and single-copy BUSCOs (S)       |
        |12     Complete and duplicated BUSCOs (D)        |
        |30     Fragmented BUSCOs (F)                     |
        |34     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Going to edit the initial_qc.sh script so that I am including all things for busco to run properly. I’m also commenting out the lines that ran successfully and switching the module to QUAST/5.0.2-foss-2020b-Python-2.7.18. Submitted batch job 309984. It ran, but only the hap1 busco scores were generated, not hap2. Below are the results for the hap1 assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.3%[S:92.8%,D:0.5%],F:3.0%,M:3.7%,n:954      |
        |890    Complete BUSCOs (C)                       |
        |885    Complete and single-copy BUSCOs (S)       |
        |5      Complete and duplicated BUSCOs (D)        |
        |29     Fragmented BUSCOs (F)                     |
        |35     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Quast also failed again with this error:

Python/2.7.18-GCCcore-10.2.0(58):ERROR:150: Module 'Python/2.7.18-GCCcore-10.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/2.7.18-GCCcore-10.2.0(58):ERROR:102: Tcl command execution failed: conflict Python

Python/2.7.18-GCCcore-10.2.0(58):ERROR:150: Module 'Python/2.7.18-GCCcore-10.2.0' conflicts with the currently loaded module(s) 'Python/3.8.6-GCCcore-10.2.0'
Python/2.7.18-GCCcore-10.2.0(58):ERROR:102: Tcl command execution failed: conflict Python

I’m going to add module purge and module load Python/2.7.18-GCCcore-10.2.0 prior to loading quast. I’m also going to comment out lines that have already run successfully. Submitted batch job 310034. Ran in ~45 mins. Below are the results for the hap2 assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:94.0%[S:92.7%,D:1.3%],F:2.9%,M:3.1%,n:954      |
        |896    Complete BUSCOs (C)                       |
        |884    Complete and single-copy BUSCOs (S)       |
        |12     Complete and duplicated BUSCOs (D)        |
        |28     Fragmented BUSCOs (F)                     |
        |30     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Once again, quast failed to run and I got this error:

ERROR! File not found (contigs): apul.hifiasm.intial.bp.p_ctg.fa
In case you have troubles running QUAST, you can write to quast.support@cab.spbu.ru
or report an issue on our GitHub repository https://github.com/ablab/quast/issues
Please provide us with quast.log file from the output directory.

I’m going to write quast its own script. Quast can be run with or without a reference genome. I’m going to try both, using Amillepora as the reference genome. In the scripts folder: nano initial_quast.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors

echo "Begin quast of primary and haplotypes (initial run) w/ reference" $(date)

quast -t 10 --eukaryote \
apul.hifiasm.intial.bp.p_ctg.fa \
apul.hifiasm.intial.bp.hap1.p_ctg.fa \
apul.hifiasm.intial.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast

echo "Quast complete (initial run); all QC complete!" $(date)

Submitted batch job 310038. I need to read more about quast command line options, as there seem to be a lot of options. Also need to look into including a reference vs not including a reference. Ran super fast but success! Quast is super informative!!!!! The most useful information is in the report.* files. This is from the report.txt file:

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    apul.hifiasm.intial.bp.p_ctg  apul.hifiasm.intial.bp.hap1.p_ctg  apul.hifiasm.intial.bp.hap2.p_ctg  Amil.v2.01.chrs
# contigs (>= 0 bp)         188                           275                                162                                854            
# contigs (>= 1000 bp)      188                           275                                162                                851            
# contigs (>= 5000 bp)      188                           273                                162                                748            
# contigs (>= 10000 bp)     186                           271                                162                                672            
# contigs (>= 25000 bp)     166                           246                                153                                545            
# contigs (>= 50000 bp)     98                            163                                124                                445            
Total length (>= 0 bp)      518528298                     481372407                          480341213                          475381253      
Total length (>= 1000 bp)   518528298                     481372407                          480341213                          475378544      
Total length (>= 5000 bp)   518528298                     481363561                          480341213                          475052084      
Total length (>= 10000 bp)  518514885                     481349140                          480341213                          474498957      
Total length (>= 25000 bp)  518188097                     480901871                          480181619                          472383091      
Total length (>= 50000 bp)  515726224                     477880931                          479141568                          468867721      
# contigs                   188                           275                                162                                854            
Largest contig              45111900                      21532546                           22038975                           39361238       
Total length                518528298                     481372407                          480341213                          475381253      
GC (%)                      39.05                         39.03                              39.04                              39.06          
N50                         16268372                      12353884                           13054353                           19840543       
N75                         13007972                      7901416                            8791894                            1469964        
L50                         11                            15                                 15                                 9              
L75                         20                            28                                 26                                 23             
# N's per 100 kbp           0.00                          0.00                               0.00                               7.79

Look at all of that info! The initial assembly appears to be the best in terms of all the stats. The total length is longer than the Amillepora and the haplotype assemblies. Additionally, it has the largest contig. The Amillepora genome has a higher N50 but the N50 for the primary assembly still looks good. There were 188 contigs generated in the primary assembly. Haplotype 1 assembly had more contigs (275), while haplotype 2 had less (162). Amillepora has 854 contigs which is so high! The primary assembly contig number is much lower than Atenuis (614), Adigitifera (955), or Amillepora (854). Quast, you have converted me <3. My next steps are to play with the -s flag in hifiasm to determine the threshold at which duplicate haplotigs should be purged. The default is 0.55 and Young et al. (2024) ran it with a range of values (0.55, 0.50, 0.45, 0.40, 0.35, 0.30). They found that all worked well to resolve haplotypes, so they stuck with the default of 0.55. They also used the --primary flag, which outputs a primary and alternate assembly as opposed to the primary, hap1 and hap2 assemblies. In their code, they justified this by saying “running hifiasm using the primary flag as we have no real way of knowing if the haplotypes produced are real or not” (line 826).

Starting a run where -s is 0.3 and 0.8. In the /data/putnamlab/jillashey/Apul_Genome/assembly/scripts folder, nano s30_hifiasm.sh:

#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/hifiasm

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Starting assembly with hifiasm" $(date)

hifiasm -o apul.hifiasm.s30 hifi_rr_allcontam_rem.fasta -s 0.3 -t 36 2> apul_hifiasm_allcontam_rem_s30.asm.log

echo "Assembly with hifiasm complete!" $(date)

conda deactivate

Submitted batch job 310048. In the scripts folder, nano s80_hifiasm.sh:

#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/hifiasm

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Starting assembly with hifiasm" $(date)

hifiasm -o apul.hifiasm.s80 hifi_rr_allcontam_rem.fasta -s 0.80 -t 36 2> apul_hifiasm_allcontam_rem_s80.asm.log

echo "Assembly with hifiasm complete!" $(date)

conda deactivate

Submitted batch job 310049

20240329

The s30 script finished running in about 3 days. Now I’m going to run busco and quast for QC on these assemblies. In the scripts folder: nano s30_primary_busco.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s30.bp.p_ctg.gfa > apul.hifiasm.s30.bp.p_ctg.fa

echo "Begin busco on hifiasm-assembled primary fasta with -s 0.30" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s30.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s30.primary.busco -m genome

echo "busco complete for hifiasm-assembled primary fasta with -s 0.30" $(date)

Submitted batch job 310241. In the scripts folder: nano s30_hap1_busco.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s30.bp.hap1.p_ctg.gfa > apul.hifiasm.s30.bp.hap1.p_ctg.fa

echo "Begin busco on hifiasm-assembled fasta hap1 with -s 0.30" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s30.bp.hap1.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s30.hap1.busco -m genome

echo "busco complete for hifiasm-assembled fasta hap1 with -s 0.30" $(date)

Submitted batch job 310242. In the scripts folder: nano s30_hap2_busco.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s30.bp.hap2.p_ctg.gfa > apul.hifiasm.s30.bp.hap2.p_ctg.fa

echo "Begin busco on hifiasm-assembled fasta hap2 with -s 0.30" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s30.bp.hap2.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s30.hap2.busco -m genome

echo "busco complete for hifiasm-assembled fasta hap2 with -s 0.30" $(date)

Submitted batch job 310243. In the scripts folder: nano s30_quast.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors

echo "Begin quast of primary and haplotypes (s30 run) w/ reference" $(date)

quast -t 10 --eukaryote \
apul.hifiasm.s30.bp.p_ctg.fa \
apul.hifiasm.s30.bp.hap1.p_ctg.fa \
apul.hifiasm.s30.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/s30

echo "Quast complete (s30 run); all QC complete!" $(date)

Submitted batch job 310244. So many jobs! The primary busco finished in about an hour. Let’s look at the results.

Busco for primary assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.3%[S:92.9%,D:0.4%],F:3.1%,M:3.6%,n:954      |
        |890    Complete BUSCOs (C)                       |
        |886    Complete and single-copy BUSCOs (S)       |
        |4      Complete and duplicated BUSCOs (D)        |
        |30     Fragmented BUSCOs (F)                     |
        |34     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Pretty similar to the initial primary assembly, which was the same in completeness (93.3%) but slightly lower in single copy buscos (92% in initial vs 92.9% in s30).

Busco for the hap1 assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.4%[S:92.9%,D:0.5%],F:3.1%,M:3.5%,n:954      |
        |891    Complete BUSCOs (C)                       |
        |886    Complete and single-copy BUSCOs (S)       |
        |5      Complete and duplicated BUSCOs (D)        |
        |30     Fragmented BUSCOs (F)                     |
        |33     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Almost identical to primary assembly with this flag. Also pretty similar to the initial hap1 assembly.

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.9%[S:93.6%,D:0.3%],F:2.8%,M:3.3%,n:954      |
        |896    Complete BUSCOs (C)                       |
        |893    Complete and single-copy BUSCOs (S)       |
        |3      Complete and duplicated BUSCOs (D)        |
        |27     Fragmented BUSCOs (F)                     |
        |31     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Once again, pretty similar to the primary and hap1 assemblies, as well as the initial hap2 assembly. So what do these results mean? That the assembly results aren’t really affected by the -s flag? I will have to see what the s80 results look like.

20240401

The s80 script finished a couple of days ago. Now time to assess completeness and what not. In the scripts folder: nano s80_primary_busco.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s80.bp.p_ctg.gfa > apul.hifiasm.s80.bp.p_ctg.fa

echo "Begin busco on hifiasm-assembled primary fasta with -s 0.80" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s80.bp.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s80.primary.busco -m genome

echo "busco complete for hifiasm-assembled primary fasta with -s 0.80" $(date)

Submitted batch job 310321. In the scripts folder: nano s80_hap1_busco.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s80.bp.hap1.p_ctg.gfa > apul.hifiasm.s80.bp.hap1.p_ctg.fa

echo "Begin busco on hifiasm-assembled fasta hap1 with -s 0.80" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s80.bp.hap1.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s80.hap1.busco -m genome

echo "busco complete for hifiasm-assembled fasta hap1 with -s 0.80" $(date)

Submitted batch job 310322. In the scripts folder: nano s80_hap2_busco.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data/

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s80.bp.hap2.p_ctg.gfa > apul.hifiasm.s80.bp.hap2.p_ctg.fa

echo "Begin busco on hifiasm-assembled fasta hap2 with -s 0.80" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s80.bp.hap2.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.s80.hap2.busco -m genome

echo "busco complete for hifiasm-assembled fasta hap2 with -s 0.80" $(date)

Submitted batch job 310323

Talked with Hollie last week about possibly assembly the mitochondrial genome for Apul. There are some Apul mito sequences on NCBI, which I’m going to pull and blast against the pacbio raw reads. If there are any hits, I will know that mito sequences are present in the data and its possible to do a mito assembly. There are 16 putative mito sequences for Apul in the NCBI link above. I’m going to pull those sequences and blast them. Make mito folders:

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data
mkdir mito
cd mito

Using the putative Apul mito sequences from NCBI, make a fasta file with the 16 sequences. File is called mito_seqs_ncbi.fasta. Similar to what I did with the viral, euk and prok sequences, blast the mito sequences against the raw reads. In the scripts folder: nano blastn_mito_seqs_ncbi.sh

#!/bin/bash 
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Build mito seq db" $(date)

makeblastdb -in /data/putnamlab/jillashey/Apul_Genome/assembly/data/mito/mito_seqs_ncbi.fasta -dbtype nucl -out /data/putnamlab/jillashey/Apul_Genome/assembly/data/mito/mito_seqs_ncbi_db

echo "Blasting hifi reads against viral genomes to look for contaminants" $(date)

blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db /data/putnamlab/jillashey/Apul_Genome/assembly/data/mito/mito_seqs_ncbi_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out mito_hits_rr.txt

echo "Blast complete!" $(date)

Submitted batch job 310324. Checking back a couple hours later. Let’s look at the mito results first:

wc -l mito_hits_rr.txt 
12449 mito_hits_rr.txt

head mito_hits_rr.txt 
m84100_240128_024355_s2/246617425/ccs	NC_081454.1:14774-15130	94.175	103	6	0	6728	6830	308	206	1.49e-39	158
m84100_240128_024355_s2/246617425/ccs	NC_081454.1:2434-16341	94.175	103	6	0	6728	6830	12648	12546	1.49e-39	158
m84100_240128_024355_s2/246485798/ccs	NC_081454.1:14774-15130	94.175	103	6	0	4066	4168	308	206	8.50e-40	158
m84100_240128_024355_s2/246485798/ccs	NC_081454.1:2434-16341	94.175	103	6	0	4066	4168	12648	12546	8.50e-40	158
m84100_240128_024355_s2/248320207/ccs	NC_081454.1:14442-14741	92.079	202	14	2	19987	20187	67	267	3.13e-77	283
m84100_240128_024355_s2/248320207/ccs	NC_081454.1:2434-16341	92.079	202	14	2	19987	20187	12075	12275	3.13e-77	283
m84100_240128_024355_s2/257166685/ccs	NC_081454.1:14442-14741	93.035	201	14	0	16520	16720	67	267	1.18e-80	294
m84100_240128_024355_s2/257166685/ccs	NC_081454.1:2434-16341	93.035	201	14	0	16520	16720	12075	12275	1.18e-80	294
m84100_240128_024355_s2/255267441/ccs	NC_081454.1:14442-14741	92.537	201	14	1	14988	15187	67	267	1.86e-78	287
m84100_240128_024355_s2/255267441/ccs	NC_081454.1:2434-16341	92.537	201	14	1	14988	15187	12075	12275	1.86e-78	287

I need to talk more with Hollie about the mito asssembly because I am still a little confused about this portion.

The busco information for the s80 assembly run also finished. Let’s look at the results!

Primary assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.5%[S:89.8%,D:3.7%],F:3.2%,M:3.3%,n:954      |
        |892    Complete BUSCOs (C)                       |
        |857    Complete and single-copy BUSCOs (S)       |
        |35     Complete and duplicated BUSCOs (D)        |
        |31     Fragmented BUSCOs (F)                     |
        |31     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Hap1 assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.6%[S:91.4%,D:2.2%],F:3.1%,M:3.3%,n:954      |
        |893    Complete BUSCOs (C)                       |
        |872    Complete and single-copy BUSCOs (S)       |
        |21     Complete and duplicated BUSCOs (D)        |
        |30     Fragmented BUSCOs (F)                     |
        |31     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Hap2 assembly:

        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:92.3%[S:91.0%,D:1.3%],F:2.9%,M:4.8%,n:954      |
        |880    Complete BUSCOs (C)                       |
        |868    Complete and single-copy BUSCOs (S)       |
        |12     Complete and duplicated BUSCOs (D)        |
        |28     Fragmented BUSCOs (F)                     |
        |46     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Assemblies look quite similar to one another and to the prior assemblies. 89.8% of single copy buscos for the primary assembly is the lowest of all of the assemblies. I’m now going to run quast with the initial, -s 0.30, -s 0.80, and the Amillepora assemblies to compare. In the scripts folder: nano test_quast.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors

echo "Begin quast of initial, s30, and s80 assemblies w/ reference" $(date)

quast -t 15 --eukaryote \
apul.hifiasm.intial.bp.p_ctg.fa \
apul.hifiasm.intial.bp.hap1.p_ctg.fa \
apul.hifiasm.intial.bp.hap2.p_ctg.fa \
apul.hifiasm.s30.bp.p_ctg.fa \
apul.hifiasm.s30.bp.hap1.p_ctg.fa \
apul.hifiasm.s30.bp.hap2.p_ctg.fa \
apul.hifiasm.s80.bp.p_ctg.fa \
apul.hifiasm.s80.bp.hap1.p_ctg.fa \
apul.hifiasm.s80.bp.hap2.p_ctg.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/s80

echo "Quast complete" $(date)

Submitted batch job 310352. Finished in about 5 mins. Here’s quast:

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    apul.hifiasm.intial.bp.p_ctg  apul.hifiasm.intial.bp.hap1.p_ctg  apul.hifiasm.intial.bp.hap2.p_ctg  apul.hifiasm.s30.bp.p_ctg  apul.hifiasm.s30.bp.hap1.p_ctg  apul.hifiasm.s30.bp.hap2.p_ctg  apul.hifiasm.s80.bp.p_ctg  apul.hifiasm.s80.bp.hap1.p_ctg  apul.hifiasm.s80.bp.hap2.p_ctg  Amil.v2.01.chrs
# contigs (>= 0 bp)         188                           275                                162                                180                        247                             167                             206                        258                             189                             854            
# contigs (>= 1000 bp)      188                           275                                162                                180                        247                             167                             206                        258                             189                             851            
# contigs (>= 5000 bp)      188                           273                                162                                180                        247                             167                             206                        256                             189                             748            
# contigs (>= 10000 bp)     186                           271                                162                                178                        246                             166                             204                        255                             187                             672            
# contigs (>= 25000 bp)     166                           246                                153                                158                        219                             161                             187                        235                             178                             545            
# contigs (>= 50000 bp)     98                            163                                124                                92                         142                             132                             120                        155                             150                             445            
Total length (>= 0 bp)      518528298                     481372407                          480341213                          504851641                  484060404                       461127429                       558522339                  509131880                       465604880                       475381253      
Total length (>= 1000 bp)   518528298                     481372407                          480341213                          504851641                  484060404                       461127429                       558522339                  509131880                       465604880                       475378544      
Total length (>= 5000 bp)   518528298                     481363561                          480341213                          504851641                  484060404                       461127429                       558522339                  509123034                       465604880                       475052084      
Total length (>= 10000 bp)  518514885                     481349140                          480341213                          504838228                  484053588                       461119824                       558508926                  509116218                       465589833                       474498957      
Total length (>= 25000 bp)  518188097                     480901871                          480181619                          504499732                  483598713                       461019312                       558241588                  508764673                       465424962                       472383091      
Total length (>= 50000 bp)  515726224                     477880931                          479141568                          502109496                  480738474                       460002421                       555789785                  505834173                       464397174                       468867721      
# contigs                   188                           275                                162                                180                        247                             167                             206                        258                             189                             854            
Largest contig              45111900                      21532546                           22038975                           30476199                   22329680                        19744096                        22038975                   22153531                        22038975                        39361238       
Total length                518528298                     481372407                          480341213                          504851641                  484060404                       461127429                       558522339                  509131880                       465604880                       475381253      
GC (%)                      39.05                         39.03                              39.04                              39.04                      39.03                           39.04                           39.07                      39.05                           39.03                           39.06          
N50                         16268372                      12353884                           13054353                           16275225                   13330421                        14742043                        14962207                   11978068                        12847727                        19840543       
N75                         13007972                      7901416                            8791894                            13021168                   9796342                         9573480                         10779388                   8114208                         6210685                         1469964        
L50                         11                            15                                 15                                 13                         14                              14                              16                         16                              14                              9              
L75                         20                            28                                 26                                 21                         24                              24                              27                         29                              26                              23             
# N's per 100 kbp           0.00                          0.00                               0.00                               0.00                       0.00                            0.00                            0.00                       0.00                            0.00                            7.79           

So much good info!!! The initial assembly still has the largest contig, but the s80 assembly has the longest total lengths. The Amillepora genome still has the best N50 value but the initial assembly also has a good N50. Overall, the initial assembly is the best out of these assemblies. The initial hap2 assembly has the lowest number of contigs (162). Out of the primary assemblies, the s30 assembly has the lowest number of contigs (180), while the initial assembly had 188 contigs and the s80 assembly had 206 contigs.

20240408

Let’s see how many rows have >1000 bit score

awk '{ if ($NF > 1000) count++ } END { print count }' mito_hits_rr.txt 
7056

How many rows have a % match >85%?

awk '$3 > 85 {count++} END {print count}' mito_hits_rr.txt
12449

wc -l mito_hits_rr.txt 
12449 mito_hits_rr.txt

There are definitely mito sequences in the raw hifi reads. I’ll be using MitoHiFi to assemble the Apul mito genome. This tool is specific for mitogenome assembly from PacBio HiFi reads. After I assemble it, I will remove it from the hifi raw reads before assembly of the nuclear genome. First, I’ll need to install with conda following the instructions on their github.

cd /data/putnamlab/conda
module load Miniconda3/4.9.2

# Clone repo
git clone https://github.com/marcelauliano/MitoHiFi.git

# Create a conda environment with yml file that is inside MitoHiFi/environment
conda env create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml 

To activate and run the now-installed mitohifi:

conda activate mitohifi_env
(mitohifi_env) python MitoHiFi/src/mitohifi.py -h

Now we can run mito hifi! Go back to assembly folder and create a mito db folder. I will need to use mitohifi command findMitoReference.py to pull mito references from closely related genomes. Young et al. 2024 pulled 4 mito genomes from NCBI (Platygyra carnosa, Favites abdita, Dipsastraea favus, and the old Orbicella faveolata). He then ran mitohifi for all of them with the Ofav hifi reads, which I’m not really sure why he did that. Maybe because he wanted to create a phylogenetic tree downstream? I’m going to pull the Acropora millepora mito sequences as a reference.

When I try to activate the conda env, I am getting this:

conda activate /data/putnamlab/conda/mitohifi_env
Not a conda environment: /data/putnamlab/conda/mitohifi_env

conda activate /data/putnamlab/conda/MitoHiFi
Not a conda environment: /data/putnamlab/conda/MitoHiFi

Very strange…maybe I just need to load miniconda and run python /data/putnamlab/conda/MitoHiFi/src/findMitoReference.py?

Go to the assembly sequence folder: nano find_mito_ref.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=125GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module purge
module load Miniconda3/4.9.2

echo "Grabbing mito refs from NCBI" $(date)

python /data/putnamlab/conda/MitoHiFi/src/findMitoReference.py --species "Acropora millepora" --email jillashey@uri.edu --outfolder /data/putnamlab/jillashey/Apul_Genome/dbs

echo "Mito grab complete!" $(date)

Submitted batch job 310649. Immediately got this error:

Traceback (most recent call last):
  File "/data/putnamlab/conda/MitoHiFi/src/findMitoReference.py", line 23, in <module>
    from Bio import Entrez
ModuleNotFoundError: No module named 'Bio'

So I think I do need to activate the environment. Try to create a new env.

cd /data/putnamlab/conda/
conda create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml 

WARNING: A conda environment already exists at '/home/jillashey/.conda/envs/mitohifi_env'
Remove existing environment (y/[n])? n

Ooooo I have a super secret conda env. Let’s try to activate it in the script.

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=125GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module purge
module load Miniconda3/4.9.2

conda activate /home/jillashey/.conda/envs/mitohifi_env

echo "Grabbing mito refs from NCBI" $(date)

python findMitoReference.py --species "Acropora millepora" --email jillashey@uri.edu --outfolder /data/putnamlab/jillashey/Apul_Genome/dbs

echo "Mito grab complete!" $(date)

conda deactivate

Immediately got this error:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

I need to do conda init but where?

cd /home/jillashey/.conda/envs/mitohifi_env

conda init
no change     /opt/software/Miniconda3/4.9.2/condabin/conda
no change     /opt/software/Miniconda3/4.9.2/bin/conda
no change     /opt/software/Miniconda3/4.9.2/bin/conda-env
no change     /opt/software/Miniconda3/4.9.2/bin/activate
no change     /opt/software/Miniconda3/4.9.2/bin/deactivate
no change     /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.sh
no change     /opt/software/Miniconda3/4.9.2/etc/fish/conf.d/conda.fish
no change     /opt/software/Miniconda3/4.9.2/shell/condabin/Conda.psm1
no change     /opt/software/Miniconda3/4.9.2/shell/condabin/conda-hook.ps1
no change     /opt/software/Miniconda3/4.9.2/lib/python3.8/site-packages/xontrib/conda.xsh
no change     /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.csh
no change     /home/jillashey/.bashrc
No action taken.

Need to figure this out! Here’s what the conda installation portion of their github says:

  1. Install MitoFinder and/or MITOS outside of Conda.
  2. Ensure MitoFinder and/or MITOS are added to the PATH before starting the run. Please note that MitoFinder and/or MITOS should be installed separately and made accessible via the PATH environment variable to ensure their proper integration with MitoHiFi. Once those are installed, do:
#Clone MitoHiFi git repo
git clone https://github.com/marcelauliano/MitoHiFi.git

#create a conda environment with our yml file that is inside MitoHiFi/environment
conda env create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml 

Add MitoFinder and/or MITOS to the PATH and then activate your mitohifi_env conda environment.

Hmm confused. come back to this.

20240527

It’s been a while. Coming back to installing mitohifi. I emailed Kevin Bryan about it on 4/30 and he said:

“For 2, it looks like you created the conda environment on the login node, instead of in an interactive session, so the compute nodes are not seeing it (remember that the /home directory on the login nodes is separate from the compute nodes for legacy reasons; I hope to fix this eventually).”

So I need to create the conda environment in an interactive session.

cd /data/putnamlab/conda
interactive

Once in the interactive session, clone the github

git clone https://github.com/marcelauliano/MitoHiFi.git

Create a conda environment with the yml file inside MitoHiFi/environment

conda env create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml 

Was taking a while to load and then the connection to the server was broken since I was on/off my computer for most of the day doing library prep. Will retry tomorrow when I am on my computer all day

20240610

Back at it. Let’s try to run this as a job so I dont have to sit here all day.

cd /data/putnamlab/conda
mkdir scripts 
cd scripts 

In the scripts folder: nano load_mitohifi.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=125GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/conda/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

## Attempting to install mitohifi: https://github.com/marcelauliano/MitoHiFi 

echo "Start" $(date)

module load Miniconda3/4.9.2

cd /data/putnamlab/conda/

# Go into interactive mode 
interactive

# Clone github 
git clone https://github.com/marcelauliano/MitoHiFi.git

# Create conda env 
conda env create -n mitohifi_env -f MitoHiFi/environment/mitohifi_env.yml 

# Activate conda env 
conda activate mitohifi_env

# Attempt to run mitohifi
python MitoHiFi/src/mitohifi.py -h

# Deactivate conda env 
conda deactivate 

echo "End" $(date)

Submitted batch job 320286. Ran for about an hour and completed but I don’t think it installed properly. From the out file:

Preparing transaction: ...working... done
Verifying transaction: ...working... failed
End Mon Jun 10 08:41:25 EDT 2024

From the error file:

CondaVerificationError: The package for pandas located at /home/jillashey/.conda/pkgs/pandas-1.3.5-py37he8f5f7f_0
appears to be corrupted. The path 'lib/python3.7/site-packages/pandas/util/version/__init__.py'
specified in the package manifest cannot be found.

CondaVerificationError: The package for pandas located at /home/jillashey/.conda/pkgs/pandas-1.3.5-py37he8f5f7f_0
appears to be corrupted. The path 'lib/python3.7/site-packages/pandas/util/version/__pycache__/__init__.cpython-37.pyc'
specified in the package manifest cannot be found.

Bleh.

20240611

Loaded env today (conda activate mitohifi_env) and ran python MitoHiFi/src/mitohifi.py -h and somehow it worked????

usage: MitoHiFi [-h] (-r <reads>.fasta | -c <contigs>.fasta) -f
                <relatedMito>.fasta -g <relatedMito>.gbk -t <THREADS> [-d]
                [-a {animal,plant,fungi}] [-p <PERC>] [-m <BLOOM FILTER>]
                [--max-read-len MAX_READ_LEN] [--mitos]
                [--circular-size CIRCULAR_SIZE]
                [--circular-offset CIRCULAR_OFFSET] [-winSize WINSIZE]
                [-covMap COVMAP] [-v] [-o <GENETIC CODE>]

required arguments:
  -r <reads>.fasta      -r: Pacbio Hifi Reads from your species
  -c <contigs>.fasta    -c: Assembled fasta contigs/scaffolds to be searched
                        to find mitogenome
  -f <relatedMito>.fasta
                        -f: Close-related Mitogenome is fasta format
  -g <relatedMito>.gbk  -k: Close-related species Mitogenome in genebank
                        format
  -t <THREADS>          -t: Number of threads for (i) hifiasm and (ii) the
                        blast search

optional arguments:
  -d                    -d: debug mode to output additional info on log
  -a {animal,plant,fungi}
                        -a: Choose between animal (default) or plant
  -p <PERC>             -p: Percentage of query in the blast match with close-
                        related mito
  -m <BLOOM FILTER>     -m: Number of bits for HiFiasm bloom filter [it maps
                        to -f in HiFiasm] (default = 0)
  --max-read-len MAX_READ_LEN
                        Maximum lenght of read relative to related mito
                        (default = 1.0x related mito length)
  --mitos               Use MITOS2 for annotation (opposed to default
                        MitoFinder
  --circular-size CIRCULAR_SIZE
                        Size to consider when checking for circularization
  --circular-offset CIRCULAR_OFFSET
                        Offset from start and finish to consider when looking
                        for circularization
  -winSize WINSIZE      Size of windows to calculate coverage over the
                        final_mitogenom
  -covMap COVMAP        Minimum mapping quality to filter reads when building
                        final coverage plot
  -v, --version         show program's version number and exit
  -o <GENETIC CODE>     -o: Organism genetic code following NCBI table (for
                        mitogenome annotation): 1. The Standard Code 2. The
                        Vertebrate MitochondrialCode 3. The Yeast
                        Mitochondrial Code 4. The Mold,Protozoan, and
                        Coelenterate Mitochondrial Code and the
                        Mycoplasma/Spiroplasma Code 5. The Invertebrate
                        Mitochondrial Code 6. The Ciliate, Dasycladacean and
                        Hexamita Nuclear Code 9. The Echinoderm and Flatworm
                        Mitochondrial Code 10. The Euplotid Nuclear Code 11.
                        The Bacterial, Archaeal and Plant Plastid Code 12. The
                        Alternative Yeast Nuclear Code 13. The Ascidian
                        Mitochondrial Code 14. The Alternative Flatworm
                        Mitochondrial Code 16. Chlorophycean Mitochondrial
                        Code 21. Trematode Mitochondrial Code 22. Scenedesmus
                        obliquus Mitochondrial Code 23. Thraustochytrium
                        Mitochondrial Code 24. Pterobranchia Mitochondrial
                        Code 25. Candidate Division SR1 and Gracilibacteria
                        Code

Confused but not going to question it…alright! First, pull mito sequences from NCBI. I’m going to use Acropora millepora and Acropora digitifera.

cd /data/putnamlab/jillashey/Apul_Genome/assembly
mkdir mito
cd mito
mkdir ref_mito_genome/

python /data/putnamlab/conda/MitoHiFi/src/findMitoReference.py --species "Acropora millepora" \
--email jillashey@uri.edu \
--outfolder ref_mito_genome/

python /data/putnamlab/conda/MitoHiFi/src/findMitoReference.py --species "Acropora digitifera" \
--email bdy8@miami.edu \
--outfolder ref_mito_genome/

In the ref_mito_genome folder, Acropora millepora output is NC_081453.1.fasta and NC_081453.1.gb and Acropora digitifera is NC_022830.1.fasta and NC_022830.1.gb. Now we need to constract the mito genome using mitohifi.py. In the /data/putnamlab/jillashey/Apul_Genome/assembly/scripts folder: nano mitohifi_amil.sh

#!/bin/bash 
#SBATCH -t 48:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Starting mito assembly with Amillepora refs" $(date)

conda activate mitohifi_env

cd /data/putnamlab/jillashey/Apul_Genome/assembly/mito

python /data/putnamlab/conda/MitoHiFi/src/mitohifi.py -r /data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.fasta \
-f /data/putnamlab/jillashey/Apul_Genome/assembly/mito/ref_mito_genome/NC_081453.1.fasta \
-g /data/putnamlab/jillashey/Apul_Genome/assembly/mito/ref_mito_genome/NC_081453.1.gb \
-t 8 \
-o 5 #invertebrate mitochondrial code

echo "Mito assembly complete!" $(date)

Submitted batch job 320590. Gives me this error:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Had this issue above. Try running in interactive mode? Job 320591. Okay still failing. Going to try to do the conda init.

echo $SHELL
/bin/bash
conda init bash

conda init bash
no change     /opt/software/Miniconda3/4.9.2/condabin/conda
no change     /opt/software/Miniconda3/4.9.2/bin/conda
no change     /opt/software/Miniconda3/4.9.2/bin/conda-env
no change     /opt/software/Miniconda3/4.9.2/bin/activate
no change     /opt/software/Miniconda3/4.9.2/bin/deactivate
no change     /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.sh
no change     /opt/software/Miniconda3/4.9.2/etc/fish/conf.d/conda.fish
no change     /opt/software/Miniconda3/4.9.2/shell/condabin/Conda.psm1
no change     /opt/software/Miniconda3/4.9.2/shell/condabin/conda-hook.ps1
no change     /opt/software/Miniconda3/4.9.2/lib/python3.8/site-packages/xontrib/conda.xsh
no change     /opt/software/Miniconda3/4.9.2/etc/profile.d/conda.csh
no change     /home/jillashey/.bashrc
No action taken.

Nothing happened. The shell seems to think everything is fine. Let’s check out the bash files.

nano ~/.bashrc

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=

# User specific aliases and functions

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/opt/software/Miniconda3/4.9.2/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/opt/software/Miniconda3/4.9.2/etc/profile.d/conda.sh" ]; then
        . "/opt/software/Miniconda3/4.9.2/etc/profile.d/conda.sh"
    else
        export PATH="/opt/software/Miniconda3/4.9.2/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

Everything looks in order. Check conda installation path

ls -l /opt/software/Miniconda3/4.9.2/bin/conda
-rwxr-xr-x. 1 bryank bryank 531 May 13  2021 /opt/software/Miniconda3/4.9.2/bin/conda

echo $PATH 
/opt/software/Miniconda3/4.9.2/bin:/opt/software/Miniconda3/4.9.2/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/jillashey/.local/bin:/home/jillashey/bin

Tried running conda init in /data/putnamlab/conda and /data/putnamlab/conda/MitoHifi but no changes. When I activate the env when I am NOT in an interactive session, it does start to run which is confusing…Need to email Kevin Bryan.

I also need to blast the symbiont genome information. Based on the ITS2 data, the Acropora spp from the Manava site have mostly A1 and D1 symbionts, so I’ll be using the A1 genome and the D1 genome.

20240617

Going to blast to sym genomes now. First download them:

cd /data/putnamlab/jillashey/Apul_Genome/dbs
wget http://smic.reefgenomics.org/download/Smic.genome.scaffold.final.fa.gz
wget https://marinegenomics.oist.jp/symbd/download/102_symbd_genome_scaffold.fa.gz

In the assembly scripts folder: nano blastn_sym.sh

#!/bin/bash 
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Build A1 seq db" $(date)

makeblastdb -in /data/putnamlab/jillashey/Apul_Genome/dbs/Smic.genome.scaffold.final.fa -dbtype nucl -out /data/putnamlab/jillashey/Apul_Genome/dbs/A1_db

echo "Build D1 seq db" $(date)

makeblastdb -in /data/putnamlab/jillashey/Apul_Genome/dbs/102_symbd_genome_scaffold.fa -dbtype nucl -out /data/putnamlab/jillashey/Apul_Genome/dbs/D1_db

echo "Blasting hifi reads against symbiont A1 genome to look for contaminants" $(date)

blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db /data/putnamlab/jillashey/Apul_Genome/dbs/A1_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out sym_A1_contaminant_hits_rr.txt

echo "A1 blast complete! Now blasting hifi reads against symbiont D1 genome to look for contaminants" $(date)

blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db /data/putnamlab/jillashey/Apul_Genome/dbs/D1_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out sym_D1_contaminant_hits_rr.txt

echo "D1 blast complete!"$(date)

Submitted batch job 323705. Ran in about 1.5 days.

20240619

Cat the sym blast results together and remove anything that has a bit score <1000.

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

cat sym_A1_contaminant_hits_rr.txt sym_D1_contaminant_hits_rr.txt > sym_contaminant_hits_rr.txt
awk '$12 > 1000 {print $0}' sym_contaminant_hits_rr.txt > contaminant_hits_sym_passfilter_rr.txt

wc -l contaminant_hits_sym_passfilter_rr.txt 
12 contaminant_hits_sym_passfilter_rr.txt

Pretty clean when bit scores <1000 are removed. Copy this data onto my computer and remove the contaminants in the R script.

Still need to get mito hifi to work.

20240622

TRINITY DID IT!!!!!! She is my hero!!!!! She used a docker singularity install and was able to run it. Here are the output files and folders:

cd /data/putnamlab/tconn/mito

-rw-r--r--. 1 trinity.conn  19K Jun 21 13:58 final_mitogenome.fasta
-rw-r--r--. 1 trinity.conn  33K Jun 21 13:58 final_mitogenome.gb
-rw-r--r--. 1 trinity.conn  275 Jun 21 13:58 contigs_stats.tsv
-rw-r--r--. 1 trinity.conn 1004 Jun 21 13:58 shared_genes.tsv
-rw-r--r--. 1 trinity.conn  48K Jun 21 13:58 final_mitogenome.annotation.png
-rw-r--r--. 1 trinity.conn  48K Jun 21 13:58 contigs_annotations.png
-rw-r--r--. 1 trinity.conn  37K Jun 21 13:58 all_potential_contigs.fa
-rw-r--r--. 1 trinity.conn  20K Jun 21 13:58 coverage_plot.png
-rw-r--r--. 1 trinity.conn  20K Jun 21 13:59 final_mitogenome.coverage.png
drwxr-xr-x. 4 trinity.conn 4.0K Jun 21 13:59 potential_contigs
drwxr-xr-x. 2 trinity.conn 4.0K Jun 21 13:59 contigs_circularization
drwxr-xr-x. 2 trinity.conn 4.0K Jun 21 13:59 final_mitogenome_choice
drwxr-xr-x. 2 trinity.conn 4.0K Jun 21 13:59 reads_mapping_and_assembly
drwxr-xr-x. 2 trinity.conn 4.0K Jun 21 13:59 contigs_filtering
drwxr-xr-x. 2 trinity.conn 4.0K Jun 21 13:59 coverage_mapping
drwxr-xr-x. 2 trinity.conn 4.0K Jun 21 17:55 MitoHifi_out

Let’s look at the output files (explanation of output files is here on the mitohifi github. The final_mitogenome.fasta is the final mitochondria circularized and rotated to start at tRNA-Phe and is 18480 bp long for our genome; the final_mitogenome.gb is the final mitochondria annotated in GenBank format.

head final_mitogenome.fasta
>ptg000003l_rc_rotated
CAAACATTAGGACAATAAGACCTGACTTCATCCAAGTGACAAACCACTGGGTTAAATCTG
TTTTATGTTTAATACACAAATTGACGACGGCCATGCAATACCTGTCAATGAAGGATTCAA
GTTTGGGTAAGGTCTCTCGCGGACTATCGAATTAAACGACACGCTCCTCTAATTAAAACA
GTGAACAGCCAAGTTTTTTGAATTTTAACCTTGCGGTCGTACTACTCAAGCGGAAAATTT
CTGACTTTTTAGGATTGCTTCACATCTTTTTCATTATTTACAGTATAGACTACCAGGGTC
CCTAATCCTGTTTGCTCCCCATACTCTCGTGTTTTAGCCATCACACTATAATCTCAAAAA
TAAATAGTCTTCACGTCTAAAGTTCTTTTTTCTATTTACACATTCCACCGCTACAAAAAA
ATTCCATTTACCTTCTTAAATTATAAAACCCTTTTTAATTAAAACGGCCTATCACACCCT
TTACGCTTTTGCCCACAAAACTAGCCCTTAAGTTTCACCGCGTCTGCTGGCACTTAATTT

The final final_mitogenome.coverage.png shows the sequencing coverage throughout the final mitogenome

The final final_mitogenome.annotation.png shows the predicted genes throughout the final mitogenome

The contigs_stats.tsv file contains the statistics of your assembled mitos such as the number of genes, size, whether it was circularized or not, if the sequence has frameshifts, etc.

less contig_stats.tsv 

# Related mitogenome is 18479 bp long and has 17 genes
contig_id       frameshifts_found       annotation_file length(bp)      number_of_genes was_circular
final_mitogenome        No frameshift found     final_mitogenome.gb     18480   24      True
ptg000003l      No frameshift found     final_mitogenome.gb     18480   24      True

The shared_genes.tsv shows the comparison of annotation between close-related mitogenome and all potential contigs assembled.

less shared_genes.tsv

contig_id       shared_genes    unique_to_contig        unique_to_relatedMito
final_mitogenome        {'ATP6': [1, 1], 'ATP8': [1, 1], 'COX1': [1, 1], 'COX2': [1, 1], 'COX3': [1, 1], 'CYTB': [1, 1], 'ND1': [1, 1], 'ND2': [1, 1], 'ND3': [1, 1], 'ND4': [1, 1], 'ND4L': [1, 1], 'ND5': [1, 1], 'ND6': [1, 1], 'tRNA-Met': [1, 1], 'tRNA-Trp': [1, 1]}      {'rrnL': [1, 0], 'tRNA-Arg': [1, 0], 'tRNA-Asp': [1, 0], 'tRNA-Gln': [1, 0], 'tRNA-Glu': [1, 0], 'tRNA-Gly': [1, 0], 'tRNA-His': [1, 0], 'tRNA-Pro': [1, 0], 'tRNA-Ser': [1, 0]}        {'l-rRNA': [0, 1], 's-rRNA': [0, 1]}
ptg000003l      {'ATP6': [1, 1], 'ATP8': [1, 1], 'COX1': [1, 1], 'COX2': [1, 1], 'COX3': [1, 1], 'CYTB': [1, 1], 'ND1': [1, 1], 'ND2': [1, 1], 'ND3': [1, 1], 'ND4': [1, 1], 'ND4L': [1, 1], 'ND5': [1, 1], 'ND6': [1, 1], 'tRNA-Met': [1, 1], 'tRNA-Trp': [1, 1]}      {'rrnL': [1, 0], 'tRNA-Arg': [1, 0], 'tRNA-Asp': [1, 0], 'tRNA-Gln': [1, 0], 'tRNA-Glu': [1, 0], 'tRNA-Gly': [1, 0], 'tRNA-His': [1, 0], 'tRNA-Pro': [1, 0], 'tRNA-Ser': [1, 0]}        {'l-rRNA': [0, 1], 's-rRNA': [0, 1]}

I uploaded all of these files onto the Apul genome github in a mito assembly folder. With the completed mito assembly, I can now blast the Apul mito fasta against the hifi reads. In the /data/putnamlab/jillashey/Apul_Genome/assembly/scripts folder: nano blastn_mito.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load BLAST+/2.13.0-gompi-2022a

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Build Apul mito genome db" $(date)

makeblastdb -in /data/putnamlab/tconn/mito/final_mitogenome.fasta -dbtype nucl -out /data/putnamlab/jillashey/Apul_Genome/dbs/mito_db

echo "Blasting hifi reads against mito genome to look for contaminants" $(date)

blastn -query m84100_240128_024355_s2.hifi_reads.bc1029.fasta -db /data/putnamlab/jillashey/Apul_Genome/dbs/mito_db -outfmt 6 -evalue 1e-4 -perc_identity 90 -out mito_contaminant_hits_rr.txt

echo "Mito blast complete!"$(date)

Submitted batch job 324454. Once this is done running, I can purge all the potential contaminants! Ran in about 2.5 hours. Remove all hits <1000.

awk '$12 > 1000 {print $0}' mito_contaminant_hits_rr.txt > contaminant_hits_mito_passfilter_rr.txt

wc -l contaminant_hits_mito_passfilter_rr.txt 
1921 contaminant_hits_mito_passfilter_rr.txt

Copy contaminant_hits_mito_passfilter_rr.txt onto computer and identify the reads that are contaminants. This will produce the file all_contam_rem_good_hifi_read_list.txt, which represents the raw hifi reads with the ones marked as contaminants removed. Copy the file all_contam_rem_good_hifi_read_list.txt that was generated from the R script. This specific file was written starting on line 290. It contains the reads that have passed contamination filtering. I copied this file into /data/putnamlab/jillashey/Apul_Genome/assembly/data.

wc -l all_contam_rem_good_hifi_read_list.txt
5896466 all_contam_rem_good_hifi_read_list.txt

Remove the length information from the file

awk '{$2=""; print $0}' all_contam_rem_good_hifi_read_list.txt > output_file.txt

wc -l output_file.txt 
5896466 output_file.txt

Run the subseq.sh script in /data/putnamlab/jillashey/Apul_Genome/assembly/scripts to subset the raw hifi fasta file to remove the contaminants identified above. Submitted batch job 324463

20240623

Script above ran in about 22 minutes and the resulting output file is /data/putnamlab/jillashey/Apul_Genome/assembly/data/hifi_rr_allcontam_rem.fasta. This file represents the raw hifi reads with eukaryotic, mitochondrial, symbiont, viral and prokaryotic contaminant reads removed. Out of 5898386 raw hifi reads, there were only 1922 that were identified as contamination. This is only 0.03258519% of the raw reads, which is pretty amazing!

Now that we have clean reads, assembly can begin! In my crazy code above, I ran a couple of different iterations of hifiasm changing the -s option, which sets a similary threshold for duplicate haplotigs that should be purged; the default is 0.55. The iterations that I ran (0.3, 0.55, and 0.8) all worked well to resolve haplotypes with the heterozygosity. Therefore, I stuck with the default 0.55 option. I’m also using -primary to output a primary and alternate assembly, instead of an assembly and two haplotype assemblies, as we have no real way of knowing if the haplotypes produced are real or not.

In the scripts folder, modify hifiasm.sh.

#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/hifiasm

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Starting assembly with hifiasm" $(date)

hifiasm -o apul.hifiasm.s55_pa hifi_rr_allcontam_rem.fasta --primary -s 0.55 -t 36 2> apul_hifiasm_allcontam_rem_s55_pa.log

echo "Assembly with hifiasm complete!" $(date)

conda deactivate

Submitted batch job 324472

20240626

Took about 3 days to run. Yay output!! The primary assembly file is apul.hifiasm.s55_pa.p_ctg.gfa and the alternate assembly file is apul.hifiasm.s55_pa.a_ctg.gfa. Let’s QC! Convert gfa to fa

## PRIMARY 
awk '/^S/{print ">"$2"\n"$3}' apul.hifiasm.s55_pa.p_ctg.gfa | fold > apul.hifiasm.s55_pa.p_ctg.fa

zgrep -c ">" apul.hifiasm.s55_pa.p_ctg.fa
187

## ALTERNATE 
awk '/^S/{print ">"$2"\n"$3}' apul.hifiasm.s55_pa.a_ctg.gfa | fold > apul.hifiasm.s55_pa.a_ctg.fa

zgrep -c ">" apul.hifiasm.s55_pa.a_ctg.fa
3548

Run busco for the primary and alternate assembly. In the scripts folder: nano busco_qc.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

###### PRIMARY ASSEMBLY w/ -s 0.55

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s55_pa.p_ctg.gfa > apul.hifiasm.s55_pa.p_ctg.fa

echo "Begin busco on hifiasm-assembled primary fasta with -s 0.55" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.primary.busco -m genome

echo "busco complete for hifiasm-assembled primary fasta with -s 0.55" $(date)

###### ALTERNATE ASSEMBLY w/ -s 0.55

awk '/^S/{print ">"$2;print $3}' apul.hifiasm.s55_pa.a_ctg.gfa > apul.hifiasm.s55_pa.a_ctg.fa

echo "Begin busco on hifiasm-assembled alternate fasta with -s 0.55" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.a_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.alternate.busco -m genome

echo "busco complete for hifiasm-assembled alternate fasta with -s 0.55" $(date)

Submitted batch job 327004. Got this error:

Use --offline to prevent permission denied issues from downloads

2024-06-27 08:00:19 ERROR:      The input file does not contain nucleotide sequences.
2024-06-27 08:00:19 ERROR:      BUSCO analysis failed !
2024-06-27 08:00:19 ERROR:      Check the logs, read the user guide (https://busco.ezlab.org/busco_userguide.html), and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

Convert gfa to fa

## PRIMARY 
awk '/^S/{print ">"$2"\n"$3}' apul.hifiasm.s55_pa.p_ctg.gfa | fold > apul.hifiasm.s55_pa.p_ctg.fa

zgrep -c ">" apul.hifiasm.s55_pa.p_ctg.fa
187

## ALTERNATE 
awk '/^S/{print ">"$2"\n"$3}' apul.hifiasm.s55_pa.a_ctg.gfa | fold > apul.hifiasm.s55_pa.a_ctg.fa

zgrep -c ">" apul.hifiasm.s55_pa.a_ctg.fa
3548

In the scripts folder: nano quast_qc.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors

echo "Begin quast of primary and alternate assemblies w/ reference" $(date)

quast -t 10 --eukaryote \
apul.hifiasm.s55_pa.p_ctg.fa \
apul.hifiasm.s55_pa.a_ctg.fa \
/data/putnamlab/jillashey/genome/Ofav_Young_et_al_2024/Orbicella_faveolata_gen_17.scaffolds.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
/data/putnamlab/jillashey/genome/Aten/GCA_014633955.1_Aten_1.0_genomic.fna \
/data/putnamlab/jillashey/genome/Ahya/GCA_014634145.1_Ahya_1.0_genomic.fna \
/data/putnamlab/jillashey/genome/Ayon/GCA_014634225.1_Ayon_1.0_genomic.fna \
/data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
/data/putnamlab/jillashey/genome/Pacuta/V2/Pocillopora_acuta_HIv2.assembly.fasta \
/data/putnamlab/jillashey/genome/Peve/Porites_evermanni_v1.fa \
/data/putnamlab/jillashey/genome/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.fna \
/data/putnamlab/jillashey/genome/Pcomp/Porites_compressa_contigs.fasta \
/data/putnamlab/jillashey/genome/Plutea/plut_final_2.1.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast

echo "Quast complete; all QC complete!" $(date)

Run busco for the primary and alternate assembly. In the scripts folder: nano busco_qc.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

###### PRIMARY ASSEMBLY w/ -s 0.55

echo "Begin busco on hifiasm-assembled primary fasta with -s 0.55" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.primary.busco -m genome

echo "busco complete for hifiasm-assembled primary fasta with -s 0.55" $(date)

###### ALTERNATE ASSEMBLY w/ -s 0.55

#echo "Begin busco on hifiasm-assembled alternate fasta with -s 0.55" $(date)

#labbase=/data/putnamlab
#busco_shared="${labbase}/shared/busco"
#[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.a_ctg.fa" # set this to the query (genome/transcriptome) you are running
#[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

#source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
#cd "${labbase}/${Apul_Genome/assembly/data}"
#busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.alternate.busco -m genome

#echo "busco complete for hifiasm-assembled alternate fasta with -s 0.55" $(date)

Only going to do the primary because was getting erros before in the input file formats. The busco for alternate assembly is commented out. Submitted batch job 327033. Ran successfully in about 30 mins! Here are the results for the primary assembly:

# BUSCO version is: 5.2.2 
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /data/putnamlab/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

        ***** Results: *****

        C:93.3%[S:92.0%,D:1.3%],F:3.1%,M:3.6%,n:954        
        890     Complete BUSCOs (C)                        
        878     Complete and single-copy BUSCOs (S)        
        12      Complete and duplicated BUSCOs (D)         
        30      Fragmented BUSCOs (F)                      
        34      Missing BUSCOs (M)                         
        954     Total BUSCO groups searched                

Dependencies and versions:
        hmmsearch: 3.3
        metaeuk: GITDIR-NOTFOUND

93.3% completeness, which is the same as my initial/iterative runs. 92% of single copy BUSCOs, which is great for the assembly.

Quast also finished running in about 6 mins. It created a lot of output files:

2024-06-27 09:00:57
RESULTS:
  Text versions of total report are saved to /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/report.txt, report.tsv, and report.tex
  Text versions of transposed total report are saved to /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/transposed_report.txt, transposed_report.tsv, and transposed_report.tex
  HTML version (interactive tables and plots) is saved to /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/report.html
  PDF version (tables and plots) is saved to /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/report.pdf
  Icarus (contig browser) is saved to /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/icarus.html
  Log is saved to /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast/quast.log

Downloaded icarus.html, report.html, report.pdf, and report.txt to /Users/jillashey/Desktop/PutnamLab/Repositories/Apulchra_genome/output/assembly/primary on my personal computer.

20240630

The pipeline from Young et al. 2024 used ragtag and ntlinks to scaffold the assembly. Now I need to do the same. I used hifiasm to assembly the long reads into contigs (approx 168 contigs in the primary assembly). Next, I need to assembly the contigs into scaffolds.

  • Contigs = set of partially overlapping reads
  • Scffold = set of contigs ordered and oriented to position information. The scaffolds also incorporate empty spaces or gaps.

Young et al. 2024 ended up going with the ntlinks to assemble the scaffolds. The ragtag program uses old reference genomes, so Young et al. used the old Orbicella genome assembly and I would have to use one of the old Acropora assemblies. ntlinks uses the long read information along with the newly assembled contigs. I think I will go with ntlink because it is a tool for de novo genome assembly and long read data can be used with it. The ntlink software has options to run multiple iterations/rounds of ntlink to achieve the highest possible contiguity without sacrificing assembly correctness. From the Basic Protocol 3 from the ntlinks paper: “Using the in-code round capability of ntLink allows a user to maximize the contiguity of the final assembly without needing to manually run ntLink multiple times. To avoid re-mapping the reads at each round, ntLink lifts over the mapping coordinates from the input draft assembly to the output post-ntLink scaffolds, which can then be used for the next round of ntLink. The same process can be repeated as many times as needed, thus enabling multiple rounds of ntLink to be powered by a single instance of long-read mapping.” Therefore, I need to turn the scaffolds into contigs. Install ntlinks on Andromeda.

cd /data/putnamlab/conda
module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/conda/ntlink
conda activate /data/putnamlab/conda/ntlink
conda install -c bioconda -c conda-forge ntlink

In the /data/putnamlab/jillashey/Apul_Genome/assembly/scripts folder: nano ntlinks_5rounds.sh

#!/bin/bash -i
#SBATCH -t 30-00:00:00
#SBATCH --nodes=1 --ntasks-per-node=36
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --exclusive
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

module load Miniconda3/4.9.2

conda activate /data/putnamlab/conda/ntlink

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Starting scaffolding of hifiasm primary assembly with ntlinks (rounds = 5)" $(date)

ntLink_rounds run_rounds_gaps \
t=36 \
g=100 \
rounds=5 \
gap_fill \
target=apul.hifiasm.s55_pa.p_ctg.fa \
reads=hifi_rr_allcontam_rem.fasta \
out_prefix=apul_ntlink_s55

echo "Scaffolding of hifiasm primary assembly with ntlinks (rounds = 5) complete!" $(date)

Submitted batch job 328341

20240701

ntlink ran in about 4 hours and produced a LOT of output files which I’m not sure what they all mean:

-rw-r--r--. 1 jillashey  11G Jun 30 20:55 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey 8.6K Jun 30 20:58 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.trimmed_scafs.agp
-rw-r--r--. 1 jillashey 495M Jun 30 21:14 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
-rw-r--r--. 1 jillashey 8.7K Jun 30 21:14 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey   72 Jun 30 21:14 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.scaffolds.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey   72 Jun 30 21:14 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey   76 Jun 30 21:14 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.agp -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey   63 Jun 30 21:14 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.verbose_mapping.tsv -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey  11G Jun 30 21:33 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey 8.3K Jun 30 21:49 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.trimmed_scafs.agp
-rw-r--r--. 1 jillashey 495M Jun 30 22:02 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
-rw-r--r--. 1 jillashey 8.4K Jun 30 22:02 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey  106 Jun 30 22:02 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  106 Jun 30 22:02 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  110 Jun 30 22:02 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.agp -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey   97 Jun 30 22:02 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.verbose_mapping.tsv -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey  11G Jun 30 22:20 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey 7.7K Jun 30 22:37 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.trimmed_scafs.agp
-rw-r--r--. 1 jillashey 495M Jun 30 22:50 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
-rw-r--r--. 1 jillashey 7.7K Jun 30 22:50 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey  113 Jun 30 22:50 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  113 Jun 30 22:50 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  117 Jun 30 22:50 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.agp -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey  104 Jun 30 22:50 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.verbose_mapping.tsv -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey  11G Jun 30 23:08 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey 7.7K Jun 30 23:23 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.trimmed_scafs.agp
-rw-r--r--. 1 jillashey 495M Jun 30 23:37 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
-rw-r--r--. 1 jillashey 7.7K Jun 30 23:37 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey  120 Jun 30 23:37 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  120 Jun 30 23:37 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  124 Jun 30 23:37 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.agp -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey  111 Jun 30 23:37 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.verbose_mapping.tsv -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey  11G Jun 30 23:55 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
-rw-r--r--. 1 jillashey 7.7K Jul  1 00:11 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.trimmed_scafs.agp
-rw-r--r--. 1 jillashey 495M Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
-rw-r--r--. 1 jillashey 7.7K Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey  127 Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  127 Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.ntLink.gap_fill.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa
lrwxrwxrwx. 1 jillashey  131 Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.agp -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.ntLink.scaffolds.gap_fill.fa.agp
lrwxrwxrwx. 1 jillashey  118 Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.verbose_mapping.tsv -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.gap_fill.fa.k32.w100.z1000.verbose_mapping.tsv
lrwxrwxrwx. 1 jillashey   90 Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.5rounds.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.ntLink.ntLink.ntLink.ntLink.gap_fill.fa
lrwxrwxrwx. 1 jillashey   70 Jul  1 00:24 apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa -> apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.gap_fill.5rounds.fa

I think the files are representing the iterations of ntlink that was run. For example, files with one ntLink in the file name are from the first iteration, files with two ntLink in the file name are from the second iteration, etc. In the output file, it also gave me a lot of info. It gave me a lot of info on the specific code/parameters for each iteration and then provided me with the final file: Done ntLink rounds! Final scaffolds found in apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa. I now need to run QC on the scaffolded assembly.

In the scripts folder: nano busco_ntlink_qc.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Convert from gfa to fasta for downstream use" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

echo "Begin busco on scaffolded assembly" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${Apul_Genome/assembly/data}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apul.ntlink.busco -m genome

echo "busco complete for scaffolded assembly" $(date)

Submitted batch job 328382. Failed, need to rerun

In the scripts folder: nano quast_ntlink_qc.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

module purge
module load Python/2.7.18-GCCcore-10.2.0
module load QUAST/5.0.2-foss-2020b-Python-2.7.18
# previously used QUAST/5.2.0-foss-2021b but it failed and produced module conflict errors

echo "Begin quast of scaffolded assemblies w/ references" $(date)

quast -t 10 --eukaryote \
apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa \
apul.hifiasm.s55_pa.p_ctg.fa \
apul.hifiasm.s55_pa.a_ctg.fa \
/data/putnamlab/jillashey/genome/Ofav_Young_et_al_2024/Orbicella_faveolata_gen_17.scaffolds.fa \
/data/putnamlab/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta \
/data/putnamlab/jillashey/genome/Aten/GCA_014633955.1_Aten_1.0_genomic.fna \
/data/putnamlab/jillashey/genome/Ahya/GCA_014634145.1_Ahya_1.0_genomic.fna \
/data/putnamlab/jillashey/genome/Ayon/GCA_014634225.1_Ayon_1.0_genomic.fna \
/data/putnamlab/jillashey/genome/Mcap/V3/Montipora_capitata_HIv3.assembly.fasta \
/data/putnamlab/jillashey/genome/Pacuta/V2/Pocillopora_acuta_HIv2.assembly.fasta \
/data/putnamlab/jillashey/genome/Peve/Porites_evermanni_v1.fa \
/data/putnamlab/jillashey/genome/Ofav/GCF_002042975.1_ofav_dov_v1_genomic.fna \
/data/putnamlab/jillashey/genome/Pcomp/Porites_compressa_contigs.fasta \
/data/putnamlab/jillashey/genome/Plutea/plut_final_2.1.fasta \
-o /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast

echo "Quast complete; all QC complete!" $(date)

Submitted batch job 328389. Move results into new directory:

cd /data/putnamlab/jillashey/Apul_Genome/assembly/output/quast
mkdir ntlink
mv *report* ntlink/
mv basic_stats/ ntlink/
mv icarus* ntlink/
mv quast.log ntlink/

Downloaded icarus.html, report.html, report.pdf, and report.txt to /Users/jillashey/Desktop/PutnamLab/Repositories/Apulchra_genome/output/assembly/ntlink on my personal computer. The quast looks good!!

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds  apul.hifiasm.s55_pa.p_ctg  apul.hifiasm.s55_pa.a_ctg  Orbicella_faveolata_gen_17.scaffolds  Amil.v2.01.chrs  GCA_014633955.1_Aten_1.0_genomic  GCA_014634145.1_Ahya_1.0_genomic  GCA_014634225.1_Ayon_1.0_genomic  Montipora_capitata_HIv3.assembly  Pocillopora_acuta_HIv2.assembly  Porites_evermanni_v1  GCF_002042975.1_ofav_dov_v1_genomic  Porites_compressa_contigs  plut_final_2.1
# contigs (>= 0 bp)         174                                                         187                        3548                       51                                    854              1538                              2758                              1010                              1699                              474                              8186                  1933                                 1071                       2975          
# contigs (>= 1000 bp)      174                                                         187                        3548                       51                                    851              1519                              2681                              992                               1697                              474                              8186                  1933                                 1071                       2933          
# contigs (>= 5000 bp)      174                                                         187                        3508                       51                                    748              1164                              1473                              677                               1642                              461                              6821                  1933                                 1071                       2138          
# contigs (>= 10000 bp)     172                                                         185                        3365                       51                                    672              1013                              1084                              581                               1567                              447                              5755                  1142                                 1071                       1951          
# contigs (>= 25000 bp)     153                                                         165                        2831                       48                                    545              819                               769                               460                               1008                              394                              4531                  831                                  1054                       1650          
# contigs (>= 50000 bp)     89                                                          97                         1445                       40                                    445              670                               581                               377                               540                               203                              3258                  687                                  965                        1344          
Total length (>= 0 bp)      518313916                                                   518458989                  482735234                  493925641                             475381253        403138309                         447200179                         438047505                         780507976                         408287534                        603805388             485548939                            751252456                  552020673     
Total length (>= 1000 bp)   518313916                                                   518458989                  482735234                  493925641                             475378544        403124220                         447138055                         438033530                         780506390                         408287534                        603805388             485548939                            751252456                  551983631     
Total length (>= 5000 bp)   518313916                                                   518458989                  482572037                  493925641                             475052084        402086370                         443760117                         437148684                         780327690                         408248011                        599829750             485548939                            751252456                  550215806     
Total length (>= 10000 bp)  518300503                                                   518445576                  481501732                  493925641                             474498957        400989068                         441052114                         436458337                         779744100                         408137131                        592561315             478462976                            751252456                  548844833     
Total length (>= 25000 bp)  517996671                                                   518118788                  472081622                  493863958                             472383091        397954288                         435913299                         434475505                         770218970                         407100823                        571497796             473919972                            750872185                  543736169     
Total length (>= 50000 bp)  515685899                                                   515656915                  421575419                  493582559                             468867721        392444742                         429324050                         431614917                         753567665                         400343351                        525186424             468943051                            747605773                  532574147     
# contigs                   174                                                         187                        3548                       51                                    854              1538                              2758                              1010                              1699                              474                              8186                  1933                                 1071                       2975          
Largest contig              45111900                                                    45111900                   5479021                    40246328                              39361238         4392697                           10924033                          11713616                          69151359                          16633824                         1802771               4771691                              7905324                    3122227       
Total length                518313916                                                   518458989                  482735234                  493925641                             475381253        403138309                         447200179                         438047505                         780507976                         408287534                        603805388             485548939                            751252456                  552020673     
GC (%)                      39.05                                                       39.05                      39.02                      39.49                                 39.06            38.93                             38.97                             39.03                             39.66                             38.11                            39.02                 38.99                                39.13                      39.05         
N50                         17861421                                                    16268372                   721379                     33295526                              19840543         1165953                           1584703                           3033871                           47716837                          5167277                          171385                1162446                              1540036                    660708        
N75                         13936008                                                    13007972                   110933                     24061036                              1469964          537206                            753273                            1342298                           38979999                          3166945                          85873                 575799                               817138                     325442        
L50                         10                                                          11                         160                        7                                     9                101                               86                                45                                7                                 24                               935                   124                                  140                        242           
L75                         18                                                          20                         589                        12                                    23               228                               191                               98                                11                                49                               2169                  272                                  308                        540           
# N's per 100 kbp           0.00                                                        0.00                       0.00                       1.02                                  7.79             7389.85                           7778.63                           6736.15                           18.07                             0.00                             6749.89               26684.10                             0.00                       8717.64       

There is an improvement of number of contigs from the initial assembly (187 contigs for initial assembly and 174 contigs for ntlinks cleaned assembly).

20240801

Met w/ Trinity last week and we discussed next steps for genome structural and functional annotation. We decided that she is going to move forward with the funannotate steps. I am going to focus on obtaining methylation data from the PacBio reads because apparently the reads also contain information about the methylation status of the bases. This is a helpful video that explains how pacbio reads have methylation data. Essentially, it uses kinetics to see how far apart the bases are from one another. Hifi sequencing uses a polymerase that incorporates fluorescently labeled nucleotides in real time complementary to a native DNA strand. Epigenetic modifications, like methylation, impact how fast the bases are added. Base modifications can be inferred from per-base pulse width (PW) and inter-pulse duration (IPD) kinetics.

I am now looking into the options for PacBio DNA methylation detection/estimation. I’ve found a few tools so far.

  • MethBat - aggregate and analyze CpG methylation calls. There are four main workflows:
    • Rare methylation analysis - Identify regions in a single dataset exhibiting a “rare” methylation patterns relative to a collection of background datasets; requires pre-defined regions such as all known CpG islands.
    • Cohort methylation analysis - Identify regions exhibiting different methylation patterns between case and control datasets; requires pre-defined regions such as all known CpG islands.
    • Segmentation - Segment (or divide) CpGs for an individual dataset into regions with a shared methylation pattern; no pre-defined regions required.
    • Signature generation - Identify regions exhibiting different methylation patterns between case and control datasets; no pre-defined regions required.

The first two require pre-defined regions of CpG islands, so I don’t think I can use those. Signature generation appears to require some kind of contrast? like different treatments or something so that might not be the way to go either. Segmentation seems like the best bet at the moment because it doesn’t require any pre-defined regions. However, the input does require the output from pb-CpG-tools, which needs mapped Hifi reads. I do not have mapped Hifi reads because what would I be mapping to??

  • Jasmine - predicts 5mC of each CpG site in Pacbio Hifi reads
    • This seems like the package to use based on the sequencing data that we have. Input is Pacbio reads with kinetics. I’m not sure if our data (or hifi reads in general) have kinetics automatically. No worries, I will be able to generate Hifi reads with kinetics with ccs-kinetics-bystrandify, which is an executable in the pbtk package. As stated above, base modifications can be inferred from per-base pulse width (PW) and inter-pulse duration (IPD) kinetics so ccs uses this information to apply the kinetic information to the reads (this is my understanding). The ccs call requires a bam file but it is easy to turn a fasta into a bam.
    • Not sure if I should wait until the structural/functional annotation is completed. It would obviously be more meaningful to have the structural and functional annotation information. But I may just run it on my primary and ntlinks assemblies to see how the programs work.

So my next steps are:

  • Convert my final ntlinks fasta to bam
  • Convert my primary assembly fasta to bam
  • Run ccs-kinetics-bystrandify in the pbtk package on both bam files
  • Install jasmine (see info here)
  • Run jasmine!

After looking briefly on the internet, it looks like there aren’t a ton of tools to convert fasta files to bam files. But the original data came as a bam file (m84100_240128_024355_s2.hifi_reads.bc1029.bam). It is totally unfiltered and unassembled. Let’s try to run that? At least run the ccs-kinetics-bystrandify.

Make a new methylation directory

cd /data/putnamlab/jillashey/Apul_Genome
mkdir methylation
cd methylation
mkdir scripts data output 

In the scripts folder: nano ccs-kinetics.sh

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=500GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/methylation/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/pbtk

echo "Adding kinetics information to hifi reads" $(date)

ccs-kinetics-bystrandify /data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.bam /data/putnamlab/jillashey/Apul_Genome/methylation/data/apul_hifi_raw_kinetics.bam

echo "Kinetics complete!" $(date)

conda deactivate

Submitted batch job 333762. Pended for an hour, then ran in 5 mins.

20240802

Time to install jasmine!

cd/data/putnamlab/conda
module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/conda/jasmine
conda install -c bioconda jasmine

This takes a couple of minutes but once its installed, jasmine can be run. In the /data/putnamlab/jillashey/Apul_Genome/methylation/scripts folder, nano jasmine.sh

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/methylation/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/jasmine

echo "Running jasmine" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/methylation/data/

jasmine apul_hifi_raw_kinetics.bam apul_hifi_raw_kinetics_5mc.bam

echo "Jasmine complete!" $(date)

conda deactivate

Submitted batch job 333785. Failed immediately with this error:

ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
Exception in thread "main" java.lang.NullPointerException
        at uio.amg.zhong.jasmine.JASMINE.findXMLfile(JASMINE.java:494)
        at uio.amg.zhong.jasmine.JASMINE.main(JASMINE.java:40)

20240829

Hollie and I met earlier this week and we briefly discussed methylation for Apul. I said that some of the MethBat tools required information about known CpG locations. She recommended I try emboss fuzznuc, which searches for specific patterns in nucleotide sequences (such as CGs). Why do we care about CGs? CpG sites occur when a cytosine is followed by a guanine in a linear sequence. Cytosines in CpG motifs can be methylated. So in order to find methylation, we need to find the CpG sites.

The Roberts lab has used the fuzznuc program before to identidy CG motifs; Sam’s notebook has an example of how to use fuzznuc, and this page has instances where the output file from fuzznuc was used in analysis. Emboss is already installed on Andromeda yay. In the /data/putnamlab/jillashey/Apul_Genome/methylation/scripts folder: nano fuzznuc.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/methylation/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

# Load module 
module load EMBOSS/6.6.0-foss-2018b

echo "Running fuzznuc on assembled Apul genome" $(date)

# Run fuzznuc 
fuzznuc \
-sequence /data/putnamlab/jillashey/Apul_Genome/assembly/data/apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa \
-pattern CG \
-outfile /data/putnamlab/jillashey/Apul_Genome/methylation/output/CGmotif_fuzznuc_Apul.gff \
-rformat gff 

echo "Fuzznuc complete" $(date)

Submitted batch job 336307. Ran very fast. Let’s look at the output.

wc -l CGmotif_fuzznuc_Apul.gff 
16011621 CGmotif_fuzznuc_Apul.gff

head CGmotif_fuzznuc_Apul.gff 
##gff-version 3
##sequence-region ntLink_7 1 182921
#!Date 2024-08-29
#!Type DNA
#!Source-version EMBOSS 6.6.0.0
ntLink_7	fuzznuc	nucleotide_motif	47	48	2	+	.	ID=ntLink_7.1;note=*pat pattern:CG
ntLink_7	fuzznuc	nucleotide_motif	50	51	2	+	.	ID=ntLink_7.2;note=*pat pattern:CG
ntLink_7	fuzznuc	nucleotide_motif	97	98	2	+	.	ID=ntLink_7.3;note=*pat pattern:CG
ntLink_7	fuzznuc	nucleotide_motif	99	100	2	+	.	ID=ntLink_7.4;note=*pat pattern:CG
ntLink_7	fuzznuc	nucleotide_motif	124	125	2	+	.	ID=ntLink_7.5;note=*pat pattern:CG

tail CGmotif_fuzznuc_Apul.gff 
ptg000187l	fuzznuc	nucleotide_motif	16594	16595	2	+	.	ID=ptg000187l.326;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	16672	16673	2	+	.	ID=ptg000187l.327;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	16891	16892	2	+	.	ID=ptg000187l.328;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	16923	16924	2	+	.	ID=ptg000187l.329;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	17048	17049	2	+	.	ID=ptg000187l.330;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	17120	17121	2	+	.	ID=ptg000187l.331;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	17701	17702	2	+	.	ID=ptg000187l.332;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	17753	17754	2	+	.	ID=ptg000187l.333;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	17765	17766	2	+	.	ID=ptg000187l.334;note=*pat pattern:CG
ptg000187l	fuzznuc	nucleotide_motif	17890	17891	2	+	.	ID=ptg000187l.335;note=*pat pattern:CG

Lot of instances of CGs in the genome. Calculate how many CG motifs per chromosome.

awk '{print $1}' CGmotif_fuzznuc_Apul.gff | sort | uniq -c > CpG_chrom_counts.txt
    174 #!Date
    174 ##gff-version
   2660 ntLink_0
   5528 ntLink_1
  18990 ntLink_2
   4230 ntLink_3
  16135 ntLink_4
 718456 ntLink_6
   6379 ntLink_7
1131794 ntLink_8
 673167 ptg000001l
 487373 ptg000002l
 481523 ptg000004l
  41081 ptg000005l
  72434 ptg000006l
 384667 ptg000007l
1210027 ptg000008l
 576662 ptg000009l
  71786 ptg000010l
 431224 ptg000011l
 612568 ptg000012l
 471803 ptg000015l
 399662 ptg000016l
 402360 ptg000017l
 514573 ptg000018l
 184959 ptg000019l
 546594 ptg000020l
 660490 ptg000021l
 300528 ptg000022l
1411649 ptg000023l
 371263 ptg000024l
 651898 ptg000025l
 463202 ptg000026l
 477238 ptg000027l
  20084 ptg000028l
  54282 ptg000029c
 104517 ptg000030l
 487040 ptg000031l
  87528 ptg000033l
 107979 ptg000034l
 294044 ptg000035l
 206362 ptg000036l
   8351 ptg000037l
   9522 ptg000038l
  32088 ptg000039l
  12627 ptg000040l
   3994 ptg000043l
   1557 ptg000045l
   1134 ptg000046l
 379058 ptg000047l
   2221 ptg000048l
  70943 ptg000049l
    822 ptg000050l
  11322 ptg000051l
   1607 ptg000052l
   2012 ptg000053l
    976 ptg000054l
   1523 ptg000055l
   1606 ptg000056l
   1168 ptg000057l
  55399 ptg000059l
   3128 ptg000060c
   6757 ptg000061l
    962 ptg000063l
   4760 ptg000064l
   1381 ptg000065l
   2038 ptg000066l
   5588 ptg000067l
  12558 ptg000069l
   5742 ptg000070l
  31307 ptg000072c
  31474 ptg000073l
    107 ptg000074l
   1752 ptg000075l
  10527 ptg000076l
   1221 ptg000077l
    701 ptg000078l
    866 ptg000079l
   1273 ptg000080l
   5641 ptg000081l
   2028 ptg000082l
   3914 ptg000083l
   3283 ptg000085l
   2711 ptg000086l
   2651 ptg000087l
   2899 ptg000088l
   1270 ptg000089l
   1474 ptg000090l
    797 ptg000092l
   1080 ptg000093l
   1040 ptg000094l
   1193 ptg000095l
   1674 ptg000096l
   1586 ptg000097l
   1601 ptg000098l
   1385 ptg000099l
   1524 ptg000100l
    866 ptg000101l
   1850 ptg000102l
   2880 ptg000105l
   1386 ptg000106l
   2113 ptg000107l
   2548 ptg000108l
   1351 ptg000109l
   1268 ptg000112l
   1804 ptg000113l
   1450 ptg000114l
   1189 ptg000115l
   1357 ptg000116l
    817 ptg000117l
    969 ptg000118l
    605 ptg000119l
    866 ptg000120l
   1330 ptg000121l
   1436 ptg000122l
   1308 ptg000123l
   1708 ptg000124l
    637 ptg000125l
    947 ptg000126l
   1023 ptg000127l
   1626 ptg000128l
   1220 ptg000129l
   1187 ptg000130l
    812 ptg000131l
    855 ptg000132l
   1153 ptg000133l
   3997 ptg000134l
    112 ptg000135l
    477 ptg000136l
   1906 ptg000137l
   2694 ptg000138l
    432 ptg000139l
    802 ptg000140l
   2269 ptg000141l
    423 ptg000142l
    609 ptg000144l
    504 ptg000145l
   1177 ptg000146l
   1059 ptg000147l
   1177 ptg000148l
   2720 ptg000149l
   4343 ptg000151l
    716 ptg000152l
    877 ptg000153l
   1155 ptg000154l
    948 ptg000155l
    462 ptg000158l
    494 ptg000159l
   1126 ptg000160l
    408 ptg000161l
    311 ptg000162l
    388 ptg000163l
    151 ptg000164l
    782 ptg000165l
    360 ptg000166l
    519 ptg000167l
    392 ptg000168l
    535 ptg000169l
    468 ptg000170l
    504 ptg000171l
   1376 ptg000172l
    697 ptg000173l
    568 ptg000174l
    525 ptg000175l
   1460 ptg000176l
   1564 ptg000177l
    554 ptg000178l
   1129 ptg000179l
    220 ptg000180l
   1048 ptg000181l
    746 ptg000182l
    892 ptg000183l
   1494 ptg000184l
   1252 ptg000185l
    471 ptg000186l
    335 ptg000187l
    174 ##sequence-region
    174 #!Source-version
    174 #!Type

20240910

Met with Trinity this morning! She has completed the repeat masker/modeling part and is now working on the structural and functional annotation. We discussed submitting the assembled genome to NCBI, which I am going to look into. When I was making the submission on GenBank, one of the submission questions was “do you want to submit motif/modification information” since its PacBio sequencing. I looked at NCBI’s page on this and they mentioned an analysis workflow (RS_Modification_and_Motif_Analysis) that will identify motifs and modifications but I couldn’t find much info about it on the pacbio website. I emailed the pacbio people to see where I should start my methylation analysis. I was also looking at the Apul PacBio summary report and it has some information on methylation as well which I have never noticed. It has plots of CpG methylation in reads but no other information.

I emailed PacBio to ask where to start with all these things.

20240918

Maybe the raw hifi reads already have the kinetics info in them that is needed for jasmine. Before I tried to convert the original bam file to one with kinetics but now I’m going to try running just the raw hifi bam through jasmine.

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/methylation/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/jasmine

echo "Running jasmine" $(date)

jasmine /data/putnamlab/jillashey/Apul_Genome/assembly/data/m84100_240128_024355_s2.hifi_reads.bc1029.bam /data/putnamlab/jillashey/Apul_Genome/methylation/data/apul_hifi_5mc.bam

echo "Jasmine complete!" $(date)

conda deactivate

Submitted batch job 338453. Failed with same error as before:

bash: cannot set terminal process group (-1): Function not implemented
bash: no job control in this shell
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
Exception in thread "main" java.lang.NullPointerException
        at uio.amg.zhong.jasmine.JASMINE.findXMLfile(JASMINE.java:494)
        at uio.amg.zhong.jasmine.JASMINE.main(JASMINE.java:40)

Got email back from PacBio people:

“For your record, we have opened case 00233359 for this inquiry.

First, I think the information on the NCBI page is not going to be as relevant in this particular instance. The base modification files that are referenced on that page are outputs from the Microbial Genome Analysis workflow and they emphasize 6mA and 4mC motifs which are the most common modifications in bacterial genomes.

For methylation analyses in eukaryotes, our key tools are focused on analysis of 5mC in CpG sites. 5mC methylation probabilities in CpG sites are called using our tools primrose or jasmine and are encoded in the hifi_reads.bam file as the MM and ML tags. We have two tools for the analyses of these data pb CpGtools and methbat.

pb cpgtools is the older of the two tools and uses either a trained machine learning model or a pileup model to summarize methylation probabilities across sites to provide evidence of hyper- or hypo-methylation. Pbcpgtools can also be used to summarize 5mC calls for individual samples, which can then be used to build cohort profiles for methbat.

Methbat is the newer of the two tools and is technically still an “in development”, but it has four workflows that are supported, which are summarized here on the user guide page. Which workflow you want to use is going to depend your experimental design and what you would like to test.

If you would be up for sending me some information on your dataset and what you are interested testing for, I would be very happy to weigh in on which workflow might be most useful.”

I sent him some info about the data that I have, what I am trying to do and what code I have tried so far. Going to download the example datasets from the PacBio websites to try to run jasmine in interactive mode. Going to submit as a job.

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/methylation/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/jasmine

echo "Running jasmine" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/methylation/data/test

jasmine m64168_200820_000733.subreads.bam test_jasmine.bam

echo "Jasmine complete!" $(date)

conda deactivate

Submitted batch job 338456. Hmm got the same error as above…so maybe an installation error?

20240924

PacBio responded with very helpful info about methylation analysis. They recommended that I:

  • Align sequences to reference genome with pbmm2, a minimap2 SMRT wrapper specifically for PacBio data.
  • Use the aligned bam file as input for pb-CpG-tools which generates site methylation probabilities for hifi reads
  • Use pb-CpG-tools output as input for MethBat

I do not need to run ccs-bystrandify or jasmine because the 5mC calling takes place on the instrument, so MM and ML tags should already be included with the data. Let’s install pbmm2 via PacBio conda instructions.

cd /data/putnamlab/conda
module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/conda/pbmm2

conda activate /data/putnamlab/conda/pbmm2
conda install -c bioconda pbmm2
conda deactivate 

The pbmm2 documentation says to align the bam file to the reference genome. But I just created the reference genome from these reads…I guess I will use the new reference genome? I am going to used the unmasked version that is in my folder. In the /data/putnamlab/jillashey/Apul_Genome/assembly/scripts folder: nano pbmm2_index.sh

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/pbmm2

echo "Indexing reference genome" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

pbmm2 index apul.hifiasm.s55_pa.p_ctg.fa.k32.w100.z1000.ntLink.5rounds.fa apul_ref_out.mmi --preset CCS

echo "Index of ref genome complete" $(date)

conda deactivate

Submitted batch job 339851. Took about 15 seconds yay. Now let’s align the raw bam file to the index. I could align either a fasta or bam file to the reference. I’m going to start with the raw bam file. This file does not have any contaminants removed or is assembled in any way. nano pbmm2_align.sh

#!/bin/bash -i
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/assembly/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

conda activate /data/putnamlab/conda/pbmm2

echo "Aligning raw bam" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/assembly/data

pbmm2 align apul_ref_out.mmi m84100_240128_024355_s2.hifi_reads.bc1029.bam out.aligned.bam --sort --preset HIFI

echo "Alignment complete" $(date)

conda deactivate

Submitted batch job 339861. Currently pending because it needs resources. While I wait, I’m going to install the other pacbio packages for methylation analysis. For the pb-CpG-tools, I need to download the release from github and unpack.

cd /data/putnamlab/conda

wget https://github.com/PacificBiosciences/pb-CpG-tools/releases/download/v2.3.2/pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu.tar.gz
tar -xzf pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu.tar.gz

# Run help option to test binary and see latest usage details:
pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/bin/aligned_bam_to_cpg_scores --help

Great, that was super easy. Install MethBat with conda

cd /data/putnamlab/conda
module load Miniconda3/4.9.2
conda create --prefix /data/putnamlab/conda/methbat

conda activate /data/putnamlab/conda/methbat
conda install -c bioconda methbat
conda deactivate 

Success! And alignment is currently running. Ran in about 4 hours. Run pb-CpG-tools to assess methylation site probabilities at CpG sites. In the /data/putnamlab/jillashey/Apul_Genome/methylation/scripts folder: nano pb_cpg_probs.sh

#!/bin/bash 
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=250GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/methylation/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Using aligned bam to generate cpg probabilities" $(date)

cd /data/putnamlab/jillashey/Apul_Genome/methylation/data

/data/putnamlab/conda/pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/bin/aligned_bam_to_cpg_scores \
  --bam /data/putnamlab/jillashey/Apul_Genome/assembly/data/out.aligned.bam \
  --output-prefix Apul.pbmm2 \
  --model /data/putnamlab/conda/pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/models/pileup_calling_model.v1.tflite \
  --threads 10

echo "cpg probability prediction complete!" $(date)

Submitted batch job 339998

20240925

The pb-CpG-tools has information on the output. Let’s look at the beginning of the bed file:

head Apul.pbmm2.combined.bed 
ntLink_7	3388	3389	15.0	Total	4	0	4	0.0
ntLink_7	3395	3396	8.5	Total	4	0	4	0.0
ntLink_7	3431	3432	3.7	Total	5	0	5	0.0
ntLink_7	3467	3468	4.8	Total	5	0	5	0.0
ntLink_7	3507	3508	4.6	Total	5	0	5	0.0
ntLink_7	3512	3513	5.4	Total	5	0	5	0.0
ntLink_7	3536	3537	3.9	Total	5	0	5	0.0
ntLink_7	3546	3547	4.8	Total	5	0	5	0.0
ntLink_7	3552	3553	7.5	Total	5	0	5	0.0
ntLink_7	3607	3608	4.8	Total	5	0	5	0.0

tail Apul.pbmm2.combined.bed 
ptg000187l	16593	16594	73.4	Total	49	36	13	73.5
ptg000187l	16671	16672	93.8	Total	46	44	2	95.7
ptg000187l	16890	16891	92.0	Total	42	39	3	92.9
ptg000187l	16922	16923	89.6	Total	40	36	4	90.0
ptg000187l	17047	17048	76.1	Total	32	25	7	78.1
ptg000187l	17119	17120	95.4	Total	31	30	1	96.8
ptg000187l	17700	17701	94.5	Total	15	15	0	100.0
ptg000187l	17752	17753	89.9	Total	14	13	1	92.9
ptg000187l	17764	17765	91.4	Total	14	13	1	92.9
ptg000187l	17889	17890	79.8	Total	10	8	2	80.0

Columns

  1. Reference name - contig name from reference genome (I used the genome that I assembled)
  2. Start coordinate
  3. End coordinate
  4. Modification score - modification probability score or the likelihood that the cytosine in the CpG site is methylated. Higher score = higher likelihood of methylation at that site
  5. Haplotype - total, probabilities combined across all reads
  6. Coverage - number of reads covering CpG site
  7. Estimated modified site count - number of CpG sites estimated to be methylated
  8. Estimated unmodified site count - number of CpG sites estimated to be unmethylated
  9. Discretized modification probability - ratio of modified to unmodified sites, indicating the confidence that the site is methylated (ie modified) or unmethylated

We are interested in the modification score, which ranges from 0 to 100, with 0 being unmethylated and 100 being fully methylated. Scores that range from 20-90 are considered partially methylated and may be areas of active gene expression or transcriptional plasticity.

THIS IS SO COOL!!!!!!

20240930

I am now interested in the overlap of the CpGs with genomic features. Trinity provided me with the path that I can use for the Apul gff: /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3. Going to use bedtools intersect to find intersections of genes and CpGs.

cd /data/putnamlab/tconn/predict_results

head Acropora_pulchra.gff3 
##gff-version 3
ntLink_0	funannotate	gene	1105	7056	.	+	.	ID=FUN_000001;
ntLink_0	funannotate	mRNA	1105	7056	.	+	.	ID=FUN_000001-T1;Parent=FUN_000001;product=hypothetical protein;
ntLink_0	funannotate	exon	1105	1188	.	+	.	ID=FUN_000001-T1.exon1;Parent=FUN_000001-T1;
ntLink_0	funannotate	exon	1861	1941	.	+	.	ID=FUN_000001-T1.exon2;Parent=FUN_000001-T1;
ntLink_0	funannotate	exon	2762	2839	.	+	.	ID=FUN_000001-T1.exon3;Parent=FUN_000001-T1;
ntLink_0	funannotate	exon	5044	7056	.	+	.	ID=FUN_000001-T1.exon4;Parent=FUN_000001-T1;
ntLink_0	funannotate	CDS	1105	1188	.	+	0	ID=FUN_000001-T1.cds;Parent=FUN_000001-T1;
ntLink_0	funannotate	CDS	1861	1941	.	+	0	ID=FUN_000001-T1.cds;Parent=FUN_000001-T1;
ntLink_0	funannotate	CDS	2762	2839	.	+	0	ID=FUN_000001-T1.cds;Parent=FUN_000001-T1;

Look for intersects in methylation data and gff

cd /data/putnamlab/jillashey/Apul_Genome/methylation/data

interactive
module load BEDTools/2.30.0-GCC-11.3.0

bedtools intersect -a Apul.pbmm2.combined.bed -b /data/putnamlab/tconn/predict_results/Acropora_pulchra.gff3 -wa -wb > Apul_methylation_genome_intersect.bed

head Apul_methylation_genome_intersect.bed 
ntLink_7	3388	3389	15.0	Total	4	0	4	0.0	ntLink_7	funannotate	gene	79	4679	.	+	.	ID=FUN_002303;
ntLink_7	3388	3389	15.0	Total	4	0	4	0.0	ntLink_7	funannotate	mRNA	79	4679	.	+	.	ID=FUN_002303-T1;Parent=FUN_002303;product=hypothetical protein;
ntLink_7	3395	3396	8.5	Total	4	0	4	0.0	ntLink_7	funannotate	gene	79	4679	.	+	.	ID=FUN_002303;
ntLink_7	3395	3396	8.5	Total	4	0	4	0.0	ntLink_7	funannotate	mRNA	79	4679	.	+	.	ID=FUN_002303-T1;Parent=FUN_002303;product=hypothetical protein;
ntLink_7	3431	3432	3.7	Total	5	0	5	0.0	ntLink_7	funannotate	gene	79	4679	.	+	.	ID=FUN_002303;
ntLink_7	3431	3432	3.7	Total	5	0	5	0.0	ntLink_7	funannotate	mRNA	79	4679	.	+	.	ID=FUN_002303-T1;Parent=FUN_002303;product=hypothetical protein;
ntLink_7	3467	3468	4.8	Total	5	0	5	0.0	ntLink_7	funannotate	gene	79	4679	.	+	.	ID=FUN_002303;
ntLink_7	3467	3468	4.8	Total	5	0	5	0.0	ntLink_7	funannotate	mRNA	79	4679	.	+	.	ID=FUN_002303-T1;Parent=FUN_002303;product=hypothetical protein;
ntLink_7	3507	3508	4.6	Total	5	0	5	0.0	ntLink_7	funannotate	gene	79	4679	.	+	.	ID=FUN_002303;
ntLink_7	3507	3508	4.6	Total	5	0	5	0.0	ntLink_7	funannotate	mRNA	79	4679	.	+	.	ID=FUN_002303-T1;Parent=FUN_002303;product=hypothetical protein;

wc -l Apul_methylation_genome_intersect.bed 
15564554 Apul_methylation_genome_intersect.bed

Select genes only

grep -w "gene" Apul_methylation_genome_intersect.bed > Apul_methylation_gene_only_intersect.bed

wc -l Apul_methylation_gene_only_intersect.bed 
6246019 Apul_methylation_gene_only_intersect.bed

Count number of genes in genome

cd /data/putnamlab/tconn/predict_results

awk '$3 == "gene"' Acropora_pulchra.gff3 | wc -l
44371

cut -f 3 Acropora_pulchra.gff3 | sort | uniq
CDS
exon
gene
mRNA
tRNA

for reference: https://github.com/hputnam/Meth_Compare?tab=readme-ov-file

20241203

Talked with Ross and Trinity last week to discuss figures and tables to include in the paper. Meeting summary (by Trinity):

  • We agreed that I will take over the bulk of responsibility for writing the draft, with Jill & I listed as Co-First Authors
  • Our first deadline will be December 6th – I will have a rough draft prepared in the overleaf and will let everyone know when that’s done!
  • We will drop figures and such in the shared github – Jill and I will also keep in contact

Figures

  • image of pulchra + sampling/geographic distribution (Trinity)
  • potential Busco scores (Jill + whatever Trinity can help with!)
  • repeat content distribution (Trinity)
  • bioanalyzer result/sequence quality statistics (Supplementary?) Tables
  • comparison assembly statistics to sanger A.palmata & A.cervicornis, and A.digitifera & A.millepora genomes (Jill)
  • description of structural + functional annotation & comparison to other Acropora assemblies (Trinity)

Other tasks for Jill & Trinity

  • Jill & Trinity: do more literature searches on use of pacbio for detection of methylation data to provide context for Jill’s methylation analysis
  • Jill & Trinity: think a little more about whether we want to include any other non-acroporids in genome comparison

Trinity recently reran busco on our masked genome and it looks beautiful! 96.6% completeness! I now need to run busco on the other genomes to compare completeness. We decided to look at Amillepora, Adigitifera, Acervicornis, and Apalmata (see table).

Amil BUSCO: cd /data/putnamlab/jillashey/genome/Amil_v2.01. In this folder: nano amil_busco.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/genome/Amil_v2.01
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Begin busco on Amil fasta" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/genome/Amil_v2.01/Amil.v2.01.chrs.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${jillashey/genome/Amil_v2.01}"
busco --config "$EBROOTBUSCO/config/config.ini"  -f -c 20 --long -i "${query}" -l metazoa_odb10 -o amil.busco -m genome

echo "busco complete for Amil" $(date)

Submitted batch job 352146

Adig BUSCO: /data/putnamlab/jillashey/genome/Adig. In this folder: nano adig_busco.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/genome/Adig
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Begin busco on Adig fasta" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/genome/Adig/GCA_014634065.1_Adig_2.0_genomic.fna" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${jillashey/genome/Adig}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o adig.busco -m genome

echo "busco complete for Amil" $(date)

Submitted batch job 352147

I am using recently made Acerv and Apalm genomes. For Acerv:

cd /data/putnamlab/jillashey/genome
mkdir jaAcrCerv1.1
cd jaAcrCerv1.1
wget ftp://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/cax/CAXITW01.fasta.gz

In this folder: nano acerv_busco.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/genome/jaAcrCerv1.1
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Begin busco on Acerv fasta" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/genome/jaAcrCerv1.1/CAXITW01.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${jillashey/genome/jaAcrCerv1.1}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o acerv.busco -m genome

echo "busco complete for Acerv" $(date)

Submitted batch job 352148. For Apalm:

cd /data/putnamlab/jillashey/genome
mkdir jaAcrPala1.1
cd jaAcrPala1.1
wget ftp://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/cax/CAXIQB01.fasta.gz

In this folder: nano apalm_busco.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/genome/jaAcrPala1.1
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Begin busco on Apalm fasta" $(date)

labbase=/data/putnamlab
busco_shared="${labbase}/shared/busco"
[ -z "$query" ] && query="${labbase}/jillashey/genome/jaAcrPala1.1/CAXIQB01.fasta" # set this to the query (genome/transcriptome) you are running
[ -z "$db_to_compare" ] && db_to_compare="${busco_shared}/downloads/lineages/metazoa_odb10"

source "${busco_shared}/scripts/busco_init.sh"  # sets up the modules required for this in the right order

# This will generate output under your $HOME/busco_output
cd "${labbase}/${jillashey/genome/jaAcrPala1.1}"
busco --config "$EBROOTBUSCO/config/config.ini" -f -c 20 --long -i "${query}" -l metazoa_odb10 -o apalm.busco -m genome

echo "busco complete for Apalm" $(date)

Submitted batch job 352149. All ran very fast! Look at results:

# Amil 
        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:92.7%[S:87.2%,D:5.5%],F:2.6%,M:4.7%,n:954      |
        |884    Complete BUSCOs (C)                       |
        |832    Complete and single-copy BUSCOs (S)       |
        |52     Complete and duplicated BUSCOs (D)        |
        |25     Fragmented BUSCOs (F)                     |
        |45     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------
        
# Adig
        --------------------------------------------------
        |Results from dataset metazoa_odb10               |
        --------------------------------------------------
        |C:93.2%[S:92.8%,D:0.4%],F:3.4%,M:3.4%,n:954      |
        |889    Complete BUSCOs (C)                       |
        |885    Complete and single-copy BUSCOs (S)       |
        |4      Complete and duplicated BUSCOs (D)        |
        |32     Fragmented BUSCOs (F)                     |
        |33     Missing BUSCOs (M)                        |
        |954    Total BUSCO groups searched               |
        --------------------------------------------------

Acerv and Apalm failed with this error:

2024-12-03 11:09:13 ERROR:      Unable to parse metaeuk results. This typically occurs because sequence headers contain pipes ('|'). Metaeuk uses pipes as delimiters in the results files. The additional pipes interfere with BUSCO's ability to accurately parse the results.To fix this problem remove any pipes from sequence headers and try again.
2024-12-03 11:09:13 ERROR:      BUSCO analysis failed!
Checking sequence headers and yes they do contain . Edit code for both species so that the “ ” is changed to a “-“ with the following line of code:
sed 's/|/-/g' GENOME.fasta > GENOME_modified.fasta

Submitted batch job 352154 for Apalm and Submitted batch job 352155 for Acerv. Zoe also informed me of the Acerv and Apalm genomes that Nick assembled (see his github and paper). I am also going to download those and run busco

NCBI genome accessions are GCA_025960835.2 for A. palmata, GCA_037043185.1 for A. cervicornis version 1, and GCA_041430625.1 for A. cervicornis version 2

20250109

With regards to the methylation data, I want to do the following:

  • Percentage of exons, introns, intergentic regions with methylated CpGs
  • Patterns of CpG density in all features
interactive
module load BEDTools/2.30.0-GCC-11.3.0

cd /data/putnamlab/jillashey/Apul_Genome/methylation/data
cp /data/putnamlab/jillashey/Apul_Genome/assembly/data/chrom_lengths.txt .

# Remove > at beginning of chromosome name 
sed 's/^>//' chrom_lengths.txt > cleaned_chrom_lengths.txt

# Extract genes
awk '$3=="transcript" {
    split($0, a, "gene_id ");
    split(a[2], b, "\"");
    print $1"\t"$4-1"\t"$5"\t"b[2]
}' /data/putnamlab/tconn/annotate_results/Acropora_pulchra.gtf > genes.bed

# Extract exons
awk '$3=="exon" {
    split($0, a, "gene_id ");
    split(a[2], b, "\"");
    print $1"\t"$4-1"\t"$5"\t"b[2]
}' /data/putnamlab/tconn/annotate_results/Acropora_pulchra.gtf > exons.bed

# Extract introns (assuming genes are continuous)
bedtools subtract -a genes.bed -b exons.bed > introns.bed

# Change cleaned_chrom_lengths.txt to tab delimited file instead of space delimited file 
sed 's/ /\t/' cleaned_chrom_lengths.txt > tab_delimited_chrom_lengths.txt

# Sort tab delim file 
sort -k1,1 tab_delimited_chrom_lengths.txt > sorted_chrom_lengths.txt

# Sort gene.bed file 
sort -k1,1 -k2,2n genes.bed > sorted_genes.bed

# Extract intergenic regions
bedtools complement -i sorted_genes.bed -g tab_delimited_chrom_lengths.txt > intergenic.bed

Sort methylation data file

sort -k1,1 -k2,2n Apul.pbmm2.combined.bed > sorted_Apul.pbmm2.combined.bed

Intersect bed feature files with methylation data. I need to run these as a job. In the scripts folder: nano intersect_methylation.sh

#!/bin/bash 
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/jillashey/Apul_Genome/methylation/scripts
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.error

echo "Intersect bed feature files with methylation data" $(date)

module load BEDTools/2.30.0-GCC-11.3.0

cd /data/putnamlab/jillashey/Apul_Genome/methylation/data

# Sort methylation file 
sort -k1,1 -k2,2n Apul.pbmm2.combined.bed > sorted_Apul.pbmm2.combined.bed

# Intersect 
bedtools intersect -a genes.bed -b sorted_Apul.pbmm2.combined.bed -wa -wb > gene_methylation.bed
bedtools intersect -a exons.bed -b sorted_Apul.pbmm2.combined.bed -wa -wb > exon_methylation.bed
bedtools intersect -a introns.bed -b sorted_Apul.pbmm2.combined.bed -wa -wb > intron_methylation.bed
bedtools intersect -a intergenic.bed -b sorted_Apul.pbmm2.combined.bed -wa -wb > intergenic_methylation.bed

echo "Intersection complete" $(date)

Submitted batch job 354769. Ran in <5 mins. Count CpGs in each region

wc -l gene_methylation.bed exon_methylation.bed intron_methylation.bed intergenic_methylation.bed
Written on February 6, 2024