Skip to content

Setting a batch analysis for single alignment-wide ω #1955

@desmodus1984

Description

@desmodus1984

Hi,
I have identified orthologues with Orthofinder, and I am interested in identifying genes under selection with hyphy and if possible doing the analysis in parallel.

I installed HyPhy with conda, and I am trying to understand how to set the parameters for the analysis.
Since I have duplications and potential paralogues I will restrict the analysis to alignments with single-copy orthologues.

I was reading the tutorial on the website and it says: "This tutorial assumes you are specifically using the HyPhy executable HYPHYMP. If you have installed a different executable (e.g. HYPHYMPI), you may need to alter some commands."
Since I installed with conda, I'd appreciate some help with the code.

The website indicates to use:

# Path to files of interest
FILEPATH=/absolute/path/to/directory/with/all/your/files/

# Run BUSTED on all files $FILEPATH/*.fas
for FILE in `ls $FILEPATH/*.fas`; do
    echo "Running BUSTED on $FILE"
    echo `(echo "10"; echo "4"; echo "1"; echo $FILE; echo "Y"; echo "1"; echo "d") | HYPHYMP`
done

I don't have HYPHYMP. First, I want to do a quick single alignment-wide ω to find genes with signal of selection and then do BUSTED with one or two species of interest, and since I have access to a cluster I would like to submit several jobs with HYPHYMPI.

The code I thought would be:

#!/bin/bash

#PBS -l nodes=1:ppn=32
#PBS -l walltime=03:00:00
#PBS -l mem=100gb
#PBS -t 1-100
#PBS -k oe
#PBS -N single-msa-w
#PBS -o ${PBS_JOBNAME}_${PBS_JOBID}.log
#PBS -e ${PBS_JOBNAME}_${PBS_JOBID}.err
#PBS -l naccesspolicy=singlejob

# Load bash environment
source ~/.bashrc

# Init conda
source /data/common/MMA/mamba.init.sh
conda activate NGS-forge

cd /data/common/MMA/data/genomes/

# Save starting time
start_time=$(date +%s)
echo "Start ${PBS_JOBNAME}: Job started: $(date)"

# Get the sample name for this array job
base=$(sed -n "${PBS_ARRAYID}p" dn.dn.sps.txt)

#Calculate single alignment-wide ω

echo "Starting single alignment-wide ω on ${base} at $(date)"

HYPHYMPI CPU=32  ...

I can't find the code for running the single alignment-wide ω.
Could you please indicate the code to run the single alignment-wide ω?

Then, to run busted on one branch out of 11 (species), so I assume it would be:

HYPHYMPI CPU=32  busted-ph --alignment path/to/${base}.fasta --tree my_tree.nwk --branches Hydlep --output ${base}.json

As mentioned before, I ran Orthofinder to identify orthologues and orthogroups. The issue/ perhaps advantage, is that it adds the file name as prefix to the protein fasta headeres, so the multiple sequence alignments look like:
Ailmel_rna-XM_034655643.1
Calurs_rna-XM_073887989.1
Canlup_rna-XM_038532347.1
Hydlep_GNX-026372-mRNA
Lepwed_LOC_00017029-mRNA-10
Mirang_rna-XM_064596364.1
Mirleo_rna-XM_035014847.1
Musnig_rna-XM_059416383.1
Neosch_rna-XM_044916755.1
Ursarc_rna-XM_026498632.3

In my case, the tree is:
((((Calurs,(Halgry,((((Hydlep,Lepwed),(Mirang,Mirleo))),Neosch))),Musnig),(Ursarc,Ailmel)),Canlup);

Can I reconcile the alignment and the tree with trim-tree, next either parse or save the reconciled files and then run the analysis?

Orthofinder uses protein data. I looked at the alignments and many seem to need trimming due to issues like gaps. Is there any program that you recommend for doing protein-alignment trimming? I am planning to use trimAI but I wanted to ask you first.

Lastly, after trimming the aa, can HyPhy run the single alignment-wide ω/ busted analysis with the protein alignment (set input data) or must I backtranslate to codon sequences?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions