-
Notifications
You must be signed in to change notification settings - Fork 70
Description
Hi,
I have identified orthologues with Orthofinder, and I am interested in identifying genes under selection with hyphy and if possible doing the analysis in parallel.
I installed HyPhy with conda, and I am trying to understand how to set the parameters for the analysis.
Since I have duplications and potential paralogues I will restrict the analysis to alignments with single-copy orthologues.
I was reading the tutorial on the website and it says: "This tutorial assumes you are specifically using the HyPhy executable HYPHYMP. If you have installed a different executable (e.g. HYPHYMPI), you may need to alter some commands."
Since I installed with conda, I'd appreciate some help with the code.
The website indicates to use:
# Path to files of interest
FILEPATH=/absolute/path/to/directory/with/all/your/files/
# Run BUSTED on all files $FILEPATH/*.fas
for FILE in `ls $FILEPATH/*.fas`; do
echo "Running BUSTED on $FILE"
echo `(echo "10"; echo "4"; echo "1"; echo $FILE; echo "Y"; echo "1"; echo "d") | HYPHYMP`
done
I don't have HYPHYMP. First, I want to do a quick single alignment-wide ω to find genes with signal of selection and then do BUSTED with one or two species of interest, and since I have access to a cluster I would like to submit several jobs with HYPHYMPI.
The code I thought would be:
#!/bin/bash
#PBS -l nodes=1:ppn=32
#PBS -l walltime=03:00:00
#PBS -l mem=100gb
#PBS -t 1-100
#PBS -k oe
#PBS -N single-msa-w
#PBS -o ${PBS_JOBNAME}_${PBS_JOBID}.log
#PBS -e ${PBS_JOBNAME}_${PBS_JOBID}.err
#PBS -l naccesspolicy=singlejob
# Load bash environment
source ~/.bashrc
# Init conda
source /data/common/MMA/mamba.init.sh
conda activate NGS-forge
cd /data/common/MMA/data/genomes/
# Save starting time
start_time=$(date +%s)
echo "Start ${PBS_JOBNAME}: Job started: $(date)"
# Get the sample name for this array job
base=$(sed -n "${PBS_ARRAYID}p" dn.dn.sps.txt)
#Calculate single alignment-wide ω
echo "Starting single alignment-wide ω on ${base} at $(date)"
HYPHYMPI CPU=32 ...
I can't find the code for running the single alignment-wide ω.
Could you please indicate the code to run the single alignment-wide ω?
Then, to run busted on one branch out of 11 (species), so I assume it would be:
HYPHYMPI CPU=32 busted-ph --alignment path/to/${base}.fasta --tree my_tree.nwk --branches Hydlep --output ${base}.json
As mentioned before, I ran Orthofinder to identify orthologues and orthogroups. The issue/ perhaps advantage, is that it adds the file name as prefix to the protein fasta headeres, so the multiple sequence alignments look like:
Ailmel_rna-XM_034655643.1
Calurs_rna-XM_073887989.1
Canlup_rna-XM_038532347.1
Hydlep_GNX-026372-mRNA
Lepwed_LOC_00017029-mRNA-10
Mirang_rna-XM_064596364.1
Mirleo_rna-XM_035014847.1
Musnig_rna-XM_059416383.1
Neosch_rna-XM_044916755.1
Ursarc_rna-XM_026498632.3
In my case, the tree is:
((((Calurs,(Halgry,((((Hydlep,Lepwed),(Mirang,Mirleo))),Neosch))),Musnig),(Ursarc,Ailmel)),Canlup);
Can I reconcile the alignment and the tree with trim-tree, next either parse or save the reconciled files and then run the analysis?
Orthofinder uses protein data. I looked at the alignments and many seem to need trimming due to issues like gaps. Is there any program that you recommend for doing protein-alignment trimming? I am planning to use trimAI but I wanted to ask you first.
Lastly, after trimming the aa, can HyPhy run the single alignment-wide ω/ busted analysis with the protein alignment (set input data) or must I backtranslate to codon sequences?
Thank you.