Python package to search/retrieve/filter proteins and protein structures.
It uses
- Uniprot Sparql endpoint to search for proteins and their measured or predicted 3D structures.
- Uniprot taxonomy to search for taxonomy.
- QuickGO to search for Gene Ontology terms.
- gemmi to work with macromolecular models.
- dask-distributed to compute in parallel.
The package is used by
An example workflow:
graph TB;
taxonomy[/Search taxon/] -. taxon_ids .-> searchuniprot[/Search UniprotKB/]
goterm[/Search GO term/] -. go_ids .-> searchuniprot[/Search UniprotKB/]
searchuniprot --> |uniprot_accessions|searchpdbe[/Search PDBe/]
searchuniprot --> |uniprot_accessions|searchaf[/Search Alphafold/]
searchuniprot -. uniprot_accessions .-> searchemdb[/Search EMDB/]
searchuniprot -. uniprot_accessions .-> searchuniprotdetails[/Search UniProt details/]
searchintactionpartners[/Search interaction partners/] -.-x |uniprot_accessions|searchuniprot
searchcomplexes[/Search complexes/]
searchpdbe -->|pdb_ids|fetchpdbe[Retrieve PDBe]
searchaf --> |uniprot_accessions|fetchad(Retrieve AlphaFold)
searchemdb -. emdb_ids .->fetchemdb[Retrieve EMDB]
fetchpdbe -->|mmcif_files| chainfilter{{Filter on chain of uniprot}}
chainfilter --> |mmcif_files| residuefilter{{Filter on chain length}}
fetchad -->|mmcif_files| confidencefilter{{Filter out low confidence}}
confidencefilter --> |mmcif_files| ssfilter{{Filter on secondary structure}}
residuefilter --> |mmcif_files| ssfilter
ssfilter -. mmcif_files .-> convert2cif([Convert to cif])
ssfilter -. mmcif_files .-> convert2uniprot_accessions([Convert to UniProt accessions])
classDef dashedBorder stroke-dasharray: 5 5;
goterm:::dashedBorder
taxonomy:::dashedBorder
searchemdb:::dashedBorder
fetchemdb:::dashedBorder
searchintactionpartners:::dashedBorder
searchcomplexes:::dashedBorder
searchuniprotdetails:::dashedBorder
convert2cif:::dashedBorder
convert2uniprot_accessions:::dashedBorder
(Dotted nodes and edges are side-quests.)
pip install protein-questOr to use the latest development version:
pip install git+https://github.com/haddocking/protein-quest.gitThe main entry point is the protein-quest command line tool which has multiple
subcommands to perform actions.
To use programmaticly, see the Jupyter notebooks and API documentation.
While downloading or copying files it uses a global cache (located at
~/.cache/protein-quest) and hardlinks to save disk space and improve speed.
This behavior can be customized with the --no-cache, --cache-dir, and
--copy-method command line arguments.
protein-quest search uniprot \
--taxon-id 9606 \
--reviewed \
--subcellular-location-uniprot "nucleus" \
--subcellular-location-go GO:0005634 \
--molecular-function-go GO:0003677 \
--limit 100 \
uniprot_accs.txt(GO:0005634 is "Nucleus" and GO:0003677 is "DNA binding")
protein-quest search pdbe uniprot_accs.txt pdbe.csvpdbe.csv file is written containing the the PDB id and chain of each uniprot
accession.
protein-quest search alphafold uniprot_accs.txt alphafold.csvprotein-quest search emdb uniprot_accs.txt emdbs.csvprotein-quest retrieve pdbe pdbe.csv downloads-pdbe/protein-quest retrieve alphafold alphafold.csv downloads-af/For each entry downloads the cif file.
protein-quest retrieve emdb emdbs.csv downloads-emdb/Filter AlphaFoldDB structures based on confidence (pLDDT). Keeps entries with requested number of residues which have a confidence score above the threshold. Also writes pdb files with only those residues.
protein-quest filter confidence \
--confidence-threshold 50 \
--min-residues 100 \
--max-residues 1000 \
./downloads-af ./filteredMake PDBe files smaller by only keeping first chain of found uniprot entry and renaming to chain A.
protein-quest filter chain \
pdbe.csv \
./downloads-pdbe ./filtered-chainsprotein-quest filter residue \
--min-residues 100 \
--max-residues 1000 \
./filtered-chains ./filteredTo filter on structure being mostly alpha helices and have no beta sheets. See the following notebook to determine the ratio of secondary structure elements.
protein-quest filter secondary-structure \
--ratio-min-helix-residues 0.5 \
--ratio-max-sheet-residues 0.0 \
--write-stats filtered-ss/stats.csv \
./filtered-chains ./filtered-ssprotein-quest search taxonomy "Homo sapiens" -You might not know what the identifier of a
Gene Ontology term is at
protein-quest search uniprot. You can use following command to search for a
Gene Ontology (GO) term.
protein-quest search go --limit 5 --aspect cellular_component apoptosome -Use https://www.ebi.ac.uk/complexportal to find interaction partners of given UniProt accession.
protein-quest search interaction-partners Q05471 interaction-partners-of-Q05471.txtThe interaction-partners-of-Q05471.txt file contains uniprot accessions (one
per line).
Given Uniprot accessions search for macromolecular complexes at https://www.ebi.ac.uk/complexportal and return the complex entries and their members.
echo Q05471 | protein-quest search complexes - complexes.csvThe complexes.csv looks like
query_protein,complex_id,complex_url,complex_title,members
Q05471,CPX-2122,https://www.ebi.ac.uk/complexportal/complex/CPX-2122,Swr1 chromatin remodelling complex,P31376;P35817;P38326;P53201;P53930;P60010;P80428;Q03388;Q03433;Q03940;Q05471;Q06707;Q12464;Q12509To get details (like protein name, sequence length, organism) for a list of UniProt accessions.
protein-quest search uniprot-details uniprot_accs.txt uniprot_details.csvThe uniprot_details.csv looks like:
uniprot_accession,uniprot_id,sequence_length,reviewed,protein_name,taxon_id,taxon_name
A0A087WUV0,ZN892_HUMAN,522,True,Zinc finger protein 892,9606,Homo sapiensSome tools (for example powerfit) only
work with .cif files and not *.cif.gz or *.bcif files.
protein-quest convert structures --format cif --output-dir ./filtered-cif ./filtered-ssAfter running some filters you might want to know which UniProt accessions are still present in the filtered structures.
protein-quest convert uniprot ./filtered-ss uniprot_accs.filtered.txtYou can use protein-quest --prov ... to store provenance information of your
CLI invocations in a
Research Object crate file called
ro-crate-metadata.json.
Protein quest can also help LLMs like Claude Sonnet 4 by providing a set of tools for protein structures.
To run mcp server you have to install the mcp extra with:
pip install protein-quest[mcp]The server can be started with:
protein-quest mcpThe mcp server contains an prompt template to search/retrieve/filter candidate structures.
The protein-quest command line tool supports shell autocompletion using
shtab.
Initialize for bash shell with:
mkdir -p ~/.local/share/bash-completion/completions
protein-quest --print-completion bash > ~/.local/share/bash-completion/completions/protein-questInitialize for zsh shell with:
mkdir -p ~/.local/share/zsh/site-functions
protein-quest --print-completion zsh > ~/.local/share/zsh/site-functions/_protein-quest
fpath=("$HOME/.local/share/zsh/site-functions" $fpath)
autoload -Uz compinit && compinitFor development information and contribution guidelines, please see CONTRIBUTING.md.
