rhapsody package

Rhapsody: a program for pathogenicity prediction of human missense variants based on sequence, structure and dynamics of proteins

rhapsody.initialSetup(working_dir=None, refresh=False, download_EVmutation=True)[source]

Function to be run right after installation for setting up the environment and main parameters and for training the default classifiers. By default, a working directory will be created in the user home directory (~/rhapsody/). Previous configuration data will be recovered. Additional data from EVmutation website will be automatically downloaded (~1.4GB).

Parameters
  • working_dir (str) – path to a local folder

  • refresh (bool) – if True, previous trained classifiers will be deleted, if found

  • download_EVmutation (bool) – if True, precomputed EVmutation scores will be downloaded (recommended)

rhapsody.getDefaultTrainingDataset()[source]
rhapsody.getDefaultClassifiers()[source]

Returns a dictionary with the paths to the three default classifiers ('full', 'reduced' and 'EVmut')

rhapsody.importDefaultClassifier(version)[source]

Imports the specified classifier and its summary

Parameters

version (str) – either ‘full’, ‘reduced’ or ‘EVmut’

rhapsody.delSettings()[source]
rhapsody.getSettings(print=True)[source]

Returns and prints essential information about the current Rhapsody configuration, such as the location of working directory and default classifiers

rhapsody.calcScoreMetrics(y_test, y_pred, bootstrap=0, **resample_kwargs)[source]

Compute accuracy metrics of continuous values (optionally bootstrapped)

rhapsody.calcClassMetrics(y_test, y_pred, bootstrap=0, **resample_kwargs)[source]

Compute accuracy metrics of binary labels (optionally bootstrapped)

rhapsody.calcPathogenicityProbs(CV_info, num_bins=15, ppred_reliability_cutoff=200, pred_distrib_fig='predictions_distribution.png', path_prob_fig='pathogenicity_prob.png', **kwargs)[source]

Compute pathogenicity probabilities, from predictions on CV test sets

rhapsody.RandomForestCV(feat_matrix, n_estimators=1500, max_features=2, **kwargs)[source]
rhapsody.trainRFclassifier(feat_matrix, n_estimators=1500, max_features=2, pickle_name='trained_classifier.pkl', feat_imp_fig='feat_importances.png', **kwargs)[source]
rhapsody.extendDefaultTrainingDataset(names, arrays, base_default_featset='full')[source]

base : array Input array to extend.

names : string, sequence String or sequence of strings corresponding to the names of the new fields.

data : array or sequence of arrays Array or sequence of arrays storing the fields to add to the base.

rhapsody.print_pred_distrib_figure(filename, bins, histo, dx, J_opt)[source]
rhapsody.print_path_prob_figure(filename, bins, histo, dx, path_prob, smooth_plot=None, cutoff=200)[source]
rhapsody.print_ROC_figure(filename, fpr, tpr, auc_stat)[source]
rhapsody.print_feat_imp_figure(filename, feat_imp, featset)[source]
rhapsody.queryUniprot(*args, n_attempts=3, dt=1, **kwargs)[source]

Redefine prody function to check for no internet connection

class rhapsody.UniprotMapping(acc, recover_pickle=False, **kwargs)[source]

Bases: object

__init__(acc, recover_pickle=False, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

refresh()[source]

Refresh imported Uniprot records and mappings, and delete precomputed alignments.

getFullRecord()[source]

Returns the output from queryUniprot()

getPDBrecords()[source]

Returns a dictionary containing only the ‘dbReference’ records relative to PDB, extracted from the full Uniprot record.

getPDBmappings(PDBID=None)[source]

Returns a list of dictionaries, with mappings of the Uniprot sequence onto single PDB chains. For each PDB chain, the residue intervals retrieved from the Uniprot database are parsed into a list of tuples (‘chain_sel’) corresponding to endpoints of individual segments. NB: ‘@’ stands for ‘all chains’, following Uniprot naming convention.

alignSinglePDB(PDBID, chain='longest')[source]

Aligns the Uniprot sequence with the sequence from the given PDB entry.

alignCustomPDB(PDB, chain='all', title=None, recover=False)[source]

Aligns the Uniprot sequence with the sequence from the given PDB.

alignAllPDBs(chain='longest')[source]

Aligns the Uniprot sequence with the sequences of all PDBs in the Uniprot record.

mapSingleResidue(resid, check_aa=False, depth='best')[source]

Map a single amino acid in a Uniprot sequence to PDBs. If ‘check_aa’ is True, it will return only PDB residues with the wild-type amino acid. If ‘depth’ is ‘matching’, it will use info from Uniprot record to determine which PDBs contain the given residue, and if ‘depth’ is ‘best’ only the longest chain will be considered and printed, to save time. If ‘depth’ is all, it will perform a thorough search among all PDBs (slow). The matching PDB residues will be sorted, in descending order, according to the identity of the relative chain with the Uniprot sequence.

mapSingleRes2CustomPDBs(resid, check_aa=False)[source]

Map an amino acid in the Uniprot sequence to aligned custom PDBs. If ‘check_aa’ is True, it will return only PDB residues with the wild-type amino acid.

setAlignAlgorithm(align_algorithm=1, gap_open_penalty=- 0.5, gap_ext_penalty=- 0.1, refresh=True)[source]

Set the Biopython alignment algorithm used for aligning Uniprot sequence to PDB sequences. All precomputed alignments will be deleted.

savePickle(filename=None, folder=None, store_custom_PDBs=False)[source]
recoverPickle(filename=None, folder=None, days=30, **kwargs)[source]
resetTimestamp()[source]
calcEvolProperties(resid='all', refresh=False, folder=None, max_cols=None, max_seqs=25000, **kwargs)[source]

Computes Evol properties, i.e. Shannon entropy, Mutual Information and Direct Information, from Pfam Multiple Sequence Alignments, for a given residue.

rhapsody.mapSAVs2PDB(SAV_coords, custom_PDB=None, refresh=False, status_file=None, status_prefix=None)[source]
rhapsody.seqScanning(Uniprot_coord, sequence=None)[source]

Returns a list of SAVs. If the string ‘Uniprot_coord’ is just a Uniprot ID, the list will contain all possible amino acid substitutions at all positions in the sequence. If ‘Uniprot_coord’ also includes a specific position, the list will only contain all possible amino acid variants at that position. If ‘sequence’ is ‘None’ (default), the sequence will be downloaded from Uniprot.

rhapsody.printSAVlist(input_SAVs, filename)[source]
class rhapsody.PDBfeatures(PDB, n_modes='all', recover_pickle=False, **kwargs)[source]

Bases: object

A class for deriving structural and dynamical properties from a PDB structure.

Parameters
  • PDB (Atomic, str) – an object or a PDB code identifying a PDB structure.

  • n_modes (int, str) – number of GNM/ANM modes to be computed.

  • recover_pickle (bool) – whether or not to recover precomputed pickle, if found

__init__(PDB, n_modes='all', recover_pickle=False, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

getPDB()[source]

Returns the parsed PDB structure as an AtomGroup object.

refresh()[source]

Deletes all precomputed ENM models and features, and resets time stamp.

recoverPickle(folder=None, filename=None, days=30, **kwargs)[source]

Looks for precomputed pickle for the current PDB structure.

Parameters
  • folder (str) – path of folder where pickles are stored. If not specified, pickles will be searched for in the local Rhapsody installation folder.

  • filename (str) – name of the pickle. If not specified, the default filename 'PDBfeatures-[PDBID].pkl' will be used. If a PDBID is not found, user must specify a valid filename.

  • days (int) – number of days after which a pickle will be considered too old and won’t be recovered.

savePickle(folder=None, filename=None)[source]

Stores a pickle of the current class instance. The pickle will contain all information and precomputed features, but not GNM and ANM models. In case a PDBID is missing, the parsed PDB AtomGroup is stored as well.

Parameters
  • folder (str) – path of the folder where the pickle will be saved. If not specified, the local Rhapsody installation folder will be used.

  • filename (str) – name of the pickle. By default, the pickle will be saved as 'PDBfeatures-[PDBID].pkl'. If a PDBID is not defined, the user must provide a filename.

Returns

pickle path

Return type

str

resetTimestamp()[source]
setNumModes(n_modes)[source]

Sets the number of ENM modes to be computed. If different from the number provided at instantiation, any precomputed features will be deleted.

calcGNM(chID, env='chain')[source]

Builds GNM model for the selected chain.

Parameters
  • chID (str) – chain identifier

  • env (str) – environment model, i.e. 'chain', 'reduced' or 'sliced'

Returns

GNM model

Return type

GNM

calcANM(chID, env='chain')[source]

Builds ANM model for the selected chain.

Parameters
  • chID (str) – chain identifier

  • env (str) – environment model, i.e. 'chain', 'reduced' or 'sliced'

Returns

ANM model

Return type

ANM

calcGNMfeatures(chain='all', env='chain', GNM_PRS=True)[source]

Computes GNM-based features.

Parameters
  • chain (str) – chain identifier

  • env (str) – environment model, i.e. 'chain', 'reduced' or 'sliced'

  • GNM_PRS (bool) – whether or not to compute features based on Perturbation Response Scanning analysis

calcANMfeatures(chain='all', env='chain', ANM_PRS=True, stiffness=True, MBS=False)[source]

Computes ANM-based features.

Parameters
  • chain (str) – chain identifier

  • env (str) – environment model, i.e. 'chain', 'reduced' or 'sliced'

  • ANM_PRS (bool) – whether or not to compute features based on Perturbation Response Scanning analysis

  • stiffness (bool) – whether or not to compute stiffness with MechStiff

  • MBS (bool) – whether or not to compute Mechanical Bridging Score

calcDSSP(chain='whole')[source]

Runs DSSP on the PDB structure.

Parameters

chain (str) – chain identifier. If 'whole', the whole complex will be considered

Returns

modified PDB object with DSSP properties added as additional attributes, accessible via method getData()

Return type

AtomGroup

calcSASA(chain='all')[source]

Computes Solvent Accessible Surface Area of single chains with DSSP algorithm.

Parameters

chain (str) – chain identifier

calcDeltaSASA(chain='all')[source]

Computes the difference between Solvent Accessible Surface Area of an isolated chain and of the same chain seen in the complex.

Parameters

chain (str) – chain identifier

calcSelFeatures(chain='all', resid=None, sel_feats=None)[source]

Computes selected PDB-based features for all chains in the PDB structure, for a specific chain or for a single residue. Available features are listed in PDB_FEATS().

Parameters
  • chain (str) – chain identifier

  • resid (int) – residue number. If selected, a single chain must be also specified

  • sel_feats – list of feature names. If None, all PDB_FEATS() will be computed

Returns

a dictionary, containing names and values (or error messages) of selected features, for each chain or residue

Return type

dict

rhapsody.calcPDBfeatures(mapped_SAVs, sel_feats=None, custom_PDB=None, refresh=False, status_file=None, status_prefix=None)[source]
rhapsody.queryPolyPhen2(filename, dump=True, prefix='pph2', fasta_file=None, fix_isoforms=False, ignore_errors=False, **kwargs)[source]
rhapsody.parsePolyPhen2output(pph2_output)[source]

Import PolyPhen-2 results directly from the output of ‘queryPolyPhen2’ or from a file (in ‘full’ format).

rhapsody.getSAVcoords(parsed_lines)[source]

Extracts SAV Uniprot coordinates as provided by the user. If not possible, the Uniprot coordinates computed by PolyPhen-2 will be returned. A string containing the original submitted SAV is returned as well.

rhapsody.calcPolyPhen2features(PolyPhen2output)[source]
rhapsody.recoverEVmutFeatures(SAVs)[source]

Compute EVmutation features by fetching precomputed scores from the downloaded local folder. If multiple values are found for a given variant, the average will be taken.

Parameters

SAVs (list or tuple of strings) – list of SAV coordinates, e.g. 'P17516 135 G E'.

Returns

an array of EVmutation features for each SAV

Return type

NumPy structured array

rhapsody.calcPfamFeatures(SAVs, status_file=None, status_prefix=None)[source]
rhapsody.calcBLOSUMfeatures(SAV_coords)[source]
class rhapsody.Rhapsody(query=None, query_type='SAVs', queryPolyPhen2=True, **kwargs)[source]

Bases: object

A class implementing the Rhapsody algorithm for pathogenicity prediction of human missense variants and that can also be used to compare results from other prediction tools, namely PolyPhen-2 and EVmutation.

__init__(query=None, query_type='SAVs', queryPolyPhen2=True, **kwargs)[source]

Initialize a Rhapsody object with a list of SAVs (optional).

Parameters
  • query (str, list) –

    Single Amino Acid Variants (SAVs) in Uniprot coordinates.

    • If None, the SAV list can be imported at a later moment, by using .importPolyPhen2output(), .queryPolyPhen2() or .setSAVs()

    • if query_type = 'SAVs' (default), query should be a filename, a string or a list/tuple of strings, containing Uniprot SAV coordinates, with the format 'P17516 135 G E'. The string could also be just a single Uniprot sequence identifier (e.g. 'P17516'), or the coordinate of a specific site in a sequence (e.g. 'P17516 135'), in which case all possible 19 amino acid substitutions at the specified positions will be analyzed.

    • if query_type = 'PolyPhen2', query should be a filename containing the output from PolyPhen-2, usually named pph2-full.txt

  • query_type (str) – 'SAVs' or 'PolyPhen2'

  • queryPolyPhen2 (bool) – if True, the PolyPhen-2 online tool will be queryied with the list of SAVs

setSAVs(query)[source]
queryPolyPhen2(query, filename='rhapsody-SAVs.txt')[source]
importPolyPhen2output(filename)[source]
getSAVcoords()[source]
setFeatSet(featset)[source]
setCustomPDB(custom_PDB)[source]
setTrueLabels(true_label_dict)[source]
getUniprot2PDBmap(filename='rhapsody-Uniprot2PDB.txt', print_header=True, refresh=False)[source]

Maps each SAV to the corresponding resid in a PDB chain.

getPDBcoords()[source]
getUniqueSAVcoords()[source]
calcFeatures(filename='rhapsody-features.txt', refresh=False)[source]
importFeatMatrix(struct_array)[source]
exportTrainingData(refresh=False)[source]
importPrecomputedExtraFeatures(features_dict)[source]
importClassifiers(classifier, aux_classifier=None, force_env=None)[source]
getPredictions(SAV='all', classifier='best', PolyPhen2=True, EVmutation=True, PDBcoords=False, refresh=False)[source]
getResAvgPredictions(resid=None, classifier='best', PolyPhen2=True, EVmutation=True, refresh=False)[source]
printPredictions(classifier='best', PolyPhen2=True, EVmutation=True, filename='rhapsody-predictions.txt', print_header=True)[source]
writePDBs(PDBID=None, predictions='best', path_prob=True, filename_prefix='rhapsody-PDB', refresh=False)[source]
savePickle(filename='rhapsody-pickle.pkl')[source]
rhapsody.calcPredictions(feat_matrix, clsf, SAV_coords=None)[source]
rhapsody.rhapsody(query, query_type='SAVs', main_classifier=None, aux_classifier=None, custom_PDB=None, force_env=None, refresh=False, log=True, **kwargs)[source]

Obtain Rhapsody pathogenicity predictions on a list of human missense variants ([ref])

Parameters
  • query (str, list) –

    Single Amino Acid Variants (SAVs) in Uniprot coordinates

    • if query_type = 'SAVs' (default), it should be a filename, a string or a list/tuple of strings, containing Uniprot SAV coordinates, with the format 'P17516 135 G E'. The string could also be just a single Uniprot sequence identifier (e.g. 'P17516'), or the coordinate of a specific site in a sequence (e.g. 'P17516 135'), in which case all possible 19 amino acid substitutions at the specified positions will be analyzed.

    • if query_type = 'PolyPhen2', it should be a filename containing the output from PolyPhen-2, usually named pph2-full.txt

  • query_type (str) – 'SAVs' or 'PolyPhen2'

  • main_classifier (str) – main classifier’s filename. If None, the default full Rhapsody classifier will be used

  • aux_classifier (str) – auxiliary classifier’s filename. If both main_classifier and aux_classifier are None, the default reduced Rhapsody classifier will be used

  • custom_PDB (str, AtomGroup) – a PDBID, a filename or an Atomic to be used for computing structural and dynamical features, instead of the PDB structure automatically selected by the program

  • force_env (str) – force a specific environment model for GNM/ANM calculations, among 'chain', 'reduced' and 'sliced'. If None (default), the model of individual dynamical features will match that found in the classifier’s feature set

  • refresh (str) – if True, precomputed features and PDB mappings found in the working directory will be ignored and computed again

  • log (str) – if True, log messages will be saved in rhapsody-log.txt

ref

Ponzoni L, Bahar I. Structural dynamics is a determinant of the functional significance of missense variants. PNAS 2018 115 (16) 4164-4169.

rhapsody.print_sat_mutagen_figure(filename, rhapsody_obj, res_interval=None, PolyPhen2=True, EVmutation=True, extra_plot=None, fig_height=8, fig_width=None, dpi=300, min_interval_size=15, html=False, main_clsf='main', aux_clsf='aux.')[source]