Macromolecular modeling and design in Rosetta: recent methods and frameworks - The first knowledge sharing application in Vietnam

The understanding that molecular structure determines biological function has motivated decades of experimental determination of protein structure and function. Many computational packages have been developed to guide experimental methods and elucidate macromolecular structure, including Rosetta. Rosetta offers capabilities spanning many bioinformatics and structural-bioinformatics tasks. Computational structural biology frameworks with similarly comprehensive scope are few, but key to progress in biology. Schrodinger1, the Molecular Operating Environment2, and Discovery Studio3 are computational chemistry platforms for advanced modeling and design for structural biology, drug discovery and material science, based on molecular mechanics, molecular dynamics and quantum mechanics calculations. The HHSuite4 includes tools for bioinformatics, sequence alignments, structure prediction and modeling. The BioChemicalLibrary5 (BCL) includes tools for structure prediction, drug discovery, and several sequence-to-structure methods using machine learning approaches. The Integrative Modeling Platform6 (IMP) models large macromolecular complexes by incorporating various types of experimental data. OpenBabel7 is a ChemInformatics toolbox supporting molecular mechanics calculations, being most heavily used for interconversion of file formats.

Molecular dynamics packages like CHARMM8, AMBER9, GROMACS10 and others simulate most atoms explicitly with a physics-based energy function that relies on solving Newton’s equation of motion. These methods can be used for folding small proteins, model refinement, modeling phenomena such as ion flow through membrane channels, and modeling interactions with small molecules and are therefore highly complementary to Rosetta. OpenMM11 is an API (application programming interface) for setting up molecular simulations and can be used as a library or standalone application.

Many other tools are available for more specialized tasks, for instance for de novo modeling (AlphaFold12,13, QUARK14, RaptorX15), homology modeling (Modeller16, SwissModel17), fold recognition (iTasser18), protein-protein docking (HADDOCK19, Zdock20, ClusPro21), ligand docking (AutoDock22, FlexX23, Glide24) and numerous other tasks requiring molecular modeling. As the focus here is on Rosetta developments, a comprehensive list of related methods is listed in the Supplementary Note.

One of Rosetta’s advantages is inter-operability of its large number of applications; however, this makes it challenging to track the scope of functionality available to scientists who wish to use the software. This Perspective is meant to guide new, returning, or seasoned users; to help them find the right protocol hiding in the Rosetta haystack.

Development of Rosetta started in the mid-1990s; it was initially aimed at protein structure prediction and protein folding25. Over time, the number of applications grew to address diverse modeling tasks, from protein–protein or –small molecule docking to incorporating NMR data, loop modeling, protein design, and interaction with peptides and nucleic acids ( ). Over more than 20 years, the community of developers and scientists, the RosettaCommons, grew from a single academic laboratory to laboratories at over 60 institutions wordwide26. The software has undergone several transitions, including in programming language and implementation, with the latest protocols based on Rosetta3, first released in 200827. The score function has been continuously improved and has been described in 28 and 29. As part of our sustained focus on accessibility, usability, and scientific reproducibility, we developed several interfaces (PyRosetta30, RosettaScripts31, Foldit32), and emphasized publishing protocol captures33 to accompany manuscripts. As those interfaces have grown more versatile and modular, development has accelerated and branched in many directions. However, this interoperability, extensibility and modularity enable scientists to combine modules in a wide variety of combinations, making it difficult to keep up with all the developments within the software and the scientific community. Here we have compiled the latest method developments in Rosetta from the past five years, divided into several categories; we provide direction on where to find further information for specific modeling problems. The Supplementary Note contains more details on the protocols with extensive links to documentation, resources on the web, limitations, and competitors.

Some Rosetta applications suffer from (1) technical challenges in implementation, (2) a lack of documentation, protocol captures, or support, and (3) a need for more diverse chemistries for biomolecules. Technical challenges are either historical or due to lack of interest in the community to develop and advance methods in these unique areas.

Many independent trajectories are generated, and the final models are evaluated based on the scientific objective. This setup highlights common limitations in Rosetta protocols involving sampling, scoring (discussed in the score function section), or technical challenges. Many protocols suffer from under-sampling 34 , especially when flexibility is involved. Sampling is a limitation for structure prediction (especially for large structures), protein design and unconstrained global protein-protein docking. For example, even with local docking we are limited by backbone flexibility and performance deteriorates with larger flexibility in the binding interface. Small molecule docking similarly relies on correct identification of the binding interface and is limited by flexibility between unbound and bound states. Enormous conformational search spaces are also prohibitive for RNA modeling due to the size and combinatorics of their torsion space (see RNA section), membrane proteins due to their size, and carbohydrates because of branching and flexibility.

A typical Rosetta protocol is outlined in : the conformation of a biomolecule (the Pose) is altered, either deterministically or stochastically, via a Mover and the resulting conformation is evaluated by a ScoreFunction. The Move is accepted based on the Metropolis criterion and the energy difference between the original and the new conformation:

There are still several limitations to the score function: (1) it does not directly estimate entropy 52 , which has been shown to improve sampling efficiency 53 . However, rotamer bond angles, solvation, fragments and pair terms all implicitly model this component of the free energy, which at these temperatures and solvation densities account for more than half of the entropy. (2) In most cases, knowledge-based score terms are derived from high-resolution crystal structures, representing a single state on the energy landscape and do not represent flexibility well (compared to solution NMR); (3) knowledge-based terms are less interpretable and transferable than physics-based terms; (4) scoring performance scales with the number of score terms and has become slower, yet more accurate, over time; (5) the solvation model is implicit, hence fast, but hinders explicit modeling of ions, water molecules, or lipid environments; (6) several score functions for specific applications (RNA, membrane proteins, carbohydrates, non-canonical amino acids) are still developing.

Hydrogen bond networks are important for biomolecular structure and catalysis but have been challenging to design because of pairwise interactions that have multi-body, cooperative properties. The HBNet protocol 47 has been used to design de novo coiled coils with interaction specificity mediated by designed hydrogen bond networks, including homo-oligomers 47 , membrane proteins 48 , and large sets of orthogonal heterodimers 49 . An improvement to HBNet uses a Monte Carlo search to sample hydrogen bond networks with drastically improved performance 50 . We further developed a statistical potential to place highly-coordinated water molecules on the surface of biomolecules. On a data set of 153 high-resolution protein-protein interfaces, the method predicts 17% of native interface waters with 20% precision within 0.5 Å of the crystallographic water positions 51 . The potential is accessible through the ExplicitWaterMover (former: WaterBoxMover) in RosettaScripts.

The newest score function REF2015 29 reproduces thermodynamic observables (such as liquid-phase properties 36 and liquid-to-vapor transfer free energies 37 ) in addition to structure 38 -based tests. It also utilizes a new, derivative-free optimization technique, which is suitable for robust optimization of >100 parameters. Further, a new energy term was added that takes into consideration non-ideality of bond lengths and angles in cartesian space 39 . The cartesian term 39 is also the basis for a cartesian_ddG method that has been used to calculate ΔΔGs of mutations to assess changes in protein stability. Only the backbones and side chains of residues near the mutation site are allowed to move 40 . Due to the local optimization, this protocol is much faster than the previous gold-standard ddg_monomer 41 , while retaining the same level of accuracy. REF2015 is now compatible with an expanded palette of chemical building-blocks: canonical and non-canonical L-α-amino acids and their D-amino acid counterparts, exotic achiral amino acids, peptoids, and oligoureas, and can model metalloproteins 42 . Score functions that enable simultaneous modeling of protein and RNA are being explored 43 . REF2015 is now thread-safe and fully mirror symmetric, i.e. enantiomers in mirror conformations score identically. Guidance energy terms for design have been added to encourage certain features, such as specific amino acid compositions 44 , 45 , hydrogen bonding networks, or global or local net charges, and discourage others, such as repeat sequences that hinder NMR assignments, buried unsatisfied hydrogen bond donors and acceptors, or voids within the protein 46 .

Some energy terms are decomposed into several components to parameterize each of them separately. For instance, the van der Waals energy is split into attractive and repulsive terms between different residues, in addition to an intra-residue repulsive term. A detailed account of the all-atom score function was published recently 28 .

Mục lục bài viết

3. Major applications

Predicting protein structures

Rosetta was originally developed for de novo protein structure prediction, assembling fragments from known protein structures via a Monte Carlo procedure and evaluating the models with the score function. While the community’s main goals have moved to macromolecular design over the past decade, performance in the CASP13 blind prediction challenge remains respectable54, with ranking for refinement and prediction of multimeric complexes among the top three groups. Meanwhile, other groups have refined their tools exploiting evolutionary couplings and machine learning, for instance Google’s DeepMind developed AlphaFold12,13 (which uses Rosetta for refinement) with outstanding performance in the recent CASP1354. Another highly ranking method is the Zhang server built on iTasser14, and QUARK14.

Homology modeling was improved by using multiple templates in RosettaCM55 (now available on the new Robetta56,57 server), which hybridizes the most homologous portions from multiple templates into a single model, while modeling missing residues de novo55. Without a template, predicting protein structures de novo, remains one of the most challenging tasks in structural biology, even though the incorporation of evolutionary coupling constraints (for instance from GREMLIN58) has led to enormous improvements in model quality. An iterative hybridize approach improves sampling and uses a genetic algorithm that recombines models from an input pool to create models that have features from their parents but are also distinct. Creating several child models in each iteration, updating the input pool, and performing 30–50 iterations led to improved model accuracy because features that are scored favorably are repeatedly used in the recombination, such that the models in the pool converge over time. Iterative hybridization has been used to improve model quality of de novo predicted models59 as well as homology models60. Model refinement or generating ensembles of structures (useful for design) can be accomplished by several algorithms in Rosetta: FastRelax61, Backrub62, or vicinity sampling using KIC/Next-Generation-KIC loop modeling 63,64. Loop modeling65 was implemented early in Rosetta66,67, with initial approaches relying on fragments sampling and iterative Cyclic Coordinate Descent (CCD)68 for chain closure. Later, a kinematic closure (termed “KIC”) approach relied on polynomial resultants to analytically solve for closed conformations, producing more native-like loops69,70. Next-Generation KIC (NGK)64 is a recent innovation that improves sampling by employing diversification (i.e. wider range of conformations) and intensification (i.e. focus around previously generated conformations), substantially increasing the fraction of near-native models64 and modeling longer loops. A related method, GeneralizedKIC44 (GenKIC) samples loop geometries between fixed endpoints including non-standard peptide chemistries or chemistries that conventional loop-modelling algorithms do not typically handle.

Modeling protein–protein complexes

Another early expansion of Rosetta’s functionality was RosettaDock, a method for predicting the structure of protein-protein complexes. The latest version, RosettaDock4.074 incorporates protein flexibility from pre-generated protein ensembles, mimicking conformer selection. This has improved sampling efficiency by automatically adjusting the sampling rate based on the diversity of the input ensembles. Scoring has been improved by a six-dimensional coarse-grained scoring scheme called motif_dock_score, employing score grids generated from known complexes in the Protein Data Bank (PDB). In local docking benchmarks with backbone deviations of up to 2.2 Å, RosettaDock4.0 successfully docked ~50% of complexes74. For symmetric homomers, Rosetta SymDock275 uses the same six-dimensional scoring scheme as RosettaDock. Symmetry information can be extracted from a homologous complex, or from a global docking search for a given point symmetry using our symmetry framework152. An induced-fit based all-atom refinement relieves clashes in tightly-packed complexes to give physically realistic models. On a benchmark set of 43 complexes with different cyclic and dihedral symmetries, global docking on homology models had accuracies of 61% and 42% for cyclic and dihedral symmetries, respectively75. These accuracies can be dramatically improved when adding restraints.

Docking small molecule ligands into proteins

Structure-based drug design has become a key drug optimization tool and leverages the vast array of knowledge contained in the increasing numbers of deposited structures in the PDB. RosettaLigand76 has demonstrated success in predicting small molecule-protein interactions. Later in the drug development process, medicinal chemists optimize ligands based on structure-activity relationships (SAR) by synthesizing different ligands that share a core chemical scaffold and are assumed to bind to their target in a similar fashion153. RosettaLigandEnsemble79 improves sampling during ligand docking by taking advantage of ligand similarities and docking a congeneric series of ligands simultaneously, allowing for a placement that works for all considered ligands while optimizing the binding interface for each ligand independently. Experimental SARs can help identify preferred binding modes. Small molecule ligands can also be used as competitive inhibitors of protein-protein interactions. However, a protein’s inhibitor-bound conformation often differs from the unbound or protein-protein bound conformation, thus Rosetta’s ability to model protein conformational flexibility is key. Rosetta’s pocket optimization approach identifies protein surface pockets and uses their volume as an additional scoring term: this allows the user to start from an unbound protein structure and bias sampling such that low-energy pocket-containing states are preferentially explored80,81. The sampled conformations match “druggable” alternate conformations observed in ligand-bound structures80,81, making these states excellent starting points for virtual screening. Pockets sampled on a protein surface can then be matched to complementary ligands by using the pocket as the starting point for pharmacophore-based screening154.

Modeling and designing antibodies and immune system proteins

Due to the therapeutic significance of antibodies, several antibody-specific and immune-specific protocols have been developed for structure prediction, docking and design (with specific protocols targeting IgG, T-cell receptors, displayed antigens of the Major Histocompatibility Complex (MHC) and other soluble antigens and immunogens). RosettaAntibody85–88 is a protocol for modeling of antibodies88. It identifies homologous templates, assembles them into a single structure and then models CDR H3 loops de novo while refining the VH-VL orientation155. Recent advances use multiple templates155, incorporate key structural constraints156,157 into CDR H3 modeling, model camelid antibodies87 and antibodies on the scale of the human repertoire158,159. AbPredict89 predicts antibody structures without homologous templates. Instead, it samples backbone fragments and rigid-body orientations from known antibody structures, without relying on sequence homology, therefore accurately modeling cases with sequence identity as low as 10%. AbPredict2 is available as a webserver90. SnugDock93 is a related method for antibody-antigen docking, taking as input a plausible starting conformation and optionally an ensemble of antibodies/antigens. SnugDock then runs local docking to refine both the antibody–antigen interface and the heavy–light chain interface (within the antibody) and re-models the CDR H2/H3 loops at the interface. Recent advances include a CDR H3 structural constraint156,157 and docking camelid antibodies160. Limitations in antibody modeling depend on the task: docking is limited by knowledge of the binding site (global vs. local docking); structure prediction, design and refinement are limited by protein flexibility, and modeling of CDRs or other loops is challenging if they are longer than 12 to 15 residues.

RosettaAntibodyDesign94 (RAbD) is based on RosettaAntibody87 (see below) and allows design of specific CDRs of different clusters and lengths, sequence design using cluster-based CDR profiles or conservative mutations, or de novo design of whole antibodies. RAbD uses North-Dunbrack CDR clustering161, reducing deleterious sequence mutations, and was benchmarked on 60 diverse antibody-antigen interfaces from complexes including both λ and κ light chains. Experimental benchmarking of two antibody-antigen complexes showed affinity improvements between 10 and 50-fold. Rosetta has been integrated with experimental immunogenic epitope data, MHC epitope prediction tools, and host genomic data to design proteins with reduced immunogenicity while retaining function and stability95. The approach implements machine learning-based epitope prediction for 28 different alleles, restricts design to select 15mer epitope regions, and uses a greedy stepwise protein design96 to eliminate the most immunogenic epitopes with the least mutations, avoiding disruptive core mutations likely to destabilize the protein. Another method, AbDesign, splits experimentally determined antibody structures along conserved positions to create interchangeable segments and then recombines them to produce a diverse set of novel antibody models97,98. The models are docked to a target of interest, either locally to a specific epitope, or globally, followed by an optimization step comprised of rigorous backbone sampling and sequence design for improving model stability and binding affinity.

Designing new proteins and functions

Protein design162 relies on several of the same core functionalities needed for structure prediction, and synergy and interoperability between design and prediction models has always been a core Rosetta design principle. For example, this synergy is well illustrated by the biased forward folding method: During de novo protein design163, a test for the consistency of the designed sequence is whether ab initio structure prediction will yield the same structure that was used as a starting point for the design. However, computationally testing a large number of designs is prohibited by the vast conformational search space for ab initio structure prediction. To limit that space and test more designs, biased forward folding72 uses three (instead of 200) fragments per residue position with fragments being chosen based on the RMSD to the native structure used to instantiate the design process. Protein design is easier when starting from known structures and when redesigning for well understood objectives like thermostability 164. More difficult design objectives include de novo design (without a template structure) and design for novel folds or functions. Successes in these cases require sampling of enormous conformational spaces, depending on the protein size. Another simplification of de novo design is thermostabilization of the protein, essentially creating rigid structures that are mostly non-functional, by expanding the energy gap between folded and unfolded designs to facilitate structural characterization. To date, novel functional designs mostly exploit known structures and the next frontier is the design of novel functions onto de novo scaffolds. Moreover, nature typically does not design for the global minimum energy conformation (in terms of stability) because proteins require flexibility to carry out their functions.

Design of novel protein structures and functions towards therapeutic intervention is addressed by various methods in Rosetta: SEWING creates de novo designs by recombining parts of protein structures from randomly-selected helical building blocks99. SEWING’s requirement-driven approach allows users to specify features that should be incorporated into their designs during backbone generation without requiring a certain size or three-dimensional fold. New features include incorporation of functional motifs such as protein-binding peptides for protein interface design and ligand binding sites for ligand-binding protein design100. A similar algorithm has been implemented for antibody design (AbDesign, see above), which was generalized for enzyme design165. A more general approach is RosettaRemodel, performing protein design by rebuilding parts or all of the structure101 from fragments of known proteins structures. RosettaRemodel uses a blueprint file in which the user defines secondary and supersecondary structure of the desired fold. Remodel interfaces with various Rosetta protocols and allows de novo modeling, fixed-backbone sequence design, refinement, loop insertion, deletion, and remodeling, disulfide engineering, domain assembly, and motif grafting.

A common task is not only design towards a certain goal (positive design), but additionally, design away from undesired features (negative design). Such a Multi-State Design166 (MSD) approach evaluates strengths and weaknesses of a single sequence on multiple backbones, for instance binding to one but not another protein partner. REstrained CONvergence103 (RECON) allows each state to sample multiple sequences during the design process, which is iteratively applied by increasing the restraint weight to encourage sequence convergence. RECON achieves on average 70% sequence recovery (a 30% increase compared to MSD) for large multi-state design problems, such as antibody affinity maturation or predicting evolutionary sequence profiles of flexible backbones167,168.

Protein function can be designed by motif grafting, i.e. grafting a known motif or predicted active- or binding-site from a template structure onto a new protein. This approach has been used for antibodies and vaccine design104 using the fold_from_loops application, where the functional motif is used as a starting point of an extended structure that is folded following the constraints of a target topology. Iterative refinement is carried out via sequence design and structural relaxation before filtering and human-guided optimization. This protocol has been extended into the Functional Folding and Design (FunFolDes) protocol, including multi-segment motif grafting, different residue length motif insertion, incorporating restraints, and folding in the presence of a binding target105. Performance of the folding stage can be improved by selecting fragments according to the target topology via the StructFragmentMover.

Designing interfaces between proteins and interaction partners

Protein design problems include interface design of proteins with proteins or small molecule ligands and predicting ΔΔGs of mutation (e.g. alanine scanning). Predicting ΔΔGs of mutations for protein stability or protein-protein interactions is difficult with low correlation coefficients (0.5–0.7)169, because the effect of the mutation is small compared to the total energy in the system, and because protein flexibility adds noise to the energies that can mask the effect of mutations. In alanine scanning (mutating into Ala), methods that use a “soft-repulsive” score function without modeling backbone flexibility170,171 typical outperform methods that allow protein flexibility and use hard-repulsive score functions172. FlexDDG106 improves protein-protein interface ΔΔG predictions and generalizes them to residues other than Ala. The protocol creates conformational ensembles using backrub sampling173, then repacks sidechains, minimizes torsions and computes change in protein-protein interaction ΔΔG by averaging across the ensembles. On 1240 interface mutants, FlexDDG outperforms the earlier ddg_monomer application, which was created to predict changes in stability upon mutation, not interfaces.

Symmetric protein assemblies modeled using parametric design. Nature created super-helical coiled-coils that are well-described by geometric equations using Crick parameters174, including variables for the radius of the bundle, major helical twist, minor helix rotation about the primary axis, etc. Several Movers such as MakeBundle, PerturbBundle, and BundleGridSampler allow designing helical bundles48,108 and β-barrels based on pre-defined or sampled parameters. These parametric methods do not rely on fragments libraries and can be applied to non-canonical coiled-coil heteropolymers.

Modeling peptides and peptidomimetics

The inherent flexibility of peptides imparts a large conformational search space to them, leading to challenging modeling problems; when peptide modeling is combined with another simulation, e.g. docking, the increase in conformational space makes the modeling task quite challenging by any method. PIPER-FlexPepDock111 is Rosetta’s global peptide docking protocol. It rigid-body docks fragments using PIPER FFT-based docking175, and refines the complex using FlexPepDock109. PIPER-FlexPepDock can generate peptide-protein complexes from a peptide sequence and a free receptor structure ( ). Performance decreases in case of receptor flexibility.

An external file that holds a picture, illustration, etc.
Object name is nihms-1634548-f0003.jpg Open in a separate window

Cyclic peptide conformations can be sampled with simple_cycpep_predict, restricting the conformational search space through cyclization44,45,108 via the Generalized Kinematic Closure (GenKIC) algorithm (see “loop modeling” above). Simple_cycpep_predict does not rely on protein fragments and can model non-canonical chemistries ( ), being a generalization of earlier protocols. Experimental protein structure determination is challenging for proteins on solid surfaces such as biominerals, self-assembled monolayers, inorganic catalysts, and nanomaterials. RosettaSurface114 samples protein conformations ab initio in both the solution and adsorbed states ( ) to account for adsorption-induced conformational changes. Experimental data can be incorporated115 to improve scoring.

Using experimental data to direct modeling

Using experimental data in modeling can vastly restrict the conformational space, allowing the modeling of larger, more complex biomolecules to greater accuracy. Electron density maps generated by cryo-electron microscopy (cryoEM) or X-ray crystallography have improved in quality and become substantially more available in the past decade and methods to incorporate them can produce high-resolution structures. To deal with variations in the resolution of these methods RosettaES118 samples enumeratively, not requiring initial assignment of densities; it gradually extends the model one residue at a time until all residues are assigned. At each iteration, short fragments are used to sample the nearby conformational space of the growing model, while undergoing a series of clustering and filtering steps based on the energy and fit to the density. If assignment is complete but the data are low-resolution, refinement into density maps is necessary. Several methods have been developed for density maps in the 3.0–4.5Å resolution range. More recently, an automated fragment-guided refinement pipeline121 splits the density map into independent training and validation maps. It finds regions with poor density fit, iteratively rebuilds them with fragments using the training map, filters the models based on their fit to the validation map, model geometry from MolProbity and fit to the full map, and then optimizes against the full map. Further, the frameworks for electron density maps and carbohydrate modeling143 (below) were connected144, allowing refinement of carbohydrates into low-resolution density maps.

NMR data were incorporated into de novo structure prediction early on, embodied in RosettaNMR. Chemical shifts were used for fragment picking using CS-Rosetta122, which could be used with Nuclear Overhauser Enhancements (NOEs), Residual Dipolar Couplings (RDCs)176, Pseudo-Contact Shifts (PCSs)123,124,177 and Paramagnetic Relaxation Enhancement (PRE) data. Improvements, for instance through RASREC resampling178 allowed the use of sparse179 or unassigned data180, easier-to-obtain data (backbone-only181), modeling larger and more complex proteins182, membrane proteins183, symmetric systems184, and combination with data from SAXS185, cryoEM186, distance restraints from homologous proteins187 and evolutionary couplings188. CS-Rosetta also has the AutoNOE189,190 module for automated assignment of NOESY data for use in structure calculations. RosettaNMR was recently overhauled and reconciled with CS-Rosetta and PCS-Rosetta to seamlessly integrate several types of NMR restraints (CS, RDC, PCS, PRE, NOE) in one consistent framework191 for structure prediction, protein-protein docking, protein-ligand docking, and symmetric assemblies.

Covalent labeling mass spectrometry data provides information on relative solvent exposure of residues, yielding information on protein tertiary structure. A low-resolution score term that allows for use of hydroxyl radical foot-printing has been implemented that can improve model quality in structure prediction126,127. Moreover, data from chemical cross-linking mass spectrometry has been incorporated into an automated workflow to identify protein-protein interactions. The PyTXMS128 protocol combines the sensitivity of mass spectrometry to analyze complex samples with the power of Rosetta structural modeling and protein-protein docking to efficiently sample the vast conformational space and identify interactions ( ). A machine learning algorithm based on high resolution MS1 data guides the potential binding interface selection, being validated and adjusted by a repository of structural models and MS2 (data-dependent acquisition (DDA)) samples.

Modeling nucleic acids and their interactions with proteins

DNA and RNA modeling requires addressing a multitude of challenges due to a lack of structures leading to under-developed score functions, low quality alignments, and a much larger sampling torsion space than for proteins (70 residue RNA comparable to 200 residue protein). In contrast to protein helices where side-chains display sequence information on the helix exterior, helical RNA sidechains point inwards, therefore hiding sequence information from the environment, making prediction of tertiary or non-local contacts more difficult. Non-local contacts are mediated by loops, challenging for prediction algorithms. Several advances have been made in the representation of nucleic acids in Rosetta. The StepWise Monte Carlo protocol (SWM) has achieved RNA structure predictions reaching atomic accuracy131; the approach provides an acceleration over the original enumerative StepWise Assembly (SWA) method129,130. A version of SWA that rebuilds one nucleotide at a time enables fine-grained correction of errors in RNA coordinates fit into crystallographic or cryo-EM maps by Enumerative Real-space Refinement ASsisted by Electron density under Rosetta135,136 (ERRASER).

The most recent advances in RNA tools expand the fragment assembly protocol to support modeling RNA-protein complexes through simultaneous folding and docking134. RNA-protein interactions are handled via additional knowledge-based score terms that supplement the low-resolution RNA score function. Free energy perturbations from RNA or protein mutations can be modeled with the Rosetta-Vienna ΔΔG protocol43. Structure coordinates can further be built into cryo-EM density maps for large RNA-protein complexes with DRRAFTER (De novo Ribonucleoprotein modeling in Real space through Assembly of Fragments Together with Experimental density in Rosetta)138. Redesign and prediction of protein-DNA interfaces192,193 has been accomplished with flexible protein backbones194, genetic algorithms192,194,195 and motif-biased rotamer sampling196,197. A potential limitation is the reliance on fixed DNA backbone conformations, which can be flexible. Key to successful protein-DNA design is a score function optimized197,198 for these highly charged and solvated interfaces. Rosetta supports prediction of specificity and affinity199, the prediction of DNA binding preferences of homologous proteins and multi-template modeling in RosettaCM55,200.

Modeling membrane proteins

Membrane proteins constitute about 30% of all proteins and are targets for over 60% of pharmaceuticals on the market201. However, experimental difficulties have limited our understanding of their structures202. Previously, Yarov-Yarovoy203 and Barth204 implemented tools for low- and high-resolution structure prediction of membrane proteins, termed RosettaMembrane. These tools were re-engineered for compatibility with Rosetta327 into a platform called RosettaMP139. RosettaMP implements core modules for representing, sampling, and scoring proteins in the context of an implicit membrane. RosettaMP is compatible with key modeling protocols including docking, design, ΔΔG prediction169, PyMOL visualization205, and assembly of symmetric proteins. Additionally, a set of basic modeling tools140 allows scoring, transforming a membrane protein into the membrane coordinate frame, de novo modeling for single transmembrane span helices, introducing mutations, and visualization in the membrane. RosettaMP has enabled rapid development of new tools including structure-based detection of lipid exposed residues in the membrane141 and domain assembly of full-length protein models from structures of transmembrane and soluble domains142. The RosettaCM protocol for multi-template homology modeling has also been adapted to membrane proteins33.

Describing membrane protein energetics is challenging as these proteins reside in an anisotropic environment and bury polar solvent molecules (e.g. water, ions) that stabilize the structure and participate in important conformational transitions. Implicit membrane models often fail to reliably model membrane protein interiors. The method SPaDES is based on a hybrid explicit-implicit solvent model that enhances the prediction and design of membrane protein structures206. Limitations to membrane protein modeling are similar but less severe than for RNA modeling: there are fewer structures in databases, fewer method developers in this field and hence fewer available tools. Consequently, the score function is less mature compared to the latest score functions for soluble proteins: the implicit solvent hydrophobic slab model is a coarse-gained representation of the membrane. Ongoing efforts expand this model by including pores, lipid specificity and different thicknesses207, yet many effects remain to be acknowledged such as measurement-specific or observed membrane geometries (micelles, bicelles, nanodiscs, vesicles, different pore types, fusion and fission of multiple membranes) and macroscopic physical phenomena like membrane tension and fluidity. Challenges in including these effects are experimental measurements for parameterization of these models and adaptation of a multitude of score terms.

Adding carbohydrates to the modeling process

Carbohydrates are fundamental to life208,209, but because of challenges in experimental characterization and computational sampling and scoring, their structures have been historically under-studied. The RosettaCarbohydrate framework143 models carbohydrate structures and complexes such as glycosylated proteins or protein–sugar complexes ( ) with the same algorithms one would use for proteins. RosettaCarbohydrate can handle commonly studied and uncommon carbohydrate structures, including linear, cyclic, and branched structures, sugar modifications, and conjugations. Methods exist for sampling ring conformations, packing substituents, refining glycosidic linkages, sampling from linkage “fragments”, and extending glycan chains. Scoring of saccharide-containing sugars includes a quantum-mechanically derived intrinsic backbone term210. Because saccharide residues are stored as distinct data structures, we can integrate bioinformatic and statistical data into these algorithms, opening the door for glycoengineering and design applications. RosettaCarbohydrate has been integrated with other frameworks, such as loop modeling (GenKIC and Stepwise Assembly), refinement (GlycanTreeModeler), symmetry, and RosettaScripts-accessible classes such as MoveMaps and ResidueSelectors. Linkages are automatically determined during PDB read-in. Carbohydrates work with Cartesian minimization, and can be refined into electron density maps144. Limitations in the carbohydrate framework include the increased sampling space due to carbohydrate flexibility and branching, and need to model many different chemistries with possible branching and cyclization. Developments in this area have only recently started and much work has yet to be done.