RELION: Implementation of a Bayesian approach to cryo-EM structure determination - The first knowledge sharing application in Vietnam

RELION, for REgularized LIkelihood OptimizatioN, is an open-source computer program for the refinement of macromolecular structures by single-particle analysis of electron cryo-microscopy (cryo-EM) data. Whereas alternative approaches often rely on user expertise for the tuning of parameters, RELION uses a Bayesian approach to infer parameters of a statistical model from the data. This paper describes developments that reduce the computational costs of the underlying maximum a posteriori (MAP) algorithm, as well as statistical considerations that yield new insights into the accuracy with which the relative orientations of individual particles may be determined. A so-called gold-standard Fourier shell correlation (FSC) procedure to prevent overfitting is also described. The resulting implementation yields high-quality reconstructions and reliable resolution estimates with minimal user intervention and at acceptable computational costs.

This paper describes the implementation of the Bayesian approach to single-particle reconstruction in the stand-alone computer program RELION, which stands for REgularized LIkelihood OptimizatioN. The theoretical implications of the statistical approach represent a huge challenge for its implementation in a useful computer program. Various algorithmic developments are described that allow MAP optimization of single-particle reconstructions at an acceptable computational cost. Moreover, the theoretical framework provided by the Bayesian approach may yield valuable insights into outstanding questions. As an example of this, I will describe an approach that uses the statistical data model to estimate the accuracy with which individual particles may be aligned and to quantify the contribution of different frequencies to this. Finally, because in principle some degree of overfitting might still go by unnoticed in the previously proposed MAP optimization approach ( Scheres, 2012 ), a new procedure is described that eradicates the possibility of overfitting by the use of so-called “gold-standard” FSC calculations ( Henderson et al., 2012; Scheres et al., 2012 ). Application of RELION to both simulated and experimental data illustrates that reconstructions that are free from overfitting may be obtained in a highly objective manner, without compromising reconstruction quality and at acceptable computational costs.

Recently, I described a Bayesian approach to cryo-EM structure determination, in which the reconstruction problem is expressed as the optimization of a single target function ( Scheres, 2012 ). In particular, the reconstruction problem is formulated as finding the model that has the highest probability of being the correct one in the light of both the observed data and available prior information. Optimization of this posterior distribution is called maximum a posteriori (MAP), or regularized likelihood optimization. The Bayesian interpretation places the cryo-EM structure determination process on a firm theoretical basis, where explicit statistical assumptions about the model and the data, as well as the optimization strategy itself, can be discussed and improved if deemed necessary. Whereas conventional refinement procedures employ many ad hoc parameters that need to be tuned by an expert user, the Bayesian approach iteratively learns most parameters of the statistical model from the data themselves.

However, many cryo-EM projects still suffer from important hurdles in image processing that cannot be overcome by automation and increased volumes of data alone. Existing image processing procedures often comprise a concatenation of multiple steps, such as particle alignment, class averaging, reconstruction, resolution estimation and filtering. Many of these steps involve the tuning of specific parameters. Whereas appropriate use of these procedures may yield useful results, suboptimal parameter settings or inadequate combinations of the separate steps may also lead to grossly incorrect structures, thus representing a potential pitfall for newcomers to the field.

The increased applicability of the technique is expected to attract new researchers to the field. Because conventional data collection and processing procedures often rely on user expertise, the needs for improved ease-of-use and automation are now widely recognized. More convenient data collection schemes are being developed through a combination of automated data acquisition software ( Suloway et al., 2005 ) and improvements in the latest generation electron microscopes ( Shrum et al., in press; Fischer et al., 2010 ). To cope with the large amounts of data from these experiments, semi-automated image processing pipelines and dedicated electronic notebooks have been proposed ( Lander et al., 2009; Ludtke et al., 2003 ). Continuing developments in these areas are expected to increase the accessibility of cryo-EM structure determination to inexperienced users.

Mục lục bài viết

2. Approach

2.1. Theoretical background

MAP refinement of cryo-EM single-particle reconstructions is based on the following linear model in Fourier space:

Xij=CTFij∑l=1LPjlϕVkl+Nij,

(1)

where:

•

Xij is the jth component, with j = 1,…,J, of the 2D Fourier transform Xi of the ith experimental image, with i = 1,…,N.
•

CTFij is the jth component of the contrast transfer function for the ith image.
•

Vkl is the lth component, with l = 1,…,L, of the 3D Fourier transform Vk of the kth of K underlying structures in the data set. Multiple structures K may be used to describe structural heterogeneity in the data, and K is assumed to be known. All components Vkl are assumed to be independent, zero-mean, and Gaussian distributed with variance τkl2.
•

Pϕ is a J × L matrix of elements Pjlϕ. The operation ∑l=1LPjlϕVkl for all j extracts a slice out of the 3D Fourier transform of the kth underlying structure, and ϕ defines the orientation of the 2D Fourier transform with respect to the 3D structure, comprising a 3D rotation and a phase shift accounting for a 2D origin offset in the experimental image. Similarly, the operation ∑j−1JPljϕTXij for all l places the 2D Fourier transform of an experimental image back into the 3D transform.
•

Nij is noise in the complex plane, which is assumed to be independent, zero-mean, and Gaussian distributed with variance σij2.

Imagining an ensemble of possible solutions, the reconstruction problem is formulated as finding the model with parameter set Θ that has the highest probability of being the correct one in the light of both the observed data 𝒳 and the prior information 𝒴. According to Bayes’ law, this so-called posterior distribution factorizes into two components:

P(Θ|𝒳,𝒴) ∝ P(𝒳|Θ,𝒴)P(Θ|𝒴)

(2)

where the likelihood
P(𝒳|Θ,𝒴) quantifies the probability of observing the data given the model, and the prior
P(Θ|𝒴) expresses how likely that model is given the prior information. The likelihood is computed based on the assumption of independent, zero-mean Gaussian noise in the images, and one marginalizes over the orientations ϕ and class assignments k. The variance σij2 of the noise components is unknown and will be estimated from the data. Variation of σij2 with resolution allows the description of non-white, or coloured noise. The prior is based on the assumption that the Fourier components of the signal are also independent, zero-mean and Gaussian distributed with unknown and resolution-dependent variance τkl2 (see Scheres, 2012 for more details). The model Θˆ, including all Vkl,σij2 and τkl2, that optimizes the posterior distribution P(Θ|𝒳,𝒴) is called the maximum a posteriori (MAP) estimate. Note that previously discussed ML methods in the Fourier domain (Scheres et al., 2007b) aimed to optimize P(𝒳|Θ,𝒴).

Optimisation of P(Θ|𝒳,𝒴) may be achieved by the expectation–maximization algorithm (Dempster et al., 1977), in which case the following iterative algorithm is obtained:

Vkl(n+1)=∑i=1N∫ϕΓikϕ(n)∑j=1JPljϕTCTFijXijσij2(n)dϕ∑i=1N∫ϕΓikϕ(n)∑j=1JPljϕTCTFij2σij2(n)dϕ+1τkl2(n),

(3)

σij2(n+1)=12∑k=1K∫ϕΓikϕ(n)Xij-CTFij∑l=1LPjlϕVkl(n)2dϕ,

(4)

τkl2(n+1)=12Vkl(n+1)2,

(5)

where Γikϕ(n) is the posterior probability of class assignment k and orientation assignment ϕ for the ith image, given the model at iteration number (n). It is calculated as follows:

Γikϕ(n)=P(Xi|k,ϕ,Θ(n),Y)P(k,ϕ|Θ(n),Y)∑k′=1K∫ϕ′P(Xi|k′,ϕ′,Θ(n),Y)P(k′,ϕ′|Θ(n),Y)dϕ′,

(6)

P(Xi|k,ϕ,Θ(n),Y)=∏j=1J12πσij2(n)expXij-CTFij∑l=1LPjlϕVkl(n)2-2σij2(n),

(7)

with:

and P(k,ϕ|Θ(n),𝒴) may be used to express prior information about the distribution of the hidden variables k and ϕ. In practice, the integrations over ϕ are replaced by (Riemann) summations over discretely sampled orientations, and translations are limited to a user-defined range. Also, the power of the signal, τkl2, and of the noise, σij2, are estimated as 1D vectors, varying only with the resolution of Fourier components j and l.

The iterative algorithm in Eqs. (3)–(7) is started from an initial estimate for Vk: the starting model. If K > 1, multiple different starting models are obtained by random division of the data set in the first iteration. The user controls the number of models K that is to be refined simultaneously. Initial estimates for τkl and σij are calculated from the power spectra of the starting model and individual particles, respectively.

It is important to note that the algorithm outlined above is a local optimizer. Thereby, the outcome of the refinement depends on the suitability of the starting model, and grossly incorrect starting models may lead to suboptimal results. Typically, to reduce bias to a possibly incorrect starting model, one applies a strong low-pass filter to the starting model.

2.2. Increasing computational speed: fast Fourier-space interpolation

Eqs. (3)–(7) represent a daunting computational challenge. Within each iteration, for every experimental image one has to evaluate the posterior probability Γikϕ(n) for all possible ϕ and k, and each image has to be back-projected into the 3D map with the corresponding weight for all ϕ and all k. Previous ML implementations reduced computational costs by keeping a set of pre-calculated 2D reference projections on a relatively coarsely sampled orientational grid in memory (Scheres et al., 2007a,b). Moreover, summations over all experimental images, in-plane rotations and translations were performed in 2D, and the corresponding weighted sums were also stored in memory. The resulting quadratic scaling of computer memory usage with the angular sampling rate in practice meant that ML refinements could not be performed with angular sampling rates finer than 10°, which seriously limited attainable resolutions.

RELION implements a drastically different approach. Instead of storing many 2D images in computer memory, it calculates projection and back-projection operations on-the-fly. The main advantage of this approach is that memory requirements no longer increase with increasing angular sampling rates, apart from storing a larger Γikϕ(n) array. However, because the (back-) projection operations have to be performed for many experimental images and a large number of orientations, this approach requires fast calculation of the (back-) projection operations in order to be computationally feasible.

As mentioned above, the projection and back-projection operations involve taking 2D slices out of a 3D Fourier transform, and putting them back in. This requires some sort of interpolation because the 3D Cartesian grid on which Vk is sampled does not generally coincide with the 2D Cartesian grid of Xi. To speed up the calculations inside RELION, the 3D Fourier transform Vk is oversampled twice by zero-padding of the map in real-space, and projection operations are then performed using linear interpolation in Fourier space. The linear interpolation scheme makes matrices Pϕ very sparse, so that the computational cost of the projection operations is minimized and the integrals over ϕ in Eqs. (3)–(7) may be evaluated within reasonable time. To reduce artifacts in the projections, a reverse gridding correction (with a sinc2-function) is applied to the 3D map prior to calculation of the Fourier transform.

A similar, inverse procedure is followed for the back-projection operations, where 2D Fourier transforms Xi are placed into an oversampled 3D transform using the transpose of matrix Pϕ. However, the summation over all back-projected images in the numerator of Eq. (3) then results in a severely non-uniformly sampled 3D transform. This transform must be properly weighted before the actual reconstruction is obtained by an inverse Fourier transform operation, since straightforward division by the weights in the denominator of Eq. (3) would lead to unsatisfactory results. For this purpose, RELION implements a modified version of an iterative gridding reconstruction algorithm that was previously proposed for medical magnetic resonance imaging (MRI) (Pipe and Menon, 1999) and positron emission tomography (PET) (Matej and Lewitt, 2001). This algorithm is described in more detail in Appendix A.

2.3. Increasing computational speed: adaptive expectation–maximization

With the computational cost of the (back-) projection operations reduced, the most costly operation in Eqs. (3)–(7) is the calculation of the l2-norm in Eq. (7), which has to be evaluated for all i,k and ϕ. In particular, the orientations ϕ span a large 5D domain, comprising 3 rotations and 2 translations. Several approaches have previously been proposed to accelerate these calculations through domain reduction (Scheres et al., 2005; Tagare et al., 2010). In the domain reduction strategy, the integration over the entire domain is replaced by an integration over a significantly smaller sub-domain. Because in practice the posterior distribution Γikϕ(n) is close to zero for many k and ϕ, this turns out to be an effective way to approximate the total integration at strongly reduced computational costs.

RELION implements a modified version of the adaptive expectation maximization algorithm that was proposed by Tagare et al. (2010). For each experimental image, in a first pass Γikϕ(n) is evaluated over the entire domain using a relatively coarsely sampled grid of ϕ. The array of all Γikϕ(n) is sorted, and a sub-domain of all k and ϕ is selected that corresponds to the highest values of Γikϕ(n) that sum to a significant fraction ξ, typically 99.9%, of the total probability mass on the coarse grid. Then, in a second pass, Γikϕ(n) is evaluated only over the selected sub-domain using a finer grid.

The adaptive algorithm requires two discrete sampling grids of the continuous orientations ϕ: a coarse one and a fine one. To avoid a bias towards certain orientations, both grids ought to be uniformly sampled over the entire domain. For computational efficiency it is also convenient if the sampling points on the coarse grid can be related at little computational cost to their neighbouring points on the fine grid. For the sampling of the 2D translations, both requirements are easily fulfilled using Cartesian grids in Euclidian space. However, for the 3D orientations, there is no known point set that achieves uniform sampling.

RELION parameterizes 3D orientations by three Euler angles, and approximates a uniform sampling of the first two Euler angles on the sphere using the HEALPix framework (Gorski et al., 2005). The HEALPix approach was originally proposed for the field of astronomy (where pixelized images of the sky are represented on a sphere), and it has two characteristics that are particularly useful for the adaptive expectation–maximization algorithm outlined above: (i) it yields a reasonable approximation to a uniform sampling of the sphere so that bias towards certain orientations may be minimized; and (ii) it generates discrete grids in a hierarchical way that allows fast calculation of neighbouring sampling points in grids with distinct angular sampling rates. In particular, each subsequent grid in the hierarchy contains four times more sampling points than the previous one, yielding an angular sampling rate that is approximately twice as high.

The implemented adaptive expectation maximization algorithm uses a given grid in the HEALPix hierarchy for the coarse sampling of the first two Euler angles, and the next one in the hierarchy for the fine sampling. In addition, it uses a two times finer, linear sampling of the third Euler angle and of both translations in the fine grid. Thereby, the fine grid will have 25 = 32 times more sampling points than the coarse sampling grid. Consequently, the maximum speed-up of the adaptive approach will be close to 32 (i.e. if only one sampling point contributes to 99.9% of the probability mass on the coarse grid). In practice, the posterior distributions are typically relatively broad during the initial stages of refinement (where low-resolution models provide less information to distinguish different orientations), and these distributions become more “spiky” towards convergence. Therefore, more orientations will contribute significantly to the probability mass on the coarse grid during the first few iterations when speed-ups are typically less pronounced, while towards the end of the refinement speed-ups become much more important.

2.4. Increasing computational speed: local orientational searches

Another effective approach to domain reduction is to limit the integrations to those orientations in the vicinity of the optimal orientations from the previous iteration. This approach is used in many structure determination procedures, and it is sometimes referred to as performing local angular searches. This approach may provide large speed-ups, but its effect on the quality of the reconstruction depends strongly on the assumption that the optimal orientations from the previous iteration are close to the true orientations. Therefore, local angular searches with fine orientational samplings are most useful during the later stages of refinement, after exhaustive searches with coarser samplings have provided orientations that are relatively close to the correct ones.

Inside the statistical framework, local angular searches may be implemented as a prior on the hidden variables. By setting P(k,ϕ|Θ(n),𝒴) = 0 for orientations that are far away from the optimal ones in the previous iteration, integrations over those orientations may be avoided. Conventional local angular searches, where equal probabilities are given to orientations in a user-defined search range correspond to using a rectangular function for P(k,ϕ|Θ(n),𝒴). RELION uses a truncated Gaussian function for P(k,ϕ|Θ(n),𝒴), and integrations are limited to orientations within three times a user-defined standard deviation. This procedure downweights orientations that are relatively far away from the optimal orientations in the previous iterations, thereby providing a more continuous transition from orientations that are close to the previous ones and orientations that fall outside the user-defined search range.

2.5. Assessing alignment accuracy based on SNR considerations

The accuracy with which individual particles may be aligned remains an unknown in many structure determination procedures. However, this value is of great interest, as it may be used to predict the attainable resolution for a given data set. The effect of orientational errors may be modelled by a B-factor on the reconstruction, so that orientational errors of a given magnitude will limit the resolution in a predictable manner, e.g. see Table 2 in Henderson et al. (2011).

The statistical assumptions of the MAP approach may be used to estimate the accuracy with which orientational assignments can be made for a given model. If orientation ϕT is the true one for the ith image, then the ratio RF/T of the posterior probabilities of assigning a false orientation ϕF and the true orientation ϕT (for a given class assignment k, and assuming equal prior probabilities for both orientations) is given by:

RF/T=ΓikϕFΓikϕT=exp∑j=1JCTFij∑l=1LPjlϕFVkl-CTFij∑l=1LPjlϕTVkl2-2σij2.

(8)

If RF/T is close to one for two neighbouring orientations, then these orientations cannot be distinguished from each other. On the other hand, if RF/T is very low, then the posterior probability of assigning the correct orientation is much larger than assigning the incorrect one, so that the correct orientation can readily be identified. Inside RELION, at every iteration one assumes for a random subset of 100 experimental images that the most likely orientations from the previous iteration are the correct ones, and one then modifies for each image each of the three Euler angles and two translations in small steps until RF/T < 0.01. The average values for the corresponding rotational and translational differences are reported by the program, and these values are considered to represent the accuracy with which different orientations may be distinguished reliably.

2.6. Preventing overfitting: “gold-standard” FSC calculations

In many structure determination procedures the resolution is assessed by FSC curves between reconstructions from halves of the data set, while a single model is used for the angular assignments. It is well-known that bias towards noise in this single model may lead to spurious correlations between the half-reconstructions. Over-optimistic low-pass filtering based on the inflated resolution estimates may then lead to further enhancement of the noise in the model. As a result, during multiple refinement iterations the amount of noise may gradually increase and final resolution estimates may be grossly exaggerated. This phenomenon has been called over-refinement, or overfitting. More realistic estimates of resolution may be obtained by refining a separate model for two independent halves of the data, so that FSC curves between the two half-reconstructions are free from spurious correlations. Such FSCs between independent reconstructions have been termed “gold-standard” FSCs (Henderson et al., 2012). As shown previously, “gold-standard” FSCs may be used to prevent overfitting without loss of reconstruction quality (Scheres et al., 2012).

Although MAP optimization was shown to effectively reduce overfitting, in theory some overfitting may still occur within the original MAP approach. If somehow noise would build up in the single reconstruction that is used for refinement, then the estimated power of the signal, through Eq. (5), would be inflated, which could then lead to overfitting. Although overfitting was observed to be much reduced compared to conventional refinement procedures, indications of a limited extent of overfitting in the MAP approach were indeed observed for very noisy data, see (Scheres, 2012) for more details.

To completely eradicate overfitting from the refinement process, an approach to estimate the power of the signal based on “gold-standard” FSC calculations was implemented inside the framework of MAP optimization. For this purpose, the data set is divided into two random halves at the outset of refinement, and two sets of model parameters Θ are refined separately, one for each half of the data. Because refinements with K > 1 of independent random halves of the data might converge to distinct classification solutions, this procedure was only implemented for the K = 1 case, and in the following all subscripts k have been dropped. At the end of every iteration, an FSC curve between the two independent reconstructions is calculated, and this curve is converted into an estimate for the resolution-dependent signal-to-noise ratio using:

SNRMAP(ν)=FSC(ν)1-FSC(ν),

(9)

which is then used to estimate the power spectrum of the underlying signal:

τ2(v)(n)=SNRMAP(v)1Nv∑l∈vNv∑i=1N∫ϕΓiϕ(n)∑j=1JPliϕTCTFij2σij2(n)dϕ,

(10)

where l ∈ ν is used to indicate that the lth 3D Fourier component lies within resolution shell ν, and Nν is the total number of Fourier components that lie within that resolution shell.

The estimated values for τ2(ν) are then used to calculate the optimal 3D linear filter for both reconstructions according to Eq. (3). Note that despite the 1D-character of τ2(ν) and FSC(ν), the modelled SNR in the Fourier domain may still be anisotropic through anisotropic CTF models and uneven orientational distributions in Eq. (3). Also note that Eq. (10) replaces Eq. (5) in the original MAP algorithm.

Only upon convergence of the refinement may the two subsets be joined to calculate a single reconstruction from all images. This final reconstruction will have a higher SNR than the two reconstructions from the independent halves of the data, but in order to prevent overfitting it may no longer be used in refinement. As suggested by Rosenthal and Henderson (2003), upon convergence, the FSC curve is modified as FSC′=2FSC(ν)/(1+FSC(ν)) to estimate the resolution of the combined reconstruction. Consequently, the frequency where the gold-standard FSC curve passes through 0.143 indicates the estimated resolution of the map. Recent insights that take into account that the volume occupied by the particle is typically only a fraction of the entire reconstructed volume (Sindelar and Grigorieff, in press) may be considered in future versions of the program.

2.7. General implementation details

RELION is implemented as a stand-alone program, and its open-source C++ code is available for download from http://www2.mrc-lmb.cam.ac.uk/relion. All developments described above have been implemented in version 1.1. Pieces of code, e.g. for dealing with symmetry, Euler angle operations and image I/O, were copied and/or adapted from the open-source packages XMIPP (Sorzano et al., 2004) and BSOFT (Heymann and Belnap, 2007), and all Euler angle and symmetry conventions are in accordance with the 3D-EM standard conventions (Heymann et al., 2005). A graphical interface is provided to facilitate its use by novice users.

Following the strategy employed in BSOFT, all metadata I/O is through plain text files in the STAR format (Hall, 1991). This format provides a convenient way to store tables of label-value pairs in a highly structured manner that is similar to XML but much easier to read by humans. The crystallographic community makes extensive use of the STAR format through crystallographic information files (CIF) (Hall et al., 1991). The structured metadata I/O in RELION was designed to facilitate its incorporation into umbrella-like packages that provide a uniform interface to a range of other programs. Efforts to do so in APPION (Lander et al., 2009) and EMAN2 (Tang et al., 2007) are currently ongoing (personal communication with Bridget Carragher and Steven Ludtke, respectively).

Despite the above-mentioned algorithmic efforts to speed up calculations, RELION may still require considerable amounts of CPU depending on the task at hand. To further reduce computation times, RELION adopts a hybrid parallelization scheme at two distinct levels. Distributed-memory parallelization through the message passing interface (MPI) is employed to divide the data set into subsets of images that are processed in parallel. A work-on-demand implementation, where a master node dispatches relatively small jobs to slave nodes that request work whenever they are idle, allows an efficient use of heterogeneous computer clusters. Also the processing of the random halves of the data for the gold-standard FSC calculations is handled by MPI, where each half of the data is sent to a different subset of the slaves. At a lower level, shared-memory parallelization through POSIX threads is employed to further divide the work load of the MPI nodes. Each thread processes a subset of all orientations for each individual image. The distinct advantage of using threads over MPI is that all threads can access the same computer memory, so that the total amount of memory in modern multi-core computing nodes may be used more efficiently. Taken together, the hybrid parallelization approach provides maximum flexibility: both in terms of scalability and memory usage.