Towards coarse-grained modeling of proteins

This paper introduces a basic theoretical background to the description of conformational dynamics of pro- teins through a system of interacting domains. The essential collective degrees of freedom derived by principal component analysis of a molecular dynamics trajectory are used as dynamic variables deﬁning the projection operator technique that underlies the formalism suggested. The explicit form of the corresponding projection operator is obtained, and the projection method is employed to derive systems of coupled generalized Langevin equations for both individual atomic degrees of freedom and essential collective degrees of freedom in a protein. A deﬁnition of correlated domains in proteins is introduced based on the analysis of the essen- tial dynamics. Examples of identiﬁcation of such domains are presented. A system of coupled generalized Langevin equations is derived representing the protein through a few interacting domains embedded into a dissipative medium. Further developments and potential applications of the formalism are outlined.


Introduction
Building coarse-grained models for proteins is one of the major unresolved challenges for the theory.Proteins are complex soft-matter systems containing thousands of atoms interacting with each other within the protein molecule, as well as with the environment (solvent) and constantly changing their spatial conformations as a result of this interaction.It is known that the dynamics of protein molecules is highly hierarchical, e.g., it comprises chaotic high-frequency motions of individual atoms and small atomic groups superimposed with slower and more regular motions involving significant parts of the macromolecule.An appropriate identification, characterization, and prediction of these slow collective motions would tremendously benefit to both the basic understanding of protein dynamics and applications to the design of functional nanosystems employing proteins (drug design, biosensing, biodiagnostics, etc.).
A significant effort has been invested by researchers to develop a methodology of characterization of the collective modes in proteins through the multivariate analysis of molecular dynamic trajectories of proteins [1][2][3][4][5][6].In this method, which is known as the principal component analysis (PCA), the trajectory of the protein in the phase space, X (t) = {X 1 (t) , X 2 (t) , . . .X 3N (t)}, where X i are the Cartesian coordinates of individual atoms, is obtained from molecular dynamics simulations.The covariance matrix c ij = (X i (t) − X i ) (X j (t) − X j ) is defined through averaging over the entire trajectory, and the eigenvalue problem, is solved to obtain the eigenvectors . .E k 3N and eigenvalues σ k (k = 1, 2, . . .3N ), where N is the total number of atoms considered.The normalized eigenvectors, E 1 , E 2 , . . .E 3N , represent a set of 3N orthogonal collective degrees of freedom.One can consider the set of eigenvectors as an intrinsic coordinate frame in the phase space, and project the phase trajectory X (t) on them, The functions x k (t) defined by equation ( 2) can be viewed as the collective coordinates representing the conformational changes in the protein embedded in solvent.The eigenvalues σ k derived from equation (1) represent the mean square displacements along the corresponding collective degrees of freedom.Thus, it is possible to rank the collective degrees of freedom according to the magnitude of the associated eigenvalues, and to consider a truncated set of collective coordinates, x 1 , x 2 , . . .x kmax , k max 3N , which include only those collective coordinates that have the highest magnitude of the displacements [1][2][3][4][5][6][7].This truncated set of collective coordinates is sometimes referred to as the essential degrees of freedom [4,7].The complementary set of collective coordinates, x kmax +1 , x kmax +2 , . . .x 3N is interpreted as small-amplitude fluctuations.
The PCA-based techniques have provided valuable information about the geometry of the conformational changes, and are currently implemented in popular simulation packages such as AMBER and GROMACS.However, this formalism alone does not provide a sufficient insight into the dynamics of the conformational motions in macromolecules.Thus, protein isoforms sometimes show only minor geometrical differences, and yet have dramatically different functionalities.Composition of solvent is another factor, whose impact is difficult to capture by analysing the geometry of the phase trajectory alone.Thus, dynamics of the collective motions needs to be addressed in addition to their characterization through the standard PCA techniques.
Extensive attempts to develop a dynamic theory of conformational motions in proteins are presented in references [3,[7][8][9][10][11][12][13][14][15][16] and citations therein.It has been suggested to describe the dynamics of proteins by the classic Langevin equations of motion, employing either the essential collective coordinates x [3,14] or the Cartesian coordinates of atoms X [9] as dynamic variables.Several authors proposed employing the generalized Langevin equation as a more comprehensive approach to the description of damped motion of individual atoms in a protein [10,12,13].In the recent study [16], it has been postulated that the motion along the essential collective coordinates can be described by the generalized Langevin equation as well.Based on this assumption, an approach has been developed that represents protein dynamics by motion along a single collective coordinate that has been derived through the PCA of a molecular dynamics trajectory [16].However, applicability of the generalized Langevin equation to the essential degrees of freedom extracted from molecular-dynamic trajectories has not been proven rigorously, and any relation between the Langevin equation for Cartesian coordinates of atoms in the protein and those for the collective essential coordinates derived through PCA has not been established.Furthermore, employing a single collective degree of freedom is too a restrictive assumption for realistic proteins.
A fundamental unsolved problem that requires accounting for more than one collective degree of freedom in proteins, is representation of the collective motion in terms of particular domains containing atoms that move in a coherent way.Efforts in identifying such domains based on molecular simulations have been recently reviewed [17].Difficulties arise even with the very definition of domains, which sometimes include rather vague criteria such as being a visually recognizable substructure in the protein [17].In the most elaborate approach [18][19][20][21], domains are defined as rigid bodies, and identified by clustering of translations and rotations of elementary building blocks.The problem of this approach, however, is that those elementary building blocks should be postulated a priori (residuals, groups of a few atoms, etc.).Also the differences in motion that need to be captured are very subtle and susceptible to thermal noise, to sampling scheme, and to other uncertainties.A proper filtering of these unwanted impacts is complicated and computationally expensive to implement.A universal and dynamically justified concept for identifying the correlated domains has not been developed to the date.This challenge might be solved through a theoretical approach employing collective coordinates as dynamic variables in a protein, and identifying the couplings that cause the formation of correlated domains.However, a theoretical methodology describing the conformational dynamics in a protein based on multiple collective degrees of freedom still needs to be developed to this end.
In this paper, a comprehensive dynamic formalism is introduced that eventually permits to define the correlated domains in a protein, and describe their motion by a system of dynamic equations of motion that are parameterized employing PCA of molecular dynamics trajectories.In section 1, the set of essential collective coordinates derived by PCA is employed to construct the projection operator.In section 2, systems of coupled generalized equations for both individual atomic degrees of freedom and essential collective degrees of freedom are derived through the PCAbased projection operator method.In section 3, a definition of correlated domains in proteins is introduced based on the analysis of coupling of dynamic variables in equations of motion, and the examples of identifying such domains are presented.In section 4, generalized Langevin equations are derived describing the protein as a system of interacting domains.Potential applications of the formalism, further developments, and related challenges are discussed.Section 5 summarizes conclusions of the work.

Projection operator methodology for protein dynamics
In this section, systems of coupled generalized Langevin equations are derived for both Cartesian degrees of freedom of individual atoms and the collective essential coordinates in a protein, based on the projection operator method [22][23][24][25].For this purpose, the projection operator is defined employing a set of multiple essential degrees of freedom, which are derived through the PCA of molecular dynamics trajectories.

The projection operator from PCA-defined eigenvectors
Consider the eigenvectors E 1 , E 2 , . . .E 3N derived from equation (1).As it has been discussed, the eigenvectors represent the set of the protein's collective degrees of freedom, which can be subdivided into the essential degrees of freedom describing significant displacements, E 1 , E 2 , . . .E kmax , k max 3N , and the complementary set of collective degrees of freedom that correspond to smallamplitude fluctuations, E kmax +1 , E kmax +2 , . . .E 3N .The number of essential degrees of freedom, k max , is not determined or limited at this point.The set of essential degrees of freedom and that associated with fluctuations can be viewed as two orthogonal subspaces of the 3N -dimensional phase space.
Next, let us introduce the operator P and the complementary operator 1 − P , where Y is an arbitrary vector.One can easily check that the operators P and 1 − P can be interpreted as geometrical projections of the vector Y onto the subspace of the essential degrees of freedom and onto the subspace of fluctuations, respectively.This can be demonstrated by the following examples, In particular, when the operators P and 1 − P are applied to the trajectory vector X (t) = {X 1 (t) , X 2 (t) , . . .X 3N (t)}, this provides the essential component of the trajectory X E and the fluctuation component It is clear that

Essential dynamics for individual coordinates of atoms
The next step is employing the introduced projection operators to derive the generalized Langevin equations for individual atomic degrees of freedom, using the analogy with the Mori projection operator formalism [22][23][24][25].Consider again the trajectory of the entire system, X (t) = {X 1 (t) , X 2 (t) , . . .X 3N (t)} .The vector X (t) obeys the equation of motion, where m is a diagonal matrix providing the masses of atoms.Taking into account equation ( 6) where X 1−E represents minor changes in the positions of atoms as compared to a more pronounced essential motion given by X E , it is possible to approximately represent the force in the right-hand side of equation (7) in the following form, In equation ( 8), F X E is the mean force, K X 1−E represents fluctuations of the force, and K is the matrix with the elements K ij = −∂F i /∂X j .Equation (7) can thus be replaced with, Next, the projection operators P and 1 − P as defined in section 2.1 are applied to both sides of equation ( 9), which gives the formal equations of motion for X E and X 1−E , respectively: If the fluctuation contribution, X 1−E , changes much faster than X E , the elements of the matrix K can be considered as constants.With this assumption, equation ( 11) can be solved with respect to X 1−E , which gives the general solution in the form, The kernel Z (t) under the integral in the right-hand side of equation ( 12) in a general case is a non-diagonal matrix, whose elements are linear combinations of the terms sin (ω i t + ϕ i ), where ω 2 i are the eigenvalues of (1 − P ) m −1 K.The second term in the right-hand side of equation (12), which is symbolically represented by R, is also a linear combination of the harmonic functions of time such as sin (ω i t + ϕ i ), weighted with the values of X 1−E (0) and ˙ X 1−E (0) at the initial time t = 0. R can be viewed as the contribution of random noise to X 1−E .
The expression (1 − P ) m −1 F X E under the integral in the right-hand side of equation (12) represents the projection of the mass-weighted force m −1 F X E acting in the subspace of essential motions, onto the subspace of fluctuations.This can be rephrased as the coupling of fluctuations with the essential degrees of freedom.In the case of bilinear coupling, the solution of equation ( 12) can be represented in the following form [26][27][28][29][30][31], The form of the damping kernel Z H (t) and the random function R H (t) for the case of bilinear coupling can be found, for example, in references [27] and [30].Substitution of equation ( 13) in the right-hand side of equation ( 10) provides the equation of motion for the projection of the trajectory onto the subspace of essential motions X E (t), which can be converted into the form similar to the generalized Langevin equation, Here U X E is the potential of mean force, defined in such a way that ∂U can be interpreted as an external random force, in the sense that R X (t) does not depend on the dynamics of the system considered [25,26] and satisfies the requirements [12,30].The final step is rewriting equation (15) in the form of the set of scalar equations of motion for 3N atomic coordinates X E i : Here and The system of equations of motion ( 16) describes the trajectories for all atoms in the system, projected onto the subspace of the essential degrees of freedom that are identified by PCA of atomic trajectories.The first term in the right-hand side represents a "purely" essential motion defined by the mean forces −∂U ∂X E j .The other terms in the right-hand side describe the effect of fluctuations onto the essential motion.The fluctuations are included in the equation in the form τ ) dτ and the random force ρ i (t).It can be seen that in equation ( 16), the atomic degrees of freedom are coupled through the summations in the right-hand side.Intuitively, the coupling of the degrees of freedom of individual atoms should be responsible for the formation of coherent domains in proteins and therefore, this coupling is of particular interest in the context of this paper.According to equation (16), the coupling between the particular degrees of freedom i and j is defined, first of all, by the coefficients C ij .Following equation ( 17), the coefficients C ij are determined by the values E k i , which represent the direction cosines of the eigenvectors E k in the 3N -dimentional phase space of the protein, The values E k i can also be viewed as the projections of the eigenvectors E k onto the Cartesian degrees of freedom of individual atoms.It is noteworthy that only the eigenvectors that correspond to the essential degrees of freedom (k k max ) contribute to the coupling.Clearly, the essential collective coordinates should play a central role in building coarsegrained models of proteins.To better understand this role, equations of motions for the collective coordinates x k will now be derived and analysed.

Essential dynamics for collective coordinates and coupling of atomic degrees of freedom
By the definition, the essential collective coordinates x k can be obtained through the scalar product X E • E k = x k .Accordingly, the equations of motion for x k (t) are provided by a similar procedure applied to both sides of equation ( 15), To simplify this equation, let us first note that for an arbitrary vector Y , and, therefore, the operator P in the right-hand side of equation ( 19) can be omitted.Second, these improvements, equation ( 19) can be rewritten as follows, where Equation ( 20) provides a system of equations of motion for the essential collective coordinates x k .It is evident that in a general case, any essential collective coordinate x k is dynamically coupled to other essential collective coordinates in the system.The coupling of the mean forces is provided by the matrix of effective reciprocal mass µ −1 , whose elements are given by equation (21), and the coupling of the dissipative forces is provided by the memory matrix ξ (t), as given by equation (22).The meaning and structure of elements of the matrices µ −1 and ξ (t) are easier to understand, considering the simplified case of a single essential degree of freedom, when k max =1.In this case, the effective reciprocal mass is simply which can be interpreted as a weighted average of the reciprocal masses of atoms, with the weights equal to E 1 i 2 .Thus, the value E 1 i 2 represents a measure of involvement of the i-th atomic degree of freedom in the collective dynamics of the system.The memory function is given by a weighted summation of the contributions from individual atomic degrees of freedom as well, In the right-hand side of equation ( 25), the most significant term (containing the diagonal elements Z X,ii ) has the structure similar to that in equation ( 24), e.g.involvement of i-th atomic degree of freedom in the collective dynamics is characterized by the value In a general case of multiple essential collective degrees of freedom, the values of the directional cosines E k i can be interpreted as a measure of correlation of a particular atomic degree of freedom i with the essential collective coordinate k.Clearly, the direction cosines E k i can adopt positive, negative, or zero values.In the first case, the collective mode represented by E k and the atomic degree of freedom i are in phase, in the second case they are in anti-phase, and in the third case there is no correlation.The magnitude E k i is representative of the level of the correlation; the larger E k i is, the stronger is the involvement of the atomic degree of freedom i into the collective mode k.
One can conclude that the coupling of individual atomic degrees of freedom that is discussed in section 2.2 is mediated by correlations of the atomic degrees of freedom with the essential collective degrees of freedom in the system.This fact is reflected by the structure of the coupling coefficients C ij which, after equation (17), are given by a sum of the expressions E k i E k j over all essential collective modes k.

Correlated domains in macromolecules
Having introduced the methodology to the description of essential dynamics in macromolecules, the next step is attempting to represent this dynamics through domains containing the atoms that move in a coherent way, e.g.show a strong coupling.In section 2 it has been demonstrated that essential collective coordinates obtained through PCA of molecular dynamics trajectories play an important role as mediators of coupling of individual atomic degrees of freedom.However, despite a rather common expectation, the essential collective coordinates generally do not explicitly represent any particular groups of atoms.This is demonstrated, for example, by the fact that the elements of the matrix of reciprocal effective mass µ −1 ij in equation ( 20) are representative of a weighted average of masses of atoms involved in the collective motion, rather than of a cumulative mass of any group of atoms.Below, coherently moving domains of atoms in a protein are identified based on the analysis of couplings of atomic degrees of freedom.This approach does not employ any a priori assumptions regarding the structure of domains, and is based on the analysis of the coupling coefficients C ij in equation ( 16).

Correlated domains from dynamic coupling of coordinates of atoms
Consider the set of equations of motion for projected atomic coordinates (16), which we rewrite as follows, where i = 1, 2, . ..3N , and Consider again the essential eigenvectors define the coupling coefficients C ij as given by equation ( 17).In a system containing N atoms, the entire sets of direction cosines , where n = 1,. . .N. Each of these subsets represents the direction cosines relative to the Cartesian degrees of freedom of an individual atom x, y, and z.The coupling coefficients C ij can now be represented by where n 1 , n 2 = 1, 2, . . .N ; α, β = 1, 2, or 3 denote the Cartesian degrees of freedom x, y, and z; and E k n,α are the directional cosines of the collective vectors E k with respect to the atomic degrees of freedom {n, α}.Accordingly, equation ( 26) is converted to As it has been previously discussed, the values E k n,α represent correlations of the essential collective degrees of freedom with individual atomic degrees of freedom, and at the same time, they define the couplings between the degrees of freedom of individual atoms.Therefore, it is natural to define correlated domains as groups of atoms for which the values E k n,α have a similar magnitude for each of the essential collective degrees of freedom k, and are nonzero for at least some k [32].This definition of domains can be illustrated by the following simple classification of cross-correlation terms in the equation of motion for the atomic coordinates in a hypothetic protein containing a single correlated domain: (i) The atoms n 1 and n 2 belong to the correlated domain and their Cartesian degrees of freedom are similar (α = β).Since in this case E k n1,α ≈ E k n2,β for all k, and E k n1,α = 0 for at least some k, the coupling coefficients in the equation of motion for the atomic coordinate {n 1 , α} are nonzero and positive.Let us denote such cases by In equation ( 29) δ denotes the domain, N δ denotes the set of atoms in the domain δ, and the expression n 2 ∈ N δ means that the atom n 2 belongs to the domain δ.
(ii) The atom n 1 belongs to the correlated domain δ, whereas the atom n 2 does not belong to this domain, n 2 / ∈ N δ , and/or the Cartesian degrees of freedom are different, α = β.The contributions of such cases are represented by Since the values E k n1,α and E k n2,β can now vary in both magnitude and sign, the coupling coefficients C n1,α,n2,β and C n1,α,n2,α in equation (30) are given by a summation of both positive and negative terms.This generates significantly smaller values for the coupling coefficients, which can also vary in signs.As a result, the total contribution to the equation of motion for the coordinate {n 1 , α} is much less than in the case (i).
The entire equation of motion for the atomic coordinate {n 1 , α} in a single correlated domain reads, ẌE n1,α = ẌE n1,α (i) where the first term in the right-hand side describes the coupling of degrees of freedom which are correlated within the domain, whereas the second term corresponds to the coupling of degrees of freedom which do not show a strong correlation within the domain.Since in most cases the contribution (i) is much larger than (ii), in the first approximation one can disregard the uncorrelated term, and assume that ẌE .
This result demonstrates the physical meaning of domains in the present theory.Domains are groups of atoms that show a strong dynamic coupling in the equation of motion for projected atomic coordinates.The domains are identified as groups of atoms, for which the corresponding direction cosines of the essential collective degrees of freedom E k n,α adopt similar values for each k.No assumption regarding any elementary building blocks and/or interatomic interactions are made in this work to identify the domains.

Example: domains in a misfolded prion protein
In this section, correlated domains in a macromolecule are identified for the example of a misfolded conformation of prion protein PrP 27-30, which has been recently suggested as a heuristic prototype of the pathogenic prion isoform apt to aggregate into oligomers [33].The structure of a monomer of PrP 27-30, which contains 139 residues, is shown in Fig. 1.The molecule contains two α-helices, as well as several layers of β-strands.To the date, little is known about the properties of such misfolded prion isoforms, as well as about the mechanism through which they arise and aggregate into dimmers, trimmers, and other oligomers.Thus, a rigorous and efficient approach to the characterization of structure and stability of the misfolded prion proteins would be of a tremendous importance.Starting with the structure shown in figure 1, a molecular dynamics trajectory of the solvated protein was generated using the NAMD2 code with Amber parm99 force field and explicit TIP3P water [34].After equilibration, a trajectory of more than 3 ns in duration has been obtained.At the beginning of the trajectory as well as after every 1 ns, 0.2 ns intervals were analysed by PCA.Four such intervals have been analysed independently.For essential collective coordinates, 30 principal components with the highest eigenvalues (k max = 30) were identified for each of these four intervals.All atoms in the molecule were accounted for when doing the PCA, which corresponded to N = 2150.The direction cosines of the collective coordinates E k n,α have been represented by N points, each corresponding to an individual atom, in the 3k max dimensional space of essential collective motions.In this space, the points that are located close to each other represent a similarity in directions of motion of the corresponding atoms.To obtain correlated domains, the N points have been clustered using the nearest-neighbor technique [35].This technique has been selected because no structural property, such as the number of domains, need to be assumed a priori.The only restriction employed is that the domains are assumed to contain more than 2 atoms.As a part of the nearest-neighbor clustering, the inter-domain distance d needs to be identified, which defines the maximum distance between the points representing atoms in the 3k max dimensional space of essential collective degrees of freedom, for the corresponding atoms belong to the same domain.Note that the distance d is a dimensionless value, as it follows from the definition of the metric in the space of the directional cosines of the collective degrees of freedom.As it can be seen in figures 2(a) and 2(b), the identification of the domains is sensitive to the selection of d.Thus, no correlated domains can be identified below a minimum distance d min ≈ 0.001, whereas most of the protein molecule is recognized as one large domain beyond a maximum distance d max ≈ 0.005 − 0.006.The most interesting and informative breakdown of the molecule into domains is reached between these two limit values.In the next publication [34], the clustering methodology is analysed more in detail.Here examples are considered for distances d that maximize the difference between the number of atoms involved in all domains and in the largest domain.In The examples in figures 3(a) and 3(b) show five largest domains identified in PrP 27-30.The correlated domains are shown with colors, whereas atoms that do not belong to the largest domains are colored gray.An important result that emerges from the figures is that the identified correlated domains form compact groups of atoms, although the clustering formalism does not require any proximity of the locations of atoms in the primary, secondary, or tertiary structure.By the definition, a proximity in the 3k max dimensional space of essential collective motions reveals only a similarity in directions of the motion of atoms.The fact that this proximity identifies the compact atomic groups, confirms the viability of the clustering formalism.Another noticeable feature is that some side groups connected to the correlated domains have not been recognized as belonging to these domains.The explanation is that the motion of side groups is more flexible in comparison with the main-chain groups, which results in a weaker correlation.Figures 4(a-d) compare five largest domains found at various trajectory times: after equilibration (a), after 1 ns (b), after 2 ns (c), and after 3 ns (d).To obtain these, four fragments of the trajectory, each of 0.2 ns, have been analysed independently.From the figures it is evident that, despite a certain variability of the domains over the trajectory, there is a reasonable match between the secondary structure and the domains identified.Thus, significant parts of the α-helices and β-strands are involved in correlated domains.However, the overlap between the domains and the secondary structure is not complete.Usually, only parts of α-helices are recognized as a single domain, which reveals a flexibility within these structures.Furthermore, structure elements that are quite remotely separated in the main chain, but located near each other in the tertiary structure, are sometimes recognized as a single domain (see e.g. an example in figure 4(b)).
Another aspect that needs addressing is that the domain systems shown in figures 4(a-c) are representative of the variability of the essential collective motion.Indeed, changes in the correlated domains reflect the corresponding changes in the set of essential collective coordinates derived by PCA of different parts of the trajectory.The most important is that the changes in the domain make it possible to distinguish minor conformational variations from more significant structural changes.Thus, all four structures presented in figures 4(a-c) are somewhat different.However, not all of these differences are important.For example in figures 4(a), 4(c), and 4(d), the domains colored orange, green, and red show the same basic structure, in spite of a variability in the size and position of the domains.In figure 4(b), however, the domains have composed a different configuration, which can be interpreted as a conformational transition in the β-rich part of the protein.
One can conclude that the introduced formalism of domain identification offers a straightforward and efficient methodology of characterizing the conformation stability over a molecular dynamics trajectory.If the configuration of major correlated domains does not significantly depend on the sampling interval, then the conformation space of the proteins can be considered as largely stable.
Otherwise, any significant changes in the conformation space generate easily detectable changes in the domains.

Towards coarse-grained dynamics of correlated domains
After the correlated domains are identified, coarse-grained dynamics of a protein can be explicitly represented by considering the motion of the domains as a result of their interaction with each other and with the environment.Thus, each domain can be characterized by the number of atoms involved N δ , the mass M δ of the domain, and the coordinates of the center of mass Xδ Xδ Here δ denotes the domains, α denotes the Cartesian coordinates of the domains x, y, or z, and X E n represents the coordinates of individual atoms.The expression n ∈ N δ α says that only α-th coordinates of atoms involved in the domain δ are accounted for to define the center-of-mass coordinate Xδ α .The coordinates of the centers of the domains of masses Xδ α can also be expressed through the essential collective coordinates where the identity is taken into account, and Double differentiation of equation ( 34) over time and replacing of ẍk with equation ( 20) leads to Equation ( 35) is a formal equation of motion for the coarse-grained degrees of freedom represented by the coordinates of the centers of the domains of masses Xδ α .If the total number of such coarsegrained degrees of freedom is equal to the number of the essential collective coordinates, x k can be expressed through Xδ where Equation ( 36) describes the coarse-grained dynamics in a protein through a few interacting domains embedded in a dissipative medium.The equation can be entirely parameterized based on the dynamics of essential collective motions discussed in section 2. If the effective masses, mean forces, and memory kernels are available for a set of essential collective coordinates, then the corresponding parameters ∂U ∂ Xs , V sl , and ζ sl (t) can be identified, provided that the domains of atoms with strongly correlated degrees of freedom exist in the molecule.An important outcome of the theory is that the maximum number of addressable coarse-grained degrees of freedom Xs is equal to the number of essential collective coordinates x k .Thus, if only one collective coordinate is considered (k max = 1), then no more than one coarse-grained degree of freedom can be described by equation (36).If k max = 3, the corresponding set of 3 equations of motion can represent the {x, y, z} coordinates of one domain, or it can describe particular degrees of freedom belonging to two or three different domains.In a general case, the number of essential collective coordinates k max cannot be less than the number of the coarse-grained degrees of freedom that is necessary to describe.
The requirement to the set of essential collective coordinates to be equal or exceed the number of coarse-grained degrees of freedom that need to be described, complements the standard multivariate analysis of molecular dynamic trajectories, which only ranks the collective coordinates according to the associated mean-square displacements and does not identify what exactly the set of essential coordinates should be.The required minimum number of coarse-grained degrees of freedom provides a guidance for this choice.
Another issue of the standard ranking of the collective coordinates according to the associated mean-square displacements is that such a ranking does not define what the "sufficient" value of the displacement is for a coordinate to be essential.This implies that the standard eigenvalue ranking needs to be complemented by another criterion based on the dynamics of the essential motions.To find such a criterion, let us recall that the formalism presented in section 2 assumes that (i) the displacements related to the essential motions are significantly larger than the fluctuations; and (ii) the essential motions are significantly slower than the fluctuations.Evidently, the selection of a set of essential collective coordinates should be consistent with both assumptions.However, only the requirement (i) is fulfilled by the standard ranking of the collective coordinates x k according to the respective eigenvalues.A solution that emerges from the analysis in section 2 is to complement the ranking of eigenvalues by a comparison of the decay times τ xx and τ ξ , which correspond to the autocorrelation function x k (t) x k (0) and to the memory kernel, ξ kk (t), respectively, for each of the potentially essential coordinates x k .Indeed, the requirement that motion along the collective coordinates is slow compared to fluctuations means that τ xx should be larger than τ ξ .Thus, the requirement τ xx > τ ξ employed together with the standard ranking of the eigenvalues of the covariance matrix would provide a sufficient criterion for identifying the sets of essential collective coordinates.A potential problem may be presented by very slow modes that are not captured well enough because their characteristic time is larger than the trajectory time [16,36].These undersampled modes combine small eigenvalues σ k with large decay times τ xx , and thus do not match neither the category of essential modes, nor the category of fluctuations.A solution is to identify and eliminate the undersampled modes, for example, through conventional drift reduction techniques.This is a natural limitation of any analysis based on molecular dynamics trajectories.
As it follows from the previous discussion, the memory kernel matrix ξ (t) is one of the central quantities in the present theory.The diagonal elements ξ kk (t) are required for complementary ranking of the essential degrees of freedom, and the entire matrix ξ (t) is needed to parameterize the coarse-grained equations of motion.For a single essential coordinate, it has been suggested to extract ξ (t) from molecular dynamics trajectories by solving the memory equation for the velocity autocorrelation function, ẋ (t) ẋ (0) , which is derived from the generalized Langevin equation through the well-known procedure [10,16,37].In the case of multiple collective coordinates that is considered in this paper, the procedure analogous to that described in reference [37] leads to the following set of integral equations for ξ kl (t): The required correlation functions can be evaluated numerically from molecular dynamics trajectories [38], and the set of integral equations (38) can then be solved with respect to ξ kl (t).Identification of the mean forces −∂U ∂x k is the most challenging implication of the theory.It has been suggested in the literature to evaluate the potential of mean force U (x) from the phase space density Ψ (x) that is derived from molecular dynamics trajectories [7,14,16], provided that the set of snapshots obtained from molecular dynamics simulation satisfies the condition of ergodicity [7,14,16].In the case of multiple essential coordinates, however, the mean force along a particular collective coordinate x k is a function of other coordinates.The expression for the mean force −∂U x 1 , x 2 , . . .x kmax ∂x k , in principle, can be derived analytically based on the formalism from sections 2.2 and 2.3.This would also make it possible to predict the variability of domains with time.To accomplish this, however, a self-consistent solution of equations ( 7) and (20) needs to be found.Solving this fundamental challenge appears to be one of the most important and promising milestones in the future development of the theory.At the present stage, however, the major purpose of the theory is characterization and comparison of protein conformations based on molecular dynamics trajectories.For this particular purpose, employing dependencies analogous to equation (39)in order to evaluate U (x) projected onto particular collective coordinates [14] appears to be a reasonable approximate solution, provided that the output of the molecular dynamics trajectory satisfies the requirements for such an interpretation [7,14,16].

Summary
This work introduces a comprehensive theoretical methodology of describing the conformational dynamics of proteins based on the projection operator method [22][23][24][25].The essential collective degrees of freedom defined by the principal component analysis of the molecular dynamics trajectory are used as dynamic variables in the formalism.The explicit form of the corresponding projection operator is obtained, and the projection technique is employed to derive systems of the coupled generalized Langevin equations for both individual atomic degrees of freedom and essential collective degrees of freedom.The number of the essential degrees of freedom is not limited in the theory.Unlike other studies of protein dynamics, the present theory is valid for any number of essential coordinates.In particular, the coupling of relevant dynamic variables is explicitly included in the equations of motion.The theory includes the model with a single essential degree of freedom, as a particular case.
Based on the analysis of the coupling of the dynamic variables, a consistent definition of correlated domains in a protein has been introduced.The domains, which are supposed to serve as major building blocks for coarse-grained modelling of proteins, are defined as groups of atoms whose Cartesian coordinates show a strong coupling in the generalized Langevin equation of motion.For such groups of atoms, the direction cosines of the essential collective degrees of freedom adopt similar values.Accordingly, the domains are identified through a simple clustering procedure.Unlike the existing approaches to the identification of the domains in proteins, subject to clustering are the directional cosines of the essential collective degrees of freedom in the phase space, and not translations and/or rotations of individual atomic groups, which makes the formalism immune to noises.Furthermore, no limiting assumptions are made regarding the structure of domains, their number, or interatomic interactions in the protein.The formalism of domain identification is general, physically transparent, and intimately related to the essential dynamics of the protein.
An example of identification of correlated domains is provided for the misfolded isoform of a prion protein, .The example demonstrates that the identified domains are composed of compact groups of atoms, although the spatial proximity of atoms is not required by the formalism.The domains have also shown a reasonable match with the secondary structure; however, there is no complete similarity.Some domains follow closely particular elements of secondary structure or their parts, while others are composed of different elements that are located near each other in the tertiary structure but are quire remotely separated in the main chain.
It has been demonstrated that the introduced formalism of domain identification offers a straightforward and efficient methodology of characterizing the conformation stability of proteins over a molecular dynamic trajectory.If the configuration of major correlated domains does not significantly depend on the sampling interval, then the conformation of the proteins can be considered as largely stable.Otherwise, changes in the essential collective coordinates cause easily detectable changes in the correlated domains.
The potential of building analytic coarse-grained models describing conformational motions in a protein through a few interacting domains embedded in a dissipative medium has been analysed in detail.For the first time, generalized Langevin equations of motion for the Cartesian coordinates of the correlated domains are derived and parameterized analytically based on the equations of motion for the essential collective coordinates.It is demonstrated that the number of addressable coarsegrained degrees of freedom cannot exceed the number of essential collective coordinates identified in the macromolecule.Dynamic criteria for identifying the set of essential collective coordinates are identified.Methodologies of parametrizing the equations of motion for the correlated domains are outlined, and potential further developments are discussed.Thus, a fundamental challenge and one of the most important and promising future milestones in the extension of the formalism is prediction of the dynamic variability of the energy landscape that generates changes in the domain structure of the protein.

Figure 2 .
Figure 2. Examples of typical dependencies of the domain system in PrP 27-30 on the interdomain distance d: (a) -the number of correlated domains as a function of d; (b) -the number of atoms in all correlated domains (solid line), and the number of atoms in the largest domain (dotted line) as functions of d; (c) -the difference between the number of atoms in all domains and in the largest domain as a function of d.Note that the distance d is a dimensionless value, as it follows from the definition of the metric in the space of directional cosines of essential collective degrees of freedom.

figure 2 (
figure 2(c) it can be seen that for PrP 27-30, the corresponding optimum value of d is close to 0.003.The examples in figures 3(a) and 3(b) show five largest domains identified in PrP 27-30.The correlated domains are shown with colors, whereas atoms that do not belong to the largest domains are colored gray.An important result that emerges from the figures is that the identified correlated domains form compact groups of atoms, although the clustering formalism does not require any proximity of the locations of atoms in the primary, secondary, or tertiary structure.By the definition, a proximity in the 3k max dimensional space of essential collective motions reveals only a similarity in directions of the motion of atoms.The fact that this proximity identifies the compact atomic groups, confirms the viability of the clustering formalism.Another noticeable feature is that some side groups connected to the correlated domains have not been recognized as belonging to these domains.The explanation is that the motion of side groups is more flexible in comparison with the main-chain groups, which results in a weaker correlation.

Figure 3 .
Figure 3. Example of five largest domains identified in PrP 27-30: the general view (a) and cartoon (b).The domains are shown blue, yellow, orange, green, and red.Parts of the protein that are not involved in the domains are shown gray.

Figure 4 .
Figure 4. Five largest correlated domains in PrP 27-30 identified at various trajectory times: after equilibration (a), after 1 ns (b), after 2 ns (c), and after 3 ns (d).Off-domain parts of the protein are not shown.The examples from figures 3(a) and 3(b) correspond to (d).