Prediction of interacting single-stranded RNA bases by protein binding patterns Alexandra Shulman-Peleg1∗, Maxim Shatsky2 , Ruth Nussinov3,4†, Haim J. Wolfson1 1 2 School of Computer Science Beverly and Raymond Sackler Faculty of Exact Sciences Tel Aviv University, Tel Aviv 69978, Israel Physical Biosciences Division, Berkeley National Lab, California, USA 3 Basic Research Program, SAIC-Frederick, Inc. Center for Cancer Research Nanobiology Program, NCI, Frederick, MD 21702, USA 4 Department of Human Genetics and Molecular Medicine Sackler Faculty of Medicine Tel Aviv University, Tel Aviv 69978, Israel Abstract Prediction of protein-RNA interactions at the atomic level of detail is crucial for our ability to understand and interfere with processes such as gene expression and regulation. Here, we investigate protein binding pockets that accommodate extruded nucleotides not involved in RNA base pairing. We observed that 86% of protein interacting nucleotides are part of a consecutive fragment of at least two nucleotides, whose rings have a significant interaction with the protein. Many of these share the same protein binding cavity and almost 30% of such pairs are π-stacked. Since these local geometries can not be inferred from the nucleotide identities, we present a novel framework for their prediction from the properties of protein binding sites. First, we present a classification of known RNA nucleotide and dinucleotide protein binding sites and identify the common types of shared 3D physico-chemical binding patterns. These are recognized by a new classification methodology which is based on spatial multiple alignment. The shared patterns reveal novel similarities between dinucleotide binding sites of proteins with different overall sequences, folds and functions. Then, we use these patterns for the prediction of RNA fragments that can bind to a protein of interest. Specifically, we search a given target protein for regions similar to known binding patterns and achieve a success rate of 80%. Based on the interaction patterns, we further predict an RNA fragment that interacts with those protein binding sites. Since the RNA fragment can have a previously unknown sequence and structure this is especially useful for such applications as RNA aptamer design. By searching a database of all known small molecule binding sites for regions similar to nucleotide and dinucleotide binding patterns, we can suggest fragments and scaffolds that target nucleotide binding sites. Keywords: protein-RNA interactions, nucleotide and dinucleotide binding sites, physico-chemical binding patterns, multiple binding site alignment, RNA aptamer drug design. ∗ Corresponding authors’ emails:{shulmana,wolfson}@post.tau.ac.il The publisher or recipient acknowledges right of the U.S. Government to retain a nonexclusive, royalty-free license in and to any copyright covering the article. † 1 1 Introduction Protein-RNA interactions are crucial for many cellular processes, such as gene expression and regulation, protein synthesis and many more. Understanding and predicting such interactions at the atomic level is crucial for our ability to interfere with malfunctioning processes in disease. Consequently, analysis of the protein-RNA interactions and RNA binding sites have been a field of intensive research. The pioneering works of Jones et al.1 and Nadassy et al2 have provided important insights into the physico-chemical and geometrical nature as well as the amino acid composition of these regions. The specific atomic interactions formed between nucleotides and amino acids have been analyzed and compared in several important contributions 3–8 . By analyzing the amino acid composition in RNA binding sites, several successful methods for prediction of RNA binding sites were developed 9–13 . However, most of these methods do not distinguish between two main interaction types which were thoroughly studied by Draper14 : (1) interactions with the backbone of double-stranded RNA molecules; (2) interactions of single-stranded RNA bases that are accommodated in the protein binding pockets. As noted in a recent comprehensive review of Auweter et al15 , while the first type of interactions occur through positively charged protein surface patches, the second type of contacts with single-stranded nucleotides, often involve hydrophobic patches. The contacts are often with the unpaired nucleic acid bases, while the direct contacts with the phosphate moieties of the backbone, which point towards the bulk solution can be rare15 . Since flexibility plays an important role in interactions with the RNA backbone15 , we focus on the prediction of protein interactions formed with the single-stranded nucleotide bases. Sequences of such nucleotides, which are not involved in local base pairs and are extruded from the surrounding double-stranded helix, also termed extruded helical single strands, were recently described as special motifs in SCOR (Structural Classification of RNA) and were proposed to be mediators in RNARNA and RNA-protein interactions16, 17 . Several works have classified the protein-RNA interactions based on the sequences and/or overall structures of the corresponding protein18–20 or RNA molecules17, 21, 22 . However, these do not always capture the similarity in the local regions which are responsible for nucleotide binding. Analysis 1 of these regions is important due to several reasons. First, proteins of the same family can form different interactions with RNA nucleotides 23–25 . Second, RNA molecules can be flexible and can explore different conformations that are “fixed” by the protein whose binding site is more rigid and quenches this motion26, 27 . This observation is also supported by a recent study of Ellis and Jones28 who evaluated the conformational changes in known RNA binding proteins and observed that the flexibility in the protein binding sites is not significant and should allow the structural prediction of these interaction regions. Previous works that analyzed the nucleotides’ physicochemical binding patterns focused on single nucleotides1, 5, 29, 30 and did not apply them to the prediction of protein-RNA interactions. Since the binding patterns of nucleotides of the same type (e.g Adenine, Guanine) are very similar 5 their application to the prediction of RNA binding sites and reconstruction of protein-RNA complexes is rather limited. Here, we investigate the protein binding pockets that accommodate extruded nucleotides not involved in RNA base pairing. We observed that many of these protein cavities are common to pairs of consecutive nucleotides that are often pi-stacked with each other. Consequently, we suggest that consideration of binding patterns of pairs of consecutive RNA nucleotides may be essential for the correct prediction of protein-RNA interactions. We observed that the local nucleotide geometries can not be inferred from the nucleotide identities and developed a novel framework for their prediction through the recognition of known protein binding patterns. Specifically, we present a classification of nucleotide and dinucleotide binding sites, which are described by a set of physico-chemical properties that may be created by amino acids with different identities and spatial location of backbone atoms. Towards this goal, we have developed a new classification algorithm which performs multiple structural alignments and validates the spatial superimposition of the cluster members. The created clusters describe the common types of 3D consensus binding patterns that are used for several applications. First, their recognition allows atomic level predictions of dinucleotide binding sites with a success rate of 80%. Second, combining the predictions of neighboring dinucleotide binding sites allows us to predict the structure and the sequence of consecutive RNA fragments of on average 5 nucleotides. Finally, searching the database 2 of drug binding sites for patterns similar nucleotide and dinucleotide binding sites can assist in the prediction of ligands and ligand fragments that can be used to interfere with protein-RNA interactions. 2 Results Our goal to recognize and predict the main types of interactions between protein binding sites and single-stranded RNA bases. Specifically, we focus on the protein binding pockets that accommodate extruded16 nucleotides not involved in RNA base pairing (see Figure 1). We define a nucleotide binding site by the protein Connolly solvent accessible surface area31 within 2Å from the surface of the RNA nucleotide ring. Nucleotides with a protein binding sites area larger than 3Å2 are considered as protein interacting. Given a pair of extruded consecutive nucleotides that interact with the protein, a dinucleotide binding site is defined by a pair of the corresponding nucleotide binding sites. The physico-chemical properties of the binding sites are represented by points in 3D space termed pseudocenters, extracted from the protein amino acids according to Schmitt et al32 . Each pseudocenter represents a group of atoms according to the interactions in which it may participate: hydrogen-bond donor, hydrogen-bond acceptor, mixed donor/acceptor, hydrophobic aliphatic and aromatic (π) contacts. Only surface exposed pseudocenters are considered. Figure 1 presents several examples of extruded nucleotide pairs and their protein dinucleotide binding sites. The analysis below is performed on a non-redundant dataset of all existing protein-RNA structures, which consisted of 278 complexes with resolution 3Å and better and less than 50% sequence identity in at least one chain (see Methods). This section starts with some general observations regarding protein-RNA interactions and protein binding pockets that are involved. We present our classification methodology and show its application for the recognition of 3D consensus patterns of dinucleotide binding sites. As illustrated in Figure 2 these patterns are further used for the prediction of dinucleotide binding sites as well as RNA fragments and protein-RNA complexes. Finally, we describe how our infrastructure can be used in drug design applications for the discovery of RNA aptamers33 and small molecules that 3 target nucleotide and dinucleotide binding sites. 2.1 Interactions of single-stranded RNA bases Our first goal is to understand which types of interactions and binding pockets are involved in protein-RNA interactions. We observed that 24% of all unpaired RNA nucleotides have a significant contact with the protein (calculated on the above described dataset of protein-RNA complexes, see also Methods and Supplementary Material). Only 3% are the binding pockets of single nucleotides whose neighbors on the strand do not form significant contact with the protein. Consequently, 21% of the unpaired protein interacting nucleotides have at least one RNA strand neighbor that interacts with the protein via its ring as well. Thus, 86% of protein interacting extruded nucleotides are part of a fragment of at least two nucleotides that have a significant interaction with the protein. Interestingly, we observed that in many cases, two consecutive nucleotides share the same binding cavity. They can be either π-stacked with each other or can form interactions with the same water molecule (e.g. PDB: 1wmq) or the protein (see Figure 1). Here, we estimate the frequency and the geometries of the first phenomenon, which was biochemically studied for some specific protein-RNA complexes34 . We define a pair of consecutive extruded nucleotides as π-stacking, if the corresponding nucleotides’ planes are either parallel or perpendicular and have an angle of 180 ± 20o or 90 ± 20o respectively. In addition, we require that the distance between nucleotide ring centroids is less than 7Å35 . Table 1 presents the frequency of π-stacking observed in two nonredundant datasets constructed from all the available RNA structures and protein-RNA complexes respectively. We observed that in general 31% of consecutive pairs of extruded nucleotides are involved in local intra-RNA π-stacking interactions. When considering protein-RNA complexes, 28% of consecutive, protein interacting, nucleotides are π-stacked with each other. As expected, T-shaped (perpendicular) π-stacking is very rare (8% of consecutive pairs) and was never observed in protein binding pockets. When trying to estimate the sequence specificity of π-stacking, the only dinucleotide pair which was observed to have a clear tendency for π-stacking, was the AC pair (significant by the hyper geometric p-value above 0.009). It constituted 60% and 37% of π-stacked pairs in the datasets of 4 protein-RNA complexes and all RNA structures respectively (see Supplementary Material). These results show that the nucleotide geometries can not be predicted from the RNA sequence alone and nucleotide identities are not sufficient for the atomic level prediction of protein-RNA complexes. To summarize, we have observed that in most cases protein-RNA interactions are mediated by consecutive, protein interacting nucleotides many of which are accommodated in the same protein binding cavity and are π-stacked. Consequently, the binding cavities of single nucleotides are not always sufficient for a proper physico-chemical description of protein-RNA interactions and the binding patterns of dinucleotide binding sites should be considered. Moreover, since the RNA sequence can not be used for the prediction of nucleotide local geometries, below we propose to utilize the protein binding patterns for the prediction of RNA nucleotide orientations and proteinRNA complexes. 2.2 Classification of nucleotide and dinucleotide binding sites Here, we classify all the nucleotide and dinucleotide binding sites extracted from the dataset of protein-RNA complexes (see Methods). We consider all the dinucleotide binding sites that accommodate pairs of consecutive extruded nucleotides. In addition, binding sites of single nucleotides that are not a part of such pairs are considered as single nucleotide binding sites. Here, we present a unique classification methodology which validates the cluster quality by multiple spatial binding site alignment. Specifically, we have developed a center-star classification algorithm (see Methods), which creates clusters by iteratively adding binding sites in the order of their decreasing similarity (based on pairwise spatial alignments) and validates each new addition by the multiple spatial alignment among all current cluster members. If the multiple similarity, measured by the score of the common physico-chemical binding pattern, is lower than a predefined threshold (e.g. ≤ 30% of one of the binding sites) the new member is ignored and not added to the cluster. The main advantage of this approach is that we validate the spatial superimposition of the cluster members and assess the quality of the shared physico-chemical binding pattern. We applied this classification methodology and obtained 186 clusters of protein dinucleotide binding patterns that were created based on the binding sites of all pairs of consecutive extruded 5 nucleotides. In addition, we obtained 41 clusters that describe protein binding patterns of single nucleotides that are not part of a consecutive protein interacting pair. The full details of all the clusters are provided in the Supplementary Material. However, many of the considered binding sites are unique and the number of clusters with more than one member is 53 and 15 for dinucleotide and nucleotide binding sites respectively. Since binding patterns of single nucleotides were considered by previous studies1, 5, 29, 30, 36 , here we focus on the analysis of dinucleotide binding patterns and distinguish between two types: (1) patterns formed by dinucleotide binding sites of proteins with similar structural folds; (2) patterns formed by dinucleotide binding sites of proteins with different structures. We start this section with a description of the clusters that involve members of the RNA-binding domain family which exhibit all of these types of similarities. Then, we present examples of additional similarities revealed by our classification. 2.2.1 The RNA-binding domains The RNA recognition motif (RRM), also known as RNA-binding domain (RBD) or ribonucleoprotein domain (RNP) is considered to be one of the most abundant protein domains23 . In spite of its overall structural simplicity this domain can recognize a wide variety of RNAs and can perform various biological functions. Our dataset contained 9 complexes with RBDs of the 6 following types: (1) splicesomal U1A proteins (PDBs: 1m5o, 1urn, 1sj3); (2) splicing factor U2B’ (PDB: 1a9n); (3) sex lethal protein, Sxl (PDB:1b7f); (4) HuD proteins (PDBs: 1g2e, 1fxl) and (5) poly(A)-binding protein (PDB:1cvj); (6) Pre-mRNA splicing factor U2AF65 (PDB:2g4b). Although all of these proteins share less than 2% overall sequence identity, all of them possess the RPM1 and RPM2 sequence motifs23 . Our clusters reveal the similarities and the differences of the various nucleotide binding sites which allow these proteins to achieve a range of required biological functions. Notably, the results obtained with our protein binding site alignment method (see Table 2), are consistent with the manual observations and comparisons of nucleotide binding sites which were performed for some of the complexes23, 24 . Table 2 presents the details of all the dinucleotide clusters which involve RBDs. As can be seen, the created clusters divide our proteins into three main groups: (1) HuD and Sxl, which 6 have 7 common dinucleotide binding sites, in spite of sharing only 49% sequence identity; (2) U1A and U2B proteins that share 69% sequence identity and 4 common dinucleotide binding sites; (3) U2AF65 and poly(A) binding proteins, which are clear outliers. They do not share clusters with other RBDs, and are indeed known to be different from the rest and each other37, 38 . Below we provide a detailed description of the similarities recognized by our clusters. Tyr-Phe aromatic platform of RBDs Figure 3(a) presents a superimposition of six com- plexes, which constitute one of our largest RBD clusters formed by four different protein types (U1A,U2B,HuD and Sxl, see Table 2, cluster 100 ). All the dinucleotide binding sites in this cluster share the same aromatic binding platform formed by the conserved Phe and Tyr residues (see Figure 3(b)). These aromatic residues form π-stacking interactions with nucleotides that have exactly the same spatial orientation but different nucleotide identities. Previous biological studies have shown the importance of these π-stacking interactions for the affinity and stability39 . The correct recognition of this platform by our method provides a validation of its correctness. The spatial conservation of the physico-chemical properties shared by all the complexes provides additional insights into the formation and importance of this platform (See Supplementary Material). Interestingly, the structure of HuD in complex with c-fos AU-rich element (PDB: 1fxl), was classified in a separate cluster together with a Rho terminating factor (PDB: 1pvo), which was recognized to have the same Tyr-Phe aromatic binding platform. This is a novel similarity, which we could not find in previous studies. The reason that the HuD binding site of 1fxl was not classified with the rest of the RBDs is that its similarity to U1A (PDB: 1sj3) is less than 30% and these two binding pockets share only 5 common features, while the common pattern of the above described cluster has 6. Remarkably, the binding site of the Rho terminating factor exhibits a pattern of 7 physico-chemical properties which is very similar to that of HuD with c-fos (See Supplementary Material). As illustrated in Figure 3(b) although the Rho terminating factor is structurally different from RBDs and does not contain the required RPM sequence motifs, our binding site alignment method reveals that it uses a very similar aromatic binding platform which binds dinucleotides in a manner similar to RBDs. 7 RBDs and proteins of other folds In addition to the already mentioned similarity of the PheTyr RBD platform to that of Rho termination factor, our clusters reveal novel similarities of RBDs to other structurally, functionally and evolutionary unrelated proteins (see Table 2). Figure 3 (c, top and middle) illustrates the similarity between dinucleotide binding sites of a π-stacked A{C|G} pairs of U1A and U2B to the binding site of a π-stacked anticodon pair GC of cysteinyl-tRNA synthetase (PDB:1u0b, G33 − C34 are part of the GCA anticodon loop). Figure 3 (c, bottom) presents an additional example of the similarity between U2AF and a guanine transglycosylase (PDB: 1q2r). In both structures there is an unusual turn between the considered nucleotides. As can be seen, this kink, which is measured to be 120◦ in U2AF38 , breaks the binding site to two separate surface patches, which are similar in the two structures. In both examples, in spite of the structural differences between the proteins, the binding sites have similar surfaces and shapes and bind the π-stacked nucleotide pairs in exactly the same orientation. A very similar situation is observed in the rest of the clusters of this type, which are detailed in Table 2. 2.2.2 Similar structures, similar dinucleotide binding sites Similar binding patterns formed by structurally similar proteins are expected and not surprising. This type of similarity is usually well illustrated in the biological literature, which allows us to verify the correctness of the recognized physico-chemical patterns. In addition, the correct superimposition of the overall structures by the transformations calculated for the dinucleotide binding sites allows us to validate the correctness of the method. Table 3 details all the recognized clusters of this type. Below we describe several clusters that provide some insights into the RNA sequence specificity and reveal the repetitive nature of certain binding patterns. Nucleotide sequence specificity MS2 RNA hairpin coat-protein complexes, have been widely used as a model system for studying RNA sequence specificity40, 41 . Our dataset contained 20 complexes of RNA bacteriophage capsid proteins, most of which were determined as part of these studies. These studies have revealed that the main driving force for the complex formation is the π-stacking interaction between the -5 base and the conserved tyrosine side chain. Our clusters 8 support this finding. First, they reveal that the only extruded pair of nucleotides which have a sufficient contact with the protein are nucleotide bases -4 and -541 (see Table3 cluster 215). Second, they recognize the physico-chemical binding pattern common to all binding sites of this dinucleotide pair (see Figure 4 (a)). They clearly show the conservation of the critical tyrosine side chain (Tyr85, PDB: 2izn), which can tolerate any nucleotide sequence mutations (U |G|A|C). In addition, the specific pattern recognized for the binding site of adenosine, can explain the specificity of this position to this nucleotide type42 . Repetitive Protein Patterns Our dataset contained 4 structures of Pumilio proteins, that regulate mRNA expression43 . These were observed to have three types of dinucleotide binding sites which are detailed in clusters 83, 92 and 206. Interestingly, our clusters reveal the repetitive nature of the Pumilio repeats, which are comprised of similar structural and sequence motifs. Figure 4 shows one example of the similarity between the binding sites located on the sequence motifs known as repeats 6 and 843 . The similar physico-chemical properties and surfaces of these binding sites are formed by the amino acid sequence QYYQ which has the same spatial arrangement in both repeats (see Supplementary Material). Although the overall structures of the compared proteins are similar, the repetitive nature of these spatial patterns can not be recognized using standard structural alignment methods44 . 2.2.3 Different structures, similar dinucleotide binding sites As was illustrated in the example of Pumilio repeat proteins, similarity of the sequence patterns can lead to the similarity of the dinucleotide binding sites. We start this section with an additional example in which we recognize similar binding patterns created by an even more fuzzy sequence motif. We further describe several of our clusters that reveal similar dinucleotide binding sites of proteins that do not share any sequence or structure motifs (see Table 4). K homology (KH) motif The K homology (KH) is a widespread RNA-binding sequence motif, an invariant Gly-X-X-Gly segment45 , which is known to be shared by proteins with globally distinct structures46 . Two of our clusters (clusters 34 and 106, see Table 4) reveal the similarity between 9 consecutive dinucleotide binding sites of transcription elongation proteins NusA (PDB: 2atw, 2asb) and neuronal splicing factors Nova-1 (PDB: 2anr, 2ann). Both of these protein types possess the KH motifs which are aligned by our method. Figure 5 presents the common binding pattern of three consecutive nucleotides GAA (G1-A3, 2atw) and CAC (C214-C216, 2anr) of NusA and Nova respectively. The recognized common protein physico-chemical pattern, which is consistent with previous studies 19, 45, 47 , consists of the conserved hydrogen bond donors contributed by the GXXG motif (2anr:G22,G25, 2atw:G237, G240), and hydrophobic aliphatic interactions formed by the conserved Ile residues (2anr:I21,I39, 2atw:I236,I255). Aminoacyl-tRNA synthetases Nine of our clusters of the second type, reveal similarities between different types of Aminoacyl-tRNA synthetase proteins. Interestingly, the structures of tRNA synthetases coding for different amino acids are different and they are currently estimated to evolve from different single domain proteins which are not known to have any common ancestor48 . However, it is estimated that the structural relationship between these proteins may reflect intermediate steps in the establishment of codon-amino acid interactions48 . Our clusters, that are detailed in Table 4 support this theory and reveal the binding patterns that may be reused in different protein regions that are important for RNA recognition. Figure 5(b) illustrates one such example of the similarity between two anticodon loop binding sites of Glutamyl-tRNA synthetase (PDB:1n78) and Aspartyl-tRNA synthetase (PDB:1il2). These binding sites accommodate the dinucleotide pairs CU and GU that are part of the codons CU C and GU C coding for Glu and Asp respectively. Notably, neither the overall structures of these complexes nor the rest of the anticodon loops are aligned by our method. This suggests that the recognized pattern may be the only building block reused during evolution in a manner that will allow the specificity of each binding site. Figure 5(c) presents an additional example of similarity between Aspartyl-tRNA synthetase (PDB: 1asy) rRNA (Uracil-5-)-methyltransferase RumA (PDB: 2bh2). Both binding sites bind the GU nucleotide pair. In 1asy this pair of nucleotides (634-635) is located in the anticodon loop, while in 2bh2 the GU pair (1954-1955) is responsible for the recognition of the hairpin by the N-terminal OB fold. As can be seen, in both of the above examples in spite of 10 the structural and functional differences of the proteins, their binding sites share similar physicochemical properties and shapes and bind exactly the same dinucleotide sequence in the same spatial orientation. 2.3 Prediction of RNA binding sites and structure of protein RNA complex Given a target protein structure, we aim to recognize the binding sites of its extruded nucleotides and to construct the RNA fragments that can bind to them, predicting the structure of an RNA strand and of the protein-RNA complex. Figure 2 illustrates the flow of our classification and prediction processes. Specifically, given the clusters of binding sites described in the previous section, we define the 3D consensus binding patterns, by the physico-chemical properties shared by all cluster members (see Methods). Given a target protein structure not used in the classification we search its surface for the presence of any of these 3D consensus patterns, predicting its nucleotide and dinucleotide binding sites. Using the nucleotide orientations observed in the consensus patterns allows us to predict the structure of RNA strands and protein-RNA complexes. 2.3.1 Prediction of RNA binding sites Here, we use the created clusters to predict RNA binding sites that accommodate unpaired extruded nucleotides. Specifically, given a target protein structure not used for the classification, we search its surface for regions similar to the created 3D consensus binding patterns, which are predicted to serve as binding sites. Due to a low number of single nucleotide clusters, we evaluate our predictions of dinucleotide binding sites. We perform leave-one-out tests on all the structures that participated in the above described clusters with more than one member. For each left out complex, we repeat the classification procedure without its structure, and use the obtained clusters to define a new set of 3D consensus binding patterns. Then, we search the surface of the left out protein for a presence of the constructed 3D patterns. All of these are searched by a single algorithm (RnaPred, see Methods) which recognizes the top ranking patterns that are similar to some protein regions. For each dinucleotide binding site of each left out protein we check whether it was correctly predicted by this procedure. A dinucleotide binding site is considered to be correctly predicted if one of the 11 20 top ranking solutions fulfils at least one of the following requirements: (1) the RMSD distance between the dinucleotides of the real complex and those predicted by the protein-profile alignment is ≤ 3Å; (2) There are at least four (or 30%) of physico-chemical properties (pseudocenters) involved in protein-dinucleotide interactions of the specific site that were correctly predicted by the proteinprofile alignment. The percentage of successfully predicted dinucleotide binding sites is the success rate of our methodology. Notably, the success rate calculated in these leave-one-out tests is 80%. Moreover, in 50% of the cases the correct prediction was the top ranking solution and in 82% of the cases it was within the 5 top ranking solutions. We consider this to be a success, especially since we do not only point to some specific amino acids that are involved in the interactions, but predict the spatial orientation of the RNA nucleotides in the protein binding site. Consequently, we expect that the combination of such predictions will allow the reconstruction of protein-RNA complexes with an unknown structure. Below, we detail our first steps in achieving this goal. 2.3.2 Prediction of RNA strands and protein-RNA complexes Given a protein structure, we would like to be able to predict which RNA fragments if can bind and what is the structure of the resulting protein-RNA complex. This is a very ambitious goal and currently we only show some initial steps in achieving it. Existing tools are unable to provide a good solution to this problem for several reasons. First, we aim to solve this problem without any assumption or knowledge of the structure of the RNA molecule. Due to the limited number of existing RNA structures, this is an important requirement. Unfortunately, it prevents using standard methods like docking for its solution. Currently, the most applicable approach is the superimposition of the given protein upon the protein-RNA complex of its closest homologue. This provides the prediction of the complex between the input protein and the RNA molecule bound to the homologue. However, this approach has several limitations. First, as can be seen in the example of the RBD clusters, similar overall sequences and folds do not always lead to similar nucleotide binding modes. Second, the superimposition of proteins done by their backbone atoms often misaligns the RNA molecules and the specific nucleotide binding sites of interest. Our methodology improves these points in the following ways. First, by looking for 12 the similarity of the physico-chemical patterns in the binding sites, we use only the most reliable protein information which indeed leads to the similarity in binding. Second, we superimpose the proteins according to the transformations calculated for their binding sites which usually optimize the similarity of the RNA nucleotide orientations. Here, we extend the RnaPred algorithm to the following scheme. Given a protein structure, we first recognize all regions that are similar to any of the above described 3D consensus patterns. Then, we superimpose the matched 3D patterns upon our target structure and consider the RNA dinucleotides that are bound to these patterns. We consider all the solutions and try to connect the nucleotides superimposed on neighboring regions to form a valid continuous RNA fragment. The longest RNA fragment with the highest similarity score of the alignment between the protein binding sites and the selected 3D patterns is the top ranking solution. Since the constructed RNA fragment is comprised of dinucleotides taken from 3D consensus patterns of totally different proteins, it can be different from any RNA fragment of known sequence and structure. This allows us to make unique predictions that can not be achieved by other methods. As described below, these can contribute to such applications as the design of RNA aptamer drugs which target protein-RNA interactions. To evaluate the performance of this method, we have performed leave-one-out tests similar to those described in the previous section. As before, each time we have left out one protein structure, re-clustered and created a new set of 3D consensus patterns that do not assume any knowledge of the left out structure. Then, this structure was searched for the highest scoring sequence of regions similar to the constructed 3D consensus patterns that allows us to create the longest continuous RNA fragment. The average length of the predicted fragment was 5 and the average running time on a standard PC was 3.5 minutes (AMD Opteron 242, 1593MHz). Notably, when we calculated the RMSD between the predicted RNA fragments and the real fragments bound to the left out structures in 22% of the cases it was less than 3Å, and in 17% of the cases it was even less than 1Å. In the rest of the cases, although we have precisely reconstructed some sub fragments, we have added false positive predictions which pointed to regions that are not in interaction in the given 13 complex. Figure 6 presents two examples which illustrate our success and limitations. In the first example, we reconstruct the RNA fragment and the structure of HutP antitermination complex (1wpu). The total length of the RNA strand of this complex is 7 nucleotides and it interacts with the protein through 4 dinucleotide binding sites. When we applied our prediction method and searched the surface of the HutP protein, we have correctly predicted all 4 of its dinucleotide binding sites. As expected, the prediction was made based on four 3D consensus patterns of homologous proteins (1wrq, 1wmq), the chaining of which constituted the top ranking solution. Interestingly, 2 of these patterns were based on clusters that contained additional binding sites of Aminoacyl-tRNA synthetase (1n78) the anticodon CU C loop of which was recognized to be similar to the central U AG motif of HutP (see Table 4, cluster 63). In spite of this addition, the patterns were correctly mapped to the query protein and the structure of the RNA fragment was correctly predicted with the overall RMSD of 0.5Å from the native. However, the structure of two 5, U U nucleotides was not predicted by our method. The first 5, U was ignored due to some missing atoms in one of the structures (PDB:1wrq) which led to a singleton cluster of 1wmq, which was not used for the construction of 3D patterns. Since the second U nucleotide has no interaction with the protein, its binding site and ring orientation could not be predicted by our method (see Figure 6 (a) top). Interestingly, this nucleotide is also more flexible than the rest and its prediction is more difficult. Figure 6 (b) presents another example in which we reconstruct part of the hepatitis delta precursor in interaction with the U1A protein. The loop of this RNA molecule interacts with the protein through 6 dinucleotide binding sites. Four of these sites, formed by the nucleotides A1-C5 of U1A 39 were correctly predicted and mapped by our method. This allowed us to reconstruct a fragment of five RNA nucleotides with overall RMSD of less than 0.3Å from their native structure. However, we failed to recognize the dinucleotide binding site of C5-A6. Consequently, the fragment A6-C7, the binding site of which was correctly mapped to the protein surface, was not added to the reconstructed RNA fragment. Interestingly, this failure occurred in the prediction of the TyrPhe binding platform which is described above (see Figure 3). This is explained by the fact that during the reclustering without query protein, another structure of Sxl (PDB:1fxl) was added to 14 the cluster. The Tyr-Phe binding platform of the query protein detailed in Figure 3 (a) is slightly different from the binding platform of 1fxl represented in Figure 3 (b). The main difference is not in the aromatic stacking platform, but in the spatial disposition of the rest of the properties that form the binding sites. Consequently, when we combined the two patterns into one cluster represented by only 5 pseudocenters of 1fxl, we failed to correctly map this pattern on the structure of 1sj3, which is the most different from 1fxl (see Section 2.1). Since in the current version we attempt to reconstruct only sequential fragments, we stopped upon the failure and did not extend the predicted RNA fragment. 2.4 Drug design applications There are two main types of drugs that can be developed to prevent the formation of protein-RNA interactions. The first type are the most common small molecule drugs, which are bioavailable when administered orally. The second type are drugs based on short strands of RNA oligonucleotides. These ligands, known as aptamers are selected for their ability to bind proteins with both high affinity and high specificity33 . Below we present drugs discovery applications and show how our methodology can contribute to the development of both type of drugs. 2.4.1 Discovery of small molecule drugs We have observed that the average surface area of the protein binding sites that accommodate dinucleotide pairs is 145Å2 if the pairs are π-stacked and 210Å2 otherwise. Since optimal drug molecules were proposed to have a surface area smaller than 140Å2 due to bioavailability reasons49 , the dinucleotide binding sites have a high potential to accommodate drugs and serve as drug targets. To provide suggestions of potential ligands and ligand fragments that can target dinucleotide binding sites, we have searched the constructed 3D consensus patterns against a database of all known small molecule binding sites (see Methods). The top ranking solutions of this search application are the binding sites with properties similar to dinucleotide binding sites. The drug leads and the substrates that are bound to them provide ideas for small molecules that can bind to the dinucleotide binding sites. The details of the top ranking binding sites and their corresponding 15 ligands is provided in the Supplementary Material. Figure 7 (a) presents one example of U 3 − A4 dinucleotide binding site of HutP aligned to the binding site of Cyclophilin A bound to sanglifehrin macrolide (PDB:1nmk, rank 24). As can be seen, the sanglifehrin molecule has a volume and shape similar to the U A pair. Its chemical groups and scaffold can be used in the design of new inhibitors for HutP. 2.4.2 Discovery of novel aptamers Our new, RnaPred, method described in Section 2.3.1 predicts the sequence and the structure of the RNA strands that can bind to the protein of interest. It provides the suggestions of the initial RNA sequences that can serve as the starting point in the design of new aptamers. Since our method uses the information from all available protein-RNA complexes, it can design previously unknown RNA sequences and structures that were never experimentally observed. Since all the predictions are knowledged-based, the constructed models are based on rational considerations of the nucleotide geometries typical to the observed protein physico-chemical patterns. 2.4.3 Aptamer optimization Obviously, the predictions provided by a computational method are only a set of suggestions, which require a further validation and optimization. One way to improve the binding affinity and/or selectivity is to optimize the single extruded nucleotides, which are unpaired and are not part of a protein interacting dinucleotide pair. To obtain ideas for the chemical groups and scaffolds that can be used for the modification, we can search the above described database of drug-like binding sites for those that are similar to the nucleotide binding site of interest. The detailed results of such searches performed with the constructed 3D consensus binding patterns of single nucleotide binding sites is provided in the Supplementary Material. Figure 7 (b) presents one example of the similarity between a single nucleotide binding site (A37) of Cysteinyl-tRNA synthetase (PDB:1u0b) aligned to the Human macrophage elastase (MMP-12, PDB:1ros, ranked 28) bound to its inhibitor. Since the binding regions of two proteins have similar physico-chemical properties and secondary structure elements (helix and 2 strands), the inhibitors developed for the MMP-12 can provide useful ideas 16 for the chemical groups that can be used to substitute and optimize the adenine nucleotide of Cysteinyl-tRNA synthetase. 3 Summary and conclusions Motivated by the important role of extruded non-paired RNA nucleotides in protein-RNA recognition, we have investigated their local geometries and interactions. We have observed that in most cases of protein-nucleotide interactions, there are several consecutive RNA nucleotides that are not involved in RNA base pairing. Since the nucleotide identities are not indicative of their spatial geometry, we consider the protein pockets that accommodate them. We observed that many of the consecutive nucleotide pairs share the same binding cavity and interact with each other. Consequently, we suggest that the binding patterns common to such nucleotide pairs provide a more correct representation of these interactions. We proposed a novel algorithmic framework which starts with the classification of all known nucleotide and dinucleotide binding patterns according to their spatial physico-chemical patterns. These patterns, termed 3D consensus patterns, are further used for the prediction of binding sites and RNA fragments. Obviously, the proposed framework is just a starting point and each of its stages can be further enhanced and improved. The classification methodology, which has the advantage of the spatial validation of the created patterns, has all the disadvantages of the regular center star clustering and is sensitive to the selected star centers and the order of traversal. The created 3D consensus patterns, which were shown to be extremely useful, do not contain the information about the variation of the spatial patterns of the cluster members, whose description is not straightforward. Selection of the shared pattern coordinates from a single structure could further influence the results. Currently, we do not predict the interactions formed with the RNA backbone, which are often represented by smaller and more flexible physico-chemical binding patterns. Nonetheless, the results of this paper indicate that it is possible to predict the interactions and the structure of fragments of single-stranded RNA bases. We intend to use the methodology presented here as the first part in a two stage scheme for the prediction of complete protein-interacting RNA 17 strands. The results of this paper allow the prediction of binding sites and interactions formed with the nucleotide bases. Modeling the short RNA sub-fragments between the predicted regions is expected to allow the challenging reconstruction of complete RNA fragments. In addition to being an important milestone towards achieving the ultimate RNA-protein structure prediction goal, our results provided several important insights. First, the presented classification reveals novel and surprising similarities between dinucleotide binding sites formed by proteins with different overall sequences, folds and functions. These results suggest that certain physicochemical patterns may be reused during the evolution in different protein regions that are important for RNA recognition. Second, we have presented a framework which allows a successful prediction of dinucleotide binding sites as well as recognition of ligands and ligand fragments that can target them. We hope that this will be useful in the design of aptamer and small molecule drugs that interfere with protein-RNA interactions. 4 Methods Dataset construction and resources The datasets of RNA structures and protein-RNA com- plexes were retrieved from the NDB database50 , March 2007, and contained 1103 and 278 structures respectively. The dataset of protein-RNA complexes contained only structures with resolution 3Å and better, while the dataset of all RNA structures contained NMR structures as well. Sequence redundancy was removed using BlastClust51 with a 50% sequence identity threshold. When considering protein-RNA complexes, structures with more than 50% sequence identity in both RNA and protein chains were removed. The local base pairing of RNA nucleotides was recognized using the 3DNA software 52 . Alignment of nucleotide and dinucleotide binding sites Since, it was shown that single nucleotides can bind in alternative modes even to the same protein binding site53 , the alignment of their binding sites was performed, using our previously developed, MultiBind method54 . This method allows the recognition of the maximal physico-chemical pattern common to the input set of binding sites, without using any information regarding the corresponding binding partners. 18 The dinucleotide binding sites extracted from protein-RNA complexes contain additional information about the spatial orientation of its two nucleotides, which we aim to predict. Consequently, it is essential that the alignment method will require the similarity of the nucleotide geometries in the aligned binding sites. To fulfil this requirement, we have developed a new algorithm, RnaBind, which aligns between dinucleotide binding sites and utilizes the nucleotide orientation for the construction of 3D transformations that superimpose the input binding sites. Specifically, each dinucleotide pair is represented by the two centroids of the corresponding nucleotide rings and the phosphate atom between them. Then we apply the Least-Squares Fitting method55 to calculate the transformation that provides the best alignment of such representative triplets extracted from the input binding sites. Once the binding sites are superimposed in 3D space we apply maximum weight match in a weighted bipartite graph56 to determine the 1:1 correspondences between the matched pseudocenters of the input binding sites and score their similarity by the scoring functions developed for our MultiBind method54 . Multiple center star clustering The standard clustering methods such as UPGMA or k- means57 provide a general methodology to group any elements based on their pair-wise relations. Obviously, some complex elements that may be similar between each other, i.e. in pairs, may not be similar as a whole group. For the clustering of dinucleotide binding sites we are interested in computing clusters of binding sites that are similar as a whole group, i.e. finding a group of binding sites that share a significant consensus pattern. Therefore, we developed the following new clustering procedure. Similar to most existing methods, we start with performing all-against all pairwise alignments between objects of interest. Then, we use the calculated pairwise scores to create a graph in which the objects of interest are the nodes. Edges are created between similar nodes with a pairwise similarity score above certain threshold. We require that the similarity score between two nodes (S(n1 , n2 )) will be at least 30% of the score of one of them aligned to itself ((S(n1 , n2 ) ≥ 0.3 ∗ S(n1 , n1 )) ∨ (S(n1 , n2 ) ≥ 0.3 ∗ S(n2 , n2 ))). Then, for each node, we consider the “star” that is created by its edges and determine its weight by the sum of their corresponding scores. We go 19 over all the “stars” in decreasing order of their weight and try to create a cluster around each “star center”. For each star, we go over its edges in the decreasing order of their score and try to create a cluster comprised of the star center and the nodes of the selected edges. For example, we start with the highest scoring edge and perform the alignment between its two nodes. Then, we add the second edge and perform a multiple alignment between three nodes: the “star” center and the two nodes of its two highest scoring edges. If the score of the core of this multiple alignment is above 30% of the score of each node, this triplet is defined as a cluster. Otherwise, the last node that was just added in this iteration is removed from the cluster. It will be considered later according to its other edges. As to the current “star”, we proceed to go over the rest of its edges and add to the cluster only those nodes which multiple alignment with the current cluster members receives a high enough score (more than 30% similarity in our example). Figure 8 (a) illustrates the process of creation of three clusters. The procedure terminates when all the nodes have been assigned to a cluster. Each cluster defines a consensus 3D pattern, which are the 3D points shared by all of its members. The coordinates that describe this pattern are taken from the node that served as “star center” and have originated the cluster. Prediction of dinucleotide binding sites We have developed a method, RnaPred, which searches a complete protein for regions similar to the created 3D consensus patterns. The input protein and the patterns are represented by the set of their corresponding pseudocenter points (see Section 2.1). The RnaPred algorithm consists of the following 4 stages: (1) Generation of a set candidate transformations and the superimposition ; (2) Defining the 1:1 correspondence between the matched points; (3) Scoring and the selection of top ranking solutions. (4) Clustering the solutions of each 3D pattern as well as of different patterns mapped to the same protein region. Superimposition: The generation of all candidate transformations is done with the Geometric Hashing method58 which consists of two stages, preprocessing and recognition. At the preprocessing, each triplet of pseudocenters from the complete molecule is considered as a local reference frame. Then, the coordinates of the other pseudocenters, calculated with respect to this coordinate system, are stored in a Geometric Hash Table. The key to the hash table consists of the point coordinates 20 and physico-chemical property. In the recognition stage the same process is repeated for each 3D consensus pattern. For each pair of reference frames, one from a 3D pattern and one from the query protein, we count the number of matched points. For each pattern, we consider the reference frames with a significant number of matched points and construct a set of candidate transformations that can superimpose the given pattern upon the entire query protein. This procedure has several advantages for our goal. First, all the information of the complete molecule is stored only once for all the 3D patterns. Second, we avoid the processing of points that cannot be matched under any transformation. This is especially important for our application since the complete molecule is significantly larger than the 3D consensus patterns. Correspondence and Scoring: The 1:1 correspondence and the scoring are implemented in the same manner as in the RnaBind method. We ignore solutions that are too small and insignificant and align less than 30% of a 3D consensus pattern (i.e. the score of the alignment is less than 30% of the score of the pattern aligned to itself). Clustering: First, for each 3D pattern we cluster similar solutions that superimpose it to similar spatial locations. This is achieved by applying efficient RMSD clustering, described by Rarey et al59 , with a default threshold of 3Å . When similar alignments are detected, we retain only a single solution with the highest score. Since our goal is to retain only a small set of top ranking solutions, we need to cluster the solutions that align different 3D consensus patterns to similar protein regions. We define two mappings of different 3D consensus patterns to be similar if their corresponding match lists (defined by the 1:1 correspondence) are based on at least 70% of identical pseudocenters of the complete query molecule. Since the algorithm performs simultaneous alignment against all the constructed patterns, it is extremely fast. Its average running time for searching a complete protein surface for the presence of all the di-nucleotide consensus binding patterns measured during all the leave-one-out tests (see Results) is 3 minutes on a standard PC (AMD Opteron 242, 1593MHz). Reconstruction of protein-RNA complexes Here, we extend the RnaPred algorithm described in the previous section to the recognition of the longest consecutive set of alignment of 3D 21 patterns to neighboring protein regions. Specifically, given all possible alignments of all the created 3D consensus patterns to the query protein, we construct a graph, in which these solutions are the nodes. Let (a1 , a2 ) and (b1 , b2 ) represent the RNA nucleotide pairs bound to two aligned 3D patterns. An edge in this graph is created if the alignments of 3D patterns and their corresponding dinucleotides satisfy the following requirements: (1) They have a pair of overlapping nucleotide rings: dist(a2 , b1 ) ≤ 1 ; (2) They have a pair of distant rings: dist(a1 , b2 ) > 1 (i.e. the 3D patterns are aligned to different protein regions); (3) The planes of the overlapping rings are parallel to each other: angle(a2 , b1 ) ≤ 2 . Using 1 = 2.0 and 2 = 20o , we ensure that the dinucleotide pairs from two 3D patterns can be combined to a consecutive RNA fragment. Given the constructed graph, we implemented a variation of the Bellman-Ford algorithm60 which recognizes K longest paths with the maximal weight. Specifically, one path (p1 ) is considered to be better that the other (p2 ) if it has at least the same length (length(p1 ) ≥ length(p2 )) and its cumulative score is higher (Cs(p1 ) ≥ Cs(p2 )). The cumulative score, which is the sum of scores of all the nodes along the path, reflects the quality of the alignments between the 3D consensus patterns to the corresponding neighboring protein regions. Figure 8 (b) provides a schematic representation of the main idea of this method. Database of small molecule binding sites The dataset of PDB structures complexed with small molecules was retrieved from the PDBsum database61 , May 2007. We retained only the binding sites of compounds with more than 7 non-hydrogen atoms that are not covalently linked to the protein. In addition, to remove the binding sites complexed with natural substrates, such as ATP or GTP, we counted the frequency of occurrence of each ligand in the PDB. Small molecules that appeared in more than 10 structures were assumed to be natural, frequently occurring substrates which can not make a significant contribution in our searches for drugs and drug fragments. As a result we constructed a dataset of 3999 binding sites which were screened for their similarity to nucleotide and dinucleotide binding patterns. 22 Supplementary Material Supplementary Figure S1 presents histograms of angles and distances between the ring planes of consecutive, protein interacting nucleotides. Supplementary Table S1 presents the general frequencies of consecutive, extruded, protein interacting nucleotides that are not involved in RNA base pairing. Table S2 presents the analysis of the sequence specificity of pairs of consecutive, extruded π-stacked nucleotides. Table S3 presents the full list of clusters of single, extruded, nucleotide binding sites. Table S4 presents the full list of clusters of dinucleotide binding sites. Table S5 presents the details of the similarity between the Phe-Tyr binding platforms of HuD RBD (1fxl) and Rho terminating factor (1pvo) revealed by cluster 18. Table S6 details the common physico-chemical binding pattern of the Phe-Tyr binding platform of RBDs revealed by cluster 100. Table S7 details the physico-chemical properties shared by 8 dinucleotide binding sites of Pumilio proteins of cluster 83 (QYYQ motif). Tables S8-S9 provide lists of 200 top ranking protein small molecule binding sites that were recognized to be similar to the created nucleotide and dinucleotide consensus binding patterns respectively. Acknowledgments We thank O. Dror for contribution of software for this project. The research of AS-P was supported by the Clore PhD Fellowship. The research of HJW has been supported in part by the Israel Science Foundation (grant no. 281/05), by the NIAID, NIH (grant No. 1UC1AI067231), by the Binational US-Israel Science Foundation (BSF)and by the Hermann Minkowski-Minerva Center for Geometry at TAU. This publication has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under contract NO1-CO-12400. This research was supported [in part] by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government. 23 References [1] Jones, S., Daley, D. T. A., Luscombe, N. M., Berman, H. M. & Thornton, J. M. (2001). ProteinRNA interactions: a structural analysis. Nucleic Acids Res. 29, 943–954. [2] Nadassy, K., Wodak, S. J. & Janin, J. (1999). Structural features of protein-nucleic acid recognition sites. Biochemistry, 38, 1999–2017. [3] Chen, Y., Kortemme, T., Robertson, T., Baker, D. & Varani, G. (2004). A new hydrogenbonding potential for the design of protein-RNA interactions predicts specific contacts and discriminates decoy. Nucleic Acids Res. 32, 5147–5162. [4] Allers, J. & Shamoo, Y. (2001). Structure-based analysis of protein-RNA interactions using the program ENTANGLE. J. Mol. Biol. 311, 75–86. [5] Morozova, N., Allers, J., Myers, J. & Shamoo, Y. (2006). Protein-RNA interactions: exploring binding patterns with a three-dimensional superposition analysis of high resolution structures. Bioinformatics, 22, 2746–2752. [6] Hoffman, M. M., Khrapov, M. A., Cox, J. C., Yao, J., Tong, L. & Ellington, A. D. (2004). AANT: the Amino Acid-Nucleotide Interaction Database. Nucleic Acids Res. 32, D174 – 181. [7] Treger, M. & Westhof, E. (2001). Statistical analysis of atomic contacts at RNA-protein interfaces. J Mol Recognit. 14, 199–214. [8] Cheng, A. C., Chen, W. W., Fuhrmann, C. N. & Frankel, A. D. (2003). Recognition of nucleic acid bases and base-pairs by hydrogen bonding to amino acid side-chains. J. Mol. Biol. 327, 781–96. [9] Jeong, E., Chung, I. F. & Miyano, S. (2004). A neural network method for identification of RNA-interacting residues in protein. Genome Inform. 15, 105–16. [10] Terribilini, M., Lee, J.-H., Yan, C., Jernigan, R. L., Honavar, V. & Dobbs, D. (2006). Prediction of RNA binding sites in proteins from amino acid sequence. RNA, 12, 1450–1462. 24 [11] Terribilini, M., Sander, J. D., Lee, J. H., Zaback, P., Jernigan, R. L., Honavar, V. & Dobbs, D. (2007). RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic Acids Res. 35, W578–84. [12] Kim, O. T. P., Yura, K. & Go, N. (2006). Amino acid residue doublet propensity in the protein-RNA interface and its application to RNA interface prediction. Nucleic Acids Res. 34, 6450–60. [13] Wang, L. & Brown, S. J. (2006). BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 34, W243 – W248. [14] Draper, D. E. (1999). Themes in RNA-protein recognition. J. Mol. Biol. 293, 255–70. [15] Auweter, S. D., Oberstrass, F. C. & Allain, F. H.-T. (2006). Sequence-specific binding of single-stranded RNA: is there a code for recognition? Nucleic Acids Res. 34, 4943–4959. [16] Klosterman, P. S., Hendrix, D. K., Tamura, M., Holbrook, S. R. & Brenner, S. E. (2004). Three-dimensional motifs from the SCOR, structural classification of RNA database: extruded strands, base triples, tetraloops and U-turns. Nucleic Acids Res. 32, 2342–52. [17] Tamura, M., Hendrix, D. K., Klosterman, P. S., Schimmelman, N. R., Brenner, S. E. & Holbrook, S. R. (2004). SCOR: Structural Classification of RNA, version 2.0. Nucleic Acids Res. 32, D182–4. [18] Chen, Y. & Varani, G. (2005). Protein families and RNA recognition. FEBS J. 272, 2088–97. [19] Lunde, B. M., Moore, C. & Varani, G. (2007). RNA-binding proteins: modular design for efficient function. Nat Rev Mol Cell Biol. 8, 479–490. [20] Murzin, A., Brenner, S., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. [21] Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S. R. & Bateman, A. (2005). Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121–4. 25 [22] Macke, T. J., Ecker, D. J., Gutell, R. R., Gautheret, D., Case, D. A. & Sampath, R. (2001). RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res. 29, 4724–35. [23] Maris, C., Dominguez, C. & Allain, F. H.-T. (2005). The RNA recognition motif, a plastic RNA-binding platform to regulate post-transcriptional gene expression. FEBS Journal, 272 (9), 2118–2131. [24] Handa, N., Nureki, O., Kurimoto, K., Kim, I., Sakamoto, H., Shimura, Y., Muto, Y. & Yokoyama, S. (1999). Structural basis for recognition of the tra mRNA precursor by the Sexlethal protein. Nature, 398, 579–585. [25] Antson, A. A. (2000). Single-stranded-RNA binding proteins. Curr Opin Struct Biol. 10, 87–94. [26] Turner, B., Melcher, S. E., Wilson, T. J., Norman, D. G. & Lilley, D. M. (2005). Induced fit of RNA on binding the L7Ae protein to the kink-turn motif. RNA, 11, 1192–200. [27] Shajani, Z., Drobny, G. & Varani, G. (2007). Binding of U1A protein changes RNA dynamics as observed by 13C NMR relaxation studies. Biochemistry, 46, 5875–83. [28] Ellis, J. J. & Jones, S. (2007). Evaluating conformational changes in protein structures binding RNA. Proteins, . in press. [29] Moodie, S. L., Mitchell, J. B. O. & Thornton, J. M. (1996). Protein recognition of adenylate: an example of a fuzzy recognition template. J. Mol. Biol. 263, 486–500. [30] Kuttner, Y. Y., Sobolev, V., Raskind, A. & Edelman, M. (2003). A consensus-binding structure for adenine at the atomic level permits searching for the ligand site in a wide spectrum of adeninecontaining complexes. Proteins, 52, 400–411. [31] Connolly, M. (1983). Analytical molecular surface calculation. J. Appl. Cryst. 16, 548–558. 26 [32] Schmitt, S., Kuhn, D. & Klebe, G. (2002). A new method to detect related function among proteins independent of sequence or fold homology. J. Mol. Biol. 323, 387–406. [33] Bunka, D. H. J. & Stockley, G. (2006). Aptamers come of age - at last. Nat Rev Microbiol, 4, 588–596. [34] Guallar, V. & Borrelli, K. W. (2005). A binding mechanism in protein-nucleotide interactions: Implication for U1A RNA binding. Proc. Natl. Acad. Sci. USA, 102, 3954–3959. [35] Burley, S. K. & Petsko, G. A. (1985). Aromatic-aromatic interaction: a mechanism of protein structure stabilization. Science, 229, 23–28. [36] Shulman-Peleg, A., Nussinov, R. & Wolfson, H. J. (2004). Recognition of functional sites in protein structures. J. Mol. Biol. 339(3), 607–633. [37] Deo, R. C., Bonanno, J. B., Sonenberg, N. & Burley, S. K. (1999). Recognition of polyadenylate RNA by the poly(A)-binding protein. Cell, 98, 835–845. [38] Sickmier, E. A., Frato, K. E., Shen, H., Paranawithana, S. R., Green, M. R. & Kielkopf, C. L. (2006). Structural Basis for Polypyrimidine Tract Recognition by the Essential Pre-mRNA Splicing Factor U2AF65. Molecular Cell, 23, 49–59. [39] Shiels, J. C., Tuite, J. B., Nolan, S. J. & Baranger, A. M. (2002). Investigation of a conserved stacking interaction in target site recognition by the U1A protein. Nucleic Acids Res. 30, 550– 558. [40] Grahn, E., Moss, T., Helgstrand, C., Fridborg, K., Sundaram, M., Tars, K., Lago, H., Stonehouse, N. J., Davis, D. R., Stockley, P. G. & Liljas, L. (2001). Structural basis of pyrimidine specificity in the MS2 RNA hairpin-coat-protein complex. RNA, 7, 1616–1627. [41] Horn, W. T., Tars, K., Grahn, E., Helgstrand, C., Baron, A. J., Lago, H., Adams, C. J., Peabody, D. S., Phillips, S. E., Stonehouse, N. J., Liljas, L. & Stockley, P. G. (2006). Structural basis of RNA binding discrimination between bacteriophages Qbeta and MS2. Structure, 14, 487–495. 27 [42] Helgstrand, C., Grahn, E., Moss, T., Stonehouse, N. J., Tars, K., Stockley, P. G. & Liljas, L. (2002). Investigating the structural basis of purine specificity in the structures of MS2 coat protein RNA translational operator hairpins. Nucl. Acids Res. 30 (12), 2678–2685. [43] Wang, X., McLachlan, J., Zamore, P. D. & Hall, T. M. (2002). Modular recognition of RNA by a human pumilio-homology domain. Cell, 110, 501–512. [44] Shatsky, M., Nussinov, R. & Wolfson, H. J. (2004). A method for simultaneous alignment of multiple protein structures. Proteins, 56, 143–156. [45] Lewis, H. A., Musunuru, K., Jensen, K. B., Edo, C., Chen, H., Darnell, R. B. & Burley, S. K. (2000). Sequence-Specific RNA Binding by a Nova KH Domain: Implications for Paraneoplastic Disease and the Fragile X Syndrome. Cell, 100, 323–332. [46] Grishin, N. V. (2001). KH domain: one motif, two folds. Nucleic Acids Res. 29, 638–643. [47] Beuth, B., Pennell, S., Arnvig, K. B., Martin, S. R. & Taylor, I. A. (2005). Structure of a Mycobacterium tuberculosis NusA-RNA complex. EMBO J. 24, 3576–87. [48] Bori-Sanz, T., Guitart, T. & Ribas de Pouplana, L. (2006). Aminoacyl-tRNA synthetases: a complex system beyond protein synthesis. Cont. to Science, 3, 149–165. [49] Veber, D. F., Johnson, S. R., Cheng, H. Y., Smith, B. R., Ward, K. W. & Kopple, K. D. (2002). Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem. 45, 2615–2623. [50] Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S. H., Srinivasan, A. R. & Schneider, B. (2003). The nucleic acid database. A comprehensive relational database of three-dimensional structures of nucleic acids. Biophys J. 63 (3), 751–759. [51] Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410. 28 [52] Lu, X.-J. & Olson, W. K. (2003). 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucl. Acids Res. 31 (17), 5108–5121. [53] Denessiouk, K. A., Rantanen, V. & Johnson, M. (2001). Adenine Recognition: A motif present in ATP-,CoA-,NAD-,NADP-, and FAD-dependent proteins. Proteins, 44, 282–291. [54] Shatsky, M., Shulman-Peleg, A., Nussinov, R. & Wolfson, H. (2006). The multiple common point set problem and its application to molecule binding pattern detection. J Comput Biol. 13, 407–42. [55] Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 34, 827–828. [56] Mehlhorn, K. (1999). The LEDA platform of combinatorial and geometric computing. Cambridge University Press. [57] Jain, A. K., Murty, M. N. & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 3, 264 – 323. [58] Wolfson, H. J. (1990). Model-Based Object Recognition by Geometric Hashing. In Proc. of the 1st European Conf. on Comp. Vision (ECCV) LNCS pp. 526–536, Springer-Verlang. [59] Rarey, M., Wefing, S. & Lengauer, T. (1996). Placement of medium-sized molecular fragments into active sites of proteins. J. Computer-Aided Molecular Design, 10, 41–54. [60] Cormen, T., Leiserson, C. & Rivest, R. (1990). Introduction to Algorithms. MIT Press. [61] Laskowski, R. A., Chistyakov, V. V. & Thornton, J. M. (2005). PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucleic acids. Nucleic Acids Res., 33, D266–268. [62] Andreeva, A., Howorth, D. & Murzin, A. G. (June 4, 2007). The pre-SCOP database. http: //www.mrc-lmb.cam.ac.uk/agm/pre-scop/. 29 [63] Eiler, S., Dock-Bregeon, A., Moulinier, L., Thierry, J. C. & Moras, D. (1999). Synthesis of aspartyl-tRNA(Asp) in Escherichia coli–a snapshot of the second step. EMBO J., 18, 6532–6541. [64] Yang, X., Grczei, T., Glover, L. T. & Correll, C. C. (2001). Crystal structures of restrictocininhibitor complexes with implications for RNA recognition and base flipping. Nat Struct Biol., 8, 968–973. [65] Tishchenko, A., Nikulin, A., Fomenkova, N., Nevskaya, N., Nikonov, O., Dumas, P., Moine, H., Ehresmann, B., Ehresmann, C., Piendl, W., Lamzin, V., Garber, M. & Nikonov, S. (2001). Detailed analysis of RNA-protein interactions within the ribosomal protein S8-rRNA complex from the archaeon Methanococcus jannaschii. J. Mol. Biol., 311, 311–324. [66] Lee, T. T., Agarwalla, S. & Stroud, R. M. (2005). A unique RNA Fold in the RumA-RNAcofactor ternary complex contributes to substrate selectivity and enzymatic function. Cell, , 120, 599–611. 30 Tables Dataset Num. of PDB structures Num. of consecutive extruded pairs π-stacked parallel π-stacked perpendicular Protein-RNA complexes All RNA structures 278 307 87 (28%) 0 (0%) 1139 4114 1291 (31%) 27 (0.7%) Table 1: Frequency of π-stacking interactions between consecutive nucleotides. The table presents the statistics of π-stacking interactions observed in the datasets of protein-RNA complexes and all RNA structures. Columns 1-2 present the dataset details. Column 3 details the number of pairs of consecutive extruded nucleotides, where only protein interacting nucleotides were considered in the dataset of proteinRNA complexes. Columns 4-5 present the total numbers and the frequencies of the considered types of πstacking interactions. To prevent the influence of missing strands which are part of a double helix, sequences of more than 5 consecutive π-stacked nucleotides were not considered. Only the standard, unmodified, nucleotides A, G, C, U, I were considered. 31 Num. Scop dist. 96 1 28 1 122 3 19 3 194 3 127 3 128 3 129 3 79 3 99 3 100 3 π type 1 2 2 2 2 2 2 2 1 2 2 Cluster size 3 3 2 3 3 3 3 3 4 4 6 Nucleotide pairs AU UU UA, UG UU UU AU, UU UU UU UG GC CA, AU, GU AC, AG, GC Cluster center (pdb:chains:nts) 1sj3:PR:A49U50 1urn:AP:U7U8 1g2e:AB:U1A2 1fxl:AB:U2U3 1b7f:AP:U6U7 1g2e:AB:A6U7 1g2e:AB:U7U8 1g2e:AB:U8U9 1m5o:CB:U38G39 1sj3:PR:G52C53 1sj3:PR:C53A54 32 7 1 5 168 7 1 2 AA, GA 1cvj:AM:A4A5 280 7 2 2 CU, UU 1q2r:AE:C8U9 18 7 2 2 UU, UC 1fxl:AB:U1U2 1urn:AP:A11C12 Location Cluster members annotation A1-U2 of U1A39 U2-U3 of U1A39 5, U3-G4 of Sxl24 U5-U6 of Sxl24 U6-U7 of Sxl24 U6-U7 of Sxl24 U7-U8 of Sxl24 U8-U9 of Sxl24 U3-G4 of U1A39 G4-C5 of U1A39 G4-U5 of Sxl24 , C5-A6 of U1A39 A6-C7 of U1A39 , G33-C34 of anticodon loop A4-A5 of poly(A) and G2-A3 of NSP3 Rna turn at C32C33 or tRNA and U4-U5 of U2AF U1-U2 of HuD, U1-C2 of Rho U1A (1urn, 1m5o, 1sj3) U1A (1urn, 1m5o, 1sj3) HuD (1g2e), Sxl(1b7f) HuD (1fxl, 1g2e), Sxl(1b7f) HuD (1fxl, 1g2e), Sxl(1b7f) HuD (1fxl, 1g2e), Sxl(1b7f) HuD (1fxl, 1g2e), Sxl(1b7f) HuD (1fxl, 1g2e), Sxl(1b7f) U1A(1sj3,1m5o,1urn), U2B(1a9n) U1A(1sj3,1m5o,1urn), U2B(1a9n) U1A(1sj3,1m5o,1urn), U2B(1a9n), HuD(1g2e), Sxl(1b7f) U1A(1sj3,1m5o,1urn), U1B’(1a9n), tRNA synthetases (1u0b) RNA-binding domain, NSP3 homodimer(1knz) tRNA-guanine transglycosylase(1q2r), U2AF(2g4b) RNA-binding domain, Rho termination factor (1pvo) Table 2: Dinucleotide clusters that involve RNA-binding domain proteins. The first column is the cluster number (according to the full list provided in the Supplementary Material). The second column specifies the maximal SCOP20, 62 distance between the cluster members. Proteins at distance 3 belong to the same SCOP super family, while proteins at distance 7 have totally different overall folds and no common structural annotation. Column 4 indicates whether all the cluster dinucleotide pairs are π-stacked (1) or non π-stacked (2). Column 5 indicates the number of binding sites in the cluster and column 6 provides the details of the binding site that initiated the cluster (PDB code, protein and RNA chains, nucleotides’ (nts) numbers according to the appearance on the RNA chain). Column 7 specifies the location of the considered dinucleotides and Column 8 details the cluster members. 32 Num. Scop dist. 215 1 π type 2 Cluster Nucleotide pairs size 20 UA,GA,AA,CA Cluster center (pdb:chains:nts) 2izn:AR:U9A10 Location 206 1 2 9 AU,AG,GU,UA 83 1 2 8 UG,UA,UC 92 1 2 4 CC,AU 254 148 149 126 1 1 1 1 1 1 1 1 3 2 2 2 UA,UU GC CC UA Rna 4nt-loop: 5, -442 1m8y:AC:A8U9 Pumilio repeats 3,4,643 1m8w:AC:U2G3 Pumilio repeats 6, 843 1m8w:AE:C5C6 Pumilio repeat 443 1n1h:AB:U1A2 Rna 5, 1c0a:AB:G65C66 Rna 3, GCCA63 1c0a:AB:C66C67 Rna 3, GCCA63 1g2e:AB:U5A6 Rna U-6, A-7 14 225 1 1 2 2 2 2 GA AG,AU 1h2c:AR:G2A3 1jbt:AC:A15G16 200 146 1 2 1 3 2 4 CU UC 208 2 2 2 AU, AA 113 49 2 2 2 1 2 2 CC CA Rna 3, G2-A3 Unfolded SRL tetraloop64 1si2ABC6U7 Rna C6-U7 1c0a:AB:U31C32 Rna anticodon loop U-635, C636 63 1i6u:AC:U25A26 Rna A-640, U64165 1qtq:AB:C72C73 Rna 3, CCA 1n78:AC:C73A74 Rna 3, CCA Protein annotation RNA bacteriophage capsid protein Pumilio-Homology Domain Pumilio-Homology Domain Pumilio-Homology Domains Minor core protein λ3 Aspartyl-tRNA Synthetases Aspartyl-tRNA Synthetases HuD protein (RNA-binding domain) Matrix Protein Vp40 Ribotoxin restrictocin Translation initiation factor Aspartyl-tRNA Synthetases Ribosomal proteins S8 Glutamyl-tRNA synthetases Glutamyl-tRNA synthetases Table 3: Similar structures, similar dinucleotide binding sites. Clusters comprised of dinucleotide binding sites of structurally similar proteins (same SCOP super family, according to the pre-SCOP database62 ). The columns are the same as in Table 2, except the last column which details the common annotation of proteins in the cluster. 33 Num. π type 63 1 Cluster Nucleotide Cluster center size pairs (pdb:chains:nts) 4 AG, UC 1wrq:AC:A4G5 87 1 4 34 2 106 Location 3 UA, CA, UU CA, GA central U AG motif of hut mRNA, CU C aticodon loop in synthetase 1m8w:AC:U6A7 Pumilio repeat 3-443 , Rna U10-U11 of Ro 2anr:AB:C13A14 protein KH motif 1 3 AA, AC 2atw:AB:A3A4 protein KH motif 74 2 2 IC, CA 1f7u:AB:I27C28 ICG anticodon loop, 3, CCA 187 2 2 CA, AC 1il2:AC:C66A67 45 3 2 CU, GU 47 3 2 CA 261 2 2 UC, CA 285 2 2 GU 3, CCA and A3-C4 of NSP3 1n78:AC:C33U34 CU C and GU C aticodon loops 1n78:AC:C35A36 anticodon loop (34nts) and anticodon loop 1qf6AB:U30C31 anticodon loop start (with the wobble base) and ACA motif 1asy:AR:G34U35 anticodon loop, hairpin recognition Cluster members annotation Hut operon regulatory protein, Aminoacyl-tRNA synthetases (1n78) Pumilio-Homology Domain, Ro ribonucleoprotein (2i91) splicing factors Nova-1, transcription elongation proteins NusA transcription elongation proteins NusA, splicing factors Nova-1 Aminoacyl-tRNA synthetases, tRNA-binding arm (1gax) Aspartyl-tRNA synthetase, NSP3 homodimer (1knz) Glutamyl-tRNA synthetase, Aspartyl-tRNA synthetase (1il2) Glutamyl-tRNA synthetases, methionyl-tRNA synthetase (2csxACC34A35) Threonyl-tRNA synthetase, tRNA pseudouridine synthase (2hvy) Aspartyl-tRNA synthetase, rRNA methyltransferase (2bh2) Table 4: Different structures, similar dinucleotide binding sites. Clusters comprised of similar dinucleotide binding sites of proteins with different overall folds. The columns are the same as in Table 2. 34 Figures (a) Parallel stacking (b) Perpendicular stacking (c) Non π-stacking Figure 1: Geometries of RNA di-nucleotide pairs. (a) Parallel π-stacking interactions, as observed in the U1A RNA-binding domain (PDB: 1m5o, A6-C7 of U1A39 ). The protein pseudocenters are represented as balls. Hydrogen bond donors are blue, acceptors - red, donors/acceptors - green, and aromatic - grey. (b) Perpendicular stacking, observed in an unbound structure of 23S Ribosomal RNA (1vor, chain B). (c) Non pi-stacking interactions of methyltransferase RumA (PDB: 2bh2, G1954-U1955). Although the rings participate in aromatic interactions, they are pi-stacked with the protein and not with each other66 . Figure 2: The algorithmic framework. The framework consists of two main stages. First, we classify known dinucleotide binding patterns and recognize the common types of the physico-chemical binding patterns (left flowchart). Then, we use these patterns for the prediction of di-nucleotide binding sites and for the reconstruction of RNA fragments and protein-RNA complexes (right flowchart). 35 (a) (b) (c) Figure 3: Di-nucleotide binding sites of RNA-binding domains. (a) Alignment between the six RBD proteins of cluster 100 (see Table 2) according to the transformation calculated for the di-nucleotide binding sites. Proteins are colored from ranging pink (1sj3) to blue (1b7f) according to cluster order in the table. The bottom figure provides a focused view of the common Tyr-Phe aromatic binding platform. The conserved amino acids of Tyr (Y13, Y128, Y214 in U1A/U2B, Hud and Sxl respectively) and Phe (F56, F170, F256) are represented as sticks and colored purple. The RNA nucleotides are colored magenta and represented as sticks. The protein pseudocenters are represented as balls and are colored as in Figure 1. (b) Alignment between HuD RBD (1fxl, pink) and Rho terminating factor (1pvo, monochrome). The bottom figure presents the Tyr-Phe binding platform shared by the proteins. As can be seen, in spite of the overall structural differences between the proteins they use similar Tyr-Phe binding platforms for nucleotide recognition. (c) Top: Alignment of the first four members of cluster 32 (see Table 2), colored as in a. Middle, the alignment of all the five members of cluster 32. Colored as in a, 1u0b is green. Bottom: Cluster 180, guanine transglycosylase (1q2r) is green and U2AF(2g4b) is blue and pink. 36 (a) (b) Figure 4: Di-nucleotide binding sites of proteins with similar structures or sequence patterns. (a) Alignment between 20 di-nucleotide binding sites of MS2 RNA hairpin coat-proteins (cluster 215, Table3). The left binding site can tolerate any nucleotide sequence, due to the conservation of its Tyr residue, represented by two pseudocenters: donor/accpetor (green) and aromatic (black). (b) Alignment between 8 binding sites of Pumilio proteins (see cluster 83, Table3). The surfaces of the binding sites are represented as small dots and are colored by the physico-chemical property as in Figure 1. Proteins in which the binding site is located on repeat 6 43 are colored cyan and those of repeat 8 are magenta. The RNA nucleotides are colored as following: C - orange, A - red, G - yellow, C - green. (a) (b) (c) Figure 5: Similar di-nucleotide binding sites of unrelated proteins. (a) Alignment of binding sites of transcription elongation proteins NusA (PDB: 2atw) and neuronal splicing factors Nova-1 (PDB: 2anr) that accommodate nucleotide triplets GAA (G1-A3, purple) and CAC (C214-C216, cyan) respectively. Surface patches with exactly the same properties and shapes (up to 1Å distance deviation) are represented as dots colored as in Figure 1. (b) Alignment between rRNA (Uracil-5-)-methyltransferase RumA (PDB: 2bh2, blue) and Aspartyl-tRNA synthetase (PDB: 1asy, green). Both binding sites bind the GU nucleotide pair. In 1asy this pair of nucleotides (634-635) is located on the anticodon loop, while in 2bh2 the GU pair (1954-1955) is responsible for the recognition of the hairpin by the N-terminal OB fold. The surfaces patches are as in (a). (c) Glutamyl-tRNA synthetase (PDB:1n78, blue) and Aspartyl-tRNA synthetase (PDB:1il2, green), which bind the rna fragments CU and GU (colored CPK) respectively. The binding sites were recognized to share 7 pseudocenters, which are represented as balls and are colored as in Figure 1 (the pseudocenters of 1n78 are represented as smaller balls). 37 (a) (b) Figure 6: Structure prediction of RNA fragments. (a) Reconstruction of the RNA fragment of HutP antitermination complex (1wpu). The predicted and the native nucleotide fragments are red and orange sticks respectively. The opposite view in the upper figure shows the non interacting U nucleotide, not predicted by our method. (b) Reconstruction of the RNA loop of hepatitis delta precursor, which interacts with U1A protein (1sj3). The predicted nucleotides are red and the native are orange. (a) (b) Figure 7: (a) Dinucleotide binding site of HutP (1wrq, light blue) aligned to the binding site of Cyclophilin A (pink) bound to sanglifehrin macrolide (PDB:1nmk). The bottom figure shows the alignment of the RNA (blue) and sanglifehrin (red) from a different angle. (b) Nucleotide binding site (A37, yellow) of CysteinyltRNA synthetase (1u0b, light blue) aligned to the Human macrophage elastase (1ros, pink) bound to its inhibitor (green). The aligned proteins have similar binding sites and similar secondary structure elements (helix and 2 strands), which suggest the potential similarity in ligand binding. 38 (a) (b) Figure 8: (a) Center star classification algorithm. The star centers s1 − s4 are labeled according to the weight and the order of traversal. The edges of s1 to s2 − s4 nodes failed to fulfil the multiple alignment requirements and were not added to the cluster of s1. Consequently, the algorithm proceeded to s2, where s4 fulfilled the multiple alignment requirement and was added to the cluster. (b) The RnaPred algorithms aligns the 3D consensus patterns to some regions (represented by curves) of the complete query protein . We consider the RNA dinucleotides bound to the aligned 3D patterns (pairs of balls, colored according to the pattern) and select the longest and highest scoring fragment of consecutive mappings of such pairs. In this example, the best RNA fragment prediction is of length 4 nucleotides (nts). 39
© Copyright 2025 Paperzz