N in a protein sequence meaning

11/19/2023

Nearest neighbors-based methods assume that close points are drawn from a uniform distribution and extract models for the statistical distribution of the first distances. Several methods are available to estimate the ID projective methods aim at representing points onto a lower dimensional space by minimizing an error function, while fractal methods measure the scaling of the number of points within a certain radius as such radius grows larger. The Intrinsic dimension (ID) of a data set is defined as the minimum number of parameters needed to describe the data without information loss. We here address a specific question: how many independent directions are explored during evolution in a protein family? This issue can be rephrased in the conceptual framework of Intrinsic Dimension. Frequent occurrences of the same amino acid in a column of the MSA together with covariation between different columns suggest that evolution modifies the sequences along a number of directions that is much lower than the bare dimension of the space sampled by randomly substituting amino acids. This observation is at the very basis of statistical models for assessing the probability that a protein sequence belongs to a family or for predicting the three-dimensional structure of the protein from the MSA. Amino acids in specific columns of the MSA are often conserved, and mutations in different columns are in many cases correlated. Despite the fact that the sequence similarity between members of the same family can be extremely low, by looking at the multiple sequence alignment (MSA) of a protein family one immediately notices patterns. The birth of a new family is a rare event, while existing families are conserved by evolution. Still, and most importantly, three-dimensional structures and functions are conserved, so that protein domains descending from a common ancestor share fundamental common traits such protein domains form the so-called families. During evolution, due to insertions, deletions, substitutions, a sequence can significantly change. Protein sequence evolution is an extremely important process in living organisms. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Ĭompeting interests: The authors have declared that no competing interests exist. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.ĭata Availability: All relevant data are within the paper and its Supporting Information files.įunding: AP acknowledges financial support from Marie Skl odowska-Curie, grant agreement No. Received: ApAccepted: DecemPublished: April 8, 2019Ĭopyright: © 2019 Facco et al. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.Ĭitation: Facco E, Pagnani A, Russo ET, Laio A (2019) The intrinsic dimension of protein sequence evolution. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families.

It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations.

0 Comments

N in a protein sequence meaning

Leave a Reply.

Author

Archives

Categories