They are predicted computationally and not derived experimentally

An important part of the annotation puzzle that is missing in particular is an in-depth understanding of the relationship between sequence similarity and function similarity over a continuous range and the amount of variability inherent in the relationship over all ranges of sequence similarity. Solving this puzzle requires generation of a sufficiently large and diverse data set of Raddeanoside-R8 proteins with experimentally Ligustroflavone characterized function, determining the best way to represent function for modeling purposes and both appropriately building and applying a proper statistical model. To address this challenge we present here a novel annotation model to predict the function of a protein of unknown function based on its sequence similarity to a protein of known function. Our annotation model is trained on proteins whose functions have been experimentally characterized and is therefore based on primary biological evidence. A major concern with most existing protein annotations is that they are predicted computationally and not derived experimentally. Previous approaches for predicting function which use these data can lead to ����circular logic����, i.e. using predictions for prediction. Consequences of this can be over-prediction, or outright erroneous predictions. It is therefore imperative that any statistical model be based on primary biological evidence. In our annotation model, BLAST sequence similarity statistics serve as the predictor variables. The output of the model, or the response variable, is a measure of function similarity and represents a novel aspect of our approach. The output provides a real numbered value of the similarity of the functional match between two proteins as opposed to just a textual protein function description provided in a typical annotation by BLAST. IC is related to the probability of occurrence of a particular GO term in a data set where less common terms have higher IC, which is interpreted as being more specific. In general, the IC of GO terms monotonically increase as the GO hierarchy is traversed upward and the root term always carries an IC of 0.0. Based on IC, metrics can be developed to measure the level of function similarity between two proteins.