Subject:

Calculation of cleavage entropies allows to quantify, map and compare protease substrate specificity by an information entropy based approach. The metric intrinsically depends on the number of experimentally determined substrates data points.

Thus a statistical analysis of its numerical stability is crucial to estimate the systematic error made by estimating specificity based on a limited number of substrates. In this contribution, we show the mathematical basis for estimating the uncertainty in cleavage entropies.

Sets of cleavage entropies are calculated using experimental cleavage data and modeled extreme cases. This allows us to extrapolate the values to an infinite number of samples and to estimate the errors. Therefore, we encourage experimental researchers in the protease field to record specificity profiles of novel proteases aiming to identify at least 30 peptide substrates of maximum sequence diversity. We expect a full characterization of protease specificity helpful to rationalize biological functions of proteases and to assist rational drug design.

## Characterizing Protease Specificity: How Many Substrates Do We Need?

Proteases are enzymes that proteolytically cleave peptide bonds and account for around two percent of all human gene products [ 1 ]. Additionally, they account for one to five percent of the genome of infectious organisms, rendering them attractive drug targets [ 2 ]. Proteases are involved in a variety of physiological processes including food digestion [ 3 ] as well as complex signaling cascades such as for example the apoptosis pathway [ 4 ], the blood coagulation cascade [ 5 ] or the complement system [ 6 ].

The broad range of biological functions is reflected in highly specialized substrate specificities of proteases.

## Associated Data

While some proteases are highly promiscuous and cleave a variety of substrates, others show high specificity for particular substrate sequences [ 7 ]. Substrate specificity of a protease is determined by molecular interactions at the protein-protein interface of protease and substrate in the binding cleft of the protease. Amino acid side chains of the substrate are accommodated within subpockets of the protease.

A unique nomenclature for the subpockets of proteases has been developed by Schechter and Berger [ 8 ]: The substrate's scissile bond is assigned between the residues P1 N-terminal and P1' C-terminal , indices are incremented for further residues in both direction. Protease subpockets are numbered accordingly Sn-Sn', ensuring consistent indexing between interacting regions.

Binding modes of substrate peptides are highly similar as the substrate is locked in an extended beta conformation in the binding cleft [ 9 ]. This arrangement typically involves residues P3-P3', in case of elastase even the P5 residue is tightly bound to the protease [ 10 ]. Several techniques have been developed to experimentally probe substrate specificity of proteases as reviewed by Poreba and Drag [ 11 ] as well as Diamond [ 12 ].

They include diverse experimental approaches based on chromatography [ 13 ], phage display [ 14 ], combinatorial substrate libraries [ 15 , 16 ] as well as usage of fluorogenic substrates [ 17 ] and labeling techniques [ 18 , 19 ]. The MEROPS database [ 20 ] hosts an annotated collection of protease cleavage sites of diverse experimental sources facilitating data mining and comparison of protease specificity [ 21 ]. Recently, we have developed metrics to quantify, map, and compare protease specificity.

Subpocket-wise cleavage entropies allow to quantify specificity of protease subpockets as well as overall specificity [ 24 ]. They are calculated as a Shannon entropy [ 25 ] over the probability of occurrence normalized to the natural occurrence p a,i of amino acids a at each substrate position i. Cleavage entropies close to the maximum of one resemble unspecific substrate cleavage, whereas low values close to zero indicate stringent substrate recognition. Cleavage entropies were found helpful for direct comparison of substrate specificities of proteases, detection of sub-site cooperativities as well as tracing protease specificity along evolution [ 24 ].

Nevertheless it should be mentioned that the cleavage entropy is only measuring the promiscuity of the protease. To compare how similar the substrates of two protease are other metrics, like substrate similarity should be used [ 26 ].

We use the term substrate specificity as a measurement of substrate variability and not of substrate similarity. Furthermore it should be added that the cleavage entropy is measuring the promiscuity and not the sequence logo of a protease [ 27 ]. Molecular origins of protease specificity can be investigated based on subpocket-wise cleavage entropies, as they can directly be mapped to protease pockets and compared to local binding site characteristics [ 28 ].

Furthermore, substrate-guided techniques can be used to intuitively group proteases based on their binding preferences [ 26 ].

As all methods described rely on experimental substrate data, a critical assessment of the data basis is crucial. In the literature, the convergence behavior of entropy measurements has been published already decades ago [ 29 , 30 ] and has been intensively studied since then up to now [ 31 ].

Different methods to correct the error due to finite samples, based on the statistics of information entropy, were reported [ 32 — 35 ].

These approaches are commonly used in a variety of fields not only including biologically and chemically relevant information like DNA sequences [ 36 , 37 ] and neural spike trains [ 38 ], but also other data like the English language [ 39 ].

A common approach is to estimate the underlying probability function and use the result to estimate the entropy of the real probability function using rank-ordered histogram-based approaches [ 32 , 40 ] or Bayesian statistics like approaches [ 41 ]. As estimating the probability distribution from a given sample can be complicated and computationally demanding, an easier and faster access to an infinite sample approximation is of general interest.

In this work, a simple approach to correct the bias of the cleavage entropy due to a limited number of peptide samples is presented. The underlying mathematics are analyzed in order to come up with a mathematically valid approach, converging to the exact value for an infinite number of substrates.

To further validate the model, test cases are analyzed, and the minimum number of substrates to characterize a protease in terms of subpocket-wise cleavage entropy is calculated.

The performance is further compared with known entropy estimators from literature [ 33 , 42 ]. To the best of our knowledge this is the first time that correction algorithms for finite samples are used in the context of protease substrate data.

If the total cleavage behavior of a protease with eight subpockets e.

## New Research In

Since this is practically not possible, the probabilities p of finding a specific AA at a specific position i in a substrate have to be estimated by testing a subset of these octapeptides and calculating estimated probabilities q.

The empirical probability for an event k a,i in our case the occurrence of the amino acid a in one of the eight pockets i, can be calculated as the quotient of occurrence of amino acid a, with the occurrence of any amino acid in this pocket Eq 2 [ 43 ].

## Serine Protease: Background & Catalytic Mechanism – Biochemistry - Lecturio

The entropy measurement introduced above uses real probabilities p, but in practice only estimated probabilities q can be used. Secondly, also a bias due to the limited number of samples is possible. So in the general case of any Shannon entropy based metric Eq 3 is true. The expectation value of the entropy cannot be split in the expectation value of the probability and the logarithmic probability.

The unequal sign would only become an equal sign if the values of q a,i and log q a,i were independent from each other. This is not the case as the logarithmic function log q a,i is strictly monotonically increasing with q a,i positive correlation resulting in a general underestimation of the entropy.

The aim of this paper is to develop a method to reduce this systematic underestimation and also add a significance value to the estimated and already published values [ 24 ]. To analyze the substrate variability of proteases, a mathematical description of the process is necessary. A way to mathematically describe the process of testing sampled substrates out of a larger set is the binomial distribution.

In this ansatz, the experimental bias of the experimentalist, who most probably tends to test peptides similar to known substrates, or of the experiment itself, e.

The probability q a,i k of measuring k substrates with an amino acid a on the position i e.

P1 is a function of the total number of known substrates n and the real probability that this substrate is accepted in this pocket p a,i Eq 4. For all modelled data the natural occurrence of amino acids is neglected, but for the analysis of real proteases the probabilities are corrected for their abundance in the proteome [ 45 ]. Inserting the probability function into the definition of the cleavage entropy Eq 1 expansion and reordering of the terms leads to Eq 5.

This equation provides a mathematical description for the expectation value of the measured entropy S i,n as a function of the real entropy and an error term. S i,n is defined as entropy calculated with the empirical probabilities without any correction algorithm, including n samples the classically reported value.

A detailed explanation how to derive Eq 5 is given in the Supporting Information. The first term on the right hand side of the second equal sign corresponds to the "real entropy" or the entropy calculated with an infinite number of samples. In the following this term is called the real entropy or the infinite sample entropy.

The second term on the right side of the equal sign describes the difference between the real entropy and the measured entropy, which corresponds to the error introduced due to limited sample size. This term is further called the error or correction term. Moving the correction term in Eq 5 leads to an equation for the infinite sample entropy as a function of the measured entropy and the error term Eq 6.

The still unknown term is the error term, which is investigated closer in the next paragraph. It is possible to split the error term into two parts. With a second order Taylor approximation of the logarithmic function it can be shown that the sum tends to be constant for a high value of samples n for samples with an equal distribution the error is smaller than four percent and so the error term is a linear function with respect to the reciprocal number of samples [ 46 ].

To gain a better insight in the behavior of the term without looking in detail at the mathematics, the sum is plotted as a function of the number of samples in Fig 1. The dependence of the Pseudo-constant on the probabilities p a,i, and on the number of samples n, and the convergence with an increasing number of samples is presented in Fig 1.

Due to the low probability of these events, the influence on the calculated entropy is low. In the further manuscript we will prove that the most challenging case for the entropy measurement, in terms of convergence is the case of the unspecific pocket, where every amino acid has the same likelihood to appear in the pocket.

The linear behavior of the pseudo-constant can be used for a linear regression approach to remove the error term in Eq 5. The problem of the model is that the value of the Pseudo-constant is not known. A way around this problem is using a linear regression Eq 7.

To create the second necessary point for the linear regression, bootstrapping is used [ 47 ], which means a random subset of substrates of size n 2 is chosen and the entropy value S n2 for this subset is calculated. By repeating this process times and using the average value a good approximation for a second data point can be created.

To achieve this the minimum number of samples is The previous chapter shows that a large part of the systematic error can be removed by the approach presented. Nevertheless, it should be mentioned that only the systematic error is corrected by this approach. To predict a confidence interval of the entropy, the variance has to be taken into account. In general, a higher number of samples also reduces the uncertainty due to statistical fluctuations variance.

As the uncertainty of the data point created by bootstrapping cannot be smaller than the error of the data point using all samples, we assume that the standard deviation is the same for both points. The bootstrapping process is repeated times, therefore the statistical error of this process is not significant compared to the error due to limited sampling.

By applying the presented rules for removing the systematic error of the entropy and by coming up with a definition for the variance it is possible to calculate corrected entropy values with a confidence interval.

In other words, it is possible to predict how many substrates we need to significantly characterize a protease in terms of substrate specificity.

Extreme cases of cleavage entropies were investigated with the program Mathematica [ 48 ].

## MATERIALS AND METHODS

Starting from a given probability function p, we analyzed the possible measured probabilities q and the values we got from applying the equations derived in the previous sections. In the extreme case of a totally specific pocket, only one amino acid AA is accepted, which means that the correct value is already known after one substrate is tested since measuring of negative events substrates which cannot be cleaved is not possible.

The entropy of the pocket is zero and is not changing with an increasing number of samples. Also the uncertainty is always zero for this case. The presented method is also valid in this case with a Pseudo-constant of zero for the linear fit.

## Serine protease mechanism pdf to excel

In Fig 2 upper right the full line indicates the expectation values of the measured entropy; the shades are the confidence intervals including one standard deviation of the entropy plotted against the number of samples and the reciprocal value of the number of samples Fig 3 upper right.

The real space plot shows that the value is in close proximity to the real value already for a small number of samples. As expected, the reciprocal plot shows an almost linear behavior.