Statistical monitoring methods are currently widely used to control the quality of a process. Most real-life processes, e.g. chemical production and printed circuit board manufacturing, require a multivariate approach so that multiple crucial quality characteristics can be monitored simultaneously. But many commonly used multivariate methods rely on assumptions that are hardly ever satisfied in practice. In this blog post we therefore introduce the modern theory of copulas to allow for better multivariate monitoring.
Dear visitors of the ProgBlogs,
Large data sets containing extensive information about a process, such as sensor readings, are easily and quickly accessible these days. Such data sets are commonly used in combination with statistical methods in order to efficiently control the quality of a process by identifying anomalies in a timely fashion. It turns out that most real-life processes require multivariate methods instead of univariate methods, so that multiple quality characteristics can be monitored simultaneously, to take possible dependencies between the quality characteristics into account. In the presence of dependent quality characteristics, univariate methods would be rather invalid or inefficient.
Most conventional multivariate methods, such as the Hotelling T2 control chart, are based on the assumptions that observations are uncorrelated and multivariate normally distributed. Some of these methods became so popular that they are often applied without carefully checking the above assumptions. However, in practice, the assumptions are almost never satisfied, which can question the efficiency of these methods. Moreover, anomaly detection can be done by monitoring the mean or variance of a multivariate distribution representing the process and raising the alarm when one deviates too much from an expected behavior. Apart from changes in mean or variance, the distribution of the process can also change in dependence structure. Unfortunately, most conventional multivariate methods fail to detect such changes in dependence structure in a timely manner. Therefore, we introduce a copula-based monitoring approach for possibly autocorrelated, non-normal multivariate data that is able to detect changes in the dependence structure of a multivariate distribution, as well as changes in its mean or variance (Krupskii et al., 2019). We are going to create some empirical results to investigate the performance of the copula approach for non-normal bivariate data.
A copula is a multivariate distribution function with all marginals being uniformly distributed on the unit interval [0,1]. This can be easily achieved since any continuous random variable can be transformed to a standard uniform random variable by its probability integral transform (Angus, 1994). The name “copula” is derived from the fact that a copula “couples” a multivariate distribution function to its marginals. More specifically, copulas allow to combine the marginals of univariate random variables to arrive at the multivariate distribution of these random variables, which makes them very useful to model non-normal multivariate data. Additionally, copulas can be used to efficiently describe the dependence structure of a multivariate distribution, as they define non-parametric measures of dependence between the individual components. The main appeal of copulas is that they allow to separate a multivariate distribution into an appropriate copula which describes the dependence structure and all the marginals (Rüschendorf, 2009). This provides a flexible framework for multivariate modeling.
Multivariate data often exhibit several characteristic properties that should be described as precisely as possible using a multivariate distribution function. Therefore, there exist various copula families that are able to express specific distributional characteristics like heavy tails or asymmetry. The most popular and commonly used copulas can be categorized into two classes:
In order to fit the correct copula family to cover characteristic properties of the multivariate data, one has to estimate the unknown parameters of the copula model based on the available data. In practice, parameter estimation is usually done by Maximum Likelihood estimation or the use of rank correlation coefficients such as Spearman’s rho and Kendall’s tau, since it is often possible to define an explicit function between copula parameters and rank correlation coefficients (Genest et al., 2013).
For the bivariate case, there exist many well-investigated copula families which can be used in order to find the best-fitting copula model. For higher dimensions, however, the choice of usable copula families is somewhat limited, since these copula families are restrictive in terms of flexibility and dependence modeling. To overcome this drawback, we will make use of vine copulas which provide a flexible graphical model for describing multivariate copulas composed of a cascade of bivariate copulas, so-called pair-copulas (Brechmann & Schepsmeier, 2013). The main idea of this construction is to decompose a multivariate probability density into blocks of pair-copulas, where each pair can be chosen independently from the others. Specifically, a multivariate density function can be decomposed into factors of conditional densities, each of which can be further decomposed into parts of pair-copulas and conditional marginal densities (Aas et al., 2009). In this way we can iteratively construct a product of pair-copulas and conditional marginal densities.
This representation of a multivariate density, however, is not unambiguous. Therefore, a graphical model which is able to distinguish between the possibilities is needed (Bedford & Cooke, 2002; Brechmann & Schepsmeier, 2013). This graphical model consists of regular vines, so-called R-vines, that are based on a set of trees. For a random vector consisting of d components, we obtain a d-dimensional vine. This vine can then be represented as a set of d-1 trees with d(d-1)/2 edges in total, where each edge stands for the corresponding pair-copula density. In general, the nodes of a tree are equivalent to the edges of the tree above. There is, however, one exception to this rule that only applies to the first tree. Namely, the nodes of the first tree are represented by the d variables. In order to get an idea of what such R-vine could look like, a possible graphical representation of a 5-dimensional R-vine is shown in Figure 2.
Figure 1: Possible 5-dimensional R-vine tree structure
Furthermore, there exist two special cases of R-vine structures. First, an R-vine is called drawable vine, or so-called D-vine, if in each tree there exist two edges for each node at most. Second, an R-vine is called canonical vine, or so-called C-vine, if in each tree there exists one unique node with the number of edges equal to the total number of nodes in that particular tree, minus one. In the end, it is possible to define a statistical vine copula model from a graphical representation, so that the multivariate density can be explicitly decomposed into factors corresponding to an underlying R-vine, D-vine or C-Vine structure.
To select the best-fitting vine copula model for a given dataset, we must process the following separate points. First, we have to select the R-vine structure, i.e., selecting which pairs of variables have to be used. Second, we have to choose a bivariate copula family for each selected pair. And finally, we need to estimate the parameters for each pair-copula. These steps can be done based on a sequential, heuristic method (Dissmann et al., 2013). Basically, this method constructs the R-vine by sequentially defining the trees in such a way that the chosen pairs model the strongest pairwise dependencies present. When the first tree structure is selected, we can choose an appropriate bivariate copula family for each edge of the tree by means of the Akaike information criterion. We are then able to estimate the parameters for each pair-copula in the tree with Maximum Likelihood estimation or the use of rank correlation coefficients, as explained above in the paragraph concerning the introduction to copulas. Subsequently, we can move forward to the next tree and repeat the corresponding steps. If we continue this way, we can iteratively define all trees in order to correctly construct a suitable R-vine structure.
We now have enough knowledge to build a statistical framework around the copula methodology. This framework will be based on control charts, a very popular tool in the Statistical Process Control context. The aim of such control charts is to timely detect anomalies that could indicate a problematic development in the process. Based on this anomaly detection, proactive action can be taken in order to restore the quality of the process. Control charts typically include for each observation a test statistic, control limits and possibly a warning signal. The control limits are decided in an offline training phase, the so-called phase I. Subsequently, incoming observations can be tested in an online monitoring phase, the so-called phase II. Furthermore, a warning signal arises when a test statistic falls outside one of the limits, which indicates that the process is statistically out-of-control.
Following the theory behind control charts, we propose a vine copula-based monitoring approach for multivariate processes (Mühlig, 2017). The basic idea is that a multivariate process can be represented by an appropriate vine copula. When the process is assumed to be in-control, the vine copula follows some in-control distribution. When a change in the process occurs, the process does no longer follow the in-control distribution and can be declared as out-of-control. In order to detect such a change in distribution, we construct a tolerance region by means of density level set estimation (Verdier, 2013).
First, we need a suitable training set for phase I, in which we assume that the process is in-control. Note that we often have to transform some variables by their probability integral transform to satisfy the requirement of standard uniform variables. Now we are able to determine the best-fitting vine copula model by the procedure explained in the above paragraph about finding an appropriate model. Based on the selected vine copula, we estimate the joint density function and calculate a large enough density sample. Subsequently, we calculate the corresponding control limits dependent on whether we want to use the one-sided or two-sided version of the control chart. For the one-sided case, the control limit is defined as the - quantile of the density sample, where is the pre-specified false alarm probability. For the two-sided case, the upper and lower control limit are defined as the 1-/2- quantile and /2- quantile of the density sample, respectively.
For phase II, we have incoming observations and we need to decide if they are generated from the in-control distribution. Again, we first need to transform some variables to standard uniform variables. Then, we can estimate the density value of an incoming observation based on the selected vine copula model in phase I. Subsequently, we compare this density value with the control limits that we determined in phase I. For the one-sided chart, we raise an alarm if the density value falls below the control limit. For the two-sided chart, we raise an alarm if the density value exceeds the upper control limit or falls below the lower control limit.
In order to measure the performance of control charts, we first need to introduce the concept of average run length. For an in-control process, the average run length equals the average number of observations until a false alarm occurs and is denoted by ARL0. For an out-of-control process, the average run length equals the average number of observations until a change is detected and is denoted by ARL1. Ideally, we want ARL0 to be large and ARL1 to be small to ensure a low false alarm rate and fast detection of changes. However, there seems to be a trade-off between ARL0 and ARL1. Therefore, we aim to reach a pre-specified ARL0 value in order to be able to make a fair comparison between control charts. Obviously, the chart with minimum ARL1 value performs best.
We will now analyze a 2-dimensional example to compare the performance of the two-sided copula control chart with the conventional Hotelling T2 control chart. For this purpose, we assume that the bivariate process is best represented by the Clayton copula family. We sample 1,000 observations from this copula family to build the training data for phase I, based upon which we estimate the joint density and calculate a density sample. We calculate the control limits based on the procedure described above, where we set =1-(1-0.0027)^{2} to account for the false alarm probability with two variables involved. We repeat this phase I procedure 1,000 times to obtain an accurate estimate of ARL0. We vary the input parameter over a sequence of possible values, so that we can determine the parameter value for which the ARL0 of both methods are approximately equal. We use this specific value and move on to phase II, in which we manually implement three possible changes in the process. This includes a shift in the mean, variance and dependence structure of the bivariate distribution. Phase II is repeated 1,000 times as well to obtain an accurate estimate of ARL1.
We are able to realize a mean shift by changing the level of the first variable. Figure 2 serves as an example and shows what such a level shift looks like in terms of the copula and its components. We vary the percentage by which the level of the first variable is increased from 10 to 90, by steps of 10, in order to obtain an extensive analysis. The results are visualized in Figure 4, from which we conclude that the copula chart does have a shorter ARL1 than the Hotelling chart in case of a mean shift, especially for small shifts.
Figure 2: Level of the first variable increased by 20 percent
Furthermore, we can implement a variance shift by perturbing the correlation matrix (Galeeva et al., 2007). Figure 3 illustrates what such a shift looks like in terms of the copula and its components. We vary the percentage by which the largest eigenvalue of the correlation matrix is increased from 10 to 90, by steps of 10, to enable distinction between small and large variance shifts. The simulation results are shown in Figure 4, from which we conclude that the copula chart does have a shorter ARL1 than the Hotelling chart in case of a shift in the variance, particularly for larger shifts.
Figure 3: Largest eigenvalue of the correlation matrix increased by 20 percent
Figure 4: Average out-of-control run length as a function of the largest eigenvalue increase
Finally, we manually insert a shift in the dependence structure by sampling from a different copula family, namely the Frank copula. Figure 5 exemplifies what such a shift looks like in terms of the copula and its components. It should be noted that both the mean and the variance of the distribution remain the same. The ARL1 values for a range of possible Kendall’s tau values are shown in Figure 6. From the phase I simulation, it turns out that the specific tau value for which the ARL0 of both methods are the same is approximately equal to 0.65. Therefore, the copula chart does have a much shorter ARL1 than the Hotelling chart if the dependence structure changes according to the Frank copula. In fact, this statement holds for all possible values of Kendall’s tau.
Figure 5: Dependence structure shifted according to the Frank copula
Figure 6: Average out-of-control run length as a function of Kendall’s tau correlation coefficient
Overall, we can conclude that the copula-based control chart significantly outperforms the Hotelling control chart in detecting shifts in the mean, variance or dependence structure of a non-normal bivariate distribution. The theory of copulas therefore offers a refreshing and flexible multivariate monitoring approach to efficiently control the quality of a process.
Are you interested in wind energy and predictive maintenance? Take a look at our blog post:
Marginals: A marginal distribution is the probability distribution for one of the individual components, irrespective of the other components.
Conditional densities: The conditional density of Y given X is the probability density function of the conditional distribution of Y given X in case this distribution is continuous.
Trees: A tree is a non-linear data structure consisting of nodes (points) and edges (lines), which can be used to organize data hierarchically.
Statistical Process Control: Statistical Process Control is a method of quality control that uses statistical techniques to monitor a process.
_{References:}
_{ Aas, K., Czado, C., Frigessi, A., & Bakken, H. (2009). Pair-copula constructions of multiple dependence. Insurance: Mathematics and economics, 44(2), 182-198.}
_{ Angus, J. E. (1994). The probability integral transform and related results. SIAM review, 36(4), 652-654.}
_{ Brechmann, E., & Schepsmeier, U. (2013). Cdvine: Modeling dependence with c-and d-vine copulas in r. Journal of statistical software, 52(3), 1-27.}
_{ Dissmann, J., Brechmann, E. C., Czado, C., & Kurowicka, D. (2013). Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics & Data Analysis, 59, 52-69.}
_{ Galeeva, R., Hoogland, J., Eydeland, A., & Stanley, M. (2007). Measuring correlation risk. Tech. rep.}
_{ Genest, C., Carabarín-Aguirre, A., & Harvey, F. (2013). Copula parameter estimation using Blomqvist’s beta. Journal de la Société Française de Statistique, 154(1), 5-24.}
_{ Krupskii, P., Harrou, F., Hering, A. S., & Sun, Y. (2019). Copula-based monitoring schemes for non-Gaussian multivariate processes. Journal of Quality Technology, 1-16.}
_{ Mühlig, B. (2017). Multivariate process monitoring based on copula structures. Master’s thesis, Julius-Maximilians-Universität Würzburg, Fakultät für Mathematik und Informatik Lehrstuhl für Statistik.}
_{ Rüschendorf, L. (2009). On the distributional transform, Sklar’s theorem, and the empirical copula process. Journal of Statistical Planning and Inference, 139(11), 3921-3927.}
_{ Verdier, G. (2013). Application of copulas to multivariate control charts. Journal of Statistical Planning and Inference, 143(12), 2151-2159.}
Sie möchten mehr über künstliche Intelligenz und ihr Potenzial erfahren? Nutzen Sie die Vorteile und trauen Sie sich, in Ihr erstes KI-Projekt zu starten:
Jetzt kostenlos E-Book downloadenIhr Partner für Predictive Analytics und Data Science.
Weitere Angaben, u. a. zum Datenschutz, finden Sie in unserem Impressum und unserer Datenschutzerklärung.
© 2022 prognostica GmbH