INSTITUTE OF PHYSICS PUBLISHING MEASUREMENT SCIENCE AND TECHNOLOGY Meas. Sci. Technol. 15 (2004) 2047–2052 PII: S0957-0233(04)78367-8 SAODR: sequence analysis for outlier data rejection Franco Pavese and Daniela Ichim Istituto di Metrologia ‘G Colonnetti’—CNR, Strada delle Cacce 73, 10135, Torino, Italy E-mail: f.pavese@imgc.cnr.it Received 17 March 2004, in final form 18 June 2004 Published 26 August 2004 Online at stacks.iop.org/MST/15/2047 doi:10.1088/0957-0233/15/10/014 Abstract In automatic data acquisition, a sample is generally made up of several instrumental readings. A series of readings is generally reduced to a single value by simple methods, such as averaging. However, outlying values can affect the series. The paper introduces an algorithm, named ‘sequenceanalysis outlier data rejection’ (SAODR), which takes into account one of the most common problems affecting the measurand during the acquisition, i.e. a nonlinear drift with embedded sequences of outliers due to pulse-noise peaks. The algorithm uses a time-ordering procedure and the ‘distances’ between successive readings. The frequent case of constant sampling rate is discussed. The reported tests show the results obtained with Fortran 77 and R implementations of the algorithm. A rejection efficiency higher MATLAB than 99% was obtained. Keywords: data analysis: algorithms and implementation, data management, data acquisition: hardware and software, mathematical procedures and computer techniques, time series analysis, time variability (Some figures in this article are in colour only in the electronic version) 1. Introduction The experiment for acquiring a sample yi , i = 1, . . . , I , consists of measurements of the output of an instrument. Though there are acquisition schemes based on a single reading, often—and more safely—each measurement consists of several instrumental readings yij in sequence, made at successive times tij , (yij , tij )i=1,...,I,j =1,...,J . Often, for each i ∈ {1, . . . , I }, the J readings are on-line processed by the instrument firmware (usually simply by averaging them), to obtain yi . Most often, the value of the response variable y depends on one or more influence variables xp , p = 1, . . . , P , whose values do not remain stable during the acquisition time (typically a ‘drift’ is present). Therefore, instead of a stable influence-variable p, xpi1 = · · · = xpiJ for each i (stationary case), there is a (small) change of xp in time, whence of yi , (yij , xpij , tij )i=1,...,I,j =1,...,J -quasi-stationary case, showing the so-called baseline drift. The multiple-reading strategy is preferable, as it produces samples less influenced by the presence of outliers if a statistical analysis is performed on-line on the readings. 0957-0233/04/102047+06$30.00 General methods of managing outliers or unusual values in a dataset can be found in the literature in books addressing diagnostics, robust statistics or filtering (e.g., [1]). There are many of these methods, since their choice critically depends on the application and on the definition of the outlier type to be identified. This paper, as better specified later, deals with the detection of spikes in the data (i.e., highfrequency features of the data sequence), including the small ones [2], also discriminating them from sudden changes of the baseline trend (drift, i.e., low-frequency components of the data sequence). Data are assumed to form an ordered sequence, like a time series (short sequences as considered in the paper, or unlimited-length sampled signals). However, instead of time any other ordering independent variable can be considered instead (e.g., spatial). Finally, an aim of the algorithm is to be fast enough to be used also for on-line rejection, in order to allow a real-time data integration avoiding missing data in the subsequent analysis. For this type of outlier and aim, detection methods based on cluster analysis are not suitable. Also statistical methods have been judged inadequate since they are unable to discriminate between © 2004 IOP Publishing Ltd Printed in the UK 2047 10 F Pavese and D Ichim 4 0 2 y[i,j] 6 8 drift noise outlier 0 10 20 30 40 t[i,j] Figure 1. A typical sample of acquired data, made of a sequence of instrumental readings. They show a trend, due to a low-frequency signal drift, and a system noise. For external reasons, noise peaks can randomly occur (square in the figure), affecting one or more consecutive readings. high- and low-frequency components of the data. Transformdomain methods and frequency filtering methods have been ruled out as they are less suitable in the case of short sequences of data (small data sets). Predictive models are limited to the cases when the baseline drift trend can be predicted. The method preferred pertains to the class of gradient methods. This paper discusses a specific novel approach, embedding also the decisions about detection. The choice of the threshold implemented in the algorithm is more conventional, being a variant of the median, but it is left to the user the possibility of making a different choice, and even adding a double threshold [2] or a training stage, by adapting the corresponding logical block in the procedure. After efficient removal of the outliers, a correct drift (baseline variation) removal can be performed before averaging the cleaned readings. A two-step procedure has been introduced in [3]: for each sample yi , (i) a pre-processing step rejects the outlying readings yij then, (ii) the baseline drift is suppressed by means of a regression applied to the ‘cleaned’ readings, to obtain the valid output sample value yi . The first step consists of a pre-processing algorithm, named ‘sequence-analysis outlier data rejection’ (SAODR), aimed at being easily fitted into the instrument firmware. It is based on the distances between consecutive readings for the analysis of the outlying data. This paper describes the algorithm and then shows the results of its testing with simulated sequences of uniformly spaced data on R two implementations, in Fortran 77 and MATLAB . 2. Assumptions on data characteristics The characteristics of a data sequence yij , such as the very general one shown in figure 1, can be summarized as follows: 2048 (i) acquired data consist of an ordered sequence of readings from an instrument (e.g., a digital voltmeter, bridge, . . . ), which can be short and we will call the ‘signal’; (ii) the signal is assumed to be quasi-stationary during the readings’ acquisition; (iii) there are three reasons for the signal value changes, of various and generally independent origin: 1. signal drift due to the signal being quasi-stationary: it is assumed to have a frequency spectrum limited to sufficiently low frequencies, e.g., to allow modelling with a low-order polynomial; 2. signal noise typically due to electrical instrumental noise: random variations of the readings with a broad frequency band (typically white noise), assumed to be stationary within each sequence of readings; 3. pulse noise due to spot events in the environment, of magnetic, electric (switching), mechanical (shocks) nature, etc. It occurs as random spikes assumed affecting not more than a maximum number K of consecutive readings (generally a few), but with no limits as to the number of occurrences, the position within the sequence, the magnitude and the direction. In estimating the ‘sample value’ yi and its associated uncertainty from the readings yij , two problems can produce misleading results when attempting outlier identification: • the number, position and size of the outliers are unknown • the low-frequency drift of the signal can be of the same order of magnitude as the outlying readings. Methods for outlier rejection based only on the values of the readings cannot discriminate between the components (1) and (3) of the signal variation components. Moreover, any simple statistical estimate of the data—e.g., the standard deviation— is affected by both the presence of outliers and signal drift and, therefore, it cannot be simply used as a robust threshold to discriminate for outliers. 3. The online algorithm SAODR The SAODR algorithm neither computes the baseline nor uses data values, but analyses the sequences of distances between consecutive instrumental readings. For readings non-uniformly spaced in time, the divided differences or the Euclidean distances should be adopted to take into account different scales in the two variables. In the case of equallyspaced data (constant sampling rate), which is the most common case in automatic data acquisition, the vectorial distance between two consecutive readings simply reduces to the projection on the y-axis. SAODR has presently been developed for this case. The SAODR inputs are: the minimum number J of data (ti,j , yi,j ), j = 1, . . . , J to be used for estimating each yi , the number M J of input initial instrumental readings (ti,j , yi,j ), j = 1, . . . , M and the hypotheses about the outliers’ structure. The latter can be summarized by assuming the maximum number K of consecutive outlying readings allowed (more than one in an outlier sequence is allowed to occur) and a threshold distance value d0 for outlier distance discrimination. These assumptions can be adapted according to the level of dj+1 1.5 1.0 dj+2 0.5 dj 0.0 • Step 1. Acquire a number M J of instrumental readings in sequence (ti,j , yi,j ), j = 1, . . . , M. • Step 2. Compute the projection on the y-axis of the (vectorial) distances dj = dist(yi,j , yi,j +1 ) = |yi,j +1 − yi,j |, j = 1, . . . , M − 1 between consecutive readings yi,j +1 and yi,j . With each dj , j = 1, . . . , M − 1 associate the sign sj : 1, yi,j +1 yi,j sj = −1, yi,j +1 < yi,j . j+1 j+2 y[i,j] knowledge of the data characteristic: e.g., in the case of a sampled signal, K depends on the ratio between the noise-spike time constant and the sampling rate: if this ratio is too high, increasing unnecessarily the average value of K, provisions can be taken to lower it in the algorithm rules. An initial training can also be added for this purpose. The basic algorithm steps for constant sampling rate are the following, for each i = 1, . . . , I : 2.0 SAODR: sequence analysis for outlier data rejection j+3 j 5 10 15 20 t[i,j] In the following dj is called ‘distance’ and the sign sj is called ‘direction’. • Step 3. Define as candidate outlier distances those distances dj for which dj > d0 (an a priori defined threshold) and compute the number C of their occurrences. • Step 4. If M − C − 1 < J , acquire A supplementary readings and GOTO STEP 2. • Step 5. Starting from each candidate outlier distance, say the j th, analyse the subsequence of length L = K + 1 of consecutive distances dj , . . . , dj +K . (a) Declare as not outlier distance a candidate outlier distance for which the relative subsequence of distances does NOT satisfy one or both of the following conditions: (1) more than one candidate outlier distance exists, except for the first and the last distances, d1 and dM−1 (2) at least one change in direction occurs (b) For all the remaining candidate outlier distances collected in step 3, take the decisions using the method of the ‘truth table’ to identify the outlying readings to reject. The ‘truth table’ presented in table 1 is for K = 2, but for any other value of K a similar table can be constructed. The feedback step 4 can be dropped if one does not require to have all yi with the same statistical weight, i.e., computed from the same number of readings. To estimate the sample yi value and its uncertainty, regression can then be applied to the cleaned sequence of at least J data using a suitable functional model of the baseline drift, see figure 1. 4. The simplest SAODR implementation: two-outlying-reading sequences Choice of the threshold. As the location statistics, an augmented median has been used. The median performs the most robust discrimination of the ‘regular’ distances (ordered by size) from the potential outlying distances, since obviously C < (M − 1)/2. However, simply rejecting all the distances above the median value would be too crude and Figure 2. A typical reading sequence which contains an outlier to be identified by the ‘truth table’. acquisition ‘expensive’, because half of the distances would be considered as candidate outlier distances. Consequently, a higher threshold d0 is defined by augmenting the median by a factor depending on the standard deviation of the distances below the median (outlier-free by definition), multiplied by an adjustable constant, µ. Choice of the maximum length of the outlying-reading sequence. The simplest implementation is the one that assumes the noise spikes affecting only, at worst, two consecutive readings (K = 2). From the point of view of the acquisition cost, the size of the increase A = M − J of the number of readings, with respect to the target number J , is another necessary choice: one has to find a trade-off between the increase in the number of readings (A) and the need to resort to the feedback step 4, both having a time cost. A good compromise seems to invoke the feedback only when there is more than one outlying reading, thus setting in this implementation A = 2. 4.1. The ‘truth table’ The analysis of the distance subsequences is the core of the algorithm, see [3]. Its construction is based on the analysis of all possible situations involving outliers in the reading sequence. Let us consider, for example, a situation like the one in figure 2 and analyse its main features. In such a situation, a candidate outlier distance is dj hence one of the analysed subsequence of distances must be (dj , dj +1 , dj +2 ), in the case K = 2. One can see that the candidate outlier distances in this subsequence are dj and dj +2 and that there is only one change of direction1 , namely between dj and dj +1 . In this particular outlier configuration, the identified outliers must be the readings on the positions j + 1 and j + 2 (in table 1, this situation is identified by the row number 10 in the ‘truth table’). If considered to be a possible outlier configuration in the studied data acquisition, such a configuration must 1 No change of direction is meant to represent not an outlier but a mere signal ‘drift’ (component 1 in section 2). 2049 F Pavese and D Ichim Table 1. The ‘truth table’ for K = 2; each analysed subsequence of distances starts with a candidate outlier distance. No dj > d0 ? dj +1 > d0 ? dj +2 > d0 ? sj ∗ sj +1 < 0? sj +1 ∗ sj +2 < 0? Declared outlier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 T T T T T T T T T T T T T T T T T T T T T T T T F F F F F F F F T T T T F F F F T T T T F F F F T T F F T T F F T T F F T T F F T F T F T F T F T F T F T F T F j +1 j + 1, j j + 1, j None j +1 j +1 None None None j + 1, j j + 1, j None None∗ None∗ None∗ None∗ ∗ +2 +2 If C = 1, the first or the last reading is declared outlier. be identified by the ‘truth table’. The other rows of the ‘truth table’ are to be constructed in a similar manner, that is, considering all the possible situations identifying outliers, 16 in total for K = 2, plus a few special cases for the first and last readings. These few special cases are due to the fact that the situation C = 1 could occur only when the first and the last readings are involved. The columns of the ‘truth table’ are identified by the distances and directions entering in the analysis of a subsequence of K + 1 distances. That is, dj , dj +1 , . . . dj +K , (sj , sj +1 ), . . . , (sj +K−1 , sj +K ). A complete analysis of all the possible typical situations identifies those sequences that contain outlier readings. Since they can be listed in full, the action that must be taken for each combination can be decided according to the ‘truth table’. Of course, for greater K the number of subsequences that contain outliers rapidly increases, and so does the analysing time. One of the reasons why the parameter K increases can merely be associated with a very high ratio between the pulse-noise time constant and the sampling rate: obviously, if the former is much higher than the latter, K can easily become too large and should be reduced by a suitable procedure. The resulting cleaned data sequence (or signal) would result as in figure 1 with the high reading (square dot) suppressed (corresponding to figure 2 with the two large distances suppressed)—and an additional clean reading added at the end of the sequence as necessary to keep J constant. 5. Code implementation for the SAODR algorithm and test results The algorithm was initially implemented in FORTRAN 77 R . The as a subroutine [3], and later also in MATLAB FORTRAN 77 subroutine takes a few microseconds and R one takes about 3.3 ms to run on a slow the MATLAB PC (clock rate <500 MHz). A block logic scheme of the implementation is reported in the appendix and figures 4 and 5 for the case where step 4 is omitted. Speed being a must, as the algorithm should be essentially transparent to data acquisition and not influencing sampling rate, the code allows stopping the procedure as soon as the presence of further outliers can be excluded: e.g., if none or only one 2050 +2 +2 candidate outlier distance is found, the procedure obviously goto END—see figure 5 (A = 2). The study of the efficiency in outlier detection was performed by using simulated data sequences having the following parameters for the readings: J = 18, M = J + 2, Omax = 4 (where Omax is the maximum number allowed for the outlying readings in a sub-sequence: Omax (M/2 − 1)/2, with Omax K − 1). The basic test sequence of readings includes a random noise component and a non-monotonic baseline drift affecting the readings by a factor of about 2, but no outliers. Then, this basic test sequence is automatically altered by including outlier elements, random in number (up to a maximum value Omax ), in position in the sequence, in relative size, R = ymax /y, and in ‘direction’. By randomly varying all these simulation parameters and the discrimination threshold, it is possible to check the ability of the algorithm to identify outlying values closer and closer to the random noise level, in order to also check the sharpness of the discrimination threshold. Tests of groups of 10 000 trials gave essentially the same results, therefore no extension of the tests above 60 000 random sequences was considered. The number of algorithm failures, i.e., of mismatches between the simulated outliers and the ones recognized by the algorithm, was tested as a function of the maximum outlier size relative to the signal value, R. In the former tests of the FORTRAN 77 routine, when Omax matches the hypothesis K = 2, i.e., when Omax = 1, the efficiency of the algorithm was found to be 100%. When Omax > 1, there is a non-zero statistical probability for the outliers within the sequence to form subsequences affecting more than two consecutive readings, violating the assumptions of the present algorithm implementation. This was the main reason for a resulting ≈0.5% inefficiency at Omax = 4 for outliers of size much larger than the signal (R = 15). A residual inefficiency 0.1% could be due to the occurrence of special sequences, such as multiple outliers starting from the first reading: they could be taken into account by a more complicated ‘truth table’, but the cost/benefit ratio may be too high in most experimental cases. R With the MATLAB code, testing gave essentially the same results. Again, for the non-monotonic tested sequence, outliers were randomly generated and the efficiency of the SAODR: sequence analysis for outlier data rejection 2 4 5 0.5 i=OutPath(j) yes i=nr OutlierReadings= OutlierReadings U (i) k=k+1 index=10 or index=11 or index=15 yes OutlierReadings= OutlierReadings U (i+1) k=k+1 yes OutlierReadings= OutlierReadings U (i+1, i+2) k=k+2 no 0.4 yes index=8 paths(i+1)>threshold index=5 or index=6 or index=13 or index=14 no 0.3 Failure Percentage no index=0 yes change of sign index= index+2 no no 0.2 no j=j+1 i+2<= nr-1 yes yes paths(i+2)>threshold index= index+4 3 no 0 10 20 30 40 50 R Figure 3. The failure percentage of SAODR for Omax = 2 maximum outliers generated in the sequence, baseline drift present. yes change of sign no yes i=ini index= index+1 OutlierReadings= OutlierReadings U (ini) k=k+1, ini=ini+1 REGRESSION AND UNCERTAINTY EVALUATION (offline or online) END no Definition of candidate outlier paths INPUT Readings, AugFact, 5 1 i=1 no=0 OutPath=empty Nread=length(Readings) nr=Nread-1 Criteria=0 Compute paths and signs paths(i)=Readings(i+1)-Readings(i) signs(i)=sign(paths(i)) paths(i)=abs(paths(i)) i=1,...,nr Figure 5. SAODR logic chart, without procedure step 4, part 2. generated by means of the uniform continuous distribution (UC ) on [0, 1] and of the dimensioning parameter R, as for the Fortran 77 tests. That is: no i<=nr yes yes Paths(i)=Threshold b ∼ UC (0, 1) no FirstTime=0 n0=n0+1 OutPaths=OutPaths U (i) Compute the median AM of paths i=i+1 rms=0 i=1 c=0 yes no no=0 i<=nr no GoodReadings=Readings no Paths(i)<=AM CASE ANALYSIS rms=rms+(paths(i))^2 c=c+1 nbout=length(OutPath) OutlierReadings=empty j=1 , ini=1, k=1 i=i+1 no j<=nbout Rms=sqrt(rms)/c Threshold=AM+AugFact*rms 1 4 2 3 Figure 4. SAODR logic chart, without procedure step 4, part 1. algorithm was approximated by its opposite, that is the failure percentage. The smaller the failure percentage, the greater the efficiency of the algorithm. The randomness in the outlier generation was again given by value, position, dimension and direction. The outlier position in the data sequence was randomly generated using a uniform discrete distribution, while the direction was generated using the Bernoulli distribution. The magnitude of the outlier was R ∈ (1.2, 50) V =R∗b (1) where V is the final dimension of the outlier inserted in the test sequence. In this testing version, the varied parameters were R, the relative size of the outlier, and the maximum number of the outliers generated in the sequence. R was allowed to vary between 1.2 and 50 using a discrete step, while the maximum number of outliers inserted in the initial test sequence of readings was allowed to be 4, that is Omax 4. Again, for each combination of the parameters, 60 000 versions of the initial test sequence (inserting outliers) were generated. In the case Omax = 1, we observed that the inserted outliers were always detected, independently of their dimension. This is a very positive result since it seems reasonable to believe that most of the practical acquisition sequences would not contain more than one outlier, at least for moderate lengths of the sequences of data. For the other values of the parameter Omax , we observed that the inefficiency decreases as R increases up to a value of R ≈ 2.5. For greater values of R, the inefficiency no longer depends on the magnitude factor and it is about 0.22%, 0.33% and 0.75% for Omax = 2, 3, 4 respectively, as can be seen in figure 3, for example. 6. Conclusions An online pre-processing algorithm, named SAODR, has been developed for outlier rejection during automatic data acquisition, based on the distances between consecutive readings and on the analysis of subsequences of these distances, instead of resorting to the signal baseline. The 2051 F Pavese and D Ichim computation of the baseline drift is then postponed to an offline step of the cleaned sequence. Implementations with routines written in FORTRAN 77 R have shown that the algorithm works properly and MATLAB and it is as robust as expected against sequences of outliers, independent of baseline drift and the size of the outliers. This was tested via data simulation. Even with only a few assumptions on the signal characteristics, one gets an efficiency better than 99% (i.e., less than one outlier over 100 is missed). The use of online outlier rejection would reduce considerably or avoid at all the number of measurements processed off-line that have to be later rejected because they are outlying, at a stage when generally new measurements can no longer be added. This is particularly important when the optimization of the experimental design is applied, reducing or eliminating the redundance of the total number of measurements, so that missing data can become a problem in the data processing procedure. Extension of the algorithm to non-uniformly spaced data is possible: suitable definitions of ‘distance’ and of the discrimination threshold are to be adopted to take into account different scales in the two variables. Extension is also easy to outlier occurrences affecting longer sequences of consecutive readings, at the cost of a heavier ‘truth table’, hence of an increasing computing time. However, the code is running so fast that it is hardly affecting the acquisition time except for fast sampling rates R (>10 kHz with FORTRAN and >30 Hz with MATLAB : an implementation in C is likely to substantially increase these limits). In conclusion, since the SAODR algorithm is simple and fast, and it is independent of the presence of a baseline drift of any shape—only provided that the frequency spectrum of the drift is limited to frequencies much lower than the sampling rate. It could be implemented directly into instrumentation firmware (using machine codes) to estimate sample values and their associated uncertainty estimate not affected by the presence of outlying signal values, in a much more efficient way than the usual simple averaging—performed either by analogue or numerical means—presently used in most instruments. 2052 R code has also been A runtime version of the MATLAB R embedded in a LabView (version 6.1) acquisition procedure, where it works fine. Appendix. Block logic scheme of SAODR actual implementation. Figures 4 and 5 show SAODR implementation flowcharts. The reported implementation shows the case where the feedback loop (see paper) is omitted. The codes and full documentation are available on request from f.pavese@imgc.cnr.it and also for downloading on www.amctm.org, the website of the SofToolos MetroNet EU Network on Advanced Mathematical and Computational Tools in Metrology. The use of the algorithm and the codes is open-source (‘OSI Certify Open Source Software’), with GNU generalpublic and library licences for non-commercial use. Basic definitions: Readings = the J instrument readings AugFact = augmentation factor µ paths = distance between two consecutive readings signs = sign of paths AM = median rms = standard deviation of the paths AM OutPaths = candidate outlying paths GoodReadings = readings not outlying according to SAODR: vector of ‘cleaned’ readings OutlierReadings = readings corresponding to outliers, eliminated from the ‘cleaned’ readings index = see table of truth = union of References [1] Simonoff J S 1991 Directions in Robust Statistics and Diagnostics—Part II (Berlin: Springer) [2] Das M and Hunt T 1998 IEEE Midwest Symp. on Systems and Circuits (Southbend, USA) p 501 [3] Pavese F, Ichim D and Ciarlini P 2001 Advanced Mathematical Tools in Metrology V, Series on Advances in Mathematics for Applied Sciences vol 57 (Singapore: World Scientific) p 283
© Copyright 2025 Paperzz