ENBIS-18 in Nancy

2 – 25 September 2018; Ecoles des Mines, Nancy (France) Abstract submission: 20 December 2017 – 4 June 2018

Identification of Outliers and Influential Obervations: An Application

4 September 2018, 14:30 – 14:50


Submitted by
Maysa de Magalhães
Rodrigo S. Von Doellinger (Division of Methods and Quality, Brazilian Institute of Geography and Statistics), Maysa S. De Magalhães (National School of Statistical Sciences, Brazilian Institute of Geography and Statistics), Pedro N. Silva (National School of Statistical Sciences, Brazilian Institute of Geography and Statistics)
For good quality statistical information be provided, it may not be necessary to identify all the errors presented in the data. It is just sufficient to detect influential observations, that is, those which when included or excluded from the analysis, significantly impact on the estimate of the parameter of interest.

The approach generally used to identify influential observations is called selective editing (Latouche and Berthelot, 1992; Lawrence and McKenzie, 2000). In the methods of selective editing, potentially influential observations are ranked based on values of a score function, which expresses the impact of the error in the estimate of parameter of interest. The observations with scores above a pre-set threshold are considered critical and should be revised. The definition of the score function implies in determining the probability of the observation to present error (risk component), as well as the magnitude of the error (component of influence). Risk and influence components are used by score functions presented in the literature (Jader and Norberg, 2005). According to Di Zio et al. (2008) the methods commonly employed to obtain the risk and influence components are based on comparison of the observed values of a given variable and the predicted values for a particular model. The differences between observed and predicted values are used in the calculation of scores for identifying observations that generate greater impact on the estimated of parameter of interest.

Di Zio et al. (2008) proposed a multivariate model to estimate the probability of error and as well as the error magnitude. The method is based on contaminated normal models (Little, 2008). The data observed are described by a mixture of two multivariate normal distributions that represent the erroneous or contamined data and the data without errors. It is assumed that the distribution of the contaminated data can be obtained by the distribution of the data without errors with an increase in the variance (Ghosh-Dastidar and Schafer, 2006).

In this paper, the method of selective editing proposed by Di Zio et al. (2008) was applied to identify outliers and influential observations in the Household Budget Survey (HBS 2008/2009) of the Brazilian Institute of Geography and Statistics (IBGE) through the use of the following variables, the monthly household income and the annual household expenditure.

Di Zio M., Guarnera U. and Luzi, O. (2008). Contamination models for the detection of outliers and influenetial errors in continuous multivariate data. UNECE, Conference of European Statistician, Work Session on Statistical Editing, Vienna.
Ghosh-Dastidar B., Schafer J.L. (2006). Outlier Detection and Editing Procedures for Continuous Multivariate Data. Journal of Official Statistics, 22 (3), 487–506.
Jäder A., Norberg A. (2005). A Selective Editing Method considering both suspicion and potential impact, developed and applied to the Swedish Foreign Trade Statistics, UN/ECE Work Session on Statistical Data Editing, Ottawa. (http://www.unece.org/stats/documents/2005.05.sde.htm).
Latouche M., Berthelot J.M. (1992). Use of a score function to prioritize and limit recontacts in editing business surveys. Journal of Official Statistics, 8 (3), 389- 400.
Lawrence D., McKenzie R. (2000). The General Application of Significance Editing. Journal of Official Statistics, 16 (3), 243-253.

Return to programme