ENBIS: European Network for Business and Industrial Statistics
Forgotten your password?
Not yet a member? Please register
ENBIS7 in Dortmund
24 – 26 September 2007The following abstracts have been accepted for this event:

Local Models in Data Mining
Authors: Gero Szepannek, Julia Schiffner and Claus Weihs (University of Dortmund, Dortmund, Germany)
Primary area of focus / application:
Submitted at 7Sep2007 12:40 by
Accepted
single rules on the whole data. This may especially be the case if the
classes are composed of several subclasses.
This talk gives
an overview over several proposed methods to solve this problem. These
methods can be subdivided into methods that either need the subclasses
to be specified in advance (see e.g. Weihs et al., 2006) or methods that
determine the locality in the data itself in an unsupervised manner (see
e.g. Hastie et al., 1996 or Czogiel et al., 2007). Some new issues are
also presented. All methods are evaluated and compared on several
realworld classification problems.
References:
Czogiel, I., Luebke, K., Zentgraf, M., Weihs, C. (2007): Localized
Linear Discriminant Analysis. In: Decker,R., Lenz, H. Gaul W. (eds):
Advances in Data Analysis, SpringerVerlag, Heidelberg, 133140.
Hastie, T., Tibshirani, R., Friedman, J. (1996). Discriminant Analysis
by Gaussian Mixtures , JRSS B 58, 158176.
Weihs, C., Szepannek, G., Ligges, U., Luebke, K. and Raabe, N. (2006):
Local Models in Register Classification by Timbre. In: V.Batagelij,
H.Bock, A.Ferligoj and A.Ziberna (eds): Data Science and Classification,
SpringerVerlag, Heidelberg, 315322. 
Load Shedding: a new proposal
Authors: R. Faranda, A. Pievatolo and E. Tironi
Primary area of focus / application:
Submitted at 7Sep2007 13:31 by
Accepted
interruptible loads is often the only solution to keep the network in
operation. Normally, in contingencies, the difference between the power
absorbed and the power produced is very low, often less than 1% of the
latter. Therefore if all the loads participated in the load shedding
program, the discomfort would be minimal, considering its usually short
duration. According to this point of view, we present a new approach to
the load shedding program to guarantee the correct electrical system
operation by increasing the number of participants. This new load
control strategy is named Distributed Interruptible Load Shedding
(DILS). Indeed, it is possible to split every user's load into
interruptible and uninterruptible parts, and to operate on the
interruptible part only. The optimal load reduction request is found by
minimizing the expected value of an appropriate cost function, thus
taking the uncertainty about the power absorbed by each customer into
account.
Presently, several users such as hospitals, data centres, supermarkets,
universities, industries, etc. might be very interested in typical
shedding programs as a way to spare money in their electrical account.
However, in the future, when the domotic power plants are likely to be
used widely, the distributors could interest the end users in
participating in DILS programs for either economic or social reasons.
By adopting the DILS program, the distributors can resort to the
interruptible loads not only in case of emergency conditions but also
during normal and alert operations.
Key Words  Black out, Demand Side Management, Load Shedding,
Interruptible Load, Stochastic Approximation, Uncertain System

Online diagnostics tools in the Mobile Spatial coordinate Measuring System (MScMS)
Authors: Franceschini F. 1, Galetto M. 1, Maisano D. 1, Mastrogiacomo L. 1
Primary area of focus / application:
Submitted at 7Sep2007 15:58 by
Accepted
Abstract
Mobile Spatial coordinate Measuring System (MScMS) is a wirelesssensornetwork based system developed at the industrial metrology and quality engineering laboratory of DISPEA – Politecnico di Torino. It has been designed to perform simple and rapid indoor dimensional measurements of largesize volumes.
It is made up of three basic parts: a “constellation” of wireless devices (Crickets), liberally distributed around the working area; a mobile probe to register the coordinate points of the measured object (using the constellation as a reference system); a PC to store data sent – via Bluetooth – by the mobile probe and to elaborate them utilising an ad hoc application software, created in Matlab. Crickets and mobile probe utilize ultrasound (US) transceivers in order to communicate and evaluate mutual distances.
The system makes it possible to calculate the position – in terms of spatial coordinates – of the object points “touched” by the probe. Acquired data are then available for different types of elaboration (determination of distances, curves or surfaces of measured objects).
In order to protect against causes of error such as, for example, US signal diffraction and reflection, external uncontrolled US sources (key jingling, neon blinking, etc...), or software algorithms nonacceptable solutions, MScMS implements some statistical tests for online diagnostics. Three of them are analyzed in this paper: “energy model diagnostics”: based on the “massspring system” localization algorithm; “distance model diagnostics”: based on the use of a distance reference standard embedded in the system; “sensor physical/model diagnostics”: based on the redundancy of Crickets’ US transceivers. For each measurement, if all these tests are satisfied at once, the measured result may be considered acceptable with a specific confidence level. Otherwise, the measurement is rejected.
This paper, after a general description of the MScMS, focuses on the description of these three online diagnostic tools. Some preliminary results of experimental tests carried out on the system prototype in the industrial metrology and quality engineering laboratory of DISPEA – Politecnico di Torino are also presented and discussed.

Robust estimation of the variogram in computer experiments
Authors: O. Roustant , D. Dupuy, C. Helbert (Ecole des Mines, SaintEtienne, France)
Primary area of focus / application:
Submitted at 7Sep2007 16:04 by Olivier Roustant
Accepted
computer experiments. Coming from geostatistics, the kriging model is a Gaussian
stochastic process
$$Y(x) = m(x) + Z(x)$$
where $x$ is a $d$dimensional vector, $m(x)$ is a deterministic trend, and $Z(x)$ a stationary
centered stochastic Gaussian process with spatial correlation function $R(h)$. Both trend
and spatial correlation should be estimated from data. However, this is not the case in
computer experiments, since a specific parametric form for $R$ is assumed. The most
common choice is the anisotropic powerexponential function:
$$R(h) = exp(\sum_{k=1}^d \theta_k h_k^{pk})$$, with $$0 < p_k \leq 2, k=1,...d$$
This contrasts with geostatistics where the spatial correlation is estimated through the
variogram:
$$2\gamma(h) = var(Z(x+h)Z(x))$$
Defined for intrinsic processes, the variogram is equivalent to $R(h)$ for stationary
processes. Using the variogram instead of the correlation function is recommended even
if the process is stationary, because of possible contaminations by trend estimate
residuals.
The estimation of
$\gamma(h)$ from a given design
$x^{(1)},...,x^{(n)}$ is not an easy task since the
random variables
$(Z(x + h)  Z(x))^2$ are not independent and strongly skewed. In
particular, large values may affect the estimation. For this reason, robust estimation is
encouraged. Two estimators were proposed by CressieHawkins (1980) and Genton
(1998). In this paper, we compare the properties of these estimators with a trimmed
mean. Simulations with various amounts of outliers are done, in the same way as
Genton's. We observe that both estimators give similar results, and both are
outperformed by the trimmed mean. In addition, we extend the study by analyzing the
robustness of these estimators to the deviations from normality. To achieve this, a 3
dimensional industrial problem is considered.
References:
Chilès JP., Delfiner P. (1999), Geostatistics. Modeling Spatial Uncertainty, Wiley & Sons
Cressie N. (1993), Statistics for Spatial Data, Wiley & Sons
Cressie N., Hawkins D.H. (1980), ''Robust estimation of the variogram: I'', Mathematical
Geology, 12 (2), 115125
Genton M. (1998), ''Highly Robust Variogram Estimation'', Mathematical Geology, 30 (2), 213
221
Huber P.J. (1977), Robust Statistical Procedures, SIAM
Rousseuw P.J., Croux C. (1993), ''Alternatives to the Median Absolute Deviation'', JASA, 88
(424), 12731283
Santner T.J., Williams B.J., Notz W.I. (2003). The Design and Analysis of Computer
Experiments, Springer.
Keywords: Computer experiments Variogram, Kriging model, Anisotropy, Robustness. 
Modelrobust designs for assessing the uncertainty of simulator outputs with linear metamodels
Authors: B. Gauthier, L. Carraro, O. Roustant (Ecole des Mines, SaintEtienne, France)
Primary area of focus / application:
Submitted at 7Sep2007 16:37 by
Accepted
distribution $Y_{sim}(x)$ of the output of a costly simulator when
the inputs $x$ are random variables with known distribution $\mu$.
Due to the computing time, a Monte Carlo method cannot be applied
directly to the simulator but only to an approximate model
$Y_{app}(x)$. This $metamodel$ is built with few experiments
$X =(x^{(1)},..., x^{(n)})$. The question is: how to choose the design
of experiments $X$, so that the distributions of $Y_{app}(x)$ and $Y_{sim}(x)$ are close?
Consider a deterministic simulator. In many situations, it is approached by a linear
combination of known basis functions $g_0,...,g_p$
$$Y_{sim}(x) = \sum_{i=0}^{p}\beta_ig_i(x) + h(x)$$
with $\beta_0,...,\beta_p$ (unknown) real coefficients, and $h$ an unknown function standing for a
model deviation. The corresponding metamodel is:
$$Y_{app}(x) = \sum_{i=0}^{p}\hat{\beta}_ig_i(x) + \eta(x)$$
where, conditionaly to spatial random variables,
$( \eta (x))$ is a centered Gaussian
process representing the estimation error. The parameters
$\hat{\beta}_0,...,\hat{\beta}_p,\hat{\sigma}^2$ have to be
estimated with the $n$ simulator values calculated for
$x \in X$ , for instance by ordinary
leastsquares.
In this framework, one can compute the two spreads
$E(Y_{app}(x))E(Y_{sim}(x))$ and
$var(Y_{app}(x))var(Y_{sim}(x))$. We show that with poor conditions on the model deviation
$h$, it is possible to choose $X$ to minimize these quantities. We assume that $h$ belongs to a
reproducing kernel Hilbert space $H$: in usual cases, this only implies regularity
conditions to $h$. Following Yue and Hickernell (1998), both criteria can be bounded by
expressions depending only on
$h_H$. Optimal designs are then obtained by minimizing
the largest eigenvalue of positive definite matrices. Finally, this methodology is
extended to stochastic simulators of the form
$$Y_{sim}(x) = \sum_{i=0}^p \beta_i g_i(x) + h(x) + \varepsilon(x) $$
where
$(\varepsilon(x))$ is a Gaussian process modelling the numerical error.
References:
Carraro L., Corre B., Helbert C., Roustant O., Josserand S. (2007). Optimal designs for the
propagation of uncertainty in computer experiments, Chemometrics and Intelligent
Laboratory Systems, to appear.
Carraro L., Corre B., Helbert C., Roustant O. (2005). Construction d'un critère d'optimalité
pour plans d'expériences numériques dans le cadre de la quantification d'incertitudes,
Revue de Statistique Appliquée.
Santner T.J., Williams B.J., Notz W.I. (2003). The Design and Analysis of Computer
Experiments, Springer.
Wahba G. (1990). Spline Models for Observational Data, SIAM, Philadelphia.
Yue R.X., Hickernell F.J. (1998). Robust designs for fitting linear models with
misspecification, Statistica Sinica 9, p. 10531069.
Keywords: Computer experiments, uncertainty propagation, metamodeling, modelrobust
designs, reproducing kernel Hilbert space.

Genetic algorithms and grid technologies in clustering
Authors: Cs. Hajas, Zs. Robotka, Cs. Seres and A. Zempléni (Loránd Eötvös University, Budapest, Hungary)
Primary area of focus / application:
Submitted at 7Sep2007 20:38 by
Accepted
In our days quite often very large data sets have to be processed. Data mining is
definitely an important and rapidly developing area for such problems. In this
presentation we focus on an important part of such work, namely clustering several
thousand objects of high dimensionality.
For the clustering, we used a version of the genetic algorithm. Such algorithms
imitate the natural selection process by random coupling of pairs of candidates for
the best (fittest) clustering and avoid the convergence to a local maximum by rare,
random mutations. In clustering applications the objective function is based on the
sum of the squared distance between all pairs in the clusters, with a suitable
compensation, which prefers the small number of clusters.
For large datasets and algorithms, which can easily be parallelised, the use of a
grid of computers is a natural, widely used idea. We compared the performance of the
gridbased results of our algorithm to the traditional, singleprocessor version.
Our data base consisted on 10000 images of medium resolution, so the total size was
around 0.5GB. Such problems may arise in industrial setup as well, such as in
welding processes or in character recognitions for applications such as car
manufacturing (see [1]).
The preprocessing constructs a Gaussian Mixture Model (GMM) representation of the
images. The GMMs are estimated with an improved Expectation Maximization (EM)
algorithm that avoids convergence to the boundary of the parameter space, see [2].
Image clustering is done by matching the representations with a distancemeasure,
based on the approximation of the KullbackLeibler divergence.
References:
[1] Content based threshold adaptation for image processing in industrial application
Aiteanu, D.; Ristic, D.; Graser, A. Control and Automation, 2005. ICCA apos;05.
International Conference on Volume 2, Issue , 2629 June 2005 Page(s): 1022  1027
Vol.2
[2] Zs. Robotka and A. Zempléni: Image Retrieval using Gaussian Mixture Models.
SPLST Symposium, Budapest, 2007.