ENBIS-7 in Dortmund

24 – 26 September 2007

My abstracts


The following abstracts have been accepted for this event:

  • Local Models in Data Mining

    Authors: Gero Szepannek, Julia Schiffner and Claus Weihs (University of Dortmund, Dortmund, Germany)
    Primary area of focus / application:
    Submitted at 7-Sep-2007 12:40 by
    In classification tasks it may sometimes not be meaningful to build
    single rules on the whole data. This may especially be the case if the
    classes are composed of several subclasses.

    This talk gives
    an overview over several proposed methods to solve this problem. These
    methods can be subdivided into methods that either need the subclasses
    to be specified in advance (see e.g. Weihs et al., 2006) or methods that
    determine the locality in the data itself in an unsupervised manner (see
    e.g. Hastie et al., 1996 or Czogiel et al., 2007). Some new issues are
    also presented. All methods are evaluated and compared on several
    real-world classification problems.


    Czogiel, I., Luebke, K., Zentgraf, M., Weihs, C. (2007): Localized
    Linear Discriminant Analysis. In: Decker,R., Lenz, H. Gaul W. (eds):
    Advances in Data Analysis, Springer-Verlag, Heidelberg, 133-140.

    Hastie, T., Tibshirani, R., Friedman, J. (1996). Discriminant Analysis
    by Gaussian Mixtures , JRSS B 58, 158-176.

    Weihs, C., Szepannek, G., Ligges, U., Luebke, K. and Raabe, N. (2006):
    Local Models in Register Classification by Timbre. In: V.Batagelij,
    H.Bock, A.Ferligoj and A.Ziberna (eds): Data Science and Classification,
    Springer-Verlag, Heidelberg, 315-322.
  • Load Shedding: a new proposal

    Authors: R. Faranda, A. Pievatolo and E. Tironi
    Primary area of focus / application:
    Submitted at 7-Sep-2007 13:31 by
    During overloads in the mains, the load curtailment applied to
    interruptible loads is often the only solution to keep the network in
    operation. Normally, in contingencies, the difference between the power
    absorbed and the power produced is very low, often less than 1% of the
    latter. Therefore if all the loads participated in the load shedding
    program, the discomfort would be minimal, considering its usually short
    duration. According to this point of view, we present a new approach to
    the load shedding program to guarantee the correct electrical system
    operation by increasing the number of participants. This new load
    control strategy is named Distributed Interruptible Load Shedding
    (DILS). Indeed, it is possible to split every user's load into
    interruptible and uninterruptible parts, and to operate on the
    interruptible part only. The optimal load reduction request is found by
    minimizing the expected value of an appropriate cost function, thus
    taking the uncertainty about the power absorbed by each customer into
    Presently, several users such as hospitals, data centres, supermarkets,
    universities, industries, etc. might be very interested in typical
    shedding programs as a way to spare money in their electrical account.
    However, in the future, when the domotic power plants are likely to be
    used widely, the distributors could interest the end users in
    participating in DILS programs for either economic or social reasons.
    By adopting the DILS program, the distributors can resort to the
    interruptible loads not only in case of emergency conditions but also
    during normal and alert operations.

    Key Words - Black out, Demand Side Management, Load Shedding,
    Interruptible Load, Stochastic Approximation, Uncertain System
  • On-line diagnostics tools in the Mobile Spatial coordinate Measuring System (MScMS)

    Authors: Franceschini F. 1, Galetto M. 1, Maisano D. 1, Mastrogiacomo L. 1
    Primary area of focus / application:
    Submitted at 7-Sep-2007 15:58 by
    Keywords: mobile measuring system, wireless sensor networks, dimensional measurements, diagnostics, localization algorithms, physical and model redundancy.
    Mobile Spatial coordinate Measuring System (MScMS) is a wireless-sensor-network based system developed at the industrial metrology and quality engineering laboratory of DISPEA – Politecnico di Torino. It has been designed to perform simple and rapid indoor dimensional measurements of large-size volumes.
    It is made up of three basic parts: a “constellation” of wireless devices (Crickets), liberally distributed around the working area; a mobile probe to register the coordinate points of the measured object (using the constellation as a reference system); a PC to store data sent – via Bluetooth – by the mobile probe and to elaborate them utilising an ad hoc application software, created in Matlab. Crickets and mobile probe utilize ultrasound (US) transceivers in order to communicate and evaluate mutual distances.
    The system makes it possible to calculate the position – in terms of spatial coordinates – of the object points “touched” by the probe. Acquired data are then available for different types of elaboration (determination of distances, curves or surfaces of measured objects).
    In order to protect against causes of error such as, for example, US signal diffraction and reflection, external uncontrolled US sources (key jingling, neon blinking, etc...), or software algorithms non-acceptable solutions, MScMS implements some statistical tests for on-line diagnostics. Three of them are analyzed in this paper: “energy model diagnostics”: based on the “mass-spring system” localization algorithm; “distance model diagnostics”: based on the use of a distance reference standard embedded in the system; “sensor physical/model diagnostics”: based on the redundancy of Crickets’ US transceivers. For each measurement, if all these tests are satisfied at once, the measured result may be considered acceptable with a specific confidence level. Otherwise, the measurement is rejected.
    This paper, after a general description of the MScMS, focuses on the description of these three on-line diagnostic tools. Some preliminary results of experimental tests carried out on the system prototype in the industrial metrology and quality engineering laboratory of DISPEA – Politecnico di Torino are also presented and discussed.
  • Robust estimation of the variogram in computer experiments

    Authors: O. Roustant , D. Dupuy, C. Helbert (Ecole des Mines, Saint-Etienne, France)
    Primary area of focus / application:
    Submitted at 7-Sep-2007 16:04 by Olivier Roustant
    This article deals with the estimation of the spatial correlation of kriging models in
    computer experiments. Coming from geostatistics, the kriging model is a Gaussian
    stochastic process
    $$Y(x) = m(x) + Z(x)$$
    where $x$ is a $d$-dimensional vector, $m(x)$ is a deterministic trend, and $Z(x)$ a stationary
    centered stochastic Gaussian process with spatial correlation function $R(h)$. Both trend
    and spatial correlation should be estimated from data. However, this is not the case in
    computer experiments, since a specific parametric form for $R$ is assumed. The most
    common choice is the anisotropic power-exponential function:
    $$R(h) = exp(-\sum_{k=1}^d \theta_k |h_k|^{pk})$$, with $$0 < p_k \leq 2, k=1,...d$$

    This contrasts with geostatistics where the spatial correlation is estimated through the
    $$2\gamma(h) = var(Z(x+h)-Z(x))$$
    Defined for intrinsic processes, the variogram is equivalent to $R(h)$ for stationary
    processes. Using the variogram instead of the correlation function is recommended even
    if the process is stationary, because of possible contaminations by trend estimate
    The estimation of
    $\gamma(h)$ from a given design
    $x^{(1)},...,x^{(n)}$ is not an easy task since the
    random variables
    $(Z(x + h) - Z(x))^2$ are not independent and strongly skewed. In
    particular, large values may affect the estimation. For this reason, robust estimation is
    encouraged. Two estimators were proposed by Cressie-Hawkins (1980) and Genton
    (1998). In this paper, we compare the properties of these estimators with a trimmed
    mean. Simulations with various amounts of outliers are done, in the same way as
    Genton's. We observe that both estimators give similar results, and both are
    outperformed by the trimmed mean. In addition, we extend the study by analyzing the
    robustness of these estimators to the deviations from normality. To achieve this, a 3-
    dimensional industrial problem is considered.

    Chilès J-P., Delfiner P. (1999), Geostatistics. Modeling Spatial Uncertainty, Wiley & Sons

    Cressie N. (1993), Statistics for Spatial Data, Wiley & Sons

    Cressie N., Hawkins D.H. (1980), ''Robust estimation of the variogram: I'', Mathematical
    Geology, 12 (2), 115-125

    Genton M. (1998), ''Highly Robust Variogram Estimation'', Mathematical Geology, 30 (2), 213-

    Huber P.J. (1977), Robust Statistical Procedures, SIAM

    Rousseuw P.J., Croux C. (1993), ''Alternatives to the Median Absolute Deviation'', JASA, 88
    (424), 1273-1283

    Santner T.J., Williams B.J., Notz W.I. (2003). The Design and Analysis of Computer
    Experiments, Springer.

    Keywords: Computer experiments Variogram, Kriging model, Anisotropy, Robustness.
  • Model-robust designs for assessing the uncertainty of simulator outputs with linear metamodels

    Authors: B. Gauthier, L. Carraro, O. Roustant (Ecole des Mines, Saint-Etienne, France)
    Primary area of focus / application:
    Submitted at 7-Sep-2007 16:37 by
    This articles addresses the industrial problem of quantifying the
    distribution $Y_{sim}(x)$ of the output of a costly simulator when
    the inputs $x$ are random variables with known distribution $\mu$.
    Due to the computing time, a Monte Carlo method cannot be applied
    directly to the simulator but only to an approximate model
    $Y_{app}(x)$. This $metamodel$ is built with few experiments
    $X =(x^{(1)},..., x^{(n)})$. The question is: how to choose the design
    of experiments $X$, so that the distributions of $Y_{app}(x)$ and $Y_{sim}(x)$ are close?

    Consider a deterministic simulator. In many situations, it is approached by a linear
    combination of known basis functions $g_0,...,g_p$

    $$Y_{sim}(x) = \sum_{i=0}^{p}\beta_ig_i(x) + h(x)$$

    with $\beta_0,...,\beta_p$ (unknown) real coefficients, and $h$ an unknown function standing for a
    model deviation. The corresponding metamodel is:

    $$Y_{app}(x) = \sum_{i=0}^{p}\hat{\beta}_ig_i(x) + \eta(x)$$

    where, conditionaly to spatial random variables,
    $( \eta (x))$ is a centered Gaussian
    process representing the estimation error. The parameters
    $\hat{\beta}_0,...,\hat{\beta}_p,\hat{\sigma}^2$ have to be
    estimated with the $n$ simulator values calculated for
    $x \in X$ , for instance by ordinary
    In this framework, one can compute the two spreads
    $|E(Y_{app}(x))-E(Y_{sim}(x))|$ and
    $|var(Y_{app}(x))-var(Y_{sim}(x))|$. We show that with poor conditions on the model deviation
    $h$, it is possible to choose $X$ to minimize these quantities. We assume that $h$ belongs to a
    reproducing kernel Hilbert space $H$: in usual cases, this only implies regularity
    conditions to $h$. Following Yue and Hickernell (1998), both criteria can be bounded by
    expressions depending only on
    $||h||_H$. Optimal designs are then obtained by minimizing
    the largest eigenvalue of positive definite matrices. Finally, this methodology is
    extended to stochastic simulators of the form

    $$Y_{sim}(x) = \sum_{i=0}^p \beta_i g_i(x) + h(x) + \varepsilon(x) $$

    $(\varepsilon(x))$ is a Gaussian process modelling the numerical error.


    Carraro L., Corre B., Helbert C., Roustant O., Josserand S. (2007). Optimal designs for the
    propagation of uncertainty in computer experiments, Chemometrics and Intelligent
    Laboratory Systems, to appear.

    Carraro L., Corre B., Helbert C., Roustant O. (2005). Construction d'un critère d'optimalité
    pour plans d'expériences numériques dans le cadre de la quantification d'incertitudes,
    Revue de Statistique Appliquée.

    Santner T.J., Williams B.J., Notz W.I. (2003). The Design and Analysis of Computer
    Experiments, Springer.

    Wahba G. (1990). Spline Models for Observational Data, SIAM, Philadelphia.

    Yue R.-X., Hickernell F.J. (1998). Robust designs for fitting linear models with
    misspecification, Statistica Sinica 9, p. 1053-1069.

    Keywords: Computer experiments, uncertainty propagation, metamodeling, model-robust
    designs, reproducing kernel Hilbert space.
  • Genetic algorithms and grid technologies in clustering

    Authors: Cs. Hajas, Zs. Robotka, Cs. Seres and A. Zempléni (Loránd Eötvös University, Budapest, Hungary)
    Primary area of focus / application:
    Submitted at 7-Sep-2007 20:38 by
    Loránd Eötvös University, Budapest

    In our days quite often very large data sets have to be processed. Data mining is
    definitely an important and rapidly developing area for such problems. In this
    presentation we focus on an important part of such work, namely clustering several
    thousand objects of high dimensionality.

    For the clustering, we used a version of the genetic algorithm. Such algorithms
    imitate the natural selection process by random coupling of pairs of candidates for
    the best (fittest) clustering and avoid the convergence to a local maximum by rare,
    random mutations. In clustering applications the objective function is based on the
    sum of the squared distance between all pairs in the clusters, with a suitable
    compensation, which prefers the small number of clusters.

    For large datasets and algorithms, which can easily be parallelised, the use of a
    grid of computers is a natural, widely used idea. We compared the performance of the
    grid-based results of our algorithm to the traditional, single-processor version.
    Our data base consisted on 10000 images of medium resolution, so the total size was
    around 0.5GB. Such problems may arise in industrial setup as well, such as in
    welding processes or in character recognitions for applications such as car
    manufacturing (see [1]).

    The preprocessing constructs a Gaussian Mixture Model (GMM) representation of the
    images. The GMMs are estimated with an improved Expectation Maximization (EM)
    algorithm that avoids convergence to the boundary of the parameter space, see [2].
    Image clustering is done by matching the representations with a distance-measure,
    based on the approximation of the Kullback-Leibler divergence.


    [1] Content based threshold adaptation for image processing in industrial application
    Aiteanu, D.; Ristic, D.; Graser, A. Control and Automation, 2005. ICCA apos;05.
    International Conference on Volume 2, Issue , 26-29 June 2005 Page(s): 1022 - 1027

    [2] Zs. Robotka and A. Zempléni: Image Retrieval using Gaussian Mixture Models.
    SPLST Symposium, Budapest, 2007.