ENBIS-18 in Nancy

2 – 25 September 2018; Ecoles des Mines, Nancy (France) Abstract submission: 20 December 2017 – 4 June 2018

My abstracts

 

The following abstracts have been accepted for this event:

  • Application of the Bayesian Spline Model to Estimate Task-Specific Exposures for Volatile Organic Compounds

    Authors: M. Abbas Virji (National Institute for Occupational Safety and Health), E. Andres Houseman (Consultant)
    Primary area of focus / application: Other: Statistical methods in industrial hygiene
    Keywords: Bayesian, Limit of detection, Non-stationary, Spline model, Time-series
    Submitted at 31-May-2018 15:04 by M. Abbas Virji
    Accepted (view paper)
    4-Sep-2018 12:00 Application of the Bayesian Spline Model to Estimate Task-Specific Exposures for Volatile Organic Compounds
    There is renewed and growing interest in the development and use of direct reading instruments that have improved sensitivity, detection limit, specificity, multiplexing capability, and other performance characteristics. Direct reading instruments are valuable tools for measuring exposure as they provide real-time measurements for rapid decision making, information on the short-term exposure variability for identification of exposure excursions and development of control strategies, and metrics of peak exposure for epidemiologic studies. However, statistical analysis of real-time data is complicated by autocorrelation among successive measurements, non-stationary time-series, and presence of left-censoring due to limit of detection. A Bayesian framework is proposed that addresses these issues in order to model workplace factors that affect exposure and to estimate summary statistics for tasks or other covariates of interest. Specifically, a spline-based approach is used to model non-stationary autocorrelation with relatively few assumptions about autocorrelation structure. Left-censoring is addressed by integrating over the left tail of the distribution. The model is fit using Markov-Chain Monte Carlo within a Bayesian paradigm. The method can flexibly account for hierarchical relationships, random effects and fixed effects of covariates. The method is implemented using the rjags package in R, and is illustrated by applying it to real-time total volatile organic compounds measurements collected in hospital setting. The model provides estimates of task means, standard deviations, quantiles (e.g., 95th percentile), and parameter estimates for covariates that can be used to identify and prioritize control measures or as metrics of peak exposure in epidemiologic studies. Ongoing effort is focused on exploring the possibility of extending the method to analyze multivariate data such as multimodal particle size distribution or multiple specific real-time VOC exposures.
  • Self-Prediction of Migraine Days: Analysis of Cohort of Migraine Patients Using a Digital Platform

    Authors: Marina Vives-Mestres (Curelator Inc.), Kenneth J. Shulman (Curelator Inc.), Alec Mian (Curelator Inc.), Noah Rosen (Northwell Health)
    Primary area of focus / application: Business
    Secondary area of focus / application: Modelling
    Keywords: Headache, Self-monitoring, Self-prediction, eHealth, mHealth
    Submitted at 1-Jun-2018 11:23 by Marina Vives-Mestres
    Accepted
    4-Sep-2018 09:00 Self-Prediction of Migraine Days: Analysis of Cohort of Migraine Patients Using a Digital Platform
    Background:
    In this study, we examine the ability of 1,537 migraineurs to predict their own attacks 24hrs in advance. Prediction of migraine days might be expected to be difficult as there is significant confusion around what is a premonitory symptom and/or trigger factors with respect to cause and effect. Additionally, migraine premonitory symptoms and the potential trigger factors show significant inter-individual variation1. However, learning to accurately predict migraine attacks may aid in self-management of the condition, impact quality of life and allow optimal timing of medication dosing. Also understanding of the strategy of good predictors may lead to generalizable and useful information for other individuals.

    Methods:
    Individuals with migraine registered to use Curelator Headache® and then used the digital platform to enter on a daily basis lifestyle factors, possible headaches, and medications as well as migraine expectation for the next 24 hours. For each individual we are interested in the four variables: (1) number of correct migraine day predictions, (2) number of correct migraine-free day predictions, (3) number of wrong migraine day predictions and (4) number of wrong migraine-free day predictions. The four variables (2x2 contingency table) form a composition (living in a restricted space; the simplex) and a multiple regression on the log-ratio coordinates is fitted and adjusted by covariates.

    Results:
    Individuals who predicted better than random differed from those with random predictions in gender (having females the greater proportion in the non-random group), age, number of tracked days and migraine frequency. In all cases the average is higher in the non-random predictors group. Almost all individuals with non-random predictions have migraine expectation positively associated with migraine occurrence the next day. The retained log-ratio model includes the variables: total migraine days tracked with Curelator, migraine frequency, gender and account type. Good migraine day predictors have higher migraine frequency than good migraine-free predictors. Individuals tracking overall more migraines are worse predictors than those having tracked less migraines. Regularly menstruating females do more often use the high/moderate predictions than other females and finally, paid users wrongly predict migraine days more often than other users.

    Conclusion:
    Migraine frequency is the most relevant variable explaining migraine predictions, thus it is possible that the strategy of good predictors is simply an a priori knowledge of the probability of having a migraine based on their past experience. The second most relevant variable is the number of tracked migraine days possibly indicating that individuals having more trouble on managing their condition stay longer using Curelator. Finally up to 46 daily factors were included in the simplicial model barely improving it but indicating that individual predictions are not only based on what happens the day before the migraine, but also on a longer term relation between factors and migraine occurrence.

    References:
    1. Peris F et al. Towards improved migraine management: Determining potential trigger factors in individual patients. Cephalalgia. 2017; 37(5):452-463
  • Statistical Analysis to Predict Clinical Outcomes with Complex Physiologic Data

    Authors: Monica Puertas (Instituto para la Calidad - Pontificia Universidad Católica del Peru), Jose Zayas-Castro (University of South Florida), Peter Fabri (University of South Florida)
    Primary area of focus / application: Other: South American session
    Secondary area of focus / application: Mining
    Keywords: Prognostic analysis, Clinical outcomes, ICU patients, Platelet count
    Submitted at 1-Jun-2018 22:25 by Monica Puertas
    Accepted (view paper)
    4-Sep-2018 14:10 Statistical Analysis to Predict Clinical Outcomes with Complex Physiologic Data
    The assessment and monitoring of the circulatory system is essential for patients in intensive care units (ICU). One component of this system is the platelet count which is used in assessing blood clotting. However platelet counts represent a dynamic equilibrium of many simultaneous processes. To characterize the value of dynamic changes in platelet counts we applied analytic methods to datasets of critically ill patients in (i) a population of ICU cardiac surgery patients and (ii) a heterogeneous group of ICU patients The objective is to develop a methodology to predict patient outcomes with the first dataset, then redefine the methodology for a more heterogeneous and complex dataset and finally extend it to other clinical parameters. By providing a dynamic patient profile the diagnosis could be more accurate and, as a consequence, physicians could anticipate changes in recovery trajectory and prescribe interventions more effectively, leading to a possible healthcare cost reduction and patient care improvement.
  • Optimal Bayesian Design via MCMC Simulations for a Soldering Reliability Study

    Authors: Rossella Berni (Department of Statistics, Informatics, Applications -University of Florence)
    Primary area of focus / application: Design and analysis of experiments
    Secondary area of focus / application: Reliability
    Keywords: Optimal experimental design, Bayesian design, Utility function, Reliability
    Submitted at 2-Jun-2018 19:18 by Rossella Berni
    Accepted
    5-Sep-2018 10:00 Optimal Bayesian Design via MCMC Simulations for a Soldering Reliability Study
    Optimal design criteria has recently received growing attention,both at theoretical and computational levels, in part following the increase of computational power. Since the 70s, there is a long history of seminal papers in literature on D and T-optimality, both to estimate model parameters and also to discriminate among models. Furthermore, the building of optimal designs has been improved in a Bayesian framework, by introducing prior distributions on models and parameters and by selecting the optimal design according to the definition of an utility function and its maximization, also by considering a decision analysis framework.

    Notwithstanding the generality achieved, in actual applications further flexibility is often needed, for example by defining a utility function in which the cost of each observation depends on the value taken by the independent variable. Moreover, the relevance for costs may be also evaluated by specific weights, which take environmental conditions and technological information into account.

    In this talk, we consider the improving of building optimal designs in the technological field by applying Markov Chain Monte Carlo simulations, and by evaluating: i) an hierarchical structure of the observed data; ii) an utility function including costs and weights; iii) modelling discrimination.
  • Deep k-Means: Jointly Clustering with k-Means and Learning Representations

    Authors: Thibaut Thonet (University Grenoble Alpes - LIG)
    Primary area of focus / application: Other: Session on deep learning
    Secondary area of focus / application: Mining
    Keywords: Clustering, Deep learning, k-means, Auto-encoders, Unsupervised learning
    Submitted at 3-Jun-2018 19:58 by Thibaut Thonet
    Accepted (view paper)
    5-Sep-2018 09:30 Deep k-Means: Jointly Clustering with k-Means and Learning Representations
    We study in this presentation the problem of jointly clustering and learning representations. As several previous studies have shown, learning representations that are both faithful to the data to be clustered and adapted to the clustering algorithm can lead to better clustering performance, all the more so that the two tasks are performed jointly. We propose here such an approach for k-Means clustering based on a continuous reparametrization of the objective function that leads to a truly joint solution. The behavior of our approach is illustrated on various datasets showing its efficacy in learning representations for objects while clustering them.
  • Process Optimization through PLS Model Inversion Using Historical Data (Not Necessarily from DOE)

    Authors: Alberto Ferrer (Universidad Politécnica de Valencia), Daniel Palací-López (Universidad Politécnica de Valencia)
    Primary area of focus / application: Other: Process Chemometrics
    Keywords: Process optimization, Partial Least Squares (PLS), Model inversion, Latent variables
    Submitted at 4-Jun-2018 00:24 by Alberto J. Ferrer-Riquelme
    Accepted
    3-Sep-2018 15:30 Process Optimization through PLS Model Inversion Using Historical Data (Not Necessarily from DOE)
    Process data in modern industry, although shares many of the characteristics presented in Big Data (i.e. volume, variety, veracity, velocity and value), may not really be that “Big” in comparison to other sectors such as social networks, sales, marketing and finance. However, the complexity of the questions we are trying to answer with industrial process data is really high. Not only do we want to find and interpret patterns in the data and use them for predictive purposes, but we also want to extract meaningful relationships that can be used for trouble-shooting and process optimization (García-Muñoz and MacGregor 2016).

    Optimizing a production process requires building a causal model that explains how changes in input variables (e.g. materials and their properties, processing conditions…) relate to changes in the outputs (e.g. amount of product obtained, its quality, purity, value, generated pollutants…). To this purpose, deterministic (i.e. first principles-based) models are always desirable. However, the lack of knowledge and the generally ample need of resources required to properly construct such models makes their use unfeasible in a large number of cases. This is why data-driven models are often resorted to (Liu and MacGregor 2005, Bonvin et al. 2016).

    To guarantee causality, when using data-driven approaches, independent variation in the input variables is required. This could be obtained from a Design of Experiments (DOE) (Box, Hunter and Hunter 2005) performed on the plant. The problem is that this is quite difficult to get in real practice. On the contrary, large amounts of historical plant operating data (highly collinear and low rank data not from a DOE) are available in most production processes. In these contexts, classical linear regression (LR) or even machine learning (ML) methods cannot be used for process optimization because none of the infinite number of good prediction models that can be fitted is unique or causal. The problem is that the process variables are highly correlated and the number of independent variations in the process is much smaller than the number of measured variables. This calls for the use of latent variable models such as PLS (Partial Least Squares).

    PLS models are especially suited to handle “Big” Data. They assume that the input (X) space and the output (Y) space are not of full statistical rank so they not only model the relationship between X and Y (as classical LR and ML models) but also provide models for both the X and Y spaces. This fact gives them a very nice property: uniqueness and causality in the reduced latent space no matter if the data come either from a DOE or daily production process (historical data) (MacGregor 2018).

    Following the ideas of Jaeckle and MacGregor (2000) and Tomba, Barolo and García Muñoz (2012) in this talk we are going to illustrate how to guide a process optimization by PLS model inversion using real historical data of a petrochemical process (not obtained from a DOE). Opening issues will also discussed.