ENBIS-18 in Nancy

2 – 25 September 2018; Ecoles des Mines, Nancy (France) Abstract submission: 20 December 2017 – 4 June 2018

A Comparison of Determining the Number of Components of a PLS Regression for MAR Mechanism

3 September 2018, 14:20 – 14:40

Abstract

Submitted by
Titin Agustin Nengsih
Authors
Titin Agustin Nengsih (University of Strasbourg)
Abstract
Missing data are endemic in much business and industry research. Missing data have been a pervasive problem in data analysis since the origin of data collection. Several methods have been developed for handling incomplete data. Method of imputation is the process of substituting missing data before estimating the relevant model parameters. Furthermore, PLS (Partial Least Squares) regression is a multivariate model for which two algorithms (SIMPLS or NIPALS) can be used to provide its parameters estimates which have been extensively used in the field of much business and industry research because of its effectiveness in analyzing causal relationships between several components. However, a few discussions can be found on how to handle missing data when using a PLS regression. The NIPALS algorithm has the interesting property of being able to provide estimates on incomplete data. Selection of the number of components to build a representative model in PLS regression is an important problem. Fitting the number of components of a PLS regression on incomplete data set leads to the problem of model validation, which is generally done using cross-validation. Determination of the number of components relies on several different criteria such as the Q2 criterion, the Akaike Information Criterion (AIC), or the Bayesian Information Criterion (BIC).

The goal of our simulation study is to analyze the impact of the missing data proportion under missing at random (MAR) assumption on the estimation of the number of components of a PLS regression. We compare six criteria for selection of the number of components of a PLS regression according to PLS regression with NIPALS algorithm (NIPALS-PLSR) on incomplete data and PLS regression on imputed data set which used three methods of imputation: multiple imputation by chained equations (MICE), k-nearest neighbor imputation (KNNimpute) and a singular value decomposition imputation (SVDimpute). The criteria are Q2-LOO, Q2-10-fold, AIC, AIC-DoF, BIC, and BIC-DoF on different proportions of missing data (ranging from 5 to 50%) and under a MAR assumption. Our simulation study shows that whatever the criterion used, the correct number of components of a PLS regression is difficult to determine, especially for small sample size and when the proportion of missing data is larger than 30%. MICE had the closest to the correct number of components at each frequency of missingness although it needs a very long time for the execution. Furthermore, NIPALS-PLSR ranked second, followed by KNNimpute and SVDimpute. Whatever the criterion, except Q2-LOO, the number of components in a PLS regression is far from the true one and tolerance to incomplete data sets depends on the sample size, the proportion of missing data and the chosen component selection method.
View paper

Return to programme