Statistical Computing SIG

This SIG is concerned with the application of computing to statistics, particularly the use of high performance and grid/cloud approaches in statistics. There are also issues in the use of computer models in wider science where statistical approaches are, sometimes unwittingly, used often embedded deeply within the codes.

The motivation is:

  1. Statistical and computational sciences are pervasive analytical tools used across all science, industry and society.
  2. Almost all major scientific and industrial Grand Challenges today are 'statistics under the hood' and this must affect business and industry as well as leading edge research. See the examples in Part 2, Table 1 of this EPSRC document .
  3. Data driven methods have become endemic generally with little regard to statistical issues.
  4. With all this data being recorded, statisticians are at the mercy of managers and policy makers who assume that the more data you have, the easier it is to see the wood for the trees. How wrong they are! The more data you have, the more complex the model often needs to be because not all things can be hidden behind Central Limit Theorems etc.
  5. Data expands at a much faster rate than chip processing power with a doubling time of typically a year rather than 2 years taken for junctions per sq.cm,
  6. Complex data takes an even bigger amount of processing power as the time taken for a simple least squares calibration of an NxP dataset is typically N2.eP - and that is ignoring additional variances, non-normal distributions and multivariate solutions, and
  7. Chip speed per core is likely to decrease to save energy with the extra junctions going into multi-core processors - ie relatively fine grained shared memory HPC. These are fine where there are a lot of minor processes running but unless software is specifically written and parallelised, possibly in a new-generation language, there will be little benefit for a single heavy process other than rescuing your PC when it hangs.
  8. Array processors may help - eg accelerators or GPGPU units (hopefully with error correction) - but these still require specialist compilation and handling.

Therefore we need to ensure that statistical computing has the tools or your PC will run no faster in 10 years than it does today on single process applications.

Put it another way - 10 years ago we bought commodity PCs with 200MHz chips and 500MB hard drives. Today we buy commodity PCs with 3GHz chips and 500GB hard drives. In BogoMIPS terms, we have gone from a few hundred to 5000 instructions per second - about 20 times - while disks have increased by 1000 times. Say no more.

In 10 years time we may predict that your PC will have 100s of TB drives and 32 or 64 core chips with much the same clock speed as we have today. Todays HPC will be tomorrows desktop. We need to ensure that the software will work optimally.

In the UK we are looking to set up a Collaborative Computational Project. Interested people can look at the talks given at a meeting in Manchester on 24th July 2008 where some of these issues were addressed. We also spoke at a lunchtime meeting at the recent RSS Conference in Nottingham but that meeting was not very well attended.

We are keen to include international participation and while the issue is wider than industrial statistics - indeed wider than statistics itself - ENBIS is a suitable platform for working with pan-European industry.

The current SIG leader is John Logsdon but as he cannot be in Athens, a meeting this year cannot be scheduled unless someone is able to chair it.  Email me if you are able to do this (j.logsdon at quantex-research.com)

We hope to have more to report by the Stockholm conference and will keep members posted.