ENBIS: European Network for Business and Industrial Statistics
Forgotten your password?
Not yet a member? Please register
ENBIS9 Goteborg
20 – 24 September 2009 Abstract submission: 1 February – 31 May 2009Validation of an experimental method where subjective visual inspection is crucial
22 September 2009, 11:40 – 12:10Abstract
- Submitted by
- Magnus Pettersson
- Authors
- Magnus Pettersson
- Affiliation
- Statistikkonsulterna, Göteborg, Sweden
- Abstract
- Designing experiments where the measurement is a subjective rating made by an operator using a visual inspection set focus on certain problems. Three critical measurements are searched for in the experiment, where the first two corresponds to the reliability and the last to the validity of the method:
Intra-rater repeatability, i.e. the agreement of the same rater, same item
Inter-rater reproducibility, i.e. the agreement of the same item, between raters
Unbiasedness, i.e. the agreement between raters and items and the correct rating
We will study an example where pictures are taken with a scanning electron microscope (SEM) at 3 randomly choosen spots on a test item. The items are rated into one of four classes, where class 2 and 3 are “approved” and 1 and 4 are “rejected”. Test items were produced and treated to fall into class 1 and 4. Test items for class 2 and 3 were randomly selected from the production process. Each operator rated each item once in a randomized blinded experiment.
The intra-rater reliability can be estimated by letting the rater re-rate the same item and comparing the agreement within rater. However, a major drawback by doing so can be the cost of re-rating or because the test is destroying the item. A separate test using pictures taken by an non participating operator was used for a re-rating test. Each rater was given a set of pictures, where each picture was duplicated before randomization. The disadvantage of using this as a sole method is that the surface examined is not completely homogenous, which gives that the sample area examined is different from each rating occasion. Further, since the rating of one item is relying on the examination of at least three pictures, the method does not fully resemble the true method.
The intra-rater reproducibility can be estimated by letting a number of raters examine the same items. The method used includes that each rater uses at least three pictures to make the decision on the classification of the items. Since the items can be (at least in some of the classes) more heterogene than expected, the rating will depend on the sample areas selected for examination. This will lead to an underestimation of the inter-rater reproducibility.
The unbiasedness (validity) of the test is depending on the ability for a rater to rate an item correctly. Class 1 and 4 are deliberately prepared to be assigned into these classes, respectively. However, the distinction between class 2 and 3 cannot be made without an examination of the same kind that is evaluated, thereby reaching a “Catch 22”. The solution tried in this evaluation was to collect all pictures from all operators, randomly select a limited number that were rated by all operators in plenum. If a unanimous decision was not found another picture was added until a decision was made. The hence reached consensus is defined as the correct answer, and the agreement to the correct answer is used as a measure of validity. Using several samples from each item it will also be possible to estimate the inter-picture variation, i.e. the heterogeneity within an item.
One major problem is that given that the items are easy to rate, the variation among raters, both within and between items will be small, likely that the agreement is 100%. Since defect items will be rare it also follows that in a true production sample the near-defect also will be rare. Meanwhile, the requirements on the method properties has to be given for the items that are close to being rejected. This “difficult” region has to be defined to make an evaluation of the validation method possible, since the evaluation has to based on the worst cases, any other evaluation will overestimate the agreement. In the presentation we will make some analysis concerning the prior probability of the item distribution and probability of making a false rating. Since the true prior distribution of the false rating probability is unknown given the items it will be crucial that the sample of items are selected randomly.