ARTICLE IN PRESS
JOURNAL OF FOOD COMPOSITION AND ANALYSIS
Journal of Food Composition and Analysis 17 (2004) 125–132 www.elsevier.com/locate/jfca
Original Article
Probability weighting of data in small samples to facilitate nutritional interpretation G.P. Sevenhuysen* Department of Human Nutritional Sciences, University of Manitoba, Winnipeg, Manitoba, Canada R3T 2N2 Received 17 May 2002; received in revised form 21 July 2003; accepted 22 September 2003
Abstract Food content or dietary intake is commonly calculated from very small numbers of estimates, and the distributions are often skewed. The mean value of a sample may not represent the health effect correctly, if the component or intake is skewed. A mean weighted towards the probability of more frequently occurring estimates should give a better indication of risk. Using 281 data values from a skewed food component distribution, ten random samples were created for each of 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50 and 100 randomly selected food component estimates per sample. The mean and a probability mean were calculated for each sample. The probability mean tended towards the population mean as the number of food components estimates per sample increased, but the two measures are different. It appears that using both the mean and the probability mean provides more information for the nutritional interpretation of data. r 2003 Elsevier Ltd. All rights reserved. Keywords: Mean; Probability mean; Variability; Exposure; Food composition; Dietary intake
1. Introduction The mean represents the average content of a sample of amount values, such as the repeated estimates of amount for a nutrient in samples of the same food, or estimates of daily food intake. The mean is the correct representation of these known values regardless of whether the distribution of values is skewed or symmetrical. The mean content of these known values can be compared with nutrient amounts that are known to have importance for health, such as the daily intakes recommended for a healthy diet. However, the mean does not represent the most likely amount of the nutrient in the next, unknown estimate if the distribution of estimates is skewed. If the distribution is skewed, several *Tel.: +1-204-474-9556; fax: +1-204-474-7592. E-mail address:
[email protected] (G.P. Sevenhuysen). 0889-1575/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.jfca.2003.09.003
ARTICLE IN PRESS 126
G.P. Sevenhuysen / Journal of Food Composition and Analysis 17 (2004) 125–132
of the amount values occur more frequently than the mean, and they have a higher probability of occurring in future estimates. Since many food component estimates (Pennington, 1993; Holden, 1994), and many daily intakes (McDowdl et al., 1994), show skewed distributions, it is necessary to represent exposure to food components with a different statistical measure than the mean. If the component or intake is skewed in the direction of higher risk of disease, using the mean may lead to incorrect interpretations of health effect. For example, repeated analysis of the fat in a commonly eaten food may show more higher fat values than lower fat values. The next portion of food is likely to contain more fat than is indicated by the mean. Assessing exposure is inherent in the way food composition or dietary record data are used. The values in food tables are used to predict the unknown content of a sample of food. Similarly, the nutrient content of a diet is used to predict the likely nutrient content of future days of intake. If the distribution of estimates is skewed, the mean does not represent the likely exposure to nutrients from future samples. The minimum value of the component in the food or diet can represent exposure in cases of ‘zero-tolerance’. In other cases, the median, or another percentile, might be used to represent exposure. However, percentiles do not reflect the distribution of estimates. The selected percentile would have to be close to the mode, or other frequently occurring estimates, in order to be valid. It is only possible to select the correct percentile when the distribution of estimates is known. However, the distribution of estimates is not known. Food composition data, and daily intakes, are commonly calculated from very few estimates. Many food component amounts are calculated from less than ten repeat analyses. Most dietary assessments do not use more than a few days of food intake for an individual. Samples of estimates that are too small to determine the underlying distribution, or to know whether it is skewed, make it difficult to find ‘‘the next few estimates most likely to be observed’’. It is not appropriate to assume a common underlying distribution. For example, in the case of food composition data, the influences of growing, storage or processing conditions on the composition of a food make the distribution of estimates uncertain. Similarly, individual dietary intakes vary from day to day due to physical, social and psychological influences, which change the distribution of food and nutrient intake. This paper addresses the issue of predicting the most likely content of future estimates when the number of samples is very small. A statistical representation of estimates is presented, that should be used together with the mean, to improve the interpretation of food composition and dietary intake data.
2. Materials The compositional data, representing erucic acid in rapeseed oil, taken from a Slovakian food composition database, were used. The estimates represented a number of different varieties of rapeseed, including high erucic acid varieties, grown under a variety of climatic and soil conditions. The data showed a negatively skewed distribution. Equally skewed distributions have been observed for many other food components in other foods, both negatively and positively. Similarly, food intake data are often equally skewed. Hence the data were regarded as an example of skewed data commonly available in nutrition research.
ARTICLE IN PRESS G.P. Sevenhuysen / Journal of Food Composition and Analysis 17 (2004) 125–132
127
The number of analytical estimates was 281. The highest value was six times larger than the smallest value. The skewness was 1.07. All of the data were values obtained by chemical analysis in a number of different laboratories and compiled by the Food Research Institute in Bratislava. 3. Methods 3.1. Most likely estimate The probability of observing a food component amount is directly related to its frequency in the distribution of amounts for that component. The mode is the most frequently occurring amount. But the mode is not the most likely value of the next estimate to be observed, unless more than 50% of all estimates in the distribution have that same value. The mean is not the most likely value of the next estimate to be observed for the same reason, even if the distribution of estimates is normally distributed. Neither is the median the next most likely value because the median does not represent all estimates in the distribution. The median depends only on the total number of estimates and the value of one of them, and it is independent of all other values in the distribution. The mean is influenced by the values of the estimates, not the relative distance between each of the values. The mean is influenced by the respective numbers of estimates with high or low values, but not by the extent to which these values happen to be grouped. The distances between estimates are related to their frequency of occurrence in the underlying distribution. Probability theory predicts that the most frequently occurring estimates in the underlying distribution are observed more often than other estimates in the sample. These estimates will have smaller distances to their neighbors than estimates that occur less often in the underlying distribution. Hence, the mean reflects the absolute content of the values, but not the probability that any might occur again. Weighting the mean by the distances between values adds a different aspect: the probability that the value might occur again. The ‘‘most likely value of the next estimate to be observed’’ was defined as the mean of estimate amount values in a sample, where each amount value is weighted by its probability of occurring again, i.e., the distance to its neighboring values. The effect is to increase the amount values that occur close together and decrease the values that are far apart, respectively representing the center and the extremes of the underlying distribution. 3.2. Most likely estimate of small samples In very small samples, for example less than ten estimates, is it not possible to identify the underlying distribution. Hence the most frequently occurring estimates are not known. However, in small samples two types of data are available: (a) The values of the estimates (in unit of measurement), and (b) The distances between the values (same unit as the estimate). The distances can be used to identify the estimates in the sample that occur most frequently in the underlying, unknown distribution. A distance can therefore be used to weight each estimate, so that the estimates occurring most frequently in the underlying distribution will contribute to a
ARTICLE IN PRESS 128
G.P. Sevenhuysen / Journal of Food Composition and Analysis 17 (2004) 125–132
greater extent to the mean of the sample estimates. The distance associated with each estimate is the average of the two distances to the consecutive estimates with the lowest and highest values, respectively. For the lowest and highest estimate values in the sample, only the distance to the one adjacent estimate is used. Formula 1 was used to calculate the distances when all estimates were sorted in ascending order by estimate value. Formula 1: Distance associated with each estimate jxi xi1 j þ jxiþ1 xi j dxi ¼ 2 xi the amount value of the ith estimate. The distances associated with lowest or highest estimates in the sample are calculated using only one of the bracketed terms as appropriate.The distribution of distances differs from the distribution of amount values. The median of the distances of estimates in the sample separates the 50% of estimates that are likely to occur more frequently from other estimates. Fifty percent of estimates were seen as sufficient to capture the center of the distribution of estimates in the sample. Amount values associated with a zero distance, i.e., estimates with the exact same values, are increased to the greatest extent. Amount values associated with the maximum distance observed among the distances in the sample are decreased more than other amount values. Amount values associated with the median distance of all distances in the sample would not change. The changes in amount values are shown in Formula 2. Formula 2: Weighting factor wf ¼ ðdmed dxi Þ dmed is the median of all distances in the sample. dxi is the average distance between the estimate and two adjacent estimates. The weighting factor represents the probability that the associated amount value will occur in the next unknown estimate. The weight is added to the amount value, instead of multiplied, for two reasons. The weight associated with any one amount value does not depend on the distribution of weights in the sample, as would a proportion or probability value. The weight is in the same unit as the amount value. The value of the weight is positive for distances greater than the median distance and negative for distances smaller than the median distance. The value of a weight can be larger than the amount value with which it is associated, for example in the case of an extreme low amount value separated by a large distance from the nearest higher amount value. In this case, the weighted amount value would be a negative number. Yet meaningful amount values cannot be less than zero and negative amount value could not be used to calculate a valid weighted mean. The weighting factor is subtracted from the largest weight, or maximum distance among all distances, observed in the sample to avoid negative values. Since the effect of subtracting the weighting factor from the maximum distance is to add a proportion of the maximum distance to each amount value, the maximum distance is subtracted from the sum of all weighted values, as shown in Formula 3 which shows the calculation of the probability mean. 3.3. Calculations performed Calculations were made on 120 samples that were randomly selected from the 281 food composition estimates. Twelve groups of ten random samples were created for each of 5, 6, 7, 8, 9,
ARTICLE IN PRESS G.P. Sevenhuysen / Journal of Food Composition and Analysis 17 (2004) 125–132
129
Distribution of 281 estimates of component amounts 70
Frequency
60
median
50 40
mean
30 20 10 0 0
10
20
30
40
50
60
70
Food component amounts (g)
Fig. 1. Distribution of 281 estimates of component amounts.
10, 15, 20, 30, 40, 50 and 100 randomly selected food component estimates per sample. The mean, probability mean (PM), standard deviations (SD) and standard error (SEM) were calculated for each sample (Excel, Microsoft Office 2000). Formula 3: Probability mean (PM) P xi þ ðdmax þ ðdmed dxi ÞÞ dmax PM ¼ n xi is the amount value of the ith estimate. dmax is the maximum distance among all distances in the sample. dmed is the median of all distances in the sample. dxi is the average distance between the estimate and two adjacent estimates. n is the number of estimates in the sample (Fig. 1).
4. Results Fig. 2 shows that the value of the mean is similar for samples of estimates of all sizes. The probability mean was always higher than the mean, but the probability mean varied by sample size. It was highest for the smallest sample sizes and lowest for the largest sample size (see Table 1). For the smallest sample sizes, the value of the probability mean was similar to the mode, amount value 52, of the underlying distribution. The progressive decrease of the probability mean with increasing sample size was due to the larger samples including more of the lower amount values. As a result, the probability mean tended towards the mean as the number of estimates per sample increased. Since the distribution from which the samples were taken was negatively skewed, the probability mean was expected to be larger than the mean. The variability of the mean and probability mean were similar, so that the precision of these two measures appears to be comparable for all sample sizes. The formula of the probability mean includes the median and maximum distances, which will differ between samples, and these distances may increase the variability of the probability mean. The results show that this effect was negligible or small for larger sample sizes.
ARTICLE IN PRESS 130
G.P. Sevenhuysen / Journal of Food Composition and Analysis 17 (2004) 125–132 Probability mean and Mean of estimate values by sample size Probability mean
54
Mean ± SEM
Estimate values (mg)
52
Median of 281 data points
50 48 46 44 42 40 38 0
10
20
30
40
50 100 280
Sample size
Fig. 2. Probability mean and mean of estimate values by sample size.
Table 1 Probability mean and mean of estimate values by sample size Sample size
Probability mean
SD of ten sample means
Mean
SD of ten sample means
5 6 7 8 9 10 15 20 30 40 50 100 281 (all available values)
52.6 52.4 51.4 50.9 50.4 50.3 49.6 49.0 48.3 47.5 47.2 46.7 44.9
5.83 4.21 4.16 2.86 4.93 5.16 3.59 3.07 3.43 2.39 2.40 1.32 —
41.4 44.5 42.4 43.7 40.9 43.3 44.1 43.4 42.8 42.1 42.2 42.9 43.0
4.44 4.84 3.21 3.34 2.85 4.81 3.12 1.95 1.80 1.43 1.40 0.65 —
5. Discussion The skew in the food composition data used in this study meant that the probability mean was higher than the mean. With a greater skew, the difference between the mean and probability mean would be larger. Had the distribution been normal, the two measures would have been the same. The probability mean, influenced by the most frequently occurring estimates, was found to decrease with increasing sample size. This mathematical behavior matches the conceptual basis of the probability mean. With only a few estimates available, and an unknown underlying distribution, it is difficult to interpret the health importance of the data. It is logical to emphasize the estimates that appear to be associated with the peak of the distribution.
ARTICLE IN PRESS G.P. Sevenhuysen / Journal of Food Composition and Analysis 17 (2004) 125–132
131
However, as more data become available, the peak of the distribution will be seen in the context of the whole distribution. The peak will lose emphasis when estimates in other parts of the distribution are found to have appreciable frequencies. In this case, the reduction of the probability mean follows the intuitive expectation of a relatively more important mean. For an appropriate interpretation of health effect, it may be important to emphasize the peak even if it represents less than half the estimates in the distribution. For example, by subtracting distances from the 70th percentile distance instead of the median distance, the most frequently occurring estimates that constitute maybe 30% of the estimates can be selected, instead of the 50% selected in this paper. Or, the weighting factor could be calculated using the 80th percentile distance, representing 20% of the estimates. The probability mean is relevant to food composition tables, the majority of which present mean values of component data. Previous work on the variability of food composition data (unpublished summaries of Slovak archival food composition data), and the variability of food intake data (McDowdl et al., 1994), suggest that the mean, median and mode can differ by amounts that are large enough to affect the interpretation of potential health effects. Invalid interpretation may lead to problems in estimating exposure to food components and representing the adequacy of a diet reported by an individual. In both cases, health professionals may make inaccurate decisions that could affect the effectiveness of planning or service to clients. In the majority of cases, published food composition data and previous estimates of daily intakes are used to predict the content of food and diets. If the available data is calculated from small samples and the underlying distributions are not known, then the mean is not sufficient for an accurate prediction of health effects. The interpretation of the data would be improved by having the probability mean available for the same data. In the case of very small samples neither the mean nor the probability mean are particularly accurate measures. But it is precisely in such cases that having both measures available will improve the interpretation of data. The preferred alternative, which is to generate enough data, will be too expensive under most circumstances. For example, resources are not likely available for repeating the analysis for a food component several hundred times in order to describe the distribution, including an adequate representation of the extreme estimates. Similarly, it is too expensive to measure the diet of non-institutionalized person over the 40–50 consecutive days needed to define the variability of their intake. If the mean and the probability mean are calculated for every food component in a food composition database, users can compare the two measures. If the two are the same, then the underlying distribution is likely to be symmetrical, or possibly statistically ‘normal’. If the two differ, then the underlying distribution is likely to be skewed. In this case, the user who wants to represent the content of untested food samples or future diets would select the probability mean as a more valid measure of content of these samples and diets.
6. Conclusion Food composition tables need to offer both the mean and the probability mean for every food component. The measures provide additional description of the data, in the same way that other measures have been proposed (Rand et al., 1987). Similarly, users of food intake data should have
ARTICLE IN PRESS 132
G.P. Sevenhuysen / Journal of Food Composition and Analysis 17 (2004) 125–132
the mean and probability mean available. Users can select the probability mean as the measure to represent the content of new and unknown estimates and thereby increase the validity of their nutritional interpretations of the data. Acknowledgements The author gratefully acknowledges the assistance of Dr. K. Holcikova, Food Research Institute, Bratislava, Slovakia, in reviewing food composition data files. References Holden, J.M., 1994. Sampling strategies to assure representative values in food composition data. Food, Nutrition and Agriculture, No. 12. Food and Agriculture Organization, Rome, Italy. McDowdl, M.A., Briefel, R.R., Alaimo, K., Bischof, A.M., Caughman, C.R., Carroll, M.D., Loria, C.M., Johnson, C.L., 1994. Energy, Macronutrient intakes of persons aged 2 and over, NHANES 1988–1991. Advance Data No.255, NCHS, USDHH, Washington, USA. Pennington, J.A.T., 1993. Nutrient Variation. Proceedings of the 18th National Nutrient Database Conference. May 23–26, Boston MA, USA. Rand, W.M., Windham, C.T., Wyse, B.W., Young, V.Y., 1987. Food Composition Data: A User’s Perspective. United Nations University Press, Tokyo, Japan.