How to Summarize Data




How to Summarize Data: Introduction



Listen




An investigator collecting data generally has two goals: to obtain descriptive information about the population from which the sample was drawn and to test hypotheses about that population. We focus here on the first goal: to summarize data collected on a single variable in a way that best describes the larger, unobserved population.




When the value of the variable associated with any given individual is more likely to fall near the mean (average) value for all individuals in the population under study than far from it and equally likely to be above the mean and below it, the mean and standard deviation for the sample observations describe the location and amount of variability among members of the population. When the value of the variable is more likely than not to fall below (or above) the mean, one should report the median and values of at least two other percentiles.




To understand these rules, assume that we observe all members of the population, not only a limited (ideally representative) sample as in an experiment




For example, suppose we wish to study the height of Martians and, to avoid any guesswork, we visit Mars and measure the entire population—all 200 of them. Figure 2-1 shows the resulting data with each Martian’s height rounded to the nearest centimeter and represented by a circle. There is a distribution of heights of the Martian population. Most Martians are between about 35 and 45 cm tall, and only a few (10 out of 200) are 30 cm or shorter, or 50 cm or taller.





Figure 2-1.



Distribution of heights of 200 Martians, with each Martian’s height represented by a single point. Notice that any individual Martian is more likely to have a height near the mean height of the population (40 cm) than far from it and is equally likely to be shorter or taller than average.





Having successfully completed this project and demonstrated the methodology, we submit a proposal to measure the height of Venusians. Our record of good work assures funding, and we proceed to make the measurements. Following the same conservative approach, we measure the heights of all 150 Venusians. Figure 2-2 shows the measured heights for the entire population of Venus, using the same presentation as Figure 2-1. As on Mars, there is a distribution of heights among members of the population, and all Venusians are around 15 cm tall, almost all of them being taller than 10 cm and shorter than 20 cm.





Figure 2-2.



Distribution of heights of 150 Venusians. Notice that although the average height and dispersion of heights about the mean differ from those of Martians (Fig. 2-1), they both have a similar bell-shaped appearance.





Comparing Figures 2-1 and 2-2 demonstrates that Venusians are shorter than Martians and that the variability of heights within the Venusian population is smaller. Whereas almost all (194 of 200) the Martians’ heights fall in a range 20 cm wide (30 to 50 cm), the analogous range for Venusians (144 of 150) is only 10 cm (10 to 20 cm). Despite these differences, there are important similarities between these two populations. In both, any given member is more likely to be near the middle of the population than far from it and equally likely to be shorter or taller than average. In fact, despite the differences in population size, average height, and variability, the shapes of the distributions of heights of the inhabitants of both the planets are almost identical. A most striking result!




Three Kinds of Data



Listen




The heights of Martians and Venusians are known as interval data because heights are measured on a scale with constant intervals, in this case, centimeters. For interval data, the absolute difference between two values can always be determined by subtraction.* The difference in heights of Martians who are 35 and 36 cm tall is the same as the difference in height of Martians who are 48 and 49 cm tall. Other variables measured on interval scales include temperature (because a 1°C difference always means the same thing), blood pressure (because a 1 mmHg difference in pressure always means the same thing), height, or weight.




There are other data, such as gender, state of birth, or whether or not a person has a certain disease, that are not measured on an interval scale. These variables are examples of nominal or categorical data, in which individuals are classified into two or more mutually exclusive and exhaustive categories. For example, people could be categorized as male or female, dead or alive, or as being born in one of the 50 states, District of Columbia, or outside the United States. In every case, it is possible to categorize each individual into one and only one category. In addition, there is no arithmetic relationship or even ordering between the categories.




Ordinal data fall between interval and nominal data. Like nominal data, ordinal data fall into categories, but there is an inherent ordering (or ranking) of the categories. Level of health (excellent, very good, good, fair, or poor) is a common example of a variable measured on an ordinal scale. The different values have a natural order, but the differences or “distances” between adjoining values on an ordinal scale are not necessarily the same and may not even be comparable. For example, excellent health is better than very good health, but this difference is not necessarily the same as the difference between fair and poor health. Indeed, these differences may not even be strictly comparable.




For the remainder of this chapter, we will concentrate on how to describe interval data, particularly how to describe the location and shape of the distributions. Because of the similar shapes of the distributions of heights of Martians and Venusians, we will reduce all the information in Figures 2-1 and 2-2 to a few numbers, called parameters, of the distributions. Indeed, since the shapes of the two distributions are so similar, we only need to describe how they differ; we do this by computing the mean height and the variability of heights about the mean.




* Relative differences can only be computed when there is a true zero point. For example, height has a true zero point, so a Martian that is 45 cm tall is 1.5 times as tall as a Martian that is 30 cm tall. In contrast, temperature measured in degrees Celsius or Fahrenheit does not have a true zero point, so it would be inaccurate to say that 100°C is twice as hot as 50°C. However, the Kelvin temperature scale does have a true zero point. Interval data that has a true zero point is called ratio data. The methods we will be developing only require interval data.




Variables measured on a nominal scale in which there are only two categories are also known as dichotomous variables.




We will present the corresponding approaches for nominal (in Chapters 5 and 11) and ordinal data (in Chapter 10). The basic principles are the same for all three kinds of data.




The Mean



Listen




To indicate the location along the height scale, define the population mean to be the average height of all members of the population. Population means are often denoted by μ, the Greek letter mu. When the population is made up of discrete members,





The equivalent mathematical statement is





in which Σ, Greek capital letter sigma, indicates the sum of the values of the variable X for all N members of the population. Applying this definition to the data in Figures 2-1 and 2-2 yields the result that the mean height of Martians is 40 cm and the mean height of Venusians is 15 cm. These numbers summarize the qualitative conclusion that the distribution of heights of Martians is higher than the distribution of heights of Venusians.




Measures of Variability



Listen




Next, we need a measure of dispersion about the mean. A value an equal distance above or below the mean should contribute the same amount to our index of variability, even though in one case the deviation from the mean is positive and in the other it is negative. Squaring a number makes it positive, so let us describe the variability of a population about the mean by computing the average squared deviation from the mean. The average squared deviation from the mean is larger when there is more variability among members of the population (compare the Martians and Venusians). It is called the population variance and is denoted by σ2, the square of the lower case Greek sigma. Its precise definition for populations made up of discrete individuals is





The equivalent mathematical statement is





Note that the units of variance are the square of the units of the variable of interest. In particular, the variance of Martian heights is 25 cm2 and the variance of Venusian heights is 6.3 cm2. These numbers summarize the qualitative conclusion that there is more variability in heights of Martians than in heights of Venusians.




Since variances are often hard to visualize, it is more common to present the square root of the variance, which we might call the square root of the average squared deviation from the mean. Since that is quite a mouthful, this quantity has been named the standard deviation, σ. Therefore, by definition,





or mathematically,





where the symbols are defined as before. Note that the standard deviation has the same units as the original observations. For example, the standard deviation of Martian heights is 5 cm, and the standard deviation of Venusian heights is 2.5 cm.




The Normal Distribution



Listen




Table 2-1 summarizes what we found out about Martians and Venusians. The three numbers in the table tell a great deal: the population size, the mean height, and how much the heights vary about the mean. The distributions of heights on both the planets have a similar shape, so that roughly 68% of the heights fall within 1 standard deviation from the mean and roughly 95% within 2 standard deviations from the mean. This pattern occurs so often that mathematicians have studied it and found that if the observed measurement is the sum of many independent small random factors, the resulting measurements will take on values that are distributed, like the heights we observed on both Mars and Venus. This distribution is called the normal (or Gaussian) distribution.





Table 2-1. Population Parameters for Heights of Martians and Venusians




Its height at any given value of X is





Note that the distribution is completely defined by the population mean μ and population standard deviation σ. Therefore, the information given in Table 2-1 is not just a good abstract of the data, it is all the information one needs to describe the population fully if the distribution of values follows a normal distribution.




Getting the Data



Listen




So far, everything we have done has been exact because we followed the conservative course of examining every single member of the population. Usually it is physically or fiscally impossible to do this, and we are limited to examining a sample of n individuals drawn from the population in the hope that it is representative of the complete population. Without knowledge of the entire population, we can no longer know the population mean, μ, and population standard deviation, σ. Nevertheless, we can estimate them from the sample. To do so, however, the sample has to be “representative” of the population from which it is drawn.




Random Sampling



All statistical methods are built on the assumption that the individuals included in your sample represent a random sample from the underlying (and unobserved) population. In a random sample every member of the population has an equal probability (chance) of being selected for the sample. For the results of any of the methods developed in this book to be reliable, this assumption has to be met.



The most direct way to create a simple random sample would be to obtain a list of every member of the population of interest, number them from 1 to N (where N is the number of population members), then use a computerized random number generator to select the n individuals for the sample. Table 2-2 shows 100 random numbers between 1 and 150 created with a random number generator. Every number has the same chance of appearing and there is no relationship between adjacent numbers.




Table 2-2. One Hundred Random Numbers between 1 and 150



We could use this table to select a random sample of Venusians from the population shown in Figure 2-2. To do this, we number the Venusians from 1 to 150, beginning with number 1 for the far left individual in Figure 2-2, numbers 2 and 3 for the next two individuals in the second column in Figure 2-2, numbers 4, 5, 6, and 7 for the individuals in the next column, until we reach the individual at the far right of the distribution, who is assigned the number 150. To obtain a simple random sample of six Venusians from this population, we take the first six numbers in the table—2, 101, 49, 54, 30, and 137—and select the corresponding individuals. Figure 2-3 shows the result of this process. (When a number repeats, as with the two 7s in the first column of Table 2-2, simply skip the repeats because the corresponding individual has already been selected.)




Figure 2-3.



To select n = 6 Venusians at random, we number the entire population of N = 150 Venusians from 1 to 150, beginning with the first individual on the far left of the population as number 1. We then select six random numbers from Table 2-2 and select the corresponding individuals for the sample to be observed.




We could create a second random sample by simply continuing in the table beginning with the seventh entry, 40, or starting in another column. The important point is not to reuse any sequence of random numbers already used to select a sample. (As a practical matter, one would probably use a computerized random number generator, which automatically makes each sequence of random numbers independent of the other sequences it generates.) In this way, we ensure that every member of the population is equally likely to be selected for observation in the sample.



The list of population members from which we drew the random sample is known as a sampling frame. Sometimes it is possible to obtain such a list (for example, a list of all people hospitalized in a given hospital on a given day), but often no such list exists. When there is no list, investigators use other techniques for creating a random sample, such as dialing telephone numbers at random for public opinion polling or selecting geographic locations at random from maps. The issue of how the sampling frame is constructed can be very important in terms of how well and to whom the results of a given study generalize to individuals beyond the specific individuals in the sample.*



The procedure we just discussed is known as a simple random sample. In more complex designs, particularly in large surveys or clinical trials, investigators sometimes use stratified random samples in which they first divide the population into different subgroups (perhaps based on gender, race, or geographic location), then construct simple random samples within each subgroup (strata). This procedure is used when there are widely varying numbers of people in the different subpopulations so that obtaining adequate sample sizes in the smaller subgroups would require collecting more data than necessary in the larger subpopulations if the sampling was done with a simple random sample. Stratification reduces data collection costs by reducing the total sample size necessary to obtain the desired precision in the results, but makes the data analysis more complicated. The basic need to create a random sample in which each member of each subpopulation (strata) has the same chance of being selected is the same as in a simple random sample.



* We will return to this issue in Chapter 12, with specific emphasis on doing clinical research on people being served at academic medical centers.




Bias



The primary reason for random sampling—whether a simple random sample or a more complex stratified sample—is to avoid bias in selecting the individuals to be included in the sample. A bias is a systematic difference between the characteristics of the members of the sample and the population from which it is drawn.



Biases can be introduced purposefully or by accident. For example, suppose you are interested in describing the age distribution of the population. The easiest way to obtain a sample would be to simply select the people whose age is to be measured from the people in your biostatistics class. The problem with this convenience sample is that you will be leaving out everyone not old enough to be learning biostatistics or those who have outgrown the desire to do so. The results obtained from this convenience sample would probably underestimate both the mean age of people in the entire population as well as the amount of variation in the population. Biases can also be introduced by selectively placing people in one comparison group or another. For example, if one is conducting an experiment to compare a new drug with conventional therapy, it would be possible to bias the results by putting the sicker people in the conventional therapy group with the expectation that they would do worse than people who were not as sick and were receiving the new drug. Random sampling protects against both these kinds of biases.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Jan 20, 2019 | Posted by in ANESTHESIA | Comments Off on How to Summarize Data

Full access? Get Clinical Tree

Get Clinical Tree app for offline access