Interpreting of Tests Scores Is Made?
As a measurement tool, a test results in a score—a number. A number, however, has no intrinsic meaning and must be compared with something that has meaning to interpret its significance. For a test score to be useful for making decisions about the test, the teacher must interpret the score.
Whether the interpretations are norm-referenced or criterion-referenced, a basic knowledge of statistical concepts is necessary to assess the quality of tests (whether teacher-made or published), understand standardized test scores, summarize assessment results, and explain test scores to others.
Test Score Distributions
Some information about how a test performed as a measurement instrument can be obtained from computer-generated test- and item-analysis reports. In addition to providing item-analysis data such as difficulty and discrimination indexes, such reports often summarize the characteristics of the score distribution. If the teacher does not have access to electronic scoring and computer software for test and item analysis, many of these analyzes can be done by hand, although more slowly.
When a test is scored, the teacher is left with a collection of raw scores. Often these scores are recorded according to the names of the students, in alphabetical order, or by student numbers. As an example, suppose that the scores displayed in Table 1 resulted from the administration of a 65-point test to 16 nursing students. Glancing at this collection of numbers, the teacher would find it difficult to answer such questions as:
1. Did a majority of students obtain high or low scores on the test?
2. Did any individuals score much higher or much lower than the majority of the students?
3. Are the scores widely scattered or grouped together?
4. What was the range of scores obtained by the majority of the students? (Nitko & Brookhart , 2007)
To make it easier to see similar characteristics of scores, the teacher should arrange them in rank order, from highest to lowest (Miller, Linn, & Gronlund , 2009), as in Table 2. Ordering the scores in this way makes it obvious that they ranged from 42 to 60, and that one student’s score was much lower than those of the other students.
But the teacher still cannot visualize easily how a typical student performed on the test or the general characteristics of the scores obtained. Removing student names, listing each score once, and tallying how many times each score occurs results in a frequency distribution, as in Table 3. By displaying scores in this way, it is easier for the teacher to identify how well the group of students performed on the exam. The frequency distribution also can be represented graphically as a histogram.
In Figure 1, the scores are ordered from lowest to highest along a horizontal line, left to right, and the number of asterisks above each score indicates the frequency of that score. Frequencies also can be indicated on a histogram by bars, with the height of each bar representing the frequency of the corresponding score, as in Figure 2. A frequency polygon is another way to display a score distribution graphically.
A dot is made above each score value to indicate the frequency with which that score occurred; if no one obtained a particular score, the dot is made on the baseline, at zero. The dots are then connected with straight lines to form a polygon or curve. Figure 3 shows a frequency polygon based on the histogram in Figure 1.
Histograms and frequency polygons thus show general characteristics such as the scores that occurred most frequently, the score distribution shape, and the range of the scores. The characteristics of a score distribution can be described on the basis of its symmetry, skewness, modality, and kurtosis. These characteristics are illustrated in Figure 4. A symmetric distribution or curve is one in which there are two equal halves, mirror images of each other.
Nonsymmetrical or asymmetric curves have a cluster of scores or a peak at one end and a tail extending toward the other end. This type of curve is said to be skewed; the direction in which the tail extends indicates whether the distribution is positively or negatively skewed. The tail of a positively skewed curve extends toward the right, in the direction of positive numbers on a scale, and the tail of a negatively skewed curve extends toward the left, in the direction of negative numbers.
A positively skewed distribution thus has the largest cluster of scores at the low end of the distribution, which seems counterintuitive. The distribution of test scores from Table 1 is nonsymmetrical and negatively skewed. Remember that the lowest possible score on this test was zero and the highest possible score was 65; the scores were clustered between 43 and 60. Frequency polygons and histograms can differ in the number of peaks they contain; This characteristic is called modality, referring to the mode or the most frequently occurring score in the distribution.
If a curve has one peak, it is immoral; if it contains two peaks, it is bimodal. A curve with many peaks is multimodal. The relative flatness or peak-ness of the curve is referred to as kurtosis. Flat curves are described as platykurtic, moderate curves are said to be Mesokurtic, and Sharply peaked curves are referred to as leptokurtic (Munro, 2001, p. 45). The histogram in Figure 1 is a bimodal, platykurtic distribution. The shape of a score distribution depends on the characteristics of the test as well as the abilities of the students who were tested ( Nitko & Brookhart , 2007).
Some teachers make grading decisions as if all test score distributions resemble a normal curve, that is, they attempt to “curve” the grades. An understanding of the characteristics of a normal curve would dispel this notion. A normal distribution is a bell-shaped curve that is symmetric, unimodal, and mesokurtic. Figure 5 illustrates a normal distribution.
Many human characteristics such as intelligence, weight, and height are normally distributed; the measurement of any of these attributes in a population would result in more scores in the middle range than at either extreme. However, most score distributions obtained from teacher-made tests do not approximate a normal distribution. This is true for several reasons.
The characteristics of a test greatly influence the resulting score distribution; a very difficult test tends to yield a positively skewed curve. Likewise, the abilities of the students influence the test score distribution. Regardless of the distribution of the attribute of intelligence among the human population, this characteristic is not likely to be distributed normally among a class of nursing students or a group of newly hired RNs.
Because admission and hiring decisions tend to select those individuals who are most likely to succeed in the nursing program or job, a distribution of IQ scores from a class of 16 nursing students or 16 newly hired RNs would tend to be negatively skewed.
Likewise, knowledge of nursing content is not likely to be normally distributed because those who have been admitted to a nursing program or hired as staff nurses are not representative of the population in general. Therefore, grading procedures that attempt to apply the characteristics of the normal curve to a test score distribution are likely to result in unwise and unfair decisions.
Measures of Central Tendency
One of the questions to be answered when interpreting test scores is, “What score is most characteristic or typical of this distribution?” A typical score is likely to be in the middle of a distribution with the other scores clustered around it; measures of central tendency provide a value around which the test scores cluster (Munro, 2001, p. 30). Three measures of central tendency commonly used to interpret test scores are the mode, median, and mean.
The mode, sometimes abbreviated Mo, is the most frequently occurring score in the distribution; it must be a score actually obtained by a student. It can be identified easily from a frequency distribution or graphic display without mathematical calculation. As such, it provides a rough indication of central tendency. The mode, however, is the least stable measure of central tendency because it tends to fluctuate considerably from one sample to another drawn from the same population (Kubiszyn & Borich , 2003; Miller et al., 2009).
That is, if the same 65-item test that yielded the scores in Table 1 were administered to a different group of 16 nursing students in the same program who had taken the same course, the mode might differ considerably. In addition, as in the distribution depicted in Figure 1, the mode has two or more values in some distributions, making it difficult to specify one typical score.
A uniform distribution of scores has no mode; Such distributions are likely to be obtained when the number of students is small, the range of scores is large, and each score is obtained by only one student. The median (abbreviated Mdn or P50) is the point that divides the distribution of scores into equal halves (Miller et al., 2009). It is a value above which fall 50% of the scores and below which fall 50% of the scores; thus it represents the 50th percentile.
The median does not have to be an actual score obtained. In an even number of scores, the median is located halfway between the two middle scores; in an odd number of scores, the median is the middle score. Because the median is an index of location, it is not influenced by the value of each score in the distribution. Thus, it is usually a good indication of a typical score in a skewed distribution containing extremely high or low scores (Miller et al.).
The mean is often referred to as the “average” score in a distribution, reflecting the mathematical calculation that determines this measure of central tendency. It is usually abbreviated as M or X – . The mean is calculated by summing each individual score and dividing by the total number of scores, as in the following formula:
M = ΣΧ /N [Equation 15.1]
where M is the mean, ΣΧ is the sum of the individual scores, and N is the total number of scores. Thus, the value of the mean is affected by every score in the distribution (Miller, Linn, & Gronlund , 2009). This property makes it the preferred index of central tendency when a measure of the total distribution is desired.
However, the mean is sensitive to the influence of extremely high or low scores in the distribution, and as such, it may not reflect the typical performance of a group of students. There is a relationship between the shape of a score distribution and the relative locations of these measures of central tendency. In a normal distribution, the mean, median, and mode have the same value, as shown in Figure 5.
In a positively skewed distribution, the mean will yield the highest measure of central tendency and the mode will give the lowest; in a negatively skewed distribution, the mode will be the highest value and the mean the lowest. Figure 6 depicts the relative positions of the three measures of central tendency in skewed distributions. The mean of the distribution of scores from Table 1 is 52.75; the median is 53.5.
The fact that the median is slightly higher than the mean confirms that the median is an index of location or position and is insensitive to the actual score values in the distribution. The mean, because it is affected by every score in the distribution, was influenced by the one extremely low score. Because the shape of this score distribution was negatively skewed, it is expected that the median would be higher than the mean because the mean is always pulled in the direction of the tail (Munroe, 2001, p. 34).
Measures of Variability
It is possible for two score distributions to have similar measures of central tendency and yet be very different. The scores in one distribution may be tightly clustered around the mean, and in the other distribution, the scores may be widely dispersed over a range of values. Measures of variability are used to determine how similar or different the students are with respect to their scores on a test. The simplest measure of variability is the range, the difference between the highest and lowest scores in the distribution. For the test score distribution in Table 3, the range is 18 (60 − 42 = 18).
The range is sometimes expressed as the highest and lowest scores, rather than a difference score. Because the range is based on only two values, it can be highly unstable. The range also tends to increase with sample size; that is, test scores from a large group of students are likely to be scattered over a wide range because of the likelihood that an extreme score will be obtained (Miller et al., 2009, p. 503). The standard deviation (abbreviated as SD, s, or o´) is the most common and useful measure of variability.
Like the mean, it takes into consideration every score in the distribution. The standard deviation is based on differences between each score and the mean. Thus, it characterizes the average amount by which the scores differ from the mean. The standard deviation is calculated in four steps:
1. Subtract the mean from each score (X − M) to calculate a deviation score (x), which can be positive or negative.
2. Square each deviation score ( x2) , which eliminates any negative values. Sum all of the squared deviation scores ( Σx2) .
3. Divide this sum by the number of test scores to yield the variance.
4. Calculate the square root of the variance. Although other formulas can be used to calculate the standard deviation, the following definitional formula represents these four steps:
SD = √ Σx2 /N [Equation 15.2]
where SD is the standard deviation, Σx2 is the sum of the squared deviation scores, and N is the number of scores (Miller et al., 2009, pp. 504–505). The standard deviation of the distribution of scores from Table 1 is 4.1. What does this value mean?
A standard deviation of 4.1 represents the average deviation of scores from the mean. On a 65-point test, 4 points is not a large average difference in scores. If the scores cluster tightly around the mean, the standard deviation will be a relatively small number; if they are widely scattered over a large range of scores, the standard deviation will be a larger number (Kubis zyn & Borich , 2003, p. 271).
Interpreting An Individual Score
Interpreting the Results of Teacher-Made Tests The ability to interpret the characteristics of a distribution of scores will assist the teacher to make norm-referenced interpretations of the meaning of any individual score in that distribution. For example, how should the teacher interpret P. Purdy’s score of 53 on the test whose results were summarized in Table 1? With a median of 53.5, a mean of 52.75, and a standard deviation of 4.1, a score of 53 is about “average.”
All scores between 49 and 57 fall within one standard deviation of the mean, and thus are not significantly different from one another. On the other hand, N. Nardozzi can rejoice because a score of 60 is almost two standard deviations higher than the mean; Thus, this score represents achievement that is much better than that of others in the group. The teacher should probably plan to counsel L.
Lynch, because a score of 42 is more than two standard deviations below the mean, much lower than others in the group. However, most nursing educators need to make criterion-referenced interpretations of individual test scores. A student’s score on the test is compared to a preset standard or criterion, and the scores of the other students are not taken into account.
The percentage-correct score is a derived score that is often used to report the results of tests that are intended for criterion-referenced interpretation. The percentage correct is a comparison of a student’s score with the maximum possible score; it is calculated by dividing the raw score by the total number of items on the test (Miller et al., 2009, p. 462).
Although many teachers believe that percentage-correct scores are an objective indication of how many students really know about a subject, in fact they can change significantly with the difficulty of the test items. Because percentage-correct scores are often used as a basis for assigning letter grades according to a predetermined grading system, it is important to recognize that they are determined more by test difficulty than by true quality of performance.
For tests that are more difficult than they were expected to be, the teacher may want to adjust the raw scores before calculating the percentage correct on that test. The percentage-correct score should not be confused with percentile rank, often used to report the results of standardized tests. The percentile rank describes the student’s relative standing within a group and therefore is a norm-referenced interpretation.
The percentile rank of a given raw score is the percentage of scores in the distribution that occur at or below that score. A percentile rank of 83, therefore, means that the student’s score is equal to or higher than the scores made by 83% of the students in that group; one cannot assume, however, that the student answered 83% of the test items correctly.
Because there are 99 points that divide a distribution into 100 groups of equal size, the highest percentile rank that can be obtained is the 99th. The median is at the 50th percentile. Differences between percentile ranks mean more at the highest and lowest extremes than they do near the median (Kubiszyn & Borich , 2003).
Interpreting the Results of Standardized Tests
The results of standardized tests usually are intended to be used to make norm-referenced interpretations. Before making such interpretations, the teacher should keep in mind that standardized tests are more relevant to general rather than specific instructional goals. Additionally, the results of standardized tests are more appropriate for evaluations of groups rather than individuals.
Consequently, standardized test scores should not be used to determine grades for a specific course or to make a decision to hire, promote, or terminate an employee. Like most educational measures, standardized tests provide gross, not precise, data about achievement. Actual differences in performance and achievement are reflected in large score differences. Standardized test results are usually reported in derived scores such as percentile ranks, standard scores, and norm group scores.
Because all of these derived scores should be interpreted in a norm-referenced way, it is important to specify an appropriate norm group for comparison. The user’s manual for any standardized test typically presents norm tables in which each raw score is matched with an equivalent derived score. Standardized test manuals may contain a number of norm tables; the norm group on which each table is based should be fully described.
The teacher should take care to select the norm group that most closely matches the group whose scores will be compared to it (Kubiszyn & Borich , 2003, p. 356; Miller et al., 2009, pp. 464–465). For example, when interpreting the results of standardized tests in nursing, the performance of a group of baccalaureate nursing students should be compared with a norm group of baccalaureate nursing students. Norm tables sometimes permit finer distinctions such as size of program, geographical region, and public versus private affiliation.
Conclusion
To be meaningful and useful for decision making, test scores must be interpreted in either norm-referenced or criterion-referenced ways. Knowledge of basic statistical concepts is necessary to make valid interpretations and to explain test scores to others. Scoring a test results in a collection of numbers known as raw scores. To make raw scores understandable, they can be arranged in frequency distributions or displayed graphically as histograms or frequency polygons.
Score distribution characteristics such as symmetry, skewness, modality, and kurtosis can assist the teacher in understanding how the test performed as a measurement tool as well as to interpret any one score in the distribution. Measures of central tendency and variability also aid in interpreting individual scores. Measures of central tendency include the mode, median, and mean; each measure has advantages and disadvantages for use. In a normal distribution, these three measures will coincide.
Most score distributions from teacher-made tests do not meet the assumptions of a normal curve. The shape of the distribution can determine the most appropriate index of central tendency to use. Variability in a distribution can be described roughly as the range of scores or more precisely as the standard deviation. Teachers can make criterion-referenced or norm-reference interpretations of individual student scores.
Norm-referenced interpretations of any individual score should take into account the characteristics of the score distribution, some index of central tendency, and some index of variability. The teacher thus can use the mean and standard deviation to make judgments about how an individual student’s score compares with those of others.
A percentage-correct score is calculated by dividing the raw score by the total possible score; thus it compares the student’s score to a preset standard or criterion and does not take the scores of other students into consideration. A percentage-correct score is not an objective indication of how much a student really knows about a subject because it is affected by the difficulty of the test items.
The percentage correct score should not be confused with percentile rank, which describes the student’s relative standing within a group and therefore is a norm-referenced interpretation. The percentile rank of a given raw score is the percentage of scores in the distribution that occurs at or below that score. The results of standardized tests are usually reported as percentile ranks or other norm-referenced scores.
Teachers should be cautious when interpreting standardized test results so that comparisons with the appropriate norm group are made. Standardized test scores should not be used to determine grades or to make personnel decisions, and results should be interpreted with the understanding that only large differences in scores indicate real differences in achievement levels.