What Scoring And Analyzing Tests
After administering a test, the teacher’s responsibility is to score it or arrange to have it scored. The teacher then interprets the results and uses these interpretations to make grading, selection, placement, or other decisions. To accurately interpret test scores, however, the teacher needs to analyze the performance of the test as a whole and of the individual test items, and to use these data to draw valid inferences about student performance.
This information also helps teachers prepare for posttest discussions with students about the exam. This topic discusses the processes of obtaining scores and performing test and item analysis. It also suggests ways in which teachers can use posttest discussions to contribute to student learning and seek student feedback that can lead to test item improvement.
Scoring
Many teachers say that they “grade” tests, when in fact it would be more accurate to say that they “score” tests. Scoring is the process of determining the first direct, unconverted, uninterested measure of performance on a test, usually called the raw, obtained, or observed score. The raw score represents the number of correct answers or number of points awarded to separate parts of an assessment ( Nitko & Brookhart , 2007).
On the other hand, grading or marking is the process of assigning a symbol to represent the quality of the student’s performance. Symbols can be letters (A, B, C, D, F , which may also include + or −); categories (pass–fail, satisfactory–unsatisfactory); integers (9 through 1); or percentages (100, 99, 98…), among other options ( Kubiszyn & Borich , 2003).
In most cases, test scores should not be converted to grades for the purpose of later computing a final average grade. Instead the teacher should record actual test scores and then combine all scores into a composite score that can be converted to a final grade. Recording scores contribute to greater measurement accuracy because information is lost each time scores are converted to symbols.
For example, if scores from 70 to 79 all are converted to a grade of C, each score in this range receives the same grade, although scores of 71 and 78 may represent important differences in achievement. If the C grades all are converted to the same numerical grade, for example, C = 2.0, then such distinctions are lost when the teacher calculates the final grade for the course.
Weighting Items
As a general rule, each objectively scored test item should have equal weight. Most electronic scoring systems assign 1 point to each correct answer unless the teacher specifies a different item weight; this seems reasonable for hand-scored tests as well. It is difficult for teachers to justify that one item is worth 2 points while another is worth 1 point; Such a weighting system also motivates students to argue for partial credit for some answers.
Differential weighting implies that the teacher believes knowledge of one concept to be more important than knowledge of another concept. If this is true, the better approach is to write more articles about the important concept; this emphasis would be reflected in the test blueprint, which specifies the number of items for each content area.
When a combination of selection-type items and supply-type items is used on a test, a variable number of points can be assigned to short-answer and essay items to reflect the complexity of the required task and the value of the student’s response (Miller, Linn, & Gronlund , 2009). It is not necessary to adjust the numerical weight of items to achieve a total of 100 points. Although a test of 100 points allows the teacher to calculate a percentage score quickly, this step is not necessary to make valid interpretations of students’ scores.
Correction for Guessing
The raw score sometimes is adjusted or corrected before it is interpreted. One procedure involves applying a formula intended to eliminate any advantage that a student might have gained by guessing correctly. The correction formula reduces the raw score by some fraction of the number of the student’s wrong answers (Miller et al., 2009; Nitko & Brookhart , 2007). The formula can be used only with simple true–false, multiple choices, and some matching items, and is dependent on the number of alternatives per item. The general formula is:
Corrected square = R − W n − 1 [Equation 10.1]
Where R is the number of right answers, W is the number of wrong answers, and n is the number of options in each item (Miller et al., 2009). Thus, for two-option items like true–false, the teacher merely subtracts the number of wrong answers from the number of right answers (or raw score); for four-option items, the raw score is reduced by 1/3 of the number of wrong answers.
A correction formula is obviously difficult to use for a test that contains several different item formats. The use of a correction formula is usually appropriate only when students do not have sufficient time to complete all test items and when they have been instructed not to answer any item for which they are uncertain of the answer (Miller et al., 2009).
Even under these circumstances, students may differ in their interpretation of “certainty” and therefore may interpret the advice differently. Some students will guess regardless of the instructions given and the threat of a penalty; the risk-taking or test wise student is likely to be rewarded with a higher score than the risk-avoiding or non-test wise student because of guessing some answers correctly.
These personality differences cannot be equalized by instructions not to guess and penalties for guessing. The use of a correction formula is also based on the assumption that the student who does not know the answer will guess blindly.
However, Nitko and Brookhart (2007) suggested that the chance of getting a high score by random guessing was slim, although many students choose correct answers through informed guesses based on some knowledge of the content. Based on these limitations and the fact that most tests in nursing education settings are not speeded, the best approach is to advise all students to answer every item, even if they are uncertain about their answers, and apply no correction for guessing.
Item Analysis
Computer software for item analysis is widely available for use with electronic answer sheet scanning equipment. Exhibit 10.1 is an example of a computer-generated item-analysis report. For teachers who do not have access to such equipment and software, procedures for analyzing student responses to test items by hand are described in detail later in this section.
Regardless of the method used for analysis, teachers should be familiar enough with the meaning of each item-analysis statistic to correctly interpret the results. It is important to realize that most item analysis techniques are designed for items that are scored dichotomously, that is, either right or wrong, from tests that are intended for norm-referenced uses (Nitko & Brookhart , 2007).
Difficulty Index
One useful indication of test-item quality is its difficulty. The most commonly employed index of difficulty is the P-level, the value of which ranges from 0 to 1.00, indicating the percentage of students who answered the item correctly. A P-value of 0 indicates that no one answered the item correctly, and a value of 1.00 indicates that every student answered the item correctly (Nitko & Brookhart , 2007). A simple formula for calculating the P-value is:
P = R /T [Equation 10.2]
Where R is the number of students who responded correctly and T is the total number of students who took the test (Miller et al., 2009).
The difficulty index commonly is interpreted to mean that items with P-values of .20 and below are difficult, and items with P-values of .80 and above are easy. However, this interpretation may imply that test items are intrinsically easy or difficult and may not take into account the quality of the instruction or the abilities of the students in that group.
A group of students who were taught by an expert instructor might tend to answer a test item correctly, whereas a group of students with similar abilities who were taught by an ineffective instructor might tend to answer it incorrectly. Different P-values might be produced by students with more or less ability. Thus, test items cannot be labeled as easy or difficult without considering how well that content was taught.
The P-value also should be interpreted in relationship to the student’s probability of guessing the correct answer. For example, if all students guess the answer to a true–false item, on the basis of chance alone, the P-value of that item should be approximately .50. On a four-option multiple-choice item, chance alone should produce a P-value of .25.
As discussed , a four-alternative, multiple-choice item with moderate difficulty therefore would have a P-value approximately halfway between chance (.25) and 1.00, or .87. For most tests whose results will be interpreted in a norm-referenced way, P-values of .30 to .70 for test items are desirable.
However, for tests whose results will be interpreted in a criterion-referenced manner, as most tests in nursing education settings are, the difficulty level of test items should be compared between groups (students whose total scores met the criterion and students who didn’t ).
If item difficulty levels indicate a relatively easy (P-value below .30) or relatively difficult (P-value above .70) item, criterion-referenced decisions still will be appropriate if the item correctly classifies students according to the criterion (Miller et al., 2009; Waltz, Strickland, & Lenz, 2005).
Very easy and very difficult items have little power to discriminate between students who know the content and students who do not, and they also decrease the reliability of the test scores. Teachers can use item difficulty information to identify the need for remedial work related to specific content or skills, or to identify test items that are ambiguous (Miller et al., 2009).
Discrimination Index
The discrimination index, D, is a powerful indicator of test-item quality. A positively discriminating item is one that was answered correctly more often by students with high scores on the test than by those whose test scores were low. In other words, a test item with a positive discrimination index discriminates in the same direction as the total test score.
A negatively discriminating item was answered correctly more often by students with low test scores than by students with high scores. When an equal number of high- and low-scoring students answer the item correctly, the item is non-discriminating (Miller et al., 2009; Nitko & Brookhart , 2007). A number of item discrimination indexes are available; a simple method of computing D is:
D = Pu − Pl [Equation 10.3]
where Pu is the fraction of students in the high-scoring group who answered the item correctly and Pl is the fraction of students in the low-scoring group who answered the item correctly. If the number of test scores is large, it is not necessary to include all scores in this calculation. Instead, the teacher (or computer item analysis software) can use the top 25% and the bottom 25% of scores based on the assumption that the responses of students in the middle group essentially follow the same pattern (Miller et al., 2009; Waltz et al., 2005) The D-value ranges from −1.00 to +1.00.
In general, the higher the positive value, the better the test item. An index of +1.00 means that all students in the upper group answered correctly, and all students in the lower group answered incorrectly; this indication of maximum positive discriminating power is rarely achieved. D-values of +.20 or above are desirable, and the higher the positive value the better.
An index of .00 means that equal numbers of students in the upper and lower groups answered the item correctly, and this item has no discriminating power (Miller et al., 2009). Negative D-values signal items that should be reviewed carefully; usually they indicate items that are flawed and need to be revised. One possible interpretation of a negative Dvalue is that the item was misinterpreted by high scorers or that it provided a clue to low scorers that enabled them to guess the correct answer (Waltz et al., 2005).
When interpreting a D-value, it is important to keep in mind that an item’s power to discriminate is highly related to its difficulty index. An item that is answered correctly by all students has a difficulty index of 1.00; the discrimination index for this item is 0.00, because there is no difference in performance on that item between students whose overall test scores were high and those whose scores were low.
Similarly, if all students answered the item incorrectly, the difficulty index is 0.00, and the discrimination index is also 0.00 because there is no discrimination power. Thus, very easy and very difficult items have low discriminating power. Items with a difficulty index of .50 make maximum discriminating power possible, but do not guarantee it (Miller et al., 2009).
It is important to keep in mind that item-discriminating power does not indicate item validity. To gather evidence of item validity, the teacher would have to compare each test item to an independent measure of achievement, although possible for teacher-constructed tests. Standardized tests in the same content area usually measure the achievement of more general objectives, so they are not appropriate as independent criteria.
The best measure of the domain of interest usually is the total score on the test if the test has been constructed to correspond to specific instructional objectives and content. Thus, comparing each item’s discriminating power to the performance of the entire test determines how effectively each item measures what the entire test measures. Retaining very easy or very difficult items despite low discrimination power may be desirable so as to measure a representative sample of learning objectives and content (Miller et al., 2009).
Distractor Analysis
As previously indicated, item-analysis statistics can serve as indicators of test item quality. No teacher, however, should make decisions about retaining a test item in its present form, revising it, or eliminating it from future use on the basis of the item statistics alone. Item difficulty and discrimination indexes are not fixed, unchanging characteristics. Item analysis data for a given test item will vary from one administration to another because of factors such as students’ ability levels, quality of instruction, and the size of the group tested.
With very small groups of students, if a few students would have changed their responses to the test item, the difficulty and discrimination indices could change considerably (Miller et al., 2009). Thus, when using these indexes to identify questionable items, the teacher should carefully examine each test item for evidence of poorly functioning distractors, ambiguous alternatives, and miss key.
Every distractor should be selected by at least one lower group student, and more lower group students than higher group students should select it. A distractor that is not selected by any student in the lower group may contain a technical flaw or may be so implausible as to be obvious even to students who lack knowledge of the correct answer. A distractor is ambiguous if upper group students tend to choose it with about the same frequency as the keyed, or correct, response.
This result usually indicates that there is no single clearly correct or best answer. Poor functioning and ambiguous distractors may be revised to make them more plausible or to eliminate the ambiguity. If a large number of higher scoring students select a particular incorrect response, the teacher should check to see if the answer key is correct. In each case, the content of the item, not the statistics alone, should guide the teacher’s decision making (Nitko & Brookhart , 2007).
Performing an Item Analysis by Hand
The following process for performing item analysis by hand is adapted from Nitko and Brookhart (2007) and Miller et al. (2009):
Step 1. After the test is scored, arrange the test scores in rank order, highest to lowest.
Step 2. Divide the scores into a high-scoring half and a low-scoring half. For large groups of students, the scores may be divided into equal thirds or quarters, with only the top and bottom groups used for analysis.
Step 3. For each item, tally the number of students in each group who chose each alternative. Record these counts on a copy of the test item next to each response option.
The keyed response for the following sample item is d; the group of 20 students is divided into 2 groups of 10 students each.
- What is the most likely explanation for breast asymmetry in an adolescent girl? Higher Lower
- Blocked mammary duct in the larger breast 0 3
- Endocrine disorder 2 3
- Mastitis in the larger breast 0 0
- Normal variation in growth 8 4
Step 4. Calculate the difficulty index for each item. The following formula is a variation of the one presented earlier, to account for the division of scores into two groups:
P = Rh + Rl T [Equation 10.4]
where Rh is the number of students in the high-scoring half who answered correctly, Rl is the number of students in the low-scoring half who answered correctly, and T is the total number of students. For the purpose of calculating the difficulty index, consider omitted responses and multiple responses as incorrect. For the example in Step 4, the Pvalue is.60, indicating an item of moderate difficulty.
Step 5. Calculate the discrimination index for each item. Using the data from Step 4, divide Rh by the total number of students in that group to obtain Ph. Repeat the process to calculate Pl from Rl. Subtract Pl from Ph to obtain D. For the example in Step 4, the discrimination index is .40, indicating that the item discriminates well between high-scoring and low-scoring students.
Step 6. Check each item for implausible distractors, ambiguity, and miskeying. It is obvious that in the sample item, no students chose “Mastitis in the larger breast” as the correct answer. This distractor does not contribute to the discrimination power of the item, and the teacher should consider replacing it with an alternative that might be more plausible.
No test item should be rejected solely on the basis of item-analysis data. The teacher should carefully examine each questionable item and, if there is no obvious structural defect, it may be best to use the item again with a different group. Remember that with small groups of students, item-analysis data can vary widely from one test administration to another.
Test Characteristics
In addition to item-analysis results, information about how the test performed as a whole also helps teachers to interpret test results. Measures of central tendency and variability, reliability estimates, and the shape of the score distribution can assist the teacher in making judgments about the quality of the test; Difficulty and discrimination indices are related to these test characteristics. In addition, teachers should examine test items in the aggregate for evidence of bias.
For example, although there may be no obvious gender bias in any single test item, such a bias may be apparent when all items are reviewed as a group. Similar cases of ethnic, racial, religious, and cultural bias may be found when items are grouped and examined together.
Conducting Posttest Discussions
Giving students feedback about test results can be an opportunity to reinforce learning, to correct misinformation, and to solicit their input for improvement of test items. But a feedback session also can be an invitation to engage in battle, with students attacking to gain extra points and the teacher defending the honor of the test and, it often seems, the very right to give tests.
Discussions with students about the test should be rational rather than opportunities for the teacher to assert power and authority ( Kubiszyn & Borich , 2003). Posttest discussions can be beneficial to both teachers and students if they are planned in advance and not emotionally charged. The teacher should prepare for a posttest discussion by completing a test analysis and an item analysis and reviewing the items that were most difficult for the majority of students.
Discussion should focus on items missed and possible reasons why. Student comments about how the test is designed, its directions, and individual test items provide an opportunity for the teacher to improve the test ( Kubiszyn & Borich ). To use time efficiently, the teacher should read the correct answers aloud quickly. If the test is hand-scored, correct answers may also be indicated by the teacher on the students’ answer sheets or test booklets.
If machine scoring is used, the answer key may be projected as a scanned document from a computer or via a document camera or overhead projector. Many electronic scoring applications allow an option for marking the correct or incorrect answers directly on each student’s answer sheet. Teachers should continue to protect the security of the test during the posttest discussion by accounting for all test booklets and answer sheets and by eliminating other opportunities for cheating.
Some teachers do not allow students to use pens or pencils during the feedback session to prevent answer-changing and subsequent complaints that scoring errors were made. Another approach is to distribute pens with red or green ink and permit only those pens to be used to mark answers. Teachers also should decide in advance whether to permit students to take notes during the session. Some teachers allow students to record their answers on the test booklets, where the students also record their names.
At the completion of the exam, students submit the answer sheets and their test booklets to the teacher. When all students have finished the exam, they return to the room to check their answers using only their test booklets. The teacher might project the answers onto a screen as described previously. At the conclusion of this session, the teacher collects the test booklets again. It is important not to review and discuss individual items because the test has not yet been scored and analyzed.
However, the teacher may ask students to indicate problematic items and give a rationale for their answers. The teacher can use this item in conjunction with the item-analysis results to evaluate the effectiveness of test items ( Kubiszyn & Borich , 2003).
One disadvantage to this method of giving posttest feedback is that because the test has not yet been scored and analyzed, the teacher would not have an opportunity to thoroughly prepare for the session; feedback consists only of the correct answers, and no discussion takes place. Whatever the structure of the posttest discussion, the teacher should control the session so that it produces maximum benefit for all students.
While discussing an item that was answered incorrectly by a majority of students, the teacher should maintain a calm, matter-of-fact, non-defensive attitude. Students who answered the item incorrectly may be asked to provide their rationale for choosing an incorrect response; students who supplied or chose the right answer may be asked to explain why it is correct.
The teacher should avoid arguing with students about individual items and engaging in emotionally charged discussion; instead, the teacher should either invite written comments as described previously or schedule individual appointments to discuss the items in question. Students who need additional help are encouraged to make appointments with the teacher for individual review sessions.
Eliminating Items or Adding Points
Teachers often debate the merits of adjusting test scores by eliminating items or adding points to compensate for real or perceived deficiencies in test construction or performance. For example, during a posttest discussion, students may argue that if they all answered an item incorrectly, the item should be omitted or all students should be awarded an extra point to compensate for the “bad item.”
It is interesting to note that students seldom propose subtracting a point from their scores if they all answer an item correctly. In any case, how should the teacher respond to such requests? In this discussion, a distinction is made between test items that are technically flawed and those that do not function as intended. If test items are properly constructed, critiqued, and proofread, it is unlikely that serious flaws will appear on the test. However, errors that do appear may have varying effects on students’ scores.
For example, if the correct answer to a multiple-choice item is inadvertently omitted from the test, no student will be able to answer the item correctly. In this case, the item simply should not be scored. That is, if the error is discovered during or after test administration and before the test is scored, the item is omitted from the answer key; a test that was intended to be worth 73 points then is worth 72 points.
If the error is discovered after the tests are scored, they can be re-scored. Students often worry about the effect of this change on their scores and may argue that they should be awarded an extra point in this case. It is obvious that omitting the flawed item and adding a point to the raw score produce nearly identical results.
Although students might view adding a point to their scores as more satisfying, it makes little sense to award a point for an item that was not answered correctly. The “extra” point in fact does not represent knowledge of any content area or achievement of an objective, and therefore it does not contribute to a valid interpretation of the test scores. Teachers should inform students matter-of-factly that an item was eliminated from the test and reassure them that their relative standing with regard to performance on the test has not changed.
If the technical flaw consists of a misspelled word in a true–false item that does not change the meaning of the statement, no adjustment should be made. The teacher should avoid lengthy debate about item semantics if it is clear that such errors are unlikely to have affected the students’ scores. Feedback from students can be used to revise items for later use and sometimes make changes in the instruction.
As previously discussed, teachers should resist the temptation to eliminate items from the test solely on the basis of low difficulty and discrimination indices. Omission of items may affect the validity of the scores from the test, particularly if several items related to one content area or objective are eliminated, resulting in inadequate sampling of that content (Miller et al., 2009).
Because identified flaws in test construction do contribute to measurement errors, the teacher should consider taking them into account when using the test scores to make grading decisions and set cutoff scores. That is, the teacher should not fix cutoff scores for assigning grades until after all tests have been given and analyzed.
The proposed grading scale can then be adjusted if necessary to compensate for deficiencies in test construction. It should be made clear to students that any changes in the grading scale because of flaws in test construction would not adversely affect their grades.
Developing A Test-Item Bank
Because considerable effort goes into developing, administering, and analyzing test items, teachers should develop a system for maintaining and expanding a pool or bank of items from which to select items for future tests. Teachers can maintain databases of test items on their computers with backups on storage devices. When teachers store test item databases electronically, the files must be password-protected and test security maintained.
When developing test banks, the teacher can record the following data with each test item:
(a) the correct response for objective-type items and a brief scoring key for completion or essay items
(b) the course, unit, content area, or objective for which it was designed
(c) the item analysis results for a specified period of time
Exhibit 10.2 offers one such example. Commercially produced software applications can be used in a similar way to develop a database of test items. Each test item is a record in the database. The test items can then be sorted according to the fields in which the data are entered; for example, the teacher could retrieve all items that are classified as Objective 3, with a moderate difficulty index. Many publishers also offer test-item banks that relate to the content contained in their textbooks.
However, faculty members need to be cautious about using these items for their own examinations. The purpose of the test, relevant characteristics of the students to be tested, and the balance and emphasis of content as reflected in the teacher’s test blueprint are the most important criteria for selecting test items. Although some teachers would consider these item banks to be a shortcut to the development and selection of test items, they should be evaluated carefully before they are used.
There is no guarantee that the quality of test items in a published item bank is superior to that of test items that a skilled teacher can construct. Many of the items may be of questionable quality. Masters and colleagues (2001) examined a random sample of 2,913 multiple-choice items from 17 test banks associated with selected nursing textbooks. Items were evaluated to determine if they met accepted guidelines for writing multiple-choice items and were coded as to their cognitive level based on Bloom’s taxonomy.
The researchers found 2,233 violations of item-writing guidelines; whereas most of the problems were minor, some were more serious. Nearly half of the items were at the recall level. In addition, published test-item banks seldom contain item-analysis information such as difficulty and discrimination indices. However, the teacher can calculate this information for each item used or modified from a published item bank, and can develop and maintain an item file.
Conclusion
After administering a test, the teacher must score it and interpret the results. To accurately interpret test scores, the teacher needs to analyze the performance of the test as a whole as well as the individual test items. Information about how the test performed helps teachers to give feedback to students about test results and to improve test items for future use. Scoring is the process of determining the first direct, uninterrupted measure of performance on a test, usually called the raw score.
The raw score usually represents the number of right answers. Test scores should not be converted to grades for the purpose of later computing a final average grade. Instead, the teacher should record actual test scores and then combine them into a composite score that can be converted to a final grade. As a general rule, each objectively scored test item should have equal weight.
If knowledge of one concept is more important than knowledge of another concept, the teacher should sample the more important domain more heavily by writing more items in that area. Most machine-scoring systems assign 1 point to each correct answer; this seems reasonable for hand-scored tests as well. A raw score sometimes is adjusted or corrected before it is interpreted.
One procedure involves applying a formula intended to eliminate any advantage that a student might have gained by guessing correctly. Correcting for guessing is appropriate only when students have been instructed to not answer any item for which they are uncertain of the answer; students may interpret and follow this advice differently. Therefore, the best approach is to advise all students to answer every item, with no correction for guessing applied.
Item analysis can be performed by hand or by the use of a computer program. Teachers should be familiar enough with the meaning of each item-analysis statistic to correctly interpret the results. The difficulty index (P), ranging from 0 to 1.00, indicates the percentage of students who answered the item correctly. Items with P-values of .20 and below are considered to be difficult, and those with P-values of .80 and above are considered to be easy.
However, interpretation of the difficulty index should take into account the quality of the instruction and the abilities of the students in the group. The discrimination index (D), ranging from −1.00 to +1.00, is an indication of the extent to which high scoring students answered the item correctly more often than low scoring students did. In general, the higher the positive value, the better the test item; Discrimination indexes should be desirable at least +.20. An item’s power to discriminate is highly related to its difficulty index.
An item that is answered correctly by all students has a difficulty index of 1.00; the discrimination index for this item is 0.00, because there is no difference in performance on that item between high scorers and low scorers. Flaws in test construction may have varying effects on students’ scores and therefore should be handled differently.
If the correct answer to a multiple-choice item is inadvertently omitted from the test, no student will be able to answer the item correctly. In this case, the item simply should not be scored. If a flaw consists of a misspelled word that does not change the meaning of the item, no adjustment should be made. Teachers should develop a system for maintaining a pool or bank of items from which to select items for future tests.
Item banks can be developed by the faculty and stored electronically. Use of published test-item banks should be based on the teacher’s evaluation of the quality of the items as well as on the purpose for testing, relevant characteristics of the students, and the desired emphasis and balance of content as reflected in the teacher’s test blueprint.