Planning for Classroom Testing: Purpose, Population, Test Length, Level Difficulty and Discrimination

Testing Purpose, Population, Test Length, Level Difficulty and Discrimination

Purpose and Population

All decisions involved in planning a test are based on a teacher’s knowledge of the purpose of the test and the relevant characteristics of the population of learners to be tested. The purpose for the test involves why it is to be given, what it is supposed to measure, and how the test scores will be used. For example, if a test is to be used to measure the extent to which students have met learning objectives to determine course grades, its primary purpose is summative.

If the teacher expects the course grades to reflect real differences in the amount of knowledge among the students, the test must be sufficiently difficult to produce an acceptable range of scores. On the other hand, if a test is to be used primarily to provide feedback to staff nurses about their knowledge following a continuing education program, the purpose of the test is formative. If the results will not be used to make important personnel decisions, a large range of scores is not necessary, and the test items can be of moderate or low difficulty.

A teacher’s knowledge of the population that will be tested will be useful in selecting the item formats to be used, determining the length of the test and the testing time required procedures, and selecting the appropriate scoring. The term population is not used here in its research sense, but rather to indicate the general group of learners who will be tested.

The students’ reading levels, English-language literacy, visual acuity, health, and previous testing experience are examples of factors that might influence these decisions. For example, if the population to be tested is a group of five patients who have completed preoperative instruction for coronary bypass graft surgery, the teacher would probably not administer a test of 100 multiple-choice and matching items with a machine-scored answer sheet. However, this type of test might be most appropriate as a final course examination for a class of 75 senior nursing students.

Test Length

The length of the test is an important factor that is related to its purpose, the abilities of the students, the item formats to be used, the amount of testing time available, and the desired reliability of the test scores.

However, if the purpose of the test is to measure knowledge of a small content domain with a limited number of objectives, fewer items will be needed to achieve an adequate sampling of the content. It should be noted that assessment length refers to the number of test items or tasks, not to the amount of time it would take the student to complete the test.

Items that require the student to analyze a complex data set, draw conclusions, and supply or choose a response take more test administration time; Therefore, fewer items of those types can be included on a test to be completed in a fixed time period. When the number of complex assessment tasks to be included on a test is limited by test administration time, it is better to test more frequently than to create longer tests that test less important learning goals (Miller, Linn, & Gronlund, 2009; Waltz, Strickland, & Lenz, 2005).

Because test length is probably limited by the scheduled length of a testing period, it is wise to construct the test so that the majority of The students working at their normal pace will be able to attempt to answer all items. This type of test is called a power test. A speeded test is one that does not provide sufficient time for all students to respond to all items. Although most standardized tests are speeded, this type of test generally is not appropriate for teacher-made tests in which accuracy rather than speed of response is important (Miller et al., 2009; Nitko & Brookhart, 2007).

Difficulty and Discrimination Level

The desired difficulty of a test and its ability to differentiate among various levels of performance are related considerations. Both factors are affected by the purpose of the test and the way in which the scores will be interpreted and used. The difficulty of individual test items affects the average test score; The mean score of a group of students is equal to the sum of the difficulty levels of the test items.

The difficulty level of each test item depends on the complexity of the task, the ability of the students who answer it, and the quality of the teaching. It may also be related to the perceived complexity of the item; if students perceive the task as too difficult, they may skip it, resulting in a lower percentage of students who answer the item correctly ( Nitko & Brookhart, 2007). Difficulty level (Waltz et al., 2005), but this rule has different applications depending on how the test results will be interpreted.

If test results are to be used to determine the relative achievement of students (i.e., norm-referenced interpretation), the majority of items on the test should be moderately difficult. The recommended difficulty level for selection-type test items depends on the number of choices allowed.

The percentage of students who answer each item correctly should be about midway between 100% and the chance of guessing correctly (eg, 50% for true–false items, 25% correct for four-alternative multiple-choice items). For example, a moderately difficult true–false item should be answered correctly by 75 to 85% of students ( Nitko & Brookhart, 2007; Waltz et al., 2005). When the majority of items on a test are too easy or too difficult, they will not discriminate well between students with varying levels of knowledge or ability.

However, if the teacher wants to make criterion-referenced judgments, more commonly used in nursing education and practice settings, the overall concern is whether a student’s performance meets a set standard rather than on the actual score itself. If the purpose of the assessment is to screen out the least capable students (eg, those failing a course), it should be relatively easy for most test-takers.

However, comparing performance to a set standard does not limit assessment to testing of lower level knowledge and ability; Considerations of assessment validity should guide the teacher to construct tests that adequately sample the knowledge or performance domain. When criterion-referenced test results are reported as percentage scores, their variability (range of scores) may be similar to norm-referenced test results, but the interpretation of the range of scores would be narrower.

For example, on a final exam in a nursing course the potential score range may be 0% to 100%, but the passing score is set at 80%. Even if there is wide variability of scores on the exam, the primary concern is whether the test correctly classifies each student as performing above or below the standard (eg, 80%). In this case, the teacher should examine the difficulty level of test items and compare them between groups (students who met the standard and students who didn’t).

If item difficulty levels indicate a relatively easy or relatively difficult exam, criterion-referenced decisions will still be appropriate if the measure consistently classifies students according to the performance standard (Miller et al., 2009; Waltz et al., 2005). It is important to keep in mind that the difficulty level of test items can only be estimated in advance, depending on the teacher’s experience in testing this content and knowledge of the abilities of the students to be tested.

When the test has been administered and scored, the actual difficulty index for each item can be compared with the expected difficulty, and items can be revised if the actual difficulty level is much lower or much higher than anticipated (Waltz et al., 2005) .

Leave a Comment