Assessment Reliability: Methods of Measurement and Factors Affecting Qualities of Effective Assessment Procedures

Methods of Measurement and Factors Affecting Reliability

Methods of Estimating Reliability

Because reliability is viewed in terms of different types of consistency, these types are determined by different methods: over time (stability), over different forms of the assessment (equivalence), within the assessment itself (internal consistency), and over different raters ( consistency of ratings or interpreter reliability). Each method of estimating reliability will be described in further detail.

Measure of Stability

Evidence of stability indicates whether students would achieve essentially the same scores if they took the same assessment at another time—a test–retest procedure. The correlation between the set of scores obtained on the first administration and the set obtained on the second yields a test–retest reliability coefficient.

This type of reliability evidence is known as stability, and is appropriate for situations in which the trait being measured is expected to be stable over time. In general, the longer the period of time between administrations of the test, the lower the stability–reliability estimate ( Nitko & Brookhart, 2007). In nursing education settings, the test–retest method of obtaining reliable information may have limited usefulness.

If the same test items are used on both tests, the students’ answers on the retest are not independent of their answers on the first test. That is, their responses to the second test may be influenced to some extent by recall of their previous responses or by discussion or individual review of content after taking the first test.

In addition, if there is a long interval between testing occasions, other factors such as real changes in student ability as a result of learning may affect the retest scores. When selecting standardized tests, however, stability is an important consideration (Miller et al., 2009).

Measure of Equivalence

Equivalent-forms reliability, also known as alternate or parallel forms, involves the use of two or more forms of the same assessment, constructed independently but based on the same set of specifications. Both forms of the assessment are administered to the same group of students in close succession, and the resulting scores are correlated.

A high reliability coefficient indicates that the two forms sample the domain of interest equally well, and that generalizations about student performance from one assessment to the other can be made with a high degree of validity.

The equivalent-form estimates of reliability are widely used in standardized testing, primarily to assure test security, but the user cannot assume comparability of alternate forms unless the test manual provides information about equivalence (Miller et al., 2009). This method of reliability estimation is not practical for teacher-constructed assessments because most teachers do not find time to prepare two forms of the same test, let alone to assure that these forms are indeed equivalent ( Nitko & Brookhart, 2007).

Measures of Internal Consistency

Internal consistency methods can be used with a set of scores from only one administration of a single assessment. Sometimes referred to as split-half or half-length methods, estimates of internal consistency reveal the extent to which consistent results are obtained from two halves of the same assessment.

The split-half technique consists of dividing the assessment into two equal subtests, usually by including odd-numbered items on one subtest and even-numbered items on the other. Then the subtests are scored separately, and the two sub scores are correlated. The resulting correlation coefficient is an estimate of the extent to which the two halves consistently perform the same measurement.

Longer assessments tend to produce more reliable results than shorter ones, in part because they tend to sample the content domain more fully. Therefore, a split half reliability estimate tends to underestimate the true reliability of the scores produced by the whole assessment (because each subset includes only half of the total number of items).

This estimation can be corrected by using the Spearman-Brown prophecy formula, also called the Spearman-Brown double length formula, as represented by the following equation (Miller et al., 2009, p. 114):

Reliability of full assessment = 2 x correlation between half test scores 1 + correlation between half test scores

Another method of estimating the internal consistency of a test is to use certain types of coefficient alpha. Coefficient alpha reliability estimates provide information about the extent to which the assessment tasks measure similar characteristics. When the assessment contains relatively homogenous material, the coefficient alpha reliability estimate is similar to that produced by the split-half method. In other words, coefficient alpha represents the average correlation obtained from all possible split-half reliability estimates.

The Kuder-Richardson formulas are a specific type of coefficient alpha. Computation of Formula 20 (KR20) is based on the proportion of correct responses and the standard deviation of the total score distribution. If the assessment items are not expected to vary much in difficulty, the simpler Formula 21 (K-R21) can be used to approximate the value of K-R20, although in most cases it will produce a slightly lower estimate of reliability.

To use either formula, the assessments items must be scored dichotomously, that is, right or wrong (Miller et al., 2009; Nitko & Brookhart, 2007). If the assessment items could receive a range of points, coefficient alpha should be used to provide a reliability estimate. The widespread availability of computer software for assessment scoring and test and item analysis makes these otherwise cumbersome calculations more feasible to obtain efficiently (Miller et al.)

Measures of Consistency of Ratings

Depending on the type of assessment, error may arise from the procedures used to score a test. Teachers may need to collect evidence to answer the question, “Would this student have obtained the same score if a different person had scored the assessment or judged the performance?

The easiest method for collecting this evidence is to have two equally qualified persons score each student’s paper or rate each student’s performance. The two scores then are compared to produce a percentage of agreement or correlated to produce an index of scorer consistency, depending on whether agreement in an absolute sense or a relative sense is required.

Achieving a high degree of interpreter consistency depends on consensus of judgment among raters regarding the value of a given performance. Such consensus is facilitated by the use of scoring rubrics and training of raters to use those rubrics. Interrater consistency is important to ensure that differences in stringency or leniency of ratings between raters do not place some students at a disadvantage (Miller et al., 2009).

Factors That Influence the Reliability of Scores

From the previous discussion, it is obvious that various factors can influence the reliability of a set of test scores. These factors can be categorized into three main sources: the assessment instruments it, the student, and the assessment administration conditions. Assessment-related factors include the length of the test, the homogeneity of assessment tasks, and the difficulty and discrimination ability of the individual items.

In general, the greater the number of assessment tasks (eg, test items), the greater the score reliability. The Spearman Brown reliability estimate formula can be used to estimate the effect on the reliability coefficient of adding assessment tasks. For example, if a 10-item test has a reliability coefficient of 0.40, adding 15 items (creating a test that is 2.5 times the length of the original test) would produce a reliability estimate of 0.625.

Of course, adding assessment tasks to increase score reliability may be counterproductive after a certain point. After that point, adding tasks will increase the reliability only slightly, and student fatigue and boredom actually may introduce more measurement error . Score reliability is also enhanced by homogeneity of content covered by the assessment.

Course content that is tightly organized and highly interrelated tends to make homogeneous assessment content easier to achieve. Finally, the technical quality of assessment items, their difficulty, and their ability to discriminate between students who know the content and students who don’t also affect the reliability of scores.

Moderately difficult items that discriminate well between high achievers and low achievers and that contain no technical errors contribute a great deal to score reliability. Student-related factors include the heterogeneity of the student group, test-taking ability, and motivation. In general, reliability tends to increase as the range of talent in the group of students increases.

Therefore, in situations in students which are very similar to one another in ability, such as in graduate programs, assessments are likely to produce scores with somewhat lower reliability than desired. A student’s test-taking skill and experience also may influence score reliability to the extent that the student is able to obtain a score higher than true ability would predict. The effect of motivation on reliability is proportional to the extent to which it influences individual students differently.

If some students are not motivated to put forth their best efforts on an assessment, their actual achievement levels may not be accurately represented, and their relative achievement in comparison to other students will be difficult to judge. Teachers need to control assessment administration conditions to enhance the reliability of scores. Inadequate time to complete the assessment can lower the reliability of scores because some students who know the content will be unable to respond to all of the items.

Cheating also contributes random errors to assessment scores when students are able to respond correctly to items to which they actually do not know the answer. Cheating, therefore, has the effect of raising the offenders’ observed scores above their true scores, contributing to inaccurate and less meaningful interpretations of test scores.

Because a reliability coefficient is an indication of the amount of measurement error associated with a set of scores, it is useful information for evaluating the meaning and usefulness of those scores. Again, it is important to remember that the numerical value of a reliability coefficient is not a stable property of an assessment; it will fluctuate from one sample of students to another each time the assessment is administered.

Teachers often wonder how high the reliability coefficient should be to ensure that an assessment will produce reliable results. The degree of reliability desired depends on a number of factors, including the importance of the educational decision being made, how far reaching the consequences would be, and whether it is possible to confirm or reverse the judgment later.

For irreversible decisions that would have serious consequences, like the results of the first attempt of the NCLEX®, a high degree of reliability must be assured. For less important decisions, especially if later review can confirm or reverse them without serious harm to the student, less reliable methods may be acceptable. For teacher-made assessments, a reliability coefficient between 0.60 and 0.85 is desirable (Miller et al., 2009).

Practicality

Although reliability and validity are used to describe the ways in which scores are interpreted and used, practicality (also referred to as usability) is a quality of the assessment instrument itself and its administration procedures. Assessment procedures should be efficient and economical.

An assessment is practical or usable to the extent that it is easy to administer and score, does not take too much time away from other instructional activities, and has reasonable resource requirements. Whether they develop their own tests and other measurement tools or use published instruments, teachers should focus on the following questions to help guide the selection of appropriate assessment procedures (Miller et al., 2009; Nitko & Brookhart, 2007):

  1. Is the assessment easy to construct and use? Essay test items may be written more quickly and easily than multiple-choice items, but they will take more time to score. Multiple-choice items that assess a student’s ability to think critically about clinical problems are time-consuming to construct, but they can be machine-scored quickly and accurately. The teacher must determine the best use of the time available for assessment construction, administration, and scoring. If a published test is selected for assessment of students’ competencies just prior to graduation, is it practical to use? Does proper administration of the test require special training? Are the test administration directions easy to understand?
  2. Is the time needed to administer and score the assessment and interpret the results reasonable? A teacher of a 15-week, 3-credit course wants to give a weekly 10-point quiz that would be reviewed immediately and self-scored by students; These procedures would take a total of 30 minutes of class time. Is this the best use of instructional time? The teacher may decide that there is enormous value in the immediate feedback provided to students during the test review, and that the opportunity to obtain weekly information about the effectiveness of instruction is also beneficial; to that teacher, 30 minutes weekly is time well spent on assessment. Another teacher, whose total instructional time is only 4 days, may find that administering more than one test consumes time that is needed for teaching. Evaluation is an important step in the instructional process, but it cannot replace teaching. Although students often learn from the process of preparing for and taking assessments, instruction is not the primary purpose of assessment, and assessment is not the most efficient or effective way to achieve instructional goals. On the other hand, reliability is related to the length of an assessment (i.e., the number of assessment tasks); it may be preferable to use fewer assessments of longer length rather than more frequent shorter assessments.
  3. Are the costs associated with assessment construction, administration, and scoring reasonable? Although teacher-made assessments may seem to be less expensive than published instruments, the cost of the instructor’s time spent in assessment development must be taken into consideration. Additional costs associated with the scoring of teacher made assessments also must be calculated. What is the initial cost of purchasing test booklets for published instruments, and can test booklets be reused? What is the cost of answer sheets, and does that cost include scoring services? When considering the adoption of a computerized testing package, teachers and administrators must decide how the costs of the program will be paid and by whom (the educational program or the individual students).
  4. Can the assessment results be interpreted easily and accurately by those who will use them? If teachers score their own assessments, will they obtain results that will help them interpret the results accurately? For example, will they have test and item statistics that will help them make meaning out of the individual test scores? Scanners and software are available that will quickly score assessments that use certain types of answer sheets, but the scope of the information produced in the score report varies considerably. Purchased assessments that are scored by the publisher also yield reports of test results. Are these reports useful for their intended purpose? What information is needed or desired by the teachers who will make evaluation decisions, and is that information provided by the score-reporting service?

Examples of information on score reports include individual raw total scores, individual raw subtest scores, group mean and median scores, individual or group profiles, and individual standard scores. Will the teachers who receive the reports need special training to interpret this information accurately? Some assessment publishers restrict the purchase of instruments to users with certain educational and experience qualifications, in part so that the test results will be interpreted and used properly.

Conclusion

Because assessment results are often used to make important educational decisions, teachers must have confidence in their interpretations of test scores. Assessment validity produces results that allow teachers to make accurate interpretations about a test-taker’s knowledge or ability. Validity is not a static property of the assessment itself, but rather, it refers to the ways in which teachers interpret and use the assessment results.

Validity is not an either /or judgment; There are degrees of validity depending on the purpose of the assessment and how the results are to be used. A single assessment may be used for many different purposes, and the results may have greater validity for one purpose than for another. Teachers must gather a variety of sources of evidence to support the validity of their interpretation and use of assessment results.

Four major considerations for validation are related to content, construct, assessment-criterion relationships, and the consequences of assessment. Content considerations focus on the extent to which the sample of assessment items or tasks represents the domain of content or abilities that the teacher wants to measure. Content validity evidence may be obtained during the assessment-development process as well as by appraising a completed assessment, as in the case of a purchased instrument.

Currently, construct considerations are seen as the unifying concept of assessment validity, representing the extent to which score-based inferences about the construct of interest are accurate and meaningful. Two questions central to the process of construct validation concern how adequately the assessment represents the construct of interest (construct representation), and the extent to which irrelevant or ancillary factors influence the results (construct relevance).

Methods used in construct validation include defining the domain to be measured, analyzing the task-response processes required by the assessment, comparing assessment results of known groups, comparing assessment results before and after a learning activity, and correlating assessment results with other measures. Procedures for collecting evidence using each of these methods were described.

Assessment-criterion relationship considerations for obtaining validity evidence focus on predicting future performance (the criterion) based on current assessment results. Obtaining this type of evidence involves a predictive validation study. If the assessment results are to be used to estimate students’ performance on another assessment (the criterion measure) at the same time, the evidence is concurrent, and obtaining this type of evidence requires a concurrent validation study.

Teachers rarely study the correlation of their own assessment results with criterion measures, but for tests with high-stakes outcomes, such as licensure and certification, this type of validity evidence is critical. Ultimately, assessment validity requires an evaluation of interpretations and use of assessment results. The concept of validity has thus expanded to include consideration of the consequences of assessment use and how results are interpreted to students, teachers, and other stakeholders.

Consideration of consequences must include both intended and unintended effects of assessment, particularly when assessment results are used to make high-stakes decisions. A number of factors affect the validity of assessment results, including characteristics of the assessment itself, the administration and scoring procedures, and the test-takers. Each of these factors was discussed in some detail. Reliability refers to the consistency of scores.

Each assessment produces a limited measure of performance at a specific time. If this measurement is reasonably consistent over time, with different raters , or with different samples of the same domain, teachers can be more confident in the assessment results. Many extraneous factors may influence the measurement of performance, including instability of the behavior being measured, different samples of tasks in each assessment, varying assessment conditions between assessments, and inconsistent scoring procedures.

These and other factors introduce error into every measurement. Methods of determining assessment reliability estimate how much measurement error is present under varying assessment conditions. When assessment results are reasonably consistent, there is less measurement error and greater reliability. Several points are important to an understanding of the concept of assessment reliability.

Reliability pertains to assessment results, not to the assessment instrument itself. A reliability estimate always refers to a particular type of consistency, and it is possible for assessment results to be reliable in one or more of these respects but not in others. A reliability estimate is always calculated with statistical indices that express the relationship between two or more sets of scores.

Reliability is an essential but insufficient condition for validity; low reliability always produces a low degree of validity, but a high reliability estimate does not guarantee a high degree of validity.

Because reliability is viewed in terms of different types of consistency, these types are determined by different methods: over time (stability), over different forms of the assessment (equivalence), within the assessment itself (internal consistency), and over different raters ( consistency of ratings or interpreter reliability).

Measures of stability indicate whether students would achieve essentially the same scores if they took the same assessment at another time—a test–retest procedure. Measures of equivalence involve the use of two or more forms of the same assessment, based on the same set of specifications (equivalent or alternate forms). Both forms of the assessment are administered to the same group of students in close succession, and the resulting scores are correlated.

A high reliability coefficient indicates that teachers can make valid generalizations about student performance from one assessment to the other. Equivalent-form estimates of reliability are widely used in standardized testing, but are not practical for teacher-constructed assessments. Measures of internal consistency (split-half or half-length methods) can be used with a set of scores from only one administration of a single assessment.

Estimates of internal consistency reveal the extent to which consistent results are obtained from two halves of the same assessment, revealing the extent to which the test items are internally consistent or homogeneous. Measures of consistency of ratings determine the extent to which ratings from two or more equally qualified persons agree on the score or rating.

Interrater consistency is important to ensure that differences in stringency or leniency of ratings between raters do not place some students at a disadvantage. Use of scoring rubrics and training of raters to use those rubrics facilitates consensus among raters. Various factors can influence the reliability of a set of test scores. These factors can be categorized into three main sources: the assessment instruments it, the student, and the assessment administration conditions.

Assessment-related factors include the length of the assessment, the homogeneity of assessment content, and the difficulty and discrimination ability of the individual items. Student-related factors include the heterogeneity of the student group, test-taking ability, and motivation. Factors related to assessment administration include inadequate time to complete the test and cheating. In addition, assessment tools should be practical and easy to use.

Although reliability and validity are used to describe the ways in which scores are interpreted and used, practicality or usability is a quality of the instrument itself and its administration procedures. Assessment procedures should be efficient and economical.

Teachers need to evaluate the following factors: ease of construction and use; time needed to administer and score the assessment and interpret the results; costs associated with assessment construction, administration, and scoring; and the ease with which assessment results can be interpreted simply and accurately by those who will use them.

Leave a Comment