Qualities of Effective Assessment Procedures and Assessment Validity
How does a teacher know if a test or another assessment instrument is good? If assessment results will be used to make important educational decisions, teachers must have confidence in their interpretations of test scores. Good assessments produce results that can be used to make appropriate inferences about learners’ knowledge and abilities. In addition, assessment tools should be practical and easy to use. Two important questions have been posed to guide the process of constructing or proposing tests and other assessments:
- To what extent will the interpretation of the scores be appropriate, meaningful, and useful for the intended application of the results?
- What are the consequences of the particular uses and interpretations that are made of the results (Miller, Linn, & Gronlund, 2009)? It will also discuss important practical considerations that might affect the choice or development of tests and other instruments.
What is Assessment Validity
Definitions of validity have changed over time. Early definitions, formed in the 1940s and early 1950s, emphasized the validity of an assessment tool itself. Tests were characterized as valid or not, apart from consideration of how they were used. It was common in that era to support a claim of validity with evidence that a test correlated well with another “true” criterion.
The concept of validity changed, however, in the 1950s through the 1970s to focus on evidence that an assessment tool is valid for a specific purpose. Most measurement textbooks of that era classified validity by three types content, criterion-related, and construct and suggested that validation of a test should include more than one approach.
In the 1980s, the understanding of validity shifted again, to an emphasis on providing evidence to support the particular inferences that teachers make from assessment results. Validity was defined in terms of the appropriateness and usefulness of the inferences made from assessments, and assessment validation was seen as a process of collecting evidence to support those inferences.
The usefulness of the validity “triad” was also questioned; Increasingly, measurement experts recognized that construct validity was the key element and unifying concept of validity (Goodwin, 1997). The current philosophy of validity continues to focus not on assessment tools themselves or on the appropriateness of using a test for a specific purpose, but on the meaningfulness of the interpretations that teachers make of assessment results.
Tests and other assessment instruments yield scores that teachers use to make inferences about how much learners know or what they can do. Validity refers to the adequacy and appropriateness of those interpretations and inferences and how the assessment results are used (Miller et al., 2009).
The emphasis is on the consequences of measurement: Does the teacher make accurate interpretations about learners’ knowledge or ability based on their test scores? Assessment experts increasingly suggest that in addition to collecting evidence to support the accuracy of inferences made, evidence should also be collected about the intended and unintended consequences of the use of a test (Goodwin, 1997; Nitko & Brookhart, 2007).
Validity does not exist on an all-or-none basis (Miller et al., 2009); There are degrees of validity depending on the purpose of the assessment and how the results are to be used. A given assessment may be used for many different purposes, and inferences about the results may have greater validity for one purpose than for another.
For example, a test designed to measure knowledge of perioperative nursing standards may produce results that have high validity for the purpose of determining certification for perioperative staff nurses, but the results may have low validity for assigning grades to students in a perioperative nursing elective course. Additionally, validity evidence may change over time, so that validation of inferences must not be considered a one-time event. Validity now is considered a unitary concept (Miller et al., 2009; Nitko & Brookhart, 2007).
The concept of validity in testing is described in the Standards for Educational and Psychological Testing prepared by a joint committee of the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).
The most recent Standards (1999) no longer includes the view that there are different types of validity —for example, construct, criterion-related, and content. Instead, there are a variety of sources of evidence to support the validity of the interpretation and use of assessment results. The strongest case for validity can be made when evidence is collected regarding four major considerations for validation:
- Content,
- Construct,
- Assessment-criterion relationships, and
- Consequences (Miller et al., 2009, p. 74).
Each of these considerations will be discussed as they can be used in nursing education settings.
Content Considerations
The goal of content validation is to determine the degree to which a sample of assessment tasks accurately represents the domain of content or abilities about which the teacher wants to interpret assessment results. Tests and other assessment measures usually contain only a sample of all possible items or tasks that could be used to assess the domain of interest.
However, interpretations of assessment results are based on what the teacher believes to be the universe of items that could have been generated. In other words, when a student correctly answers 83% of the items on women’s health nursing final examination, the teacher usually infers that the student would probably answer correctly 83% of all items in the universe of women’s health nursing content.
The test score thus serves as an indicator of the student’s true standing in the larger domain. Although this type of generalization is commonly made, it should be noted that the domains of achievement in nursing education involve complex understandings and integrated performances, about which it is difficult to judge the representativeness of a sample of assessment tasks (Miller et al., 2009).
A superficial conclusion could be made about the match between a test’s appearance and its intended use by asking a panel of experts to judge whether the test appears to be based on appropriate content. This type of judgment, sometimes referred to as face validity, is not sufficient evidence of content representativeness and should not be used as a substitute for rigorous appraisal of sampling adequacy (Miller et al., 2009).
Efforts to include suitable content on an assessment can and should be made during its development. This process begins with defining the universe of content. The content definition should be related to the purpose for which the test will be used. For example, if a test is supposed to measure a new staff nurse’s understanding of hospital safety policies and procedures presented during orientation, the teacher first defines the universe of content by outlining the knowledge about policies that the staff nurse needs to function satisfactorily.
The teacher then uses professional judgment to write or select test items that satisfactorily represent this desired content domain.. If the teacher needs to select an appropriate assessment for a particular use, for example, choosing a standardized achievement test, content validation is also of concern.
A published test may or may not be suitable for the intended use in a particular nursing education program or with a specific group of learners. The ultimate responsibility for appropriate use of an assessment and interpretation of results lies with the teacher (Miller et al., 2009; Standards, 1999).
To determine the extent to which an existing test is suitable, experts in the domain review the assessment, item by item, to determine if the items or tasks are relevant and satisfactory represent the defined domain, represented by the table of specifications, and the desired learning outcomes. Because these judgments admittedly are subjective, the trustworthiness of this evidence depends on clear instructions to the experts and estimation of rater reliability.
Construct Considerations
Construct validity has been proposed as the “umbrella” under which all types of assessment validation belong (Goodwin, 1997). Content validation determines how well test scores represent a given domain and is important in evaluating assessments of achievement.
When teachers need to make inferences from assessment results to more general abilities and characteristics, however, such as critical thinking or communication ability, a critical consideration is the construct that the assessment is intended to measure (Miller et al., 2009). A construct is an individual characteristic that is assumed to exist because it explains some observed behavior.
As a theoretical construction, it cannot be observed directly, but it can be inferred from performance on an assessment. Construct validation is the process of determining the extent to which assessment results can be interpreted in terms of a given construct or set of constructs. Two questions, applicable to both teacher-constructed and published assessments, are central to the process of construct validation:
- How adequately does the assessment represent the construct of interest (construct representation)?
- Is the observed performance influenced by any irrelevant or ancillary factors (construct relevance)? (Miller et al., 2009)
Assessment validity is reduced to the extent that important elements of the construct are underrepresented in the assessment. For example, if the construct of interest is clinical problem-solving ability, the validity of a clinical performance assessment would be weakened if it focused entirely on problems defined by the teacher, because the learner’s ability to recognize and define clinical problems is an important aspect of clinical problem solving ( Gaberson & Oermann, 2007).
The influence of factors that are unrelated or irrelevant to the construct of interest also reduces assessment validity. For example, students for whom English is a second language may perform poorly on an assessment of clinical problem solving, not because of limited ability to recognize, identify, and solve problems, but because of unfamiliarity with language or cultural colloquialisms used by patients or teachers (Bosher & Bowles, 2008).
Another potential construct relevant factor is writing skill. For example, the ability to communicate clearly and accurately in writing may be an important outcome of a nursing education program, but the construct of interest for a course writing assignment is clinical problem solving. To the extent that student scores on that assignment are affected by spelling or grammatical errors, the construct-relevant validity of the assessment is reduced.
Testwiseness, performance anxiety, and learner motivation are additional examples of possible construct-irrelevant factors that may undermine assessment validity (Miller et al., 2009). Construct validation for a teacher-made assessment occurs primarily during its development by collecting evidence of construct representation and construct relevance from a variety of sources. Test manuals for published tests should include evidence that these methods were used to generate evidence of construct validity. Methods used in construct validation include:
- Defining the domain to be measured. The assessment specifications should clearly define the meaning of the construct so that it is possible to judge whether the assessment includes relevant and representative tasks.
- Analyzing the process of responding to tasks required by the assessment. The teacher can administer an assessment task to the learners (for example, a multiple-choice item that purportedly assesses critical thinking) and ask them to think aloud while they perform the test (for example, explain how they arrived at the answer they chose) . This method may reveal that students were able to identify the correct answer because the same example was used in class or in an assigned reading, not because they were able to analyze the situation critically.
- Comparing assessment results of known groups. Sometimes it is reasonable to expect that scores on a particular measure will differ from one group to another because members of those groups are known to possess different levels of the ability being measured.
For example, if the purpose of a test is to measure students’ ability to think critically about pediatric clinical problems, students who achieve high scores on this test would be assumed to be better critical thinkers than students who achieve low scores. To collect evidence in support of this assumption, the teacher might design a study to determine if student scores on the test are correlated with their scores on a standardized test of critical thinking in nursing.
The teacher could divide the sample of students into two groups based on their standardized test scores: those who scored high on the standardized test in one group and those whose standardized test scores were low in the other group. Then the teacher would compare the teacher-made test scores of the students in both groups.
If the teacher’s hypothesis is confirmed (that is, if the students with good standardized test scores obtained high scores on the teacher made test), this evidence could be used as partial support for construct validation (Miller et al., 2009). Group-comparison techniques have also been used in studies of test bias or test fairness.
Approaches to detection of test bias have looked for differential item functioning (DIF) related to test-takers’ race, gender, or culture. If test items function differently for members of groups with characteristics that do not directly relate to the variable of interest, differential validity of inferences from the test scores may result..
- Comparing assessment results before and after a learning activity. It is reasonable to expect that assessments of student performance would improve during instruction, whether in the classroom or in the clinical area, but assessment results should not be affected by other variables such as anxiety or memory of the reinstructions assessment content.
For example, evidence that assessment scores improve following instruction but are unaffected by an intervention designed to reduce students’ test anxiety would support the assessments construct validity (Miller et al., 2009).
2. Correlating assessment results with other measures. Scores produced by a particular assessment should correlate well with scores of other measures of the same construct but show poor correlation with measures of a different construct.
For example, teachers’ ratings of students’ performance in pediatric clinical settings should correlate highly with scores on a final exam testing knowledge of nursing care of children, but may not correlate satisfactorily with their classroom or clinical performance in a women’s health course. These correlations may be used to support the claim that a test measures the construct of interest (Miller et al., 2009).
Assessment-Criterion Relationship Considerations
This approach to obtaining validity evidence focuses on predicting future performance (the criterion) based on current assessment results. For example, nursing faculties often use scores from a standardized comprehensive exam given in the final academic semester or quarter to predict whether prelicensure students are likely to be successful on the NCLEX (the criterion measure).
Obtaining this type of evidence involves a predictive validation study (Miller et al., 2009). If teachers want to use assessment results to estimate students’ performance on another assessment (the criterion measure) at the same time, the validity evidence is concurrent, and obtaining this type of evidence requires a concurrent validation study.
This type of evidence may be desirable for making a decision about whether one test or measurement instrument may be substituted for another, more resource-intensive one. For example, a staff development educator may want to collect concurrent validity evidence to determine if a checklist with a rating scale can be substituted for a less efficient narrative appraisal of a staff nurse’s competence.
Teachers rarely conduct formal studies of the extent to which the scores on assessments that they have constructed are correlated with criterion measures. In some cases, adequate criterion measures are not available; the test in use is considered to be the best instrument that has been devised to measure the ability in question. If better measures were available, they could be used instead of the test being validated.
However, for tests with high-stakes outcomes, such as licensure and certification, this type of validity evidence is crucial. Multiple criterion measures are often used so that the strengths of one measure may offset the weaknesses of others (Miller et al., 2009). The relationship between assessment scores and those obtained on the criterion measure is usually expressed as a correlation coefficient.
A desired level of correlation between the two measures cannot be recommended because the correlation may be influenced by a number of factors, including test length, variability of scores in the distribution, and the amount of time between measures. The teacher who uses the test must use good professional judgment to determine what magnitude of correlation is considered adequate for the intended use of the assessment for which criterion-related evidence is desired.
Consideration of Consequences
Incorporating concern about the social consequences of assessment into the concept of validity is a relatively recent trend. Assessment has both intended and unintended consequences. For example, the faculties of many undergraduate nursing programs have adopted programs of achievement testing that are designed to assess student performance throughout the nursing curriculum.
The intended positive consequence of such testing is to identify students at risk of failure on the NCLEX, and to use this information to design remediation programs to increase student learning. Unintended negative consequences, however, may include increased student anxiety, decreased time for instruction relative to increased time allocated for testing, and tailoring instruction to more closely match the content of the tests while focusing less intently on other important aspects of the curriculum that will not be tested on the NCLEX.
The intended consequence of using standardized comprehensive exam scores to predict success on the NCLEX may be to motivate students whose assessment results predict failure to remediate and prepare more thoroughly for the licensure exam. But an unintended consequence might be that students whose comprehensive exam scores predict NCLEX success may decide not to prepare further for that important exam, risking a negative outcome.
Ultimately, assessment validity requires an evaluation of interpretations and use of assessment results. The concept of validity has thus expanded to include consideration of the consequences of assessment use and how results are interpreted to students, teachers, and other stakeholders. An adequate consideration of consequences must include both intended and unintended effects of assessment, particularly when assessment results are used to make high-stakes decisions (Miller et al., 2009).