The Understanding Test Validity and Reliability: A Complete Guide for Researchers and Educators. The reliability of a test describes its consistency, that is, it produces similar results under the same conditions.
Understanding Test Validity and Reliability: A Complete Guide for Researchers and Educators
The validity of a test describes its accuracy, that is, it measures what it purports to measure. To draw reliable conclusions from a test’s results, it must be both reliable and valid. Understanding these concepts helps researchers and educators ensure the quality of their assessments and the provision of reliable information.
The researcher will have to judge the place and significance of test data, not forgetting the problem of the Hawthorne effect operating negatively or positively on students who have to undertake the tests.
What Affects Test Reliability?
There is a range of issues which might affect the reliability of the test for example, the time of day, the time of the school year, the temperature in the test room, the perceived importance of the test, the degree of formality of the test situation, ‘examination nerves’, the amount of guessing of answers by the students (the calculation of standard error which the tests demonstrate feature here), the way that the test is ad ministered, the way that the test is marked, the degree of closure or openness of test items.
Hence the researcher who is considering using testing as a way of acquiring research data must ensure that it is appropriate, valid and reliable (Linn, 1993). Wolf (1994) suggests four main factors that might affect reliability: the range of the group that is being tested, the group’s level of proficiency, the length of the measure (the longer the test the greater the chance of errors), and the way in which reliability is calculated.
Fitz-Gib bon (1997:36) argues that, other things being equal, longer tests are more reliable than shorter tests. Additionally there are several ways in which reliability might be compromised in tests.
Four Major Threats to Test Reliability
Feldt and Brennan (1993) suggest four types of threat to reliability:
- individuals (e.g. their motivation, concentration, forgetfulness, health, carelessness, guessing, their related skills, e.g. reading ability, their usedness to solving the type of problem set, the effects of practice);
- situational factors (e.g. the psychological and physical conditions for the test—the context);
- test marker factors, e.g. idiosyncrasy and subjectivity;
- instrument variables (e.g. poor domain sampling, errors in sampling tasks, the realism of the tasks and relatedness to the experience of the testees, poor question items, the assumption or extent of uni-dimensionality in item
- response theory, length of the test, mechanical errors, scoring errors, computer errors).
Common Reliability Problems
There is also a range of particular problems in conducting reliable tests, for example:
- there might be a questionable assumption of transferability of knowledge and skills from one context to another (e.g. students might perform highly in a mathematics examination, but are unable to use the same algorithm in a physics examination);
- students whose motivation, self-esteem, and familiarity with the test situation are low might demonstrate less than their full abilities;
- language and readability exert a significant influence (e.g. whether testees are using their first or second language);
- tests might have a strong cultural bias;
- instructions might be unclear and ambiguous;
- difficulty levels might be too low or too high;
- the number of operations in a single test item might be unreasonable (e.g. students might be able to perform each separate item but might be unable to perform several operations in combination).
Strategies for Improving Test Reliability
To address reliability there is a need for moderation procedures (before and after the administration of the test) to iron out inconsistencies between test markers (Harlen, 1994), including:
- statistical reference/scaling tests;
- inspection of samples (by post or by visit);
- group moderation of grades;
- post hoc adjustment of marks;
- accreditation of institutions;
- visits of verifiers;
- agreement panels;
- defining marking criteria;
- exemplification;
- group moderation meetings.
Whilst moderation procedures are essentially post hoc adjustments to scores, agreement trials and practice-marking can be undertaken before the administration of a test, which is particularly important if there are large numbers of scripts or several markers. The issue here is that the results as well as the instruments should be reliable. Reliability is also addressed by:
- calculating coefficients of reliability, split half techniques, the Kuder—Richardson formula, parallel/equivalent forms of a test, test/re-test methods;
- calculating and controlling the standard error of measurement;
- increasing the sample size (to maximize the range and spread of scores in a norm-referenced test), though criterion-referenced tests recognize that scores may bunch around the high level (in mastery learning for example), i.e. that the range of scores might be limited, thereby lowering the correlation co-efficient that can be calculated;
- increasing the number of observations made and items included in the test (in order to increase the range of scores);
- ensuring effective domain sampling of items in tests based on item response theory (a particular issue in Computer Adaptive Testing introduced below (Thissen, 1990));
- ensuring effective levels of item discriminability and item difficulty.
Reliability not only has to be achieved but be seen to be achieved, particularly in ‘high stakes’ testing (where a lot hangs on the results of the test, e.g. entrance to higher education or employment). Hence the procedures for ensuring reliability must be transparent. The difficulty here is that the more one moves towards reliability as defined above, the more the test will become objective, the more students will be measured as though they are inanimate objects, and the more the test will become decontextualized.
An alternative form of reliability which is premised on a more constructivist psychology emphasizes the significance of context, the importance of subjectivity and the need to engage and involve the testee more fully than a simple test.
Objective tests, as described in this topic, lean strongly towards the positivist paradigm, whilst more phenomenological and interpretive paradigms of social science research will emphasize the importance of settings, of individual perceptions, of attitudes, in short, of ‘authentic’ testing (e.g. by using non-contrived, non-artificial forms of test data, for example portfolios, documents, course work, tasks that are stronger in realism and more ‘hands on’).
Though this latter adopts a view which is closer to assessment rather than narrowly ‘testing’, nevertheless the two overlap, both can yield marks, grades and awards, both can be formative as well as summative, both can be criterion-referenced.
Understanding Test Validity
With regard to validity, it is important to note here that an effective test will ensure adequate:
- content validity (e.g. adequate and representative coverage of program and test objectives in the test items, a key feature of domain sampling); content validity is achieved by ensuring that the content of the test fairly samples the class or fields of the situations or subject matter in question. Content validity is achieved by making professional judgements about the relevance and sampling of the con tents of the test to a particular domain. It is concerned with coverage and representative ness rather than with patterns of response or scores. It is a matter of judgement rather than measurement (Kerlinger, 1986). Content validity will need to ensure several features of a test (Wolf, 1994): (a) test coverage (the extent to which the test covers the relevant field); (b) test relevance (the extent to which the test items are taught through, or are relevant to, a particular program ); (c) program coverage (the extent to which the program covers the overall field in question).
- criterion-related validity (where a high correlation co-efficient exists between the scores on the test and the scores on other accepted tests of the same performance); criterion-related validity is achieved by comparing the scores on the test with one or more variables (criteria) from other measures or tests that are considered to measure the same factor. Wolf (1994) argues that a major problem facing test devisers addressing criterion-related validity is the selection of the suitable criterion measure. He cites the example of the difficulty of selecting a suitable criterion of academic achievement in a test of academic aptitude. The criterion must be: (a) relevant (and agreed to be relevant); (b) free from bias (i.e. where external factors that might contaminate the criterion are removed); (c) reliable—precise and accurate; (d) capable of being measured or achieved.
- construct validity (e.g. the clear relatedness of a test item to its proposed construct/ unobservable quality or trait, demonstrated by both empirical data and logical analysis and debate, i.e. the extent to which particular constructs or concepts can give an account for performance on the test); construct validity is achieved by ensuring that performance on the test is fairly explained by particular appropriate constructs or concepts. As with content validity, it is not based on test scores, but is more a matter of whether the test items are indicators of the underlying, latent construct in question. In this respect construct validity also subsumes content and criterion related validity. It is argued (Loevinger, 1957) that, in fact construct validity is the queen of the types of validity because it is subsumptive and because it concerns constructs or explanations rather than methodological factors. Construct validity is threatened by (a) under representation of the construct, i.e. the test is too narrow and neglects significant facets of a construct, (b) the inclusion of irrelevancies—excess reliable variance.
- concurrent validity (where the results of the test concur with results on other tests or instruments that are testing/assessing the same construct/performance similar to predictive validity but without the time dimension. Con current validity can occur simultaneously with another instrument rather than after some time has elapsed);
- face validity (that, superficially, the test appears at face value to test what it is designed to test);
- jury validity (an important element in construct validity, where it is important to agree on the conceptions and operationalization of an unobservable construct);
- predictive validity (where results on a test accurately predict subsequent performance akin to criterion-related validity);
- consequential validity (where the inferences that can be made from a test are sound);
- systemic validity (Fredericksen and Collins, 1989) (where programme activities both enhance test performance and enhance performance of the construct that is being addressed in the objective). Cunningham (1998) gives an example of systemic validity where, if the test and the objective of vocabulary performance lead to testees increasing their vocabulary, then systemic validity has been ad dressed.
To ensure test validity, then the test must demonstrate fitness for purpose as well as addressing the several types of validity outlined above. The most difficult for researchers to address, perhaps, is construct validity, for it argues for agreement on the definition and operationalization of an unseen, half-guessed-at construct or phenomenon. The community of scholars has a role to play here. For a full discussion of validity see Messick (1993).
Validity And Reliability In Life Histories
Three central issues underpin the quality of data generated by life history methodology. They are to do with representativeness, validity and reliability. Plummer (1983) draws attention to a frequent criticism of life history research, namely, that its cases are atypical rather than representative. To avoid this charge, he urges intending researchers to, ‘work out and explicitly state the life history’s relationship to a wider population’ (Plummer, 1983) by way of appraising the subject on a continuum of representativeness and non-representativeness.
Reliability in life history research hinges upon the identification of sources of bias and the application of techniques to reduce them. Bias arises from the informant, the researcher, and the interactional encounter itself. Plummer (1983) provides a check list of some aspects of bias arising from these principal sources. Several validity checks are available to in tending researchers. Plummer identifies the following:
1 The subject of the life history may present an autocritique of it, having read the entire product.
2 A comparison may be made with similar writ ten sources by way of identifying points of major divergence or similarity.
3 A comparison may be made with official records by way of imposing accuracy checks on the life history.
4 A comparison may be made by interviewing other informants.
Essentially, the validity of any life history lies in its ability to represent the informant’s subjective reality, that is to say, his or her definition of the situation.
Read More:
https://nurseseducator.com/didactic-and-dialectic-teaching-rationale-for-team-based-learning/
https://nurseseducator.com/high-fidelity-simulation-use-in-nursing-education/
First NCLEX Exam Center In Pakistan From Lahore (Mall of Lahore) to the Global Nursing
Categories of Journals: W, X, Y and Z Category Journal In Nursing Education
AI in Healthcare Content Creation: A Double-Edged Sword and Scary
Social Links:
https://www.facebook.com/nurseseducator/
https://www.instagram.com/nurseseducator/
https://www.pinterest.com/NursesEducator/
https://www.linkedin.com/company/nurseseducator/
https://www.linkedin.com/in/nurseseducator/
https://www.researchgate.net/profile/Afza-Lal-Din
https://scholar.google.com/citations?hl=en&user=F0XY9vQAAAAJ
I believe you have observed some very interesting details, appreciate it for the post.