Educational Testing In Research Complete Guide: Types, Construction, and Best Practices for Researchers

Complete Guide to Educational Testing In Research: Types, Construction, and Best Practices for Researchers. When conducting tests in educational research, careful selection of test types, a strict construction method to ensure validity and reliability, and adherence to best practices in administration and interpretation are required.

Types, Construction, and Best Practices for Researchers: Complete Guide to Educational Testing In Research

Understanding Educational Testing: A Comprehensive Guide

Tests and testing have a long and venerable history. Since the spelling test of Rice (1897), the fatigue test of Ebbinghaus (1897) and the intelligence scale of Binet (1905) the growth of tests has proceeded at an extraordinary pace in terms of volume, variety, scope and sophistication. The field of testing is extensive, so extensive in fact that the comments that follow must needs be of an introductory nature and the reader seeking a deeper understanding will have to refer to specialist texts and sources on the subject.

Limitations of space permit no more than a brief outline of a small number of key issues to do with tests and testing. Readers wishing to undertake studies to greater depth will need to pursue their interests elsewhere. In tests, researchers have at their disposal a powerful method of data collection, an impressive array of tests for gathering data of a numerical rather than verbal kind.

Types of Educational Tests: Key Classifications

In considering testing for gathering research data, several issues need to be borne in mind:

  • Are we dealing with parametric or nonparametric tests?
  • Are they achievement potential or aptitude tests?
  • Are they norm-referenced or criterion-referenced?
  • Are they available commercially for researchers to use or will researchers have to develop home produced tests?
  • Do the test scores derive from a pretest and post-test in the experimental method?
  • Are they group or individual tests?

Let us unpack some of these issues.

Parametric vs. Non-Parametric Tests Explained

Parametric tests are designed to represent the wide population—e.g. of a country or age group. They make assumptions about the wider population and the characteristics of that wider population, i.e. the parameters of abilities are known. They assume (Morrison, 1993):

  • That there is a normal curve of distribution of scores in the population (the bell-shaped symmetry of the Gaussian curve of distribution seen, for example, in standardized scores of IQ or the measurement of people’s height or the distribution of achievement on reading tests in the population as a whole);
  • That there are continuous and equal intervals between the test scores (so that, for example, a score of 80 per cent could be said to be double that of 40 per cent; this differs from the ordinal scaling of rating scales discussed earlier in connection with questionnaire de sign where equal intervals between each score could not be assumed).

Parametric tests will usually be published as standardized tests which are commercially available and which have been piloted on a large and representative sample of the whole population. They usually arrive complete with the backup data on sampling, reliability and validity statistics which have been computed in the devising of the tests. Working with these tests enables the researcher to use statistics applicable to interval and ratio levels of data.

On the other hand, non-parametric tests make few or no assumptions about the distribution of the population (the parameters of the scores) or the characteristics of that population. The tests do not assume a regular bell-shaped curve of distribution in the wider population; indeed the wider population is perhaps irrelevant as these tests are designed for a given specific population—a class in school, a chemistry group, a primary school year group.

Because they make no assumptions about the wider population, the researcher is confined to working with non-parametric statistics appropriate to nominal and ordinal levels of data. The attraction of non-parametric statistics is their utility for small samples because they do not make any assumptions about how normal, even and regular the distributions of scores will be.

Furthermore, computation of statistics for non-parametric tests is less complicated than that for parametric tests. It is perhaps safe to assume that a home-devised test (like a home-devised questionnaire) will probably be non-parametric unless it deliberately contains interval and ratio data. Non-parametric tests are the stock-in-trade of classroom teachers—the spelling test, the mathematics test, the end-of-year examination, the mock-examination.

They have the advantage of being tailored to particular institutional, departmental and individual circumstances. They offer teachers a valuable opportunity for quick, relevant and focused feedback on student performance. Parametric tests are more powerful than non-parametric tests because they not only derive from standardized scores but enable the re searcher to compare sub-populations with a whole population (e.g. to compare the results of one school or local education authority with the whole country, for instance in comparing students’ performance in norm-referenced or criterion-referenced tests against a national average score in that same test).

They enable the researcher to use powerful statistics in data processing (e.g. means, standard deviations, t tests, Pearson product moment correlations, factor analysis, analysis of variance), and to make inferences about the results. Because non-parametric tests make no assumptions about the wider population a different set of statistics is available to the researcher (e.g. modal scores, rankings, the chi-square statistic, a Spearman correlation). These can be used in very specific situations—one class of students, one year group, one style of teaching, one curriculum area—and hence are valuable to teachers.

Norm-Referenced, Criterion-Referenced, and Domain-Referenced Tests

A norm-referenced test compares students’ achievements relative to other students’ achievements (e.g. a national test of mathematical performance or a test of intelligence which has been standardized on a large and representative sample of students between the ages of six and six teen). A criterion-referenced test does not com pare student with student but, rather, requires the student to fulfill a given set of criteria, a predefined and absolute standard or outcome (Cunningham, 1998).

For example, a driving test is usually criterion-referenced since to pass it requires the ability to meet certain test items— reversing round a corner, undertaking an emergency stop, avoiding a crash, etc. regardless of how many others have or have not passed the driving test. Similarly many tests of playing a musical instrument require specified performances—e.g. the ability to play a particular scale or arpeggio, the ability to play a Bach fugue without hesitation or technical error. If the student meets the criteria, then he or she passes the examination.

The link between criterion referenced tests and mastery learning is strong, for both emphasize the achievement of objectives per se rather than in comparison to other students. Both place an emphasis on learning outcomes. Further, Cunningham (1998) has indicated the link between criterion referencing, the minimum competency testing and measurement driven instruction in the US; all of them share the concern for measuring predetermined and specific outcomes and objectives.

Though this use of criterion-referencing declined in the closing decade of the twentieth century, the use of criterion-referencing to set standards burgeoned in the same period. What we have, then, is the move away from criterion-referencing as measurement of the achievement of detailed and specific behavioral objectives and towards a testing of what a student has achieved that is not so specifically framed.

A criterion-referenced test provides the researcher with information about exactly what a student has learned, what she can do, whereas a norm-referenced test can only provide the researcher with information on how well one student has achieved in comparison to another, enabling rank orderings of performance and achievement to be constructed. Hence a major feature of the norm-referenced test is its ability to dis criminate between students and their achievements—a well-constructed norm-referenced test enables differences in achievement to be measured acutely, i.e. to provide variability or a great range of scores.

For a criterion-referenced test this is less of a problem, the intention here is to indicate whether students have achieved a set of given criteria, regardless of how many others might or might not have achieved them, hence variability or range is less important here.

The question of the politics in the use of data from criterion-referenced examination results arises when such data are used in a norm-referenced way to compare student with student, school with school, local authority with local authority, region with region (as has been done in the United Kingdom with the publication of ‘league tables’ of local authorities’ successes in the achievement of their students when tested at the age of seven—a process which is envisaged to develop into the publication of achievements at several ages and school by school).

More recently an outgrowth of criterion-referenced testing has been the rise of domain-referenced tests (Gipps, 1994:81). Here consider able significance is accorded to the careful and detailed specification of the content or the domain which will be assessed. The domain is the particular field or area of the subject that is being tested, for example, light in science, two-part counter point in music, parts of speech in English language. The domain is set out very clearly and very fully, such that the full depth and breadth of the content is established.

Test items are then selected from this very full field, with careful attention to sampling procedures so that representativeness of the wider field is ensured in the test items. The student’s achievements on that test are computed to yield a pro portion of the maximum score possible, and this, in turn, is used as an index of the proportion of the overall domain that she has grasped.

So, for example, if a domain has 1,000 items and the test has 50 items, and the student scores 30 marks from the possible 50 then it is inferred that she has grasped 60 per cent ({30÷50}×100) of the domain of 1,000 items. Here inferences are being made from a limited number of items to the student’s achievements in the whole do main; this requires careful and representative sampling procedures for test items.

Complete Guide to Educational Testing In Research: Types, Construction, and Best Practices for Researchers

Commercially Produced vs. Researcher-Produced Tests

There is a battery of tests in the public domain which cover a vast range of topics and which can be used for evaluative purposes. Most schools will have used published tests at one time or another: diagnostic tests, aptitude tests, achievement tests, norm-referenced tests, readiness tests, subject-specific tests, skills tests, criterion-referenced tests, reading tests, verbal reasoning tests, non-verbal reasoning tests, tests of social adjustment, tests of intelligence, tests of critical thinking; the list is colossal.

Possible Attractions

There are several attractions to using published tests:

  • They are objective;
  • They have been piloted and refined;
  • They have been standardized across a named population (e.g. a region of the country, the whole country, a particular age group or various age groups) so that they represent a wide population;
  • They declare how reliable and valid they are (mentioned in the statistical details which are usually contained in the manual of instructions for administering the test);
  • They tend to be parametric tests, hence enabling sophisticated statistics to be calculated;
  • They come complete with instructions for administration;
  • They are often straightforward and quick to administer and to mark;
  • Guides to the interpretation of the data are usually included in the manual;
  • Researchers are spared the task of having to devise, pilot and refine their own test.

Several commercially produced tests have restricted release or availability, hence the re searcher might have to register with a particular association before being given clearance to use the test or before being given copies of it. For example, the Psychological Corporation Ltd and McGraw-Hill publishers not only hold the rights to a world-wide battery of tests of all kinds but require registration before releasing tests.

In this example the Psychological Corporation also has different levels of clearance, so that certain parties or researchers may not be eligible to have a test released to them because they do not fulfill particular criteria for eligibility. Published tests by definition are not tailored to institutional or local contexts or needs; in deed their claim to objectivity is made on the grounds that they are deliberately supra-institutional.

The researcher wishing to use published tests must be certain that the purposes, objectives and content of the published tests match the purposes, objectives and content of the evaluation. For example, a published diagnostic test might not fit the needs of the evaluation to have an achievement test; a test of achievement might not have the predictive quality which the re searcher seeks in an aptitude test, a published reading test might not address the areas of reading that the researcher is wishing to cover, a verbal reading test written in English might contain language which is difficult for a student whose first language is not English. These are important considerations.

A much-cited text on evaluating the utility for researchers of commercially available tests is produced by the American Psychological Association (1974) in the Standards for Educational and Psychological Testing. The golden rule for deciding to use a published test is that it must demonstrate fitness for purpose. If it fails to demonstrate this, then tests will have to be devised by the researcher.

The attraction of this latter point is that such a ‘homegrown’ test will be tailored to the local and institutional context very tightly, i.e. that the purposes, objectives and content of the test will be deliberately fitted to the specific needs of the researcher in a specific, given context. In discussing ‘fitness for purpose’ (Cronbach, 1949; Gronlund and Linn, 1990) set out a range of criteria against which a commercially produced test can be evaluated for its suitability for specific research purposes.

Against these advantages of course there are several important considerations in devising a ‘home-grown’ test. Not only might it be time consuming to devise, pilot, refine and then ad minister the test but, because much of it will probably be non-parametric, there will be a more limited range of statistics which may be applied to the data than in the case of parametric tests. The scope of tests and testing is far-reaching; it is as if no areas of educational activity are untouched by them.

Achievement tests, largely summative in nature, measure achieved performance in a given content area. Aptitude tests are intended to predict capability, achievement potential, learning potential and future achievements. However, the assumption that these two constructs—achievement and aptitude—are separate has to be questioned (Cunningham, 1998); indeed it is often the case that a test of aptitude for, say, geography, at a particular age or stage will be measured by using an achievement test at that age or stage.

Cunningham (1998) has suggested that an achievement test might include more straightforward measures of basic skills whereas aptitude tests might put these in combination, e.g. combining reasoning (often abstract) and particular knowledge, i.e. that achievement and aptitude tests differ according to what they are testing. Not only do the tests differ according to what they measure, but, since both can be used predictively, they differ according to what they might be able to predict.

For example, because an achievement test is more specific and often tied to a specific content area, it will be useful as a predictor of future performance in that content area but will be largely unable to predict future performance out of that content area. An aptitude test tends to test more generalized abilities (e.g. aspects of ‘intelligence’, skills and abilities that are common to several areas of knowledge or curricula), hence it is able to be used as a more generalized predictor of achievement.

Achievement tests, Gronlund (1985) suggests, are more linked to school experiences whereas aptitude tests encompass out-of-school learning and wider experiences and abilities. However Cunningham (1998), in arguing that there is a considerable overlap between the two types, is suggesting that the difference is largely cosmetic. An achievement test tends to be much more specific and linked to instructional program and cognate areas than an aptitude test, which looks for more general aptitudes (Hanna, 1993) (e.g. intelligence or intelligences, Gardner, 1993).

How to Construct an Effective Test: 8 Essential Steps

The opportunity to devise a test is exciting and challenging, and in doing so the researcher will have to consider:

  • The purposes of the test (for answering evaluation questions and ensuring that it tests what it is supposed to be testing, e.g. The achievement of the objectives of a piece of the curriculum);
  • The type of test (e.g. Diagnostic, achievement, aptitude, criterion-referenced, norm-referenced);
  • The objectives of the test (cast in very specific terms so that the content of the test items can be seen to relate to specific objectives of a programme or curriculum);
  • The content of the test;
  • The construction of the test, involving item analysis in order to clarify the item discriminability and item difficulty of the test (see below);
  • The format of the test—its layout, instructions, method of working and of completion (e.g. Oral instructions to clarify what students will need to write, or a written set of instructions to introduce a practical piece of work);
  • The nature of the piloting of the test;
  • The validity and reliability of the test;
  • The provision of a manual of instructions for the administration, marking and data treatment of the test (this is particularly important if the test is not to be administered by the researcher or if the test is to be administered by several different people, so that reliability is ensured by having a standard procedure). In planning a test the researcher can proceed thus:

Step 1: Identify Your Test Purposes

The purposes of a test are several, for example to diagnose a student’s strengths, weaknesses and difficulties, to measure achievement, to measure aptitude and potential, to identify readiness for a program. Gronlund and Linn (1990) term this ‘placement testing’ and it is usually a form of pretest, normally designed to discover whether students have the essential pre-requisites to begin a program (e.g. in terms of knowledge, skills, understandings).

These types of tests occur at different stages. For ex ample the placement test is conducted prior to the commencement of a program and will identify starting abilities and achievements—the initial or ‘entry’ abilities in a student.

If the placement test is designed to assign students to tracks, sets or teaching groups (i.e. to place them into administrative or teaching groupings), then the entry test might be criterion-referenced or norm referenced; if it is designed to measure detailed starting points, knowledge, abilities and skills then the test might be more criterion-referenced as it requires a high level of detail.

It has its equivalent in ‘baseline assessment’ and is an important feature if one is to measure the ‘value added’ component of teaching and learning: one can only assess how much a set of educational experiences has added value to the student if one knows that student’s starting point and starting abilities and achievements.

  • Formative testing is undertaken during a program, and is designed to monitor students’ progress during that program, to measure achievement of sections of the program, and to diagnose strengths and weaknesses. It is typically criterion-referenced.
  • Diagnostic testing is an in-depth test to discover particular strengths, weaknesses and difficulties that a student is experiencing, and is designed to expose causes and specific areas of weakness or strength. This often re quires the test to include several items about the same feature, so that, for example, several types of difficulty in a student’s understanding will be exposed; the diagnostic test will need to construct test items that will focus on each of a range of very specific difficulties that students might be experiencing, in order to identify the exact problems that they are having from a range of possible problems. Clearly this type of test is criterion-referenced.
  • Summative testing is the test given at the end of the program, and is designed to measure achievement, outcomes, or ‘mastery’. This might be criterion-referenced or norm-referenced, depending to some extent on the use to which the results will be put (e.g. to award certificates or grades, to identify achievement of specific objectives)

Step 2: Develop Test Specifications

The test specifications include:

  • Which program objectives and student learning outcomes will be addressed
  • Which content areas will be addressed
  • The relative weightings, balance and cover age of items
  • The total number of items in the test
  • The number of questions required to address a particular element of a program or learning outcomes
  • The exact items in the test. To ensure validity in a test it is essential to ensure that the objectives of the test are fairly ad dressed in the test items.

Objectives, it is argued (Mager, 1962; Wiles and Bondi, 1984), should:

(a) Be specific and be expressed with an appropriate degree of precision

(b) Represent intended learning outcomes

(c) Identify the actual and observable behavior which will demonstrate achievement

(d) Include an active verb

(e) Be unitary (focusing on one item per objective).

One way of ensuring that the objectives are fairly addressed in test items can be done through a matrix frame that indicates the coverage of content areas, the coverage of objectives of the program, and the relative weighting of the items on the test. Such a matrix is set out in Box 18.1 taking the example from a secondary school history syllabus.

Box 18.1 indicates the main areas of the program to be covered in the test (content areas); then it indicates which objectives or de tailed content areas will be covered (1a–3c)— these numbers refer to the identified specifications in the syllabus; then it indicates the marks/ percentages to be awarded for each area. This indicates several points

  • The least emphasis is given to the build-up to and end of the war (10 marks each in the ‘total’ column);
  • The greatest emphasis is given to the invasion of France (35 marks in the ‘total’ column);
  • There is fairly even coverage of the objectives specified (the figures in the ‘total’ row only vary from 9–13);
  • Greatest coverage is given to objectives 2a and 3a, and least coverage is given to objective 1c;
  • Some content areas are not covered in the test items (the blanks in the matrix).

Step 2: Develop Test Specifications

Hence we have here a test scheme that indicates relative weightings, coverage of objectives and content, and the relation between these two latter elements. Gronlund and Linn (1990) suggest that relative weightings should be ad dressed by firstly assigning percentages at the foot of each column, then by assigning percent ages at the end of each row, and then completing each cell of the matrix within these specifications.

This ensures that appropriate sampling and coverage of the items are achieved. The example of the matrix refers to specific objectives as column headings; of course these could be replaced by factual knowledge, conceptual knowledge and principles, and skills for each of the column headings. Alternatively they could be replaced with specific aspects of an activity, for example (Cohen, Manion and Morrison, 1996:416): designing a crane, making the crane, testing the crane, evaluating the results, improving the design.

Indeed these latter could become content (row) headings as shown in Box 18.2. Here one can see that practical skills will carry fewer marks than recording skills (the column totals), and that making and evaluating carry equal marks (the row totals). This exercise also enables some indication to be gained on the number of items to be included in the test, for instance in the example of the history test above the matrix is 5×9=45 possible items, and in the ‘crane’ activity below the matrix is 5×4=20 possible items.

Of course, there could be considerable variation in this, for ex ample more test items could be inserted if it were deemed desirable to test one cell of the matrix with more than one item (possible for cross checking), or indeed there could be fewer items if it were possible to have a single test item that serves more than one cell of the matrix.

The difficulty in matrix construction is that it can easily become a runaway activity, generating very many test items and, hence, leading to an unworkably long test—typically the greater the degree of specificity required, the greater the number of test items there will be. One skill in test construction is to be able to have a single test item that provides valid and reliable data for more than a single factor. Having undertaken the test specifications, the researcher should have achieved clarity on:

(a) The exact test items that test certain aspects of achievement of objectives, program, con tents etc

(b) The coverage and balance of coverage of the test items

(c) The relative weightings of the test items.

Step 2: Develop Test Specifications

Step 3: Select Test Content Through Item Analysis

Select the contents of the test Here the test is subject to item analysis. Gronlund and Linn (1990) suggest that an item analysis will need to consider:

  • The suitability of the format of each item for the (learning) objective (appropriateness)
  • The ability of each item to enable students to demonstrate their performance of the (learning) objective (relevance)
  • The clarity of the task for each item
  • The straightforwardness of the task
  • The unambiguity of the outcome of each item, and agreement on what that outcome should be
  • The cultural fairness of each item
  • The independence of each item (i.e. Where the influence of other items of the test is minimal and where successful completion of one item is not dependent on successful completion of another)
  • The adequacy of coverage of each (learning) objective by the items of the test.
Construction test

In moving to test construction the researcher will need to consider how each element to be tested will be operationalized:

(a) What indicators and kinds of evidence of achievement of the objective will be required

(b) What indicators of high, moderate and low achievement there will be

(c) What the students will be doing when they are working on each element of the test

(d) What the outcome of the test will be (e.g. a written response, a tick in a box of multiple-choice items, an essay, a diagram, a computation).

Group Task Assessment

Indeed the Task Group on Assessment and Testing in the UK (1988) took from the work of the UK’s Assessment of Performance Unit the suggestion that attention will have to be given to the presentation, operation and response modes of a test:

(a) How the task will be introduced (e.g. oral, written, pictorial, computer, practical demonstration)

(b) What the students will be doing when they are working on the test (e.g. mental computation, practical work, oral work, writ ten)

(c) What the outcome will be—how they will show achievement and present the outcomes (e.g. choosing one item from a multiple-choice question, writing a short response, open ended writing, oral, practical outcome, computer output).

Stages of Organization

Operationalizing a test from objectives can proceed by stages:

  • Identify the objectives/outcomes/elements to be covered.
  • Break down the objectives/outcomes/elements into constituent components or elements
  • Select the components that will feature in the test, such that, if possible, they will represent the larger field (i.e. Domain referencing, if required)
  • Recast the components in terms of specific, practical, observable behaviors, activities and practices that fairly represent and cover that components
  • Specify the kinds of data required to provide information on the achievement of the criteria
  • Specify the success criteria (performance indicators) in practical terms, working out marks and grades to be awarded and how weightings will be addressed
  • Write each item of the test
  • Conduct a pilot to refine the language/read ability and presentation of the items, to gauge item discriminability, item difficulty and distractors (discussed below), and to address validity and reliability.
Analysis

Item analysis, Gronlund and Linn aver (p. 255), is designed to ensure that:

(a) The items function as they are intended, for example, that criterion-referenced items fairly cover the fields and criteria and that norm-referenced items demonstrate item discriminability (discussed below)

(b) The level of difficulty of the items is appropriate (see below: item difficulty)

(c) The test is reliable (free of distractors—unnecessary information and irrelevant cues, see below: distractors) (see Millmann and Greene (1993)).

An item analysis will consider the accuracy levels available in the answer, the item difficulty, the importance of the knowledge or skill being tested, the match of the item to the program, and the number of items to be included. The basis of item analysis can be seen in item response theory (see Hambleton, 1993). Item response theory (IRT) is based on the principle that it is possible to measure single, specific la tent traits, abilities, attributes that, themselves, are not observable, i.e. to determine observable quantities of unobservable quantities.

The theory model assumes a relationship between a person’s possession or level of a particular attribute, trait or ability and his/her response to a test item. IRT is also based on the view that it is possible:

  • To identify objective levels of difficulty of an item, e.g. The Rasch model (Wainer and Mislevy, 1990)
  • To devise items that will be able to discriminate effectively between individuals
  • To describe an item independently of any particular sample of people who might be responding to it, i.e. Is not group dependent (i.e. The item difficulty and item discriminability are independent of the sample)
  • To describe a testee’s proficiency in terms of his or her achievement on an item of a known difficulty level
  • To describe a person independently of any sample of items that has been administered to that person (i.e. A testee’s ability does not depend on the particular sample of test items)
  • To specify and predict the properties of a test before it has been administered.
  • For traits to be uni-dimensionality (single traits are specifiable, e.g. Verbal ability, mathematical proficiency) and to account for test out comes and performance
  • For a set of items to measure a common trait or ability
  • For a testee’s response to any one test item not to affect his or her response to another test item
  • That the probability of the correct response to an item does not depend on the number of testees who might be at the same level of ability
  • That it is possible to identify objective levels of difficulty of an item
  • That a statistic can be calculated that indicates the precision of the measured ability for each testee, and that this statistic depends on the ability of the testee and the number and properties of the test items.

In constructing a test the researcher will need to undertake an item analysis to clarify the item discriminability and item difficulty of each item of the test.  Item discriminability refers to the potential of the item in question to be answered correctly by those students who have a lot of the particular quality that the item is designed to measure and to be answered incorrectly by those students who have less of the particular quality that the same item is designed to measure.

In other words, how effective is the test item in showing up differences between a group of students? Does the item enable us to discriminate between students’ abilities in a given field? An item with high discriminability will enable the researcher to see a potentially wide variety of scores on that item; an item with low discriminability will show scores on that item poorly differentiated.

Clearly a high measure of discriminability is desirable. Suppose the researcher wishes to construct a test of mathematics for eventual use with thirty students in a particular school (or with class A in a particular school). The researcher devises a test and pilots it in a different school or class B respectively; administering the test to thirty students of the same age (i.e. she matches the sample of the pilot school or class to the sample in the school which eventually will be used).

The scores of the thirty pilot children are then split into three groups of ten students each (high, medium and low scores). It would be reason able to assume that there will be more correct answers to a particular item amongst the high scorers than amongst the low scorers. For each item compute the following:

Step 3: Select Test Content Through Item Analysis

Suppose all ten students from the high scoring group answered the item correctly and two students from the low scoring group answered the item correctly. The formula would work out thus:

Step 3: Select Test Content Through Item Analysis

The maximum index of discriminability is 1.00. Any item whose index of discriminability is less than 0.67, i.e. is too undiscriminating, should be reviewed firstly to find out whether this is due to ambiguity in the wording or possible clues in the wording. If this is not the case, then whether the researcher uses an item with an index lower than 0.67 is a matter of judgement.

It would appear, then, that the item in the example would be appropriate to use in a test. For a further discussion of item discriminability see Linn (1993). One can use the discriminability index to examine the effectiveness of distractors. This is based on the premise that an effective distractor should attract more students from a low scoring group than from a high scoring group. Consider the following example, where low and high scoring groups are identified:

Step 3: Select Test Content Through Item Analysis

In example A, the item discriminates positively in that it attracts more correct responses (10) from the top 10 students than the bottom 10 (8) and hence is a poor distractor; here, also, the discriminability index is 0.20, hence is a poor discriminator and is also a poor distractor. Example B is an ineffective distractor because nobody was included from either group. Example C is an effective distractor because it includes far more students from the bottom 10 students (10) than the higher group (2).

However, in this case any ambiguities must be ruled out before the discriminating power can be improved. Distractors are the stuff of multiple choice items, where incorrect alternatives are offered, and students have to select the correct alternatives.

Here a simple frequency count of the number of times a particular alternative is selected will provide information on the effective ness of the distractor: if it is selected many times then it is working effectively; if it is seldom or never selected then it is not working effectively and it should be replaced. If we wished to calculate the item difficulty of a test, we could use the following formula:

Step 3: Select Test Content Through Item Analysis

The maximum index of difficulty is 100 per cent. Items falling below 33 per cent and above 67 per cent are likely to be too easy and too difficult respectively. It would appear, then, that this item would be appropriate to use in a test. Here, again, whether the researcher uses an item with an index of difficulty below or above the cut-off points is a matter of judgement. In a norm-referenced test the item difficulty should be around 50 per cent (Frisbie, 1981). For further discussion of item difficulty see Linn (1993) and Hanna (1993).

Given that the researcher can only know the degree of item discriminability and difficulty once the test has been undertaken, there is an unavoidable need to pilot home-grown tests. Items with limited discriminability and limited difficulty must be weeded out and replaced, those items with the greatest discriminability and the most appropriate degrees of difficulty can be retained; this can only be undertaken once data from a pilot have been analyzed.

Item discriminability and item difficulty take on differential significance in norm-referenced and criterion-referenced tests. In a norm-referenced test we wish to compare students with each other, hence item discriminability is very important. In a criterion-referenced test, on the other hand, it is not important per se to be able to compare or discriminate between students’ performance.

For example, it may be the case that we wish to discover whether a group of students has learnt a particular body of knowledge, that is the objective, rather than, say, finding out how many have learned it better than others. Hence it may be that a criterion-referenced test has very low discriminability if all the students achieve very well or achieve very poorly, but the discriminability is less important than the fact that the students have or have not learnt the material.

A norm-referenced test would regard such a poorly discriminating item as unsuitable for inclusion, whereas a criterion-referenced test would regard such an item as providing useful information (on success or failure). With regard to item difficulty, in a criterion referenced test the level of difficulty is that which is appropriate to the task or objective. Hence if an objective is easily achieved then the test item should be easily achieved; if the objective is difficult then the test item should be correspondingly difficult.

This means that, unlike a norm-referenced test where an item might be reworked in order to increase its discriminability index, this is less of an issue in criterion-referencing. Of course, this is not to deny the value of undertaking an item difficulty analysis, rather, it is to question the centrality of such a concern. Gronlund and Linn (1990:265) suggest that where instruction has been effective the item difficulty index of a criterion-referenced test will be high.

Complete Guide to Educational Testing In Research: Types, Construction, and Best Practices for Researchers.

In addressing the item discriminability, item difficulty and distractor effect of particular test items, it is advisable, of course, to pilot these tests and to be cautious about placing too great a store on indices of difficulty and discriminability that are computed from small samples. In constructing a test with item analysis, item discriminability, item difficulty and distractor effects in mind, it is important also to consider the actual requirements of the test (Nuttall, 1987; Cresswell and Houston, 1991), for example:

  • Are all the items in the test equally difficult?
  • Which items are easy, moderately hard, hard, very hard?
  • What kinds of task each item is addressing (e.g. Is it (a) a practice item—repeating known knowledge, (b) an application item—applying known knowledge, (c) a synthesis item— bringing together and integrating diverse areas of knowledge)?
  • If not, what makes some items more difficult than the rest
  • Whether the items are sufficiently within the experience of the students
  • How motivated students will be by the con tents of each item (i.e. How relevant they perceive the item to be, how interesting it is). The contents of the test will also need to take account of the notion of fitness for purpose, for ex ample in the types of test items.

Here the researcher will need to consider whether the kinds of data to demonstrate ability, understanding and achievement will be best demonstrated in, for example (Lewis, 1974; Cohen, Manion and Morrison, 1996):

  • An open essay
  • A factual and heavily directed essay
  • Short answer questions
  • Divergent thinking items
  • Completion items
  • Multiple choice items (with one correct answer or more than one correct answer)
  • Matching pairs of items or statements
  • Inserting missing words
  • Incomplete sentences or incomplete, unlabeled diagrams
  • True/false statements
  • Open-ended questions where students are given guidance on how much to write (e.g. 300 words, a sentence, a paragraph)
  • Closed questions.

These items can test recall, knowledge, comprehension, application, analysis, synthesis, and evaluation, i.e. different orders of thinking. These take their rationale from Bloom (1956) on hierarchies of thinking—from low order (comprehension, application), through middle order thinking (analysis, synthesis) to higher order thinking (evaluation, judgement, criticism).

Clearly the selection of the form of the test item will be based on the principle of gaining the maximum amount of information in the most economical way. More recently this is evidenced in the explosive rise of machine-scorable multiple choice completion tests, where optical mark readers and scanners can enter and process large scale data rapidly.

Step 4: Choose the Right Test Format

Much of the discussion in this chapter assumes that the test is of the pen-and-paper variety. Clearly this need not be the case, for example tests can be written, oral, practical, interactive, computer-based, dramatic, diagrammatic, pictorial, photographic, involve the use of audio and video material, presentational and role-play, simulations. This does not negate the issues discussed in this chapter, for the form of the test will still need to consider, for example, reliability and validity, difficulty, discriminability, marking and grading, item analysis, timing.

In deed several of these factors take on an added significance in non-written forms of testing; for example:

(a) reliability is a major issue in judging live musical performance or the performance of a gymnastics routine—where a ‘one-off’ event is likely

(b) reliability and validity are significant issues in group performance or group exercises—where group dynamics may prevent a testee’s true abilities from being demonstrated.

Clearly the researcher will need to consider whether the test will be undertaken individually, or in a group, and what form it will take.

Step 5: Write Effective Test Items

The test will need to address the intended and unintended clues and cues that might be provided in it, for example (Morris et al., 1987):

  • The number of blanks might indicate the number of words required;
  • The number of dots might indicate the number of letters required;
  • The length of blanks might indicate the length of response required;
  • The space left for completion will give cues about how much to write;
  • Blanks in different parts of a sentence will be assisted by the reader having read the other parts of the sentence (anaphoric and cataphoric reading cues).
Guidelines

Hanna (1993:139–41) and Cunningham (1998) provide several guidelines for constructing short answer items to overcome some of these problems:

  • Make the blanks close to the end of the sentence
  • Keep the blanks the same length
  • Ensure that there can be only a single correct answer
  • Avoid putting several blanks close to each other (in a sentence or paragraph) such that the overall meaning is obscured
  • Only make blanks of key words or concepts, rather than of trivial words
  • Avoid addressing only trivial matters
  • Ensure that students know exactly the kind and specificity of the answer required
  • Specify the units in which a numerical answer is to be given
  • Use short-answers for testing knowledge recall.
Potential Problems

With regard to multiple choice items there are several potential problems:

  • The number of choices in a single multiple choice item (and whether there is one or more right answer(s))
  • The number and realism of the distractors in a multiple-choice item (e.g. There might be many distractors but many of them are too obvious to be chosen—there may be several redundant items)
  • The sequence of items and their effects on each other
  • The location of the correct response(s) in a multiple choice item.
Effectiveness of MCQs

Gronlund and Linn (1990), Hanna (1993: 161–75) and Cunningham (1998) set out several suggestions for constructing effective multiple choice test items:

  • Ensure that they catch significant knowledge and learning rather than low-level recall of facts
  • Frame the nature of the issue in the stem of the item, ensuring that the stem is meaningful in itself (e.g. Replace the general ‘sheep: (a) are graminivorous, (b) are cloven footed, (c) usually give birth to one or two calves at a time’ with ‘how many lambs are normally born to a sheep at one time?’)
  • Ensure that the stem includes as much of the item as possible, with no irrelevancies
  • Avoid negative stems to the item
  • Keep the readability levels low
  • Ensure clarity and unambiguity
  • Ensure that all the options are plausible so that guessing of the only possible option is avoided
  • Avoid the possibility of students making the correct choice through incorrect reasoning
  • Include some novelty to the item if it is being used to measure understanding
  • Ensure that there can only be a single correct option (if a single answer is required) and that it is unambiguously the right response
  • Avoid syntactical and grammatical clues by making all options syntactically and grammatically parallel and by avoiding matching the phrasing of a stem with similar phrasing in the response
  • Avoid including in the stem clues as to which may be the correct response
  • Ensure that the length of each response item is the same (e.g. To avoid one long correct answer from standing out)
  • Keep each option separate, avoiding options which are included in each other
  • Ensure that the correct option is positioned differently for each item (e.g. So that it is not always option 2)
  • Avoid using options like ‘all of the above’ or ‘none of the above’
  • Avoid answers from one item being used to cue answers to another item—keep items separate.
Problems or True and False Items

Morris et al. (1987:161), Gronlund and Linn (1990), Hanna (1993:147) and Cunningham (1998) also indicate particular problems in true false questions:

  • Ambiguity of meaning
  • Some items might be partly true or partly false
  • Items that polarize—being too easy or too hard
  • Most items might be true or false under certain conditions
  • It may not be clear to the student whether facts or opinions are being sought
  • As this is dichotomous, students have an even chance of guessing the correct answer
  • An imbalance of true to false statements
  • Some items might contain ‘absolutes’ which give powerful clues, e.g. ‘always’, ‘never’, ‘all’, ‘none’.
 Precautions for True and False

To overcome these problems the authors suggest several points that can be addressed:

  • Avoid generalized statements (as they are usually false)
  • Avoid trivial questions
  • Avoid negatives and double negatives in statements
  • Avoid over-long and over-complex statements
  • Ensure that items are rooted in facts
  • Ensure that statements can be either only true or false
  • Write statements in everyday language
  • Decide where it is appropriate to use ‘degrees’—‘generally’, ‘usually’, ‘often’—as these are capable of interpretation
  • Avoid ambiguities
  • Ensure that each statement only contains one idea
  • If an opinion is to be sought then ensure that it is attributable to a named source
  • Ensure that true statements and false statements are equal in length and number.
Matching Items and Potential Difficulties

Morris et al. (1987), Hanna (1993:150–2) and Cunningham (1998) also indicates particular potential difficulties in matching items:

  • It might be very clear to a student which items in a list simply cannot be matched to items in the other list (e.g. By dint of content, gram mar, concepts), thereby enabling the student to complete the matching by elimination rather than understanding
  • One item in one list might be able to be matched to several items in the other
  • The lists might contain unequal numbers of items, thereby introducing distractors—rendering the selection as much a multiple choice item as a matching exercise.
How Address Difficulties in Matching Items

The authors suggest that difficulties in matching items can be addressed thus:

  • Ensure that the items for matching are homogeneous—similar—over the whole test (to render guessing more difficult)
  • Avoid constructing matching items to answers that can be worked out by elimination (e.g. By ensuring that: (a) there are different numbers of items in each column so that there are more options to be matched than there are items; (b) students can avoid being able to re duce the field of options as they increase the number of items that they have matched; (c) the same option may be used more than once)
  • Decide whether to mix the two columns of matched items (i.e. Ensure, if desired, that each column includes both items and options)
  • Sequence the options for matching so that they are logical and easy to follow (e.g. By number, by chronology)
  • Avoid over-long columns and keep the columns on a single page
  • Make the statements in the options columns as brief as possible
  • Avoid ambiguity by ensuring that there is a clearly suitable option that stands out from its rivals
  • Make it clear what the nature of the relationship should be between the item and the option (on what terms they relate to each other)
  • Number the items and letter the options.

With regard to essay questions, there are several advantages that can be claimed. For example, an essay, as an open form of testing, enables complex learning outcomes to be measured, it enables the student to integrate, apply and synthesize knowledge, to demonstrate the ability for expression and self-expression, and to demonstrate higher order and divergent cognitive processes.

Further, it is comparatively easy to construct an essay title. On the other hand, essays have been criticized for yielding unreliable data (Gronlund and Linn, 1990; Cunningham, 1998), for being prone to unreliable (inconsistent and variable) scoring, neglectful of intended learning outcomes and prone to marker bias and preference (being too intuitive, subjective, holistic, and time-consuming to mark).

To overcome these difficulties the authors suggest that:

  • The essay question must be restricted to those learning outcomes that are unable to be measured more objectively
  • The essay question must ensure that it is clearly linked to desired learning outcomes; that it is clear what behaviors the students must demonstrate
  • The essay question must indicate the field and tasks very clearly (e.g. ‘compare’, ‘justify’, ‘critique’, ‘summarize’, ‘classify’, ‘analyse’, ‘clarify’, ‘examine’, ‘apply’, ‘evaluate’, ‘synthesize’, ‘contrast’, ‘explain’, ‘illustrate’)
  • Time limits are set for each essay
  • Options are avoided, or, if options are to be given, ensure that, if students have a list of titles from which to choose, each title is equally difficult and equally capable of enabling the student to demonstrate achievement, understanding etc.
  • Marking criteria are prepared and are explicit, indicating what must be included in the answers and the points to be awarded for such inclusions or ratings to be scored for the extent to which certain criteria have been met
  • Decisions are agreed on how to address and score irrelevancies, inaccuracies, poor gram mar and spelling
  • The work is double marked, blind, and, where appropriate, without the marker knowing (the name of) the essay writer.

Clearly these are issues of reliability .The following issue is that layout can exert a profound effect on the test.

Step 6: Optimize Test Layout and Design

This will include (Gronlund and Linn, 1990; Hanna, 1993; Linn, 1993; Cunningham, 1998):

  • The nature, length and clarity of the instructions (e.g. What to do, how long to take, how much to do, how many items to attempt, what kind of response is required (e.g. A single word, a sentence, a paragraph, a formula, a number, a statement etc.), how and where to enter the response, where to show the ‘working out’ of a problem, where to start new answers, e.g. In a separate booklet), is one answer only required to a multiple choice item, or is more than one answer required
  • The spreading of the instructions through the test, avoiding overloading students with too much information at first, and providing instructions for each section as they come to it
  • What marks are to be awarded for which parts of the test
  • Minimizing ambiguity and taking care over the readability of the items
  • The progression from the easy to the more difficult items of the test (i.e. The location and sequence of items)
  • The visual layout of the page, for example, avoiding overloading students with visual material or words
  • The grouping of items—keeping together items that have the same contents or the same format
  • The setting out of the answer sheets/locations so that they can be entered onto computers and read by optical mark readers and scanners (if appropriate).

The layout of the text should be such that it supports the completion of the test and that this is done as efficiently and as effectively as possible for the student.

Step 7: Determine Appropriate Timing

This refers to two areas:

(a) when the test will take place (the day of the week, month, time of day)

(b) the time allowances to be given to the test and its component items.

With regard to the former, in part this is a matter of reliability, for the time of day, week etc. might influence how alert, motivated, capable a student might be. With regard to the latter, the researcher will need to decide what time restrictions are being imposed and why (for example, is the pressure of a time constraint desirable—to show what a student can do under time pressure—or an unnecessary impediment, putting a time boundary around something that need not be bounded—was Van Gogh put under a time pressure to produce the painting of sunflowers?).

Though it is vital that the student knows what the overall time allowance is for the test, clearly it might be helpful to a student to indicate notional time allowances for different elements of the test; if these are aligned to the relative weightings of the test (see the discussions of weighting and scoring) they enable a student to decide where to place emphasis in the test—she may want to concentrate her time on the high scoring elements of the test.

Further, if the items of the test have exact time allowances, this enables a degree of standardization to be built into the test, and this may be useful if the results are going to be used to compare individuals or groups.

Step 8: Plan Your Scoring System

The awarding of scores for different items of the test is a clear indication of the relative significance of each item—the weightings of each item are addressed in their scoring. It is important to ensure that easier parts of the test attract fewer marks than more difficult parts of it, otherwise a student’s results might be artificially inflated by answering many easy questions and fewer more difficult questions (Gronlund and Linn, 1990).

Tractions

Additionally, there are several at tractions to making the scoring of tests as de tailed and specific as possible (Cresswell and Houston, 1991; Gipps, 1994), awarding specific points for each item and sub-item, for example:

  • It enables partial completion of the task to be recognized—students gain marks in pro portion to how much of the task they have completed successfully (an important feature of domain referencing)
  • It enables a student to compensate for doing badly in some parts of a test by doing well in other parts of the test
  • It enables weightings to be made explicit to the students
  • It enables the rewards for successful completion of parts of a test to reflect considerations such as the length of the item, the time
  • Required to complete it, its level of difficulty, its level of importance
  • It facilitates moderation because it is clear and specific
  • It enables comparisons to be made across groups by item
  • It enables reliability indices to be calculated (see discussions of reliability)
  • Scores can be aggregated and converted into grades straightforwardly.

Ebel (1979) argues that the more marks that are available to indicate different levels of achievement (e.g. For the awarding of grades), the greater the reliability of the grades will be, though, clearly this could make the test longer.

Scoring will also need to be prepared to handle issues of poor spelling, grammar and punctuation—is it to be penalized, and how will consistency be assured here? Further, how will issues of omission be treated, e.g. if a student omits the units of measurement (miles per hour, dollars or pounds, metres or centimetres)?

Related to the scoring of the test is the issue of reporting the results. If the scoring of a test is specific then this enables variety in reporting to be addressed, for example, results may be reported item by item, section by section, or whole test by whole test. This degree of flexibility might be useful for the researcher, as it will enable particular strengths and weaknesses in groups of students to be exposed.

The desirability of some of the above points is open to question. For example, it could be argued that the strength of criterion-referencing is precisely its specificity, and that to aggregate data (e.g. to assign grades) is to lose the very purpose of the criterion-referencing (Gipps, 1994:85). For example, if I am awarded a grade E for spelling in English, and a grade A for imaginative writing, this could be aggregated into a C grade as an overall grade of my English language competence, but what does this C grade mean?

It is meaningless, it has no frame of reference or clear criteria, it loses the useful specificity of the A and E grades, it is a compromise that actually tells us nothing. Further, aggregating such grades assumes equal levels of difficulty of all items. Of course, raw scores are still open to interpretation—which is a matter of judgement rather than exactitude or precision (Wiliam, 1996).

For example, if a test is designed to assess ‘mastery’ of a subject, then the researcher is faced with the issue of deciding what constitutes ‘mastery’— is it an absolute (i.e. very high score) or are there gradations, and if the latter, then where do these gradations fall? For published tests the scoring is standardized and already made clear, as are the conversions of scores into, for example, per centiles and grades.

Underpinning the discussion of scoring is the need to make it unequivocally clear exactly what the marking criteria are—what will and will not score points. This requires a clarification of whether there is a ‘checklist’ of features that must be present in a student’s answer. Clearly criterion-referenced tests will have to declare their lowest boundary—a cut-off point— below which the student has been deemed to fail to meet the criteria.

A compromise can be seen in those criterion-referenced tests which award different grades for different levels of performance of the same task, necessitating the clarification of different cut-off points in the examination. A common example of this can be seen in the GCSE examinations for secondary school pupils in the United Kingdom, where students can achieve a grade between A and F for a criterion-related examination.

The determination of cut-off points has been addressed by Nedelsky (1954), Angoff (1971), Ebel (1972) and Linn (1993). Angoff (1971) suggests a method for dichotomously scored items. Here judges are asked to identify the pro portion of minimally acceptable persons who would answer each item correctly. The sum of these proportions would then be taken to represent the minimally acceptable score. An elaborated version of this principle comes from Ebel (1972).

Here a difficulty by relevance matrix is constructed for all the items. Difficulty might be assigned three levels (e.g. easy, medium and hard), and relevance might be assigned three levels (e.g. highly relevant, moderately relevant, barely relevant). When each and every test item has been assigned to the cells of the matrix the judges estimate the proportion of items in each cell that minimally acceptable persons would answer correctly, with the standard for each judge being the weighted average of the proportions in each cell (which are determined by the number of items in each cell).

In this method judges have to consider two factors—relevance and difficulty (unlike Angoff, where only difficulty featured). What characterizes these approaches is the trust that they place in experts in making judgements about levels (e.g. of difficulty, or relevance, or proportions of successful achievement), i.e. they are based on fallible human subjectivity. Ebel (1979) argues that one principle in assignation of grades is that they should represent equal intervals on the score scales.

Reference is made to median scores and standard deviations, median scores because it is meaningless to assume an absolute zero on scoring, and standard deviations as the unit of convenient size for inclusion of scores for each grade (see also Cohen and Holliday, 1996). One procedure is thus:

Step 1 Calculate the median and standard deviation of the scores.

Step 2 Determine the lower score limits of the mark intervals using the median and the standard deviation as the unit of size for each grade.

However, the issue of cut-off scores is complicated by the fact that they may vary according to the different purposes and uses of scores (e.g. for diagnosis, for certification, for selection, for program evaluation), as these purposes will affect the number of cut-off points and grades, and the precision of detail required. For a full analysis of determining cut-off grades see Linn (1993). The issue of scoring takes in a range of factors, for example: grade norms, age norms, per centile norms and standard score norms (e.g. z-scores, T-scores, stanine scores, percentiles).

These are beyond the scope of this book to discuss, but readers are referred to Cronbach (1970), Gronlund and Linn (1990), Cohen and Holliday (1996), Hopkins et al. (1996).

Devising A Pretest and Post-Test

The construction and administration of tests is an essential part of the experimental model of research, where a pretest and a post-test have to be devised for the control and experimental groups. The pretest and post-test must adhere to several guidelines:

  • The pretest may have questions which differ in form or wording from the post-test, though the two tests must test the same content, i.e. they will be alternate forms of a test for the same groups.
  • The pretest must be the same for the control and experimental groups.
  • The post-test must be the same for both groups.
  • Care must be taken in the construction of a post-test to avoid making the test easier to complete by one group than another.
  • The level of difficulty must be the same in both tests. Test data feature centrally in the experimental model of research; additionally, they may feature as part of a questionnaire, interview and documentary material.

Reliability And Validity of Tests

Chapter 5 covers issues of reliability and validity. Suffice it here to say that reliability concerns the degree of confidence that can be placed in the results and the data, which is often a matter of statistical calculation and subsequent test re designing. Validity, on the other hand, concerns the extent to which the test tests what it is supposed to test! This devolves on content, construct, face, criterion-related, and concurrent validity.

Ethical Issues in Preparing for Tests

A major source of unreliability of test data derives from the extent and ways in which students have been prepared for the test. These can be located on a continuum from direct and specific preparation, through indirect and general preparation, to no preparation at all. With the growing demand for test data (e.g. for selection, for certification, for grading, for employment, for tracking, for entry to higher education, for accountability, for judging schools and teachers) there is a perhaps understandable pressure to prepare students for tests.

This is the ‘high stakes’ aspect of testing (Harlen, 1994), where much hinges on the test results. At one level this can be seen in the backwash effect of examinations on curricula and syllabuses; at another level it can lead to the direct preparation of students for specific examinations. Preparation can take many forms (Mehrens and Kaminski, 1989; Gipps, 1994):

  • Ensuring coverage, amongst other program contents and objectives, of the objectives and program that will be tested
  • Restricting the coverage of the program content and objectives to those only that will be tested
  • Preparing students with ‘exam technique’
  • Practice with past/similar papers
  • Directly matching the teaching to specific test items, where each piece of teaching and con tents is the same as each test item
  • Practice on an exactly parallel form of the test
  • Telling students in advance what will appear on the test
  • Practice on, and preparation of, the identical test itself (e.g. Giving out test papers in advance) without teacher input
  • Practice on, and preparation of, the identical test itself (e.g. Giving out the test papers in advance), with the teacher working through the items, maybe providing sample answers.

How ethical it would be to undertake the final four of these is perhaps questionable, or indeed any apart from the first on the list. Are they cheating or legitimate test preparation? Should one teach to a test; is not to do so a dereliction of duty (e.g. in criterion-and domain-referenced tests) or giving students an unfair advantage and thus reducing the reliability of the test as a true and fair measure of ability or achievement?

In high stakes assessment (e.g. for public account ability and to compare schools and teachers) there is even the issue of not entering for tests students whose performance will be low (see, for example, Haladyna, Nolen and Hass, 1991). There is a risk of a correlation between the ‘stakes’ and the degree of unethical practice— the greater the stakes, the greater the incidence of unethical practice. Unethical practice, observes Gipps (1994) occurs where scores are inflated but reliable inference on performance or achievement is not, and where different groups of students are prepared differentially for tests, i.e. giving some students an unfair advantage over others.

To overcome such problems, she suggests, it is ethical and legitimate for teachers to teach to a broader domain than the test, that teachers should not teach directly to the test, and the situation should only be that better instruction rather than test preparation is accept able (Cunningham, 1998).

One can add to this list of considerations (Cronbach, 1970; Hanna, 1993; Cunningham, 1998) the view that:

  • Tests Must Be Valid And Reliable (See The Chapter On Reliability And Validity)
  • The Administration, Marking And Use Of The Test Should Only Be Undertaken By Suitably Competent/Qualified People (I.E. People And Projects Should Be Vetted)
  • Access To Test Materials Should Be Controlled, For Instance: Test Items Should Not Be Reproduced Apart From Selections In Professional Publication; The Tests Should Only Be Released To Suitably Qualified Professionals In Connection With Specific Professionally Acceptable Projects
  • Tests Should Benefit The Testee (Beneficence)
  • Clear Marking And Grading Protocols Should
  • Exist (The Issue Of Transparency Is Discussed In The Blog Post On Reliability And Validity)
  • Test Results Are Only Reported In A Way That Cannot Be Misinterpreted
  • The Privacy And Dignity Of Individuals Should Be Respected (E.G. Confidentiality, Anonymity, Non-Traceability)
  • Individuals Should Not Be Harmed By The Test Or Its Results (Non-Maleficence)
  • Informed Consent To Participate In The Test Should Be Sought.

Computerized Adaptive Testing

A recent trend in testing is towards computerized adaptive testing (Wainer, 1990). This is particularly useful for large-scale testing, where a wide range of ability can be expected. Here a test must be devised that enables the tester to cover this wide range of ability; hence it must include some easy to some difficult items—too easy and it does not enable a range of high ability to be charted (testees simply getting all the answers right), too difficult and it does not enable a range of low ability to be charted (testees simply getting all the answers wrong).

We find out very little about a testee if we ask a battery of questions which are too easy or too difficult for her.

Further, it is more efficient and reliable if a test can avoid the problem for high ability testees of having to work through a mass of easy items in-order-to reach the more difficult items and for low ability testees of having to try to guess the answers to more difficult items. Hence it is useful to have a test that is flexible and that can be adapted to the testees. For example, if a testee found an item too hard the next item could adapt to this and be easier, and, conversely, if a testee was successful on an item the next item could be harder.

Wainer indicates that in an adaptive test the first item is pitched in the middle of the assumed ability range; if the testee answers it correctly then it is followed by a more difficult item, and if the testee answers it incorrectly then it is followed by an easier item. Computers here provide an ideal opportunity to address the flexibility, discriminability and efficiency of testing.

Testees can work at their own pace, they need not be discouraged but can be challenged, the test is scored instantly to provide feedback to the testee, a greater range of items can be included in the test and a greater degree of pre cision and reliability of measurement can be achieved; indeed test security can be increased and the problem of understanding answer sheets is avoided. Clearly the use of computer adaptive testing has several putative attractions.

On the other hand it requires different skills from traditional tests, and these might compromise the reliability of the test, for example:

  • The mental processes required to work with a computer screen and computer programme differ from those required for a pen and pa per test
  • Motivation and anxiety levels increase or de crease when testees work with computers
  • The physical environment might exert a significant difference, e.g. lighting, glare from the screen, noise from machines, loading and running the software
  • Reliability shifts from an index of the variability of the test to an index of the standard error of the testee’s performance.
  • The usual formula for calculating standard error assumes that error variance is the same for all scores, whereas in item response theory it is assumed that error variance depends on each testee’s ability. The conventional statistics of error variance calculates a single average variance of summed scores, whereas in item response theory this is, at best very crude, and at worst misleading as variation is a function of ability rather than test variation and cannot fairly be summed (see Thissen, 1990, for an analysis of how to address this issue);
  • Having so many test items increase the chance of inclusion of poor items.

Computer adaptive testing requires a large item pool for each area of content domain to be developed (Flaugher, 1990), with sufficient numbers, variety and spread of difficulty. The items have to be pretested and validated, their difficulty and discriminability calculated, the effect of distractors reduced, the capability of the test to address uni-dimensionality and/or multidimensionality to be clarified, and the rules for selecting items to be enacted.

Read More:

https://nurseseducator.com/didactic-and-dialectic-teaching-rationale-for-team-based-learning/

https://nurseseducator.com/high-fidelity-simulation-use-in-nursing-education/

First NCLEX Exam Center In Pakistan From Lahore (Mall of Lahore) to the Global Nursing 

Categories of Journals: W, X, Y and Z Category Journal In Nursing Education

AI in Healthcare Content Creation: A Double-Edged Sword and Scary

Social Links:

https://www.facebook.com/nurseseducator/

https://www.instagram.com/nurseseducator/

https://www.pinterest.com/NursesEducator/

https://www.linkedin.com/company/nurseseducator/

https://www.linkedin.com/in/nurseseducator/

https://www.researchgate.net/profile/Afza-Lal-Din

https://scholar.google.com/citations?hl=en&user=F0XY9vQAAAAJ

https://youtube.com/@nurseslyceum2358

https://lumsedu.academia.edu/AfzaLALDIN

Leave a Comment