## Tests Used for Outcomes Evaluation in Nursing Education and Statistics to Check Reliability and Validity

Test Statistics Used for Test Reliability In Nursing Education, Raw Score for Testing the Reliability , Central Tendency, Variability of Test Statistics for Reliability Score, Range of Test Statistics for Reliability Score, Standard Deviation of Test Statistics for Reliability Score, Normal Curve of Test Statistics for Reliability Score, Standard Error of Measurement of Test Statistics for Reliability Score, Standardized Scores of Test Statistics for Reliability Score, Conducting the Item Analysis of Test Statistics for Reliability Score, Item Difficulty of Test Statistics for Reliability Score, Item Discrimination of Test Statistics for Reliability Score.

### Test Statistics Used for Test Reliability in Nursing Education

Various test statistics can be calculated, generated by test authoring software, or reported from computer scoring services. These statistics help faculty interpret test results and provide data for item revision. Test grading software typically provides students’ raw and percentage scores, individual student reports, and test statistics such as central tendencies and test reliability indices as well as item analysis data.

### Raw Score for Testing the Reliability of Test Statistics in Nursing Education

The raw score is the number of test questions answered correctly. Raw scores are the most accurate test scores but yield limited information. A frequency distribution can be used to arrange raw scores to create class intervals. If tests are scored by computer, a frequency polygon is likely. The percentage score compares the raw score with the maximum possible score.

### Central Tendency of Test Statistics for Reliability Score

Central tendency is a descriptive statistic for a set of scores. Measures of central tendency include the mean, median, and mode. The mean (or average) has the advantage of ease of calculation. The mean is calculated as the sum of all scores divided by the total number of scores.

The median divides the scores in the middle (i.e., 50% of scores fall below the median and 50% of cores are above the median). The median is a better measure of central tendency than the mean if the scores are not normally distributed.

### Variability of Test Statistics for Reliability Score

Variability refers to the dispersion of scores and is thus a measure of group heterogeneity. Variability of scores affects other statistics. For example, low variability (homogeneity of scores) will tend to lower reliability coefficients such as the Kuder-Richardson coefficient (Lyman, 1997)

Relative grading scales are most meaningful when they are applied to a wide range of scores. Mastery tests, by design, may show little variability. As groups of students progress in a nursing program, there may be less variability in scores because of attrition of students (failure or withdrawal from the program).

### Range of Test Statistics for Variability Score

The range is the simplest measure of variability and is calculated by subtracting the lowest score from the highest score.

### Standard Deviation of Test Statistics for Reliability Score

The standard deviation (SD) of scores is the best measure of variability. Most computer scoring programs provide the SD of the scores with the results. Calculators with statistical functions can also be used to figure the SD.

For more information on formulas and methods for calculating the SD, consult a statistics text. The SD is just the average distance of scores from the mean. The SD can be used in making interpretations from the normal curve (Lyman, 1997).

### Normal Curve of Test Statistics for Reliability Score

The normal curve is a theoretical distribution of scores that is bell shaped and symmetrical. The mean, median, and mode is the same score on a normal curve. Also, for a normal curve, 68% of scores will fall within ± 1 SD of the mean and 95% of scores will fall within ± 2 SDs of the mean. This distribution may be used in assigning grades.

### Standard Error of Measurement of Test Statistics for Reliability Score

The standard error of measurement is an estimate of how much the observed score is likely to differ from the “true” score. That is, the student’s “true” score most likely lies between the observed score plus or minus the standard error. The standard error of measurement is calculated using the SD and the test reliability.

Many computer scoring programs calculate the standard error of measurement. Some faculty members give students the benefit of the doubt and add the standard error to each raw score before they assign grades.

### Standardized Scores of Test Statistics for Reliability Score

Standardized scores allow for ease of comparison between individual scores and sets of scores. The z score converts a raw score into units of SD on a normal curve. The z score can be calculated as follows: where x = observed score, m = mean, and SD = standard deviation.

For example, for a raw score of 34: Thus, a raw score of 34 falls approximately 0.5 SD below the mean. Because z scores are expressed by using decimals and both positive and negative values, many faculty prefer to use t scores instead. The z score can be used to calculate the t score. Converting raw scores to t scores has the following advantages:

- The mean of the distribution is set at 50.
- The SD from the mean is set at 10.
- t scores can be manipulated mathematically for grading purposes.

### Conducting the Item Analysis of Test Statistics for Reliability Score

Classic test theory is used for this discussion of item analysis and the discrimination index. Classic test theory and related inferences assume a norm referenced measure. For a critique of classic test theory and an explanation of the newer item response theories, see Developing and Validating Multiple-Choice Test Items (Haladyna, 2004).

Item response theories depend on large samples and thus are of limited application to classroom tests. Item analysis assists faculty in determining whether test items have separated the learners from the non-learners (discrimination). Many computer scoring programs supply item statistics.

### Item Difficulty of Test Statistics for Reliability Score

The item difficulty index (P value) is simply the percentage correct for the group answering the item. The upper limit of item difficulty is 1.00, meaning that 100% of students answered the question correctly.

The lower limit of item difficulty depends on the number of possible responses and is the probability of guessing the correct answer. For example, for a question with four options, P = 0.25 is the lower limit or probability of guessing. An item difficulty index of greater than 0.80 indicates low difficulty; an index of 0.30 to 0.80 indicates medium difficulty, and less than 0.30 indicates high difficulty (Tarrant & Ware, 2012).

McDonald (2013) recommends keeping the P values of the items in the range of 0.70 to 0.80 to help ensure that questions separate learners from non-learners (a good discrimination index). Clifton and Schriner (2010) recommend using 0.50 as a quick reference point, with low limits at 0.30 and high limits at 0.80. Some items may be slightly easier or more difficult, however, and faculty can determine the range of difficulty that is appropriate for their students and tests.

### Item Discrimination of Test Statistics for Reliability Score

Item discrimination, the item discrimination index, refers to the way an item differentiates students who know the content from those who do not. Discrimination can be measured as a point biserial correlation. The point biserial correlation compares each student’s item performance with each student’s overall test performance.

If a question discriminates well, the point biserial correlation will be highly positive for the correct answer and negative for the distractors. This indicates that the “**learners**,” or the students who knew the content, answered the question correctly and the “**nonlearners**” chose distractors.

An index of greater than 0.40 indicates excellent discrimination; an index of 0.30 to 0.39 indicates good discrimination; an index of 0.15 to 0.29 is satisfactory; an index of less than 0.15 indicates low discrimination; and an index of 0 indicates that there is no discrimination (Tarrant & Ware, 2012). Haladyna (2004) cautions that if the item difficulty index is either too high or too low, the discrimination index is attenuated.

The discrimination index is maximized when the item difficulty is moderate (P = 0.5). Ultimately, test reliability depends on item discrimination. Inclusion of mastery level material on a norm-referenced test tends to lower test reliability because that item tends to be answered correctly by many students and will thus be a poor discriminator.