Infolinks In Text Ads

Tuesday, July 27, 2010

Test Items Analyses



Item Discrimination Analysis

Group 1: 0.40 and up
- Items number (3), (4), (6), (8) are considered very good items. They are subject to retention in the revised test.
- For item (3), three students, belonging to the strong category, and one weak student gave correct answer.
- Item (4) is an ideal item because the IF is 0.50 and the ID is 1.00 that means it is well-centered. 50 percent of students answer correctly and the other 50 percent answer incorrectly.
- For item (6), there was only one weak student who gave correct answer.
- Item (8), the item is acceptable because the IF is 0.63 and the ID is 0.40.

Group 2: 0.20 to 0.29
- Items (1) and (10) are marginal items. They are subject to improvement in the revised test.
- For item (1), most students can answer correctly. There is only one weak student who gave wrong answer. So the item needs to be revised to make the distractors more efficient.
- For item (10), almost all students gave wrong answer. There is only one student from the strong category can answer correctly. So it needs to be improved.

Group 3: Below 0.19
- Items number (2), (5), (7), (9) are poor items. They should be sorted out or improved by revision.
- For item (2), the number of students from the strong and the weak category who can answer correctly is exactly the same.
- Item (5) is absolutely a bad item because all students can answer correctly.
- Item (7) should also be rejected because no single students from the strong category can give correct answers. On the other hand, all weak students gave correct answers.
- For item (9), the number of weak students who answer correctly is more than that of the students of the strong category. There is only one student from strong category answer correctly.

Test-Retest Reliability

Since the same reading comprehension test is administered twice over a period of time to a group of students, we should use the strategy of test-retest reliability to calculate the value of the test reliability. The calculation of this test-retest reliability uses the formula of Pearson’s product-moment correlation coefficient.
180.91 – (0.04 x 0.08) / 10
r =√ (173.88 – 0.04² / 10) (353.68 – 0.08² / 10)
r = 0.729178253 → rounded to 0.73

From the calculation of using Pearson product-moment correlation coefficient between the two sets of scores, we get the reliability estimate of .73 that means about 73% of the variance of the students’ observed test scores is attributable to the true ability and the other 27% is attributable to error.

Split-Half Reliability

Since the test can only be administered once, so the total number of the test items that is 40 items is divided equally into two categories: the odd-numbered items and the even- numbered items. This is done just as though the scores were two different forms. Then the two sets of scores are calculated to find the correlation coefficient. The coefficient gives the reliability for either the odd-numbered items or the even-numbered items, but just half of the test. The calculated coefficient must therefore be adjusted to provide a coefficient that represents the full-test reliability. This adjustment of the half-test correlation to estimate the full-test reliability is accomplished by using the Spearman-Brown formula.

The half-test correlation coefficient:
(30 x 367.52) – (1.5 x -0.1)
r =√ (30 x 405.12) –1.5² √ (30 x 389.5) –0.1²
r = 0.919036585 → rounded to 0.92
Spearman-Brown Formula:
rsb = 2rxy /(1+rxy) → rsb = 2 x 0.92
1 + 0.92
rsb = 0.96

After calculating the full-test correlation coefficient using Spearman-Brown Formula, we get the result of the full-test reliability value of .96 that means almost perfect. So the test items are considered consistent or stable to be administered repeatedly. There is only small percentage (4%) of the test item variance that is attributable to error. And the other 94% of the variance is reliable to represent the true ability of the assessed test takers.

The item facility index is a statistical technique used in test item analysis to examine the percentage of students who answered a particular item correctly. The objective of analyzing the item based on this Item Facility (IF) is to evaluate to what extend the item is easy or difficult for students. Calculating the IF value can be done as illustrated by the following equation:

Ncorrect = number of students answering correctly
Ntotal = total number of students taking the test

The result of this formula is an item facility value that can range from 0.00 to 1.00 Thus, if 45 out of 50 students answered a particular item correctly, the proportion would be 45/50 = .90. An IF of .90 means that 90% of the students answered the item correctly, and by extension, that the item is very easy.
Item Discrimination (ID) is a statistic that indicates the degree to which an item separates the students who performed well from those who did poorly on the test as a whole. These groups are sometimes referred to as the “high” and “low” scorers or “upper’” and “lower’” proficiency students. Item discrimination can be calculated by first figuring out who the upper and lower students are on the test (using their total scores to sort them from the highest score to the lowest). The upper and lower groups should probably be made up of equal numbers of students who represent approximately one third of the total group each. The formula would be like the following:

ID = IF upper – IF lower
ID = item discrimination for an individual item
IF upper = item facility for the upper group on the whole test
IF lower = item facility for the lower group on the whole test

Ideal items in an NRT should have an average IF of .50. Such items would thus be well centered, i.e., 50 percent of the students would have answered correctly, and by extension, 50 percent would have answered incorrectly. In reality however, items rarely have an IF of exactly .50, so those that fall in a range between .30 and .70 are usually considered acceptable for NRT purposes.
Once those items that fall within the .30 to .70 range of IFs are identified, the items among them that have the highest IDs should be further selected for inclusion in the revised test. This process would help the test designer to keep only those items that are well centered and discriminate well between the high and the low scoring students.



Language Test Construction and Evaluation (Cambridge Language Teaching Library)

Optimizing the diagnostic power of tests: An illustration from language arts (CSE report)

Language Testing and Assessment: An Advanced Resource Book (Routledge Applied Linguistics)

No comments:

Post a Comment