Item and Test Analysis of Chemistry Achievement Test Used for the Ethiopian University Entrance Examination (2020)

Author(s):

Aregawi Gidey(presenting / submitting)Tamirie Andualem

Conference:

ECER 2025

Network:

99. Emerging Researchers' Group (for presentation at Emerging Researchers' Conference)

Format:

Paper

Session Information

99 ERC SES 07 F, Innovations and Insights in Educational Measurement and Evaluation

Paper Session

Time:

2025-09-09

09:00-10:30

Room:

TV sala | Hotel Palace | Mezzanine

Chair:

Gasper Cankar

Contribution

Objective and Research Questions

The main purpose of this study is to analyze the quality of Chemistry Achievement Test items of EUEE administered in (2020) using CTT, FA, GT, and IRT models and to identify the comparative advantage of each model for item and test analysis. Based on this objective, the study tries to answer the following questions.

1. What do the item characteristics of the Chemistry Achievement Test in EUEE look like? (Difficulty, Discrimination Index, Point Bi-serial correlation, etc.)

2. Are there significant mean differences among the test booklets of the Chemistry

Achievement

test in EUEE?

3. Is the Chemistry Achievement Test in EUEE unidimensional or a multiple-dimension test?

4. What do the item and test characteristics of the Chemistry Achievement Test look like

concerning the IRT model?

Conceptual and Theoretical Frameworks

The conceptual framework for this study on "Item and Test Analysis of University Entrance Examination" is grounded in classical test theory (CTT) and item response theory (IRT). These theories provide a structured approach to evaluating the reliability, validity, and fairness of examination items. The study aims to assess the quality of test items, measure the test's overall effectiveness, and provide insights into improving assessment practices in chemistry entrance examinations.

Theoretical Framework

1. Classical Test Theory (CTT): examines item difficulty, discrimination, reliability, and determines internal consistency using statistical measures such as Cronbach’s alpha. It focuses on test scores and their relation to true scores and errors.

2. Item Response Theory (IRT): analyzes individual item characteristics. It models student ability based on their responses to test items and evaluates item difficulty and discrimination using logistic models.

The study considers the following key variables:

§ Item Characteristics:

- Item Difficulty (proportion of students answering correctly)

- Item Discrimination (the ability of an item to differentiate between high- and low-performing students)

- Item Reliability (consistency of an item in measuring what it intends to measure)

§ Test Characteristics:

- Test Reliability (consistency of the entire exam over repeated administrations)

- Test Validity (the extent to which the test measures chemistry knowledge accurately)

- Test Fairness (absence of bias in item formulation and scoring)

§ Student Performance:

- Individual test scores

- Group performance distribution

- Correlation between student ability and test performance

1. What do the item characteristics of the Chemistry Achievement Test in EUEE look like? (Difficulty, Discrimination Index, Point Bi-serial correlation, etc.)

2. Are there significant mean differences among the test booklets of the Chemistry Achievement test in EUEE?

3. Is the Chemistry Achievement Test in EUEE unidimensional or a multiple-dimension test?

4. What do the item and test characteristics of the Chemistry Achievement Test look like

concerning the IRT model?

Method

Study Design This study employed a quantitative, descriptive design to conduct a post-test item analysis of national examination test scores. The purpose was to evaluate the psychometric characteristics of individual test items of the Chemistry achievement test in EUEE/2020, and the test as a whole, including measures of difficulty, discrimination, reliability, and overall test performance. Data Collection Data for the analysis consisted of responses from 45081 students who participated in the national examination. The dataset included the responses to all test items and the corresponding scores for each student. The examination consisted of 80 multiple-choice items. Procedures 1. Data Preparation: The raw test data were cleaned and organized, with incomplete or invalid responses excluded from the analysis. Each item response was dichotomously coded as (1) for correct or (0) for incorrect responses. 2. Item Analysis:  Difficulty Index (P): Calculated as the proportion of students who answered each item correctly, providing insight into item difficulty. Items with a difficulty index between 0.30 and 0.80 were considered acceptable.  Discrimination Index (D): Measured using the point-biserial correlation coefficient between item scores and total test scores. A higher discrimination index indicated that an item effectively differentiated between high-performing and low-performing students.  Distractor Analysis: For multiple-choice items, the effectiveness of distractors was analyzed to ensure that incorrect options were functioning as intended to distract lower-performing students. 3. Reliability Analysis: The internal consistency of the test was assessed using Cronbach's alpha. A coefficient of 0.70 or higher was considered acceptable, indicating that the test items measured the same underlying construct. 4. Test-Level Analysis: Descriptive statistics (mean, standard deviation was calculated to evaluate the overall test performance. The total test score distribution was examined for normality. Data Analysis Tools The statistical analyses were conducted using statistical software, such as SPSS, STATA, IATA, or jMetik, to ensure accurate computations. Ethical Considerations The study adhered to ethical guidelines, ensuring the confidentiality and anonymity of the students’ data. Institutional approval was obtained, and all data were analyzed in aggregate form to prevent the identification of individual students.

Expected Outcomes

Item analysis plays a significant role in determining the quality of assessment tools during test construction, validation, and scoring. From the findings, the majority (67.5%) of items have moderate difficulty, and 72.5% are good discriminating items. However, 32.5% of the items need moderate to major revision or else eliminated from the test. When the four test forms are analyzed for the arrangement of their items according to difficulty level, items in form 02 seem to be arranged from easy to difficult items (figure 3). However, the graphs for the other forms do not indicate a consistent variation in their item difficulties. From this, the mean performance of examinees in form 02 was also found higher (59.89%) than performances in the other forms and the difference in performances among the four forms was significant F (3, 170275) = 1461.4, p < .000. The effect size  = .025 shows small but significant that 2.5% of the variance was accounted for by the independent variable/arrangement of items/ in the test. Items identified with major problems in one test form were not consistently problem items in the other forms. Only one item (Q_45) is found to be a bad item across all test forms. This also shows items could be good or bad depending on their position in the test format.

References

Aryadoust, V., & Raquel, M. (Eds.). (2019). Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques (1st ed.). Routledge. https://doi.org/10.4324/9781315187815 Bichi, A. A. (2013). Item Analysis using a Derived Science Achievement Test Data. 4(5), 8. Chadha, N. K. (2009). Applied psychometry. SAGE. Crocker, L. M., & Algina, J. (2008). Introduction to classical and modern test theory. Cengage Learning. Eleje, L. I., & Esomonu, N. P. M. (2018). Test of Achievement in Quantitative Economics for Secondary Schools: Construction and Validation Using Item Response Theory. Asian Journal of Education and Training, 4(1), 18–28. https://doi.org/10.20448/journal.522.2018.41.18.28 Finch, W. H. (n.d.). Applied Psychometrics Using SPSS and AMOS. 288. Gagnon, R. J. (2019). Measurement Theory and Applications for the Social Sciences. Measurement: Interdisciplinary Research and Perspectives, 17(4), 209–210. https://doi.org/10.1080/15366367.2019.1610343 Reynolds, C. R., Altmann, R. A., & Allen, D. N. (2021a). Mastering Modern Psychological Testing: Theory and Methods. Springer International Publishing. https://doi.org/10.1007/978-3-030-59455-8 Reynolds, C. R., Altmann, R. A., & Allen, D. N. (2021b). Mastering Modern Psychological Testing: Theory and Methods. Springer International Publishing. https://doi.org/10.1007/978-3-030-59455-8 Shultz, K. S., Whitney, D. J., & Zickar, M. J. (2020). Measurement Theory in Action: Case Studies and Exercises (3rd ed.). Routledge. https://doi.org/10.4324/9781003127536

Author Information

Aregawi Gidey (presenting / submitting)

Addis Ababa University

Addis Ababa

Tamirie Andualem

Addis Ababa University, Ethiopia

Update Modus of this Database

The current conference programme can be browsed in the conference management system (conftool) and, closer to the conference, in the conference app.
This database will be updated with the conference data after ECER.

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance, please use the conference app, which will be issued some weeks before the conference and the conference agenda provided in conftool.
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.