Early Detection of Students at Risk - Predicting drop-out in higher education by using Administrative Student Data and Machine Learning

Author(s):

Johannes Berens(submitting)Simon Oster(presenting)Kerstin Schneider

Conference:

ECER 2018

Network:

22. Research in Higher Education

Format:

Paper

Session Information

22 SES 07 B, Students at Risk: How to avoid drop-outs

Paper Session

Time:

2018-09-05

15:30-17:00

Room:

C1.03

Chair:

Melissa Laufer

Contribution

Student attrition at universities has a negative impact on all parties involved: the students, the institutions, and the general public. Notwithstanding the educational gain of a student prior to dropping out, university attrition represents a misuse of resources and a lost investment for the university, broader society, and the student. Furthermore, dropping out may create feelings of inadequacy and lead to one being socially stigmatized (Larsen et al., 2013). Accordingly, European education providers and policy makers are increasing their efforts to reduce the number of student dropouts (Gaebel et al., 2012). Identifying students at risk early help to allocate resources to support those students more efficiently. This is confirmed by Seidmann (1996) who suggests that the educational success of potential dropouts depends largely on early identification of at-risk students and a continuous and intensive program of intervention.

Arulampalam et al. (2005) and Danilowicz-Gösele et al. (2014) show for Great Britain and Germany, respectively, that the probability of dropping out can be determined from the analysis of administrative student data. The academic performance of the student and the performance of the student’s peer group are both relevant for predicting student dropouts.

We develop an early detection system (EDS) that uses machine learning and regression techniques to predict student success in tertiary education. The EDS uses common administrative student data that universities required to maintain and regularly update; therefore the EDS can be readily implemented at every of university throughout Germany and with minor changes at most universities in Europe:

- Demographic data: name, gender, age, residence, nationality, place of birth

- Previous education: previous studies, university entrance level, GPA,

- Academic performance: passed, failed and missed exams, grades, CP

Hence, the EDS is self-adjusting to the university, and by regular updates with recent data at the end of each semester, it is self-adjusting over time. Furthermore, a self-adjusting EDS that uses readily available administrative data can be implemented and run without the involvement of university staff, thus, considerably easing the legal requirements with regard to data protection laws. It also precludes the need for student surveys which would otherwise need to be performed repeatedly for the whole student body and would depend on voluntary participation of the students.

The EDS can be used to monitor individual student groups, study programs, entire student cohorts, and, if desired, even individual students. Thus, the EDS provides a good starting point for descriptive research of dropouts and support the decision making processes of university administration. For example, the EDS allows for studying the effects of changes in study programs and courses, the influence of entry barriers on enrollment, e.g., study fees, and it can monitor the efficiency of intervention measures and aid programs. The EDS can also be useful in the efficient allocation of support and intervention measures to reach at-risk students. As a general rule, there are a large number of preventative measures taken at a university to reduce the number of student dropouts. Unfortunately, these programs currently do not help in identifying at-risk students and are, thus, offered to the general student body. Accordingly, in order for at-risk students to benefit from them, they have to self-select into a program. Hence, individual support networks and assistance programs may go underutilized if students are unaware of their availability or if they are aware but simply do not participate in them. For at-risk students to benefit from individual support they must first be aware of the fact that they are at risk of dropping out; secondly, they have to be made aware of the offer of support, and, finally, students must choose to participate.

Method

The EDS was developed and tested at two medium-sized universities in the federal state of North Rhine-Westphalia: a state university (SU) with about 23,000 students and 90 different bachelor programs and a private university of applied sciences (PUAS) with about 6,500 students and 26 bachelor programs. The machine learning process and regression analysis was performed using data from former bachelor students (graduates and dropouts) between 2007 and 2017 - overall more than 40.000 students. Next, we verified the EDS at both universities with data of one matriculation cohort that was excluded from the training data. In the first step, a prediction model (parameters, weights, rules, and point estimates) is developed using the training data. The aim of the model is to identify potential dropouts as early as possible by classifying student observations as graduates or dropouts in the test-cohort and then checking the precision of the prediction. Instead of relying on a single method, the EDS model is composed of multiple evaluation methods (classifiers). The methods are used alongside each other to evaluate their respective predictive powers. The methods used for the analysis are the OLS and probit regression models, the neural network model, as well as different decision tree algorithms. To combine the predictive powers of these methods we use a boosting algorithm, which evaluate the influence of the individual methods (weak classifiers) and merges the results into a single (strong) classifier. Here the adaptive boosting (AdaBoost) algorithm developed by Schapire and Freund (1997; 2012) is applied. Additional Data: The EDS use already available data. However, this does not mean that the EDS cannot utilize additional student data which can be imputed from the available data. Universities typically only know a student’s citizenship, place of university entrance qualification, and place of birth. That means, that second or third generation immigrants cannot be directly identified from the collected data. Since it is known, however, that most second and third generation immigrants underperform in the educational system, it is important to be able to identify them from their colleagues. Based on the methodology of Humpert and Schneiderheinze (2002), a name-database containing around 200,000 forenames and 600,000 surnames (Michael, 2007; Michael, 2016) is used to determine the likelihood of a migration background. The validity of the imputation was checked on a group of 4,004 students with a known background. More than 94% of the name combinations were correctly assigned.

Expected Outcomes

The performance of a machine learning method can be described by its forecasting accuracy, specificity, recall, and precision (Ting, 2011; Powers, 2011). For our purposes, a correctly predicted graduate is a student which is correctly rejected as an at-risk student. Consequently, a correctly predicted dropout is correctly identified as an at-risk student. However, since the aim of the EDS is to identify students at risk, in the present study, recall and precision is of particular relevance. Recall, also known as sensitivity or true positive rate, measures how many of the at-risk students are identified, while the precision, also known as positive predictive value, measures how many of the identified students are in fact at risk. Performance data becomes newly available after the completion of each respective semester. Accordingly, forecast estimates before the end of the 1st semester are based solely on student demographic data. At both universities, the forecasting accuracy increases in semesters, as the probability of dropping out becomes less likely with each additional semester and additional performance data becomes available. At both universities, forecast results based on the demographic variables have a prognostic accuracy of about 68% (recall: 66,5% SU and 49,78% PUAS). After the first semester, the accuracy improves to 78,9% (recall 74,2%) for the SU and 84,5% (recall 72,1%) for the PUAS and after the fourth semester to 90% (recall 80,2%) for the SU and 95% (recall 83,22%) for the PUAS. Our results indicate, that the forecast accuracy at the PUAS improves faster in the earlier semesters, whereas that of the SU increases at a steadier rate. The forecast accuracy at the SU was 90.99% (81.35% recall) and 91.85% (82.94% recall) for the fifth and sixth semesters, respectively.

References

Arulampalam, W., Naylor, R.A. & Smith, J.P. (2005) Effects of in-class variation and student rank on the probability of withdrawal: cross-section and time-series analysis for UK university students. Economics of Education Review, 24, p. 251-62. Danilowicz-Gösele, K., Meya, J., Schwager, R. & Suntheim, K. (2014) Determinants of students success at university. discussion papers. Gaebel, M., Hauschildt, K., Mühleck, K. & Smidt, H. (2012) Tracking Learners’ and Graduates’ Progression Paths. TRACKIT. EUA Publications. Humpert, A. & Schneiderheinze, K. (2002) Stichprobenziehung für telefonische Zuwandererumfragen. Praktische Erfahrungen und Erweiterung der Auswahlgrundlage. Münster: Waxmann. Larsen, M.L. et al. (2013) Dropout Phenomena at Universities: What is Dropout? Why does Dropout Occur? What Can be Done by the Universities to Prevent or Reduce it? A systematic review. Danish Clearinghouse for Educational Research. Michael, J. (2007) Anredebestimmung anhand des Vornamens. c´t, 17/2007, p. 182-83. Michael, J. (2016) Name Quality Pro (to be published). (available from the author; mail to: namequality.pro@gmail.com). Powers, D.M.W. (2011) Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning , 2(1), p. 37-63. Schapire, E. & Freund, Y. (1997) A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Science, 55, p. 119-39. Schapire, R.E. & Freund, Y. (2012) Boosting - Foundations and Algorithms. Massachusetts: Institute of Technology. Seidmann, A. (1996) Spring Retention Revisited: RET = E ID + (E + I + C) IV. College and University, 71, p. 18-20. Ting, K.M. (2011) Precision and Recall. In C. Sammut, Webb & G., eds. Encyclopedia of Machine Learning. Springer US. p. 781 & 901.

Author Information

Johannes Berens (submitting)

Bergische Universtität Wuppertal - Schumpeter School

WIB

Wuppertal

Simon Oster (presenting)

Bergische Universität Wuppertal

Velbert

Kerstin Schneider

Bergische Universtität Wuppertal - Schumpeter School, Germany; Wuppertaler Institut für bildungsökonomische Forschung; CESifo

Update Modus of this Database

The current conference programme can be browsed in the conference management system (conftool) and, closer to the conference, in the conference app.
This database will be updated with the conference data after ECER.

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance, please use the conference app, which will be issued some weeks before the conference and the conference agenda provided in conftool.
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.