Building A Retention Model
“…only the educated are free.”Epictetus
Aristotle declared there to be three primary causes to outcomes. The Efficient (natural), the Formal (momentary, explainable), and the Final (root) cause. Every year countless students start and abandon a formal college education. Each case is unique and should be treated as such when working with an individual student. However, when attempting to design policies and retention tools across the broader student body, we need to find some pattern in the sea of disparate circumstances that cause students to leave and not return. This is one scenario in which data science can help, if used correctly.
To that end, I partnered with a university to try and build a predictive model for student retention as part of my ongoing Data Science journey this year. I am memorializing that experience in this post for my own benefit and hopefully that of others. A way-finder of sorts. I am sharing some key observations, considerations, and data issues that we came across while trying to build this model.
For many reasons, my attempt failed. If you’re not stuck in a legacy business mentality, you’ll know how failure is just another learning opportunity. I did in fact learn a lot about spotting patterns in a noisy data set. The big win is that when these patterns when reviewed by the Institution’s subject matter experts can help to create new approaches to stop attrition earlier.
One reason that predicting retention is so hard, is information. The best way to know how a student is doing is to spend a lot of time with them. You just can’t do that in real life. So the data you collect represent small slivers of their lives and well-being. Data points that require more inference than you might prefer. In this particular effort, the data on a student comes from at least three separate places. You have admissions/registrations data, you also have census data, there is also FAFSA (Free Application for Federal Student Aid) data. Beyond these three many data points you can have graduation sets, high school data, and so much more.
No effort to run models on this data can begin without a lengthy exploratory data analysis period that is conducted in lockstep with an institutional subject matter expert in the research department. It was easy to write code to merge similar columns like semester dates, majors, advisor, and similar. But with the young, nothing stays the same; not even from one semester to the next. Majors change, advisors are reassigned, gender identification, and then there are the study abroad months, or leaving a club or activity.
So there a logical exercise on what data in each feature to use became necessary. The most recent, or aggregated, a substitute, or breakout into new features for normalized values? Will our predictions change if we don’t account for multiple major changes or is that even relevant? You see where I am going with this complexity. Predicting is hard with this sort of moveable feast of student behavior.
In the Data Science field, some argue that the more features (attributes) the better. The typical More is More argument. Others focus on having the right features. Malcom Gladwell partly takes on this debate in his book Blink. The key take away from many of the models I ran after trying to normalize the data into one large set, was that you are left with more questions than answers.
What I found is that the best thing a model can do for retention or churn is to classify and not predict. The first generation college student with no social activities, no declared major, and off campus needs an entirely different support network than the third generation student with social connections and lives on campus. Essentially, the model should free up the school personnel from sifting through the mountain of data to for targeted 1-on-1 engagement that pays more attention to the deeply psychological issues that the data often can’t capture.
Another idea that came out of this study was whether or not Robotic Process Automation (RPA) could help in the data harvesting & cleansing along with follow up action part. The logos behind this integration is that much of the data is in legacy systems, some even written in COBOL or similarly dated languages. RPA could help with the process of extracting and rewriting that data with screen scraping versus finding a developer that can even lasso that legacy dataset. RPA could also be used in automating intervention. Taking cues from the models classification quarter to quarter to perform registration for the slower acting students, or sending a message to the advisor when certain behaviors change.
Ultimately, early detection is the most important part of public health policy. So any delivery that achieves that aim is an early win. Surfacing risks and trends to populations that might otherwise be ignored or missed. Combined with the right feedback loops, we can influence better outcomes for everyone.