A quick intro – I skipped weeks 3 & 4 for the time being since I was very much behind schedule coz of starting late. I jumped to Week 5 – Prediction modeling so that I can participate in the discussion and bazaar but sadly I was still lagging to participate. I’m aiming to complete weeks 3 and 4 later when I’m back on track 🙂
My Notes/ Learning:
– Developing a model that can infer an aspect of data (predicted variable) from a combination of other data (predictor variables)
– Inferences about the future/ present
Two categories of prediction model:
– A model that predicts a number in data mining (E.g. How long student takes to answer)
– Label –>.something we want to predict
Building a Regression model:
- Training label – a data set where we already know the answer, to train the model for prediction
- Test data – the data set for testing our model
– The basic idea of regression is to determine which features, in which combination can predict the label’s value.
E.g. No. of papers = 2+ 2* # of grad students – 0.1*(# of grad students)^2
– only for linear functions
– flexible, fast
– more accurate than complex models, once cross-validated.
– feasible to understand our model
E.g. If x>3 y = 2A+ 3B
else if x< -7 y = 2A- 3B
else y = 2A+ 0.5B+ C
– Different linear relationships between some variables, depending on other variables.
– what we want to predict is categorical
E.g. Correct/ Wrong (0/1), Category A/B/C/D/E/F
– Each label is associated with a set of “features” – to help in predicting the label
– Determine which features, in what combination can predict the model.
– Many classification algorithms available
– Data mining packages that include them:
- SAS Enterprise miner
– Some algorithms useful for educational data mining:
- Step regression
- logistic regression
- J48/ C4.5 Decision Trees
- JRip Decision rules
- K* Instance-based classifiers
– fits linear regression function, has arbitrary cut-off
– for binary classification (0/1)
– selects parameters, assigns weight to each parameter and computes numerical values
E.g. Y = 0.5 a + 0.7 b – 0.2 c +0.4 d + 0.3
If cut-off = 0.5, all values below 0.5 treated as 0 and all values >= 0.5 treated as 1.
– lack of closed-form expression
– often better in EDM, due to over-fitting (conservative)
– also for binary classification
-given a specific set of values of predictor variables, fits logistic function to data to find out the frequency/ odds of a specific value of the dependent variable.
E.g. m = a0+a1v1+a2v2+a2v3…
p(m) = 1/ (1+e^-m)
– relatively conservative algorithm due to its simple functional form.
– useful for cases where changes in predictor variables have predictable effects on probability of predicted variable class (without interaction effects)
E.g. A= Bad, B= Bad, but A+B = Good
– Automated feature distillation available but it is not optimal.
– Deals with interaction effects
J 48/ C4.5:
– J48 is the open source re-implementation in Weka/Rapidminer of C4.5
– both numerical and categorical predictor variables can be used
– tries to find optimal split in numerical variables
– relatively conservative
– good when the data has natural splits and multi-level
-good for data when same construct can be arrived at in multiple ways.
– set of if-then rules to check in order
– many different algorithms available with difference in how rules are generated and selected.
-most popular sub-category (JRip, PART) repeatedly create decision trees and distills best values.
– relatively conservative – simpler than most decision trees.
– very interpretable models unlike other apporaches
– good when multi-level interactions are common
– instance based classifier
-predicts a data point from neighboring data points (stronger weights when points are nearby)
– good when data is very divergent with:
no easy patterns but there are clusters
different processes can lead to same result
interactable to find general rules
– sometimes works when nothing else works
Drawback – whole data set is needed –> useful for offline analysis
– related to decision trees
– lot of trees with only the first feature, later we aggregate them
– relatively conservative
– close variant is Random forest (building a lot of trees and aggregating across the trees)
Some less-conservative algorithms:
Support Vector Machines SVM:
– conducts dimensionality reduction on data space and then fits hyper plane which splits classes.
– creates sophisticated models, great for text mining
– not optimal for other educational data (logs, grades, interactions with software)
– uses mutation, combination and natural selection to search space of possible models.
– it can produce inconsistent answers – randomness
– composes of extremely complex relationships through combining “perceptrons” in different fashions
Fitting to noise as well as signal.
Over-fit model will be less good for new data.
Every model is over fit – we can try to reduce it but cannot completely eliminate over-fitting
Check if the model transfers to new contexts or it is over-fit to a specific context
Test model on unseen data
Training set > Test set
– split data points into N equal sized groups
– Train on all groups except one and test on the last group.
– Repeat for all groups changing the training ans test data groups for all possible combinations.
K-fold: pick a number K, to split into this number of groups
Leave-out-one: Every data point is a fold (avoids stratification issues)
Flat-Cross validation -each point has equal chance of being placed in a fold
Student-level cross validation – minimum requirement (testing generalization to new students)
Other levels like school, lesson, demography, software package etc.
Uses of Prediction Modeling
My ideas in using prediction modeling for education:
1. Predict future career path and train accordingly:
If we are able to predict the future career path of students based on their interests in subjects, we can give more field-level training. That kind of education will be more meaningful to students to gain the skills required in the industry. Students will also be more interested to learn what they like, rather than being forced upon to learn something they don’t prefer.
2. Provide help for weak students:
Not all students require the same amount of help to understand the subject. Some students may learn easier than others. If we predict the different possible points where students may find difficulties, we can provide help in the specific areas.
3. Identify competencies:
If we can identify the competencies of students and what they lack, we can provide more guidance in that area. For example, a student does his work perfectly and exhibits good leadership, but doesn’t practice teamwork, we can guide him to learn teamwork competency better.