Week 8 Activity – Data preparation

Activity: Textual data pre-processing and informal analysis
Rule 1:
I created a list of positive words (unigrams and bigrams) from the given data and used them to identify positive and negative instances.
IF (effective OR intriguing OR breathtaking OR captivated OR (NOT not)_perfect OR loved OR real_chemistry OR really_good OR charm OR enthralled OR beautifully_done OR thoughtprovoking OR poignant OR fabulous OR sweet OR true_chemistry OR so_well OR enjoy OR excellent OR well_handled OR touching OR believable OR likeable OR very_successful OR enjoy OR interesting OR good OR entertaining OR great OR believable OR engaging) THEN pos ELSE neg
This rule doesn’t apply correctly on all negative instances since some of them have positive words also. 
Rule 2:
This rule is based on a list of negative words from the given data
IF (not_perfect OR dull OR onedimensional OR misused OR unnatural OR lack OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR cliché OR waste OR unintentional_laughs OR silliness OR immaturity OR passionless OR false_hope OR collapse OR annoying OR undercut OR not_so_well OR disaster OR not_original) THEN neg ELSE pos
This rule predicts some positive instances wrongly since a few of negative words occur in positive instances. 
Rule 3: 
To overcome the issue of wrong predictions due to some instances containing both positive and negative words, I used count to see which dominates which.
FOR ALL(effective OR intriguing OR breathtaking OR captivated OR (NOT not)_perfect OR loved OR real_chemistry OR really_good OR charm OR enthralled OR beautifully_done OR thoughtprovoking OR poignant OR fabulous OR sweet OR true_chemistry OR so_well OR enjoy OR excellent OR well_handled OR touching OR believable OR likeable OR very_successful OR enjoy OR interesting OR good OR entertaining OR great OR believable OR engaging) Add 1 to count_pos for each occurrence
FOR ALL(not_perfect OR dull OR onedimensional OR misused OR unnatural OR lack OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR cliché OR waste OR unintentional_laughs OR silliness OR immaturity OR passionless OR false_hope OR collapse OR annoying OR undercut OR not_so_well OR disaster OR not_original) Add 1 to count_neg for each occurrence
IF count_pos> count_neg, THEN pos
ELSE neg
Rule 4: 
We can see that the list of words are hand-picked based on our sample data, so the above rule over-fits to our data. I removed words which may have different contexts in different occurrences and maintained only words that are predictive at all occurrences.
FOR ALL(effective OR breathtaking OR loved OR (real OR true_chemistry) OR really_good OR enthralled OR beautifully_done OR thoughtprovoking OR fabulous OR excellent OR well_handled OR very_successful) Add 1 to count_pos for each occurrence
FOR ALL(dull OR unnatural OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR waste OR silliness OR annoying) Add 1 to count_neg for each occurrence
IF count_pos> count_neg, THEN pos
ELSE neg
Even though the above rule seems to fit okay, it may not be very predictive of instances which contain words other than the ones listed or which contain an opposite context of a word. They can be captured to some extent by complex rules involving the proximity of word occurrence. More features can be added and tested by cross-validation until we get a model with reasonable reliability. My take away is that it is not at all an easy task! 🙂

Week 6 Activity

In the activity for Week 6, we were asked to calculate different metrics for assessing models which were discussed in Ryan Baker’s unit of Behavior Detection and Model Assessment. Two data sets, classifier-data-asgn2.csv and regressor-data-asgn2.csv were given. 
I used Excel for these calculations and for the last metric (A’ or AUC), I downloaded a plugin called XLSTAT from http://www.xlstat.com/en/ since SPSS didnot give the correct answer. I will detail out the steps which I followed to complete this activity containing 11 questions. I urge you to save all the steps since you may need the answer of previous steps to continue the next steps. To better understand the steps I’ve described, refer the lecture videos 🙂

Q1) Using regressor-data-asgn2.csv, what is the Pearson correlation between data and predicted (model)? (Round to three significant digits; e.g. 0.24675 should be written as 0.247) (Hint: this is easy to compute in Excel)


 Use the excel function CORREL or PEARSON to calculate the Pearson correlation for the regressor model using the given two input arrays of data. Round the number you get, instead of truncating it to get the correct answer.

Q2) Using regressor-data-asgn2.csv, what is the RMSE between data and predicted (model)? (Round to three significant digits; e.g. 0.24675 should be written as 0.247) (Hint: this is easy to compute in Excel)


Calculate the residual (difference between actual data and predicted model) and use those values for the array in the below formula:
=SQRT(SUMSQ(A2:A1001)/COUNTA(A2:A1001))

Q3) Using regressor-data-asgn2.csv, what is the MAD between data and predicted (model)? (Round to three significant digits; e.g. 0.24675 should be written as 0.247) (Hint: this is easy to compute in Excel)


 Calculate the absolute values of the previous residual values in an array =ABS(RMSE!A2:A1001) and average them. 

Q4) Using classifier-data-asgn2.csv, what is the accuracy of the predicted (model)? Assume a threshold of 0.5. (Just give a rounded value rather than including the decimal; e.g. write 57.213% as 57) (Hint: this is easy to compute in Excel)


Compute the column of predicted model values with Y based on the given threshold of 0.5 (If >0.5, then Y). Compare it with the no of Ys in data to find the number of agreements. Calculate “= no. of agreements/ total count” for the accuracy.

Q5) Using classifier-data-asgn2.csv, how well would a detector perform, if it always picked the majority (most common) class? (Just give a rounded value rather than including the decimal; e.g. write 57.213% as 57) (Hint: this is easy to compute in Excel)


Calculate “= number of disagreements/total count”. Use previous step values.

Q6) Is this detector’s performance better than chance, according to the accuracy and the frequency of the most common class?


Answer Yes/No based on the previous values you got.

Q7) What is this detector’s value for Cohen’s Kappa? Assume a threshold of 0.5. (Just round to the first two decimal places; e.g. write 0.74821 as 0.75).


I calculated the agreements between data and prediction model to form the confusion matrix of the number of True Negatives(TN), True Positives (TP), False Positives (FP), False Negatives (FN) and listed them as below from O5 to O8 and then used a formula:
00 (TN)
11 (TP)
01 (FP)
10 (FN)
=((O5+O6)-((((O6+O7)*(O6+O8))/SUM(O5:O8))+(((O5+O7)*(O5+O8))/SUM(O5:O8))))/((SUM(O5:O8))-((((O6+O7)*(O6+O8))/SUM(O5:O8))+(((O5+O7)*(O5+O8))/SUM(O5:O8))))

Alternatively, you may apply the values from your confusion matrix to any online calculator for Cohen’s Kappa.

Q8) What is this detector’s precision, assuming we are trying to predict “Y” and assuming a threshold of 0.5 (Just round to the first two decimal places; e.g. write 0.74821 as 0.75).


Use formula Precision = TP/ (TP+FP)

Q9) What is this detector’s recall, assuming we are trying to predict “Y” and assuming a threshold of 0.5 (Just round to the first two decimal places; e.g. write 0.74821 as 0.75).


Use formula Recall = TP/ (TP+FN)

Q10) Based on the precision and recall, should this detector be used for strong interventions that have a high cost if mis-applied, or fail-soft interventions with low benefit and a low cost if mis-applied?


Select the correct option from the list of options.

Q11) What is this detector’s value for A’? (Hint: There are some data points with the exact same detector confidence, so it is probably preferable to use a tool that computes A’, such as http://www.columbia.edu/~rsb2162/computeAPrime.zip — rather than a tool that computes the area under the ROC curve).


I used ROC Curve from XLSTAT plugin to compute Area under the curve (AUC) using excel. 


To compute A’ without ROC curve, you may follow our co-learner’s steps listed in his blog:

Hope this helps you to reach this screen! 🙂






Competency 7.4

Competency 7.4: Describe how models might be used in Learning Analytics research, specifically for the problem of assessing some reasons for attrition along the way in MOOCs.

One particular model described by Dr. Carolyn talks about how certain properties of discussion correlate to dropout in MOOC. It explores how analyses of sentiment predict attrition over time (Sentiment however was found to be the least consistent and weakest indicator for dropouts). Refer the article below:

Survival Modeling:

Survival model is a regression model that captures the changes in probability of survival over time. It captures the probability at each time point and it is measured in terms of  hazard ratio which indicates how much more or less likely a student is to drop out. If Hazard ratio>1, the student is significantly more likely to drop out in the next time point.

Sentiment analysis in MOOC forums looked at Expressed sentiment and Exposure to sentiment. The four independent variables Individual Positivity, Individual Negativity, Thread Positivity and Thread Negativity were used to calculate the dependent variable Dropout. The effects were relatively weak and inconsistent across courses.

Some factors that may contribute to student attrition like student’s prior motivation, skill set/ knowledge in the area, previous experience in learning MOOCs are difficult to capture. We can link different analysis methods like social network analysis, text mining, predictive modeling and survey data analysis to try to get the complete picture of an individual student for more consistent results. 

Competency 7.3/ Assignment

Building a simple text classification experiment – Training and evaluating a simple predictive model


I used LightSIDE tool as explained by Dr. Carolyn to run a simple classification experiment. The tool is easy to use and straightforward if we follow the steps. 

In the Extract Features pane, I loaded the NewsgroupTopic dataset from the sample data directory in LightSIDE. I selected Unigram and Bigram features and clicked on Extract. I then saved the feature space for later use.



In the Build Models pane, I used the recently created feature table. I selected Naive Bayes as the learning plugin, set the number of folds to be 20 for cross-validation and clicked on Train. 


I got Accuracy 58% and Kappa = 0.44 for the model as given in the assignment, which means my steps were correct 🙂

Competency 7.1/ 7.2 Text Mining

Text Mining is the process of extracting and identifying useful and meaningful information, from different sources of unstructured text data.

Prominent Areas of Text Mining

Information Retrieval:

Information Retrieval is the process of searching and retrieving the required document from a collection of documents based on the given search query. The search engines we use like Google, Yahoo etc. make use of IR techniques for matching and returning documents relevant to the user’s query.

Document Classification/ Text Categorization:

Classification is the process of identifying the category a new observation belongs to, on the basis of a training set consisting of data with pre-defined categories (supervised learning). An example is the classification of email into spam/non-spam.

Clustering:

Clustering is the unsupervised procedure of classification where a set of similar objects are grouped to a cluster. An example analysis would be the summarization of common complaints based on open-ended survey responses.

Trend Analysis:

Trend Analysis is the process of discovering the trends of different topics over a given period of time. It is widely applied in summarizing news events and social network trends. An example would be the prediction of stock prices based on news articles.

Sentiment Analysis:

Sentiment analysis is the process of categorizing opinions based on sentiments like positive, negative or neutral. Sample applications include identifying sentiments in movie reviews and gaining real-time awareness to users’ feedback.

Sub-area of Text Mining

Collaborative Learning Process Analysis

It is the process of analyzing the collaborative learning process of students using text mining techniques. Different indicators and language features are used for this study. Some of them are:
  • General indicators of interactivity
  • Turn length
  • Conversation Length
  • Number of student questions
  • Student to tutor word ratio
  • Student initiative
  • Features related to cognitive processes
  • Transactivity
Data familiarity in the domain is important to understand and develop features that are relevant.

Competency 6.2: Key Diagnostic Metrics

A part of Week 6 was designed to learn about the diagnostic metrics, to see how well our model does, as either classifiers or regressors. 

Metrics for Classifiers

Accuracy:

The easiest measure of model goodness is accuracy. It is also called agreement, when measuring the inter-rater reliability.

Accuracy = # of agreements/ Total # of assessments

It is generally not considered a good metric across fields, since it has non even assignment to categories and not useful. E.g. 92% accuracy in the Kindergarten Failure Detector Model in the extreme case always says Pass.

Kappa:

Kappa = (Agreement – Expected Agreement) / (1 – Expected Agreement)

If Kappa value
= 0, agreement is at chance
= 1, agreement is perfect
= negative infinity, agreement is perfectly inverse
> 1, something is wrong
< 0, agreement is worse than chance
0<Kappa<1, no absolute standard. For data-mined models, 0.3-0.5 is considered good enough for publishing.
Kappa is scaled by the proportion of each category, influenced by the data set. We can compare the Kappa values within the same data set, but not between two data sets.

ROC:

The Receiver Operating Characteristic Curve (ROC) is used while a model predicts something having two values (E.g correct/incorrect, dropout/not dropout) and outputs a probability or other real value (E.g. Student will drop out with 73% probability). 

It takes any number as cut-off (threshold) and some number of predictions (maybe 0) may then be classified as 1’s and the rest may be classified as 0s. There are four possibilities for a classification threshold:
True Positive (TP) – Model and the Data say 1
False Positive (FP) – Data says 0, Model says 1
True Negative (TN) – Model and the Data say 0
False Negative (FN) – Data says 1, Model says 0

The ROC Curve has in its X axis Percent False Positives (Vs. True Negatives) and in Y axis Percent True Positives (Vs. False Negatives). The model is good if it is above the chance line in its diagonal.

A’:

A’ is the probability that if the model is given an example from each category, it will accurately identify which is which. It is a close relative of ROC and mathematically equivalent to Wilcoxon statistic. It gives useful result, since we can compute statistical tests for:
– whether two A’ values are significantly different in the same or different data sets.
– whether an A’ value is significantly different than choice.

A’ Vs Kappa:

A’ is more difficult to compute and works only for 2 categories. It’s meaning is invariant across data sets i.e) A’=0.6 is always better than A’=0.5. It is easy to interpret statistically and has value almost always higher than Kappa values. It also takes confidence into account.

Precision and Recall:

Precision is the probability that a data point classified as true is actually true.
Precision = TP / (TP+FP)
Recall is the probability that a data point that is actually true is classified as true.
Recall = TP / (TP+FN)
They don’t take confidence into account.


Metrics for Regressors

Linear Correlation (Pearson correlation):

In r(A,B) when A’s value changes, does B change in the same direction?
It assumes a linear relationship.
If correlation value is
1.0 : perfect
0.0 : none
-1.0 : perfectly negatively correlated
In between 0 and 1 : Depends on the field
0.3 is good enough in education since a lot of factors contribute to just any dependent measure.
Different functions (outliers) may also have the same correlation.

R square:

R square is correlation squared. It is the measure of what percentage of variance in dependent dependent measure is explained by a model. If predicting A with B,C,D,E, it is often used as the measure of model goodness rather than r.

MAE/MAD:

Mean Absolute Error/ Deviation is the average of absolute value of actual value minus predicted value. i.e) the average of each data point’s difference between actual and predicted value. It tells the average amount to which the predictions deviate from the actual value and is very interpret able.

RMSE:

Root Mean Square Error (RMSE) is the square root of average of (actual value minus predicted value)^2. It can be interpreted similar to MAD but it penalizes large deviation more than small deviation. It is largely preferred to MAD. Low RMSE is good.

RMSE/ MAD
Correlation
Model
Low
High
Good
High
Low
Bad
High
High
Goes in the right direction, but systematically biased
Low
Low
Values are in the right range, but doesn’t capture relative change


Information Criteria:

BiC:

Bayesian Information Criterion (BiC) makes trade-off between goodness of fit and flexibility of fit (number of parameters). The formula for linear regression:
BiC’ = n log (1-r^2) + p log n 
where n – number of students, p – number of variables
If value > 0, worse than expected, given number of variables
   value <0, better than expected, given number of variables
It can be used to understand the significance of difference between models. (E.g. 6 implies statistically significant difference)

AiC:

An Information Criterion/ Akaike’s Information Criterion (AiC) is an alternative to BiC. It has slightly different trade-off between goodness and flexibility of fit.

Note: There is no single measure to choose between classifiers. We have to understand multiple dimensions and use multiple metrics.

Types of Validity

Generalizability:

Does your model remain predictive when used in a new data set?
Generalizability underlies the cross-validation paradigm that is common in data mining. Knowing the context of the model where it will be used in, drives the kind of generalization to be studied.
Fail: Model of boredom built on data from 3 students fails when applied to new students

Ecological Validity:

Do your findings apply to real-life situations outside of research settings?
E.g. If a behavior detector built in lab settings work in real classrooms.

Construct Validity:

Does your model actually measure what it was intended to measure?
Does your model fir the training data? (provided the training data is correct)

Predictive Validity:

Does your model predict not just the present, but the future as well?

Substantive Validity:

Does your results matter?

Content Validity:

From testing; Does your test cover the full domain it is meant to cover?
For behavior modeling, does the model cover the full range of behavior it is intended to?

Conclusion Validity:

Are your conclusions justified based on evidence?

I think that the lessons in Week 5 and 6 are very useful, especially when we want to get our hands deep into predictive modeling and diagnosing its usefulness. I hope to use them in my predictive modeling work 🙂




Competency 6.1: Engineer both feature and training labels

My notes/ learning

Behavior Detectors

Behavior detectors are automated (predictive) models that can infer from log files whether a student is behaving in a certain way. 
Behaviors:
Disengaged behaviors 
-gaming the system by trying to succeed without learning
-off-task behavior
-carelessness by giving wrong answer even when having the required skills
-WTF behavior – Without Thinking Fastidiously (by doing unrelated tasks while using the system)
Metacognitive behaviors
-help-avoidance
-unscaffolded self-exploration
-exploration behaviors
Related Problem:
-sensor-free affect detection (without the use of video-capture, gesture capture etc.)
– detecting boredom, frustration, engaged concentration, delight

Ground Truth

Ground truth is the accuracy of classification in supervised learning/ machine learning.
Where to get the prediction labels from is the big issue in developing behavior detectors.
E.g. How to identify when a student is off-task/ gaming the system?
Behavior labels are noisy; there is no perfect way to get indicators of student behavior.
Sources of ground truth:

  • Self-Report: -common for affect, self-efficacy; not common for labeling behavior (students may not admit gaming)
  •  Field observations
  • Text replays
  • Video coding

Field observations:
One or more observers watch students and take notes
– requires training to do it right
Text Replays:
Analyzing student interaction behavior from log files based on their input in the system.
– Fast to conduct
– Decent inter-rater reliability
– Agrees with other measures of constructs
– Can be used to train behavior detectors
– Only limited constructs can be coded
– Lower precision than field observation due to lower bandwidth
Video Coding:
Videos of live behavior in the classrooms or screen replay videos analyzed
– slowest, but replicable and precise
– challenges in camera positioning

Kappa= 0.6 or higher expected for expert coding
However, 1000 data points with kappa= 0.5 > 100 data points with kappa= 0.7

Once we have ground truth, we can build the detector.


Feature Engineering

Feature engineering is the art of creating predictor variables. The model will not be good if our features (predictors) are not good. It involves lore rather than well-known and validated principles.

The big idea is how we can take the voluminous, ill-formed and yet under-specified data that we now have in education and shape it into a reasonable set of variables in an efficient and predictive way.

Process:

  1. Brainstorming features – IDEO tips for brainstorming
  2. Deciding what features to create – trade-off between effort and usefulness of feature
  3. Creating the features – Excel, OpenRefine, Distillation code
  4. Studying the impact of features on model goodness
  5. Iterating on features if useful – try close variants and test
  6. Go to 3 (or 1)


Feature engineering can over-fit –> Iterate and use cross-validation, test on held-out data or newly collected data.

Thinking about our variables is likely to yield better results than using pre-existing variables from a standard set.


Knowledge Engineering and Data Mining:

Knowledge engineering is where the model is created by a smart human being, rather than an exhaustive computer (that searches through all possibilities). It is also called rational modeling or cognitive modeling.

At its best:
Knowledge engineering is the art of a human being becoming deeply familiar with the target construct by carefully studying the data, including possible process data, understanding the relevant theory and thoughtfully crafting an excellent model.
-achieves higher construct validity and comparable performance than data mining
-may even transfer better to new data (while data-mined model may get trapped at finding specific features to the population)

E.g. Alevan et al.’s (2004, 2006) Help-seeking model

It was developed based on scientific articles, experience in designing learning environments, log files of student interaction and experience watching students using educational software in classes.

At its worst:
If it refers to making up a simple model very quickly and calling the resultant construct by a well-known name, not testing on data or has no evidence.
– poorer construct validity than data mining
– predicts desired constructs poorly 
– can slow scientific progress by false results
– can hurt student outcomes by wrong intervention

It is easier to identify if a data mining model is bad, from the features, validation procedure or goodness metrics; but difficult for knowledge engineering since the hard-work process in researcher’s brain is invisible.

To Do’s for both methods:
– Test the models
– Use direct measures (Training labels) or Indirect measures (E.g. predicting student learning). 
– Careful study of construct leads to better features and better models


Assignment – Critical Reflection:

Possible uses in education:
Behavior detection can be used to create automated learning management systems that will give hints to users/ comment on their performances by detecting their behavior. It can be used in places where the tutor is not available to help all students. The online automated tutor can jump in to give suggestions. If the behavior is still detected to be disengaged, an available tutor can be mapped to the student.

Competency 5.1/ Week 5 Activity

Competency 5.1: Learn to conduct prediction modeling effectively and appropriately.
I think this competency can be achieved if we are able to complete the given activity in RapidMiner. It is quite difficult for a newbie, but its well-described in the course and definitely doable 🙂

We were asked to build a set of detectors predicting the variable ONTASK for the given data file using RapidMiner 5.3. I had previously installed Rapidminer 6.1, so I was using it. The question progresses to the next one only if you answer it correctly and there were 13 questions,most of which require you to enter the Kappa of the model you executed. There were a few difficulties along the way (for which I will try to give some useful tips) and I had a huge relief when I finally got this screen 😉




1) Build a decision tree (using operator W-J48 from the Weka Extension Pack) on the entire data set. What is the non-cross-validated kappa?

You can follow the Rapidminer walkthrough video to answer this question. It’s almost the same steps, but in the last stage of data import from excel, you need to change the variable types if they are not correctly guessed. My Rapid Miner 6.1 version guessed the data types correctly, but if you are using 5.3, you should probably change the types of polynomial variables which were incorrectly guessed as integer. (You can open and check the input excel file to see what kind of values are there)

You should set attribute name as “ONTASK” and target role = label (the value to be predicted in the given exercise) and add operators W-J48, Apply model and Performance (Binary Classification). The rest should be fine.

2) The kappa value you just obtained is artificially high – the model is over-fitting to which student it is. What is the non-cross-validated kappa, if you build the model (using the same operator), excluding student?

Two ways to exclude a field – to delete the field or use Select attributes operator. The latter is better for obvious reasons. For this question you need to add “Select Attributes” operator and set Attribute filter type = single, attribute=StudentID and check invert selection (since we are asked to exclude student). 

3) Some other features in the data set may make your model overly specific to the current data set. Which data features would not apply outside of the population sampled in the current data set? Select all that apply.

For this question, you need to select the options which will not generalise outside your population. The system will assist you if you are wrong.

4) What is the non-cross-validated kappa, if you build the W-J48 decision tree model (using the same operator), excluding student and the variables from Question 3?

For this, we need to exclude all variables in Q3 which do not apply to the population outside our sample data in addition to the studentID we already excluded. You can change the attributes by selecting filter type= subset and Select the attributes to be excluded in the next window. Check invert selection. 

5) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use Naive Bayes?

Replace the W-J48 operator for Weka’s decision tree by Naïve bayes operator.

6) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use W-JRip?

Replace Naïve Bayes operator by W-Jrip operator.

7) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use Logistic Regression? (Hint: You will need to transform some variables to make this work; RapidMiner will tell you what to do)

Add operators “Nominal to Numerical” and “Logistic Regression” because Logistic Regression cannot handle polynominal attributes/ label

This was the one for which I spent maximum time, but still couldn’t go through. I’m not sure if the Kappa I got was wrong or what the system expects itself was wrong. Anyways, since I couldn’t afford more than a day on that issue and I was almost on the verge of quitting the activity, I had to trespass this question with Ryan’s answer in the discussion forum of Quickhelper. That’s very unfortunate, but I hardly had a choice 🙁

8) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use Step Regression (called Linear Regression)?

Just replace Logistic Regression by Linear Regression operator.

9) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use k-NN instead of W-J48? (We will discuss the results of this test later).

Replace Linear Regression by K-NN operator.

10) What is the kappa, for the same set of variables you used for question 4, if you use W-J48, and conduct 10-fold stratified-sample cross-validation?

For cross-validating our model, you can refer the Rapidminer walkthrough. You need to add  X-Validation operator. Remove the W-J48, Apply model and Performance operators from the process and add it inside the training and test set of X-Validation operator.

11) Why is the kappa lower for question 11 (cross-validation) than question 4 (no cross-validation?)
K-NN predicts a point using itself when cross-validation is turned off, and that’s bad.

You should be able to answer this question, otherwise the system will help.

12) What is the kappa, for the same set of variables you used for question 4, if you use k-NN, and conduct 10-fold stratified-sample cross-validation?

Replace W-J48 by K-NN inside the X-Validation training set.
13) k-NN and W-J48 got almost the same Kappa when compared using cross-validation. But the kappa for k-NN was much higher (1.000) when cross-validation was not used. Why is that?

You should be able to answer this question as well, else the system will help.

I wanted to post a tutorial with the pictures so it can help new comers, but I didn’t have time for that since I haven’t started Week6 yet, which is running now. Hope my tips help. All the best!


Competency 5.2/ Week 5 Reflection

A quick intro – I skipped weeks 3 & 4 for the time being since I was very much behind schedule coz of starting late. I jumped to Week 5 – Prediction modeling so that I can participate in the discussion and bazaar but sadly I was still lagging to participate. I’m aiming to complete weeks 3 and 4 later when I’m back on track 🙂

My Notes/ Learning:

Prediction 
– Developing a model that can infer an aspect of data (predicted variable) from a combination of other data (predictor variables)
– Inferences about the future/ present 
Two categories of prediction model:

  1. Regressers
  2. Classifiers

Regression

– A model that predicts a number in data mining (E.g. How long student takes to answer)
– Label –>.something we want to predict
Building a Regression model:
  1. Training label – a data set where we already know the answer, to train the model for prediction
  2. Test data – the data set for testing our model

– The basic idea of regression is to determine which features, in which combination can predict the label’s value.

Linear Regression
E.g. No. of papers = 2+ 2* # of grad students – 0.1*(# of grad students)^2
– only for linear functions 
– flexible, fast
– more accurate than complex models, once cross-validated.
– feasible to understand our model

Regression Trees
E.g. If x>3 y = 2A+ 3B
       else if x< -7 y = 2A- 3B
       else y = 2A+ 0.5B+ C
– Non-linear
– Different linear relationships between some variables, depending on other variables.


Classification

– what we want to predict is categorical
E.g. Correct/ Wrong (0/1), Category A/B/C/D/E/F
– Each label is associated with a set of “features” – to help in predicting the label

Classifier
– Determine which features, in what combination can predict the model.
– Many classification algorithms available 
– Data mining packages that include them:
  • Rapidminer
  • SAS Enterprise miner
  • Weka
  • KEEL

– Some algorithms useful for educational data mining:
  • Step regression
  • logistic regression
  • J48/ C4.5 Decision Trees
  • JRip Decision rules
  • K* Instance-based classifiers

Step Regression:
– fits linear regression function, has arbitrary cut-off
– for binary classification (0/1)
– selects parameters, assigns weight to each parameter and computes numerical values
E.g.  Y = 0.5 a + 0.7 b – 0.2 c +0.4 d + 0.3
If cut-off = 0.5, all values below 0.5 treated as 0 and all values >= 0.5 treated as 1.
– lack of closed-form expression
– often better in EDM, due to over-fitting (conservative)

Logistic Regression:
– also for binary classification
-given a specific set of values of predictor variables, fits logistic function to data to find out the frequency/ odds of a specific value of the dependent variable.
E.g. m = a0+a1v1+a2v2+a2v3…
       p(m) = 1/ (1+e^-m)
– relatively conservative algorithm due to its simple functional form.
– useful for cases where changes in predictor variables have predictable effects on probability of predicted variable class (without interaction effects)
E.g. A= Bad, B= Bad, but A+B = Good
– Automated feature distillation available but it is not optimal.

Decision Trees:
– Deals with interaction effects


J 48/ C4.5:
 – J48 is the open source re-implementation in Weka/Rapidminer of C4.5
– both numerical and categorical predictor variables can be used
– tries to find optimal split in numerical variables
– relatively conservative
– good when the data has natural splits and multi-level
-good for data when same construct can be arrived at in multiple ways.

Decision Rules:
– set of if-then rules to check in order
– many different algorithms available with difference in how rules are generated and selected. 
-most popular sub-category (JRip, PART) repeatedly create decision trees and distills best values.
– relatively conservative – simpler than most decision trees.
– very interpretable models unlike other apporaches
– good when multi-level interactions are common

K*
– instance based classifier
-predicts a data point from neighboring data points (stronger weights when points are nearby)
– good when data is very divergent with:
no easy patterns but there are clusters
different processes can lead to same result
interactable to find general rules
– sometimes works when nothing else works
Drawback – whole data set is needed –> useful for offline analysis

Bagged Stumps:
– related to decision trees
– lot of trees with only the first feature, later we aggregate them
– relatively conservative
– close variant is Random forest (building a lot of trees and aggregating across the trees)
Some less-conservative algorithms:
– complex
Support Vector Machines SVM:
– conducts dimensionality reduction on data space and then fits hyper plane which splits classes.
– creates sophisticated models, great for text mining
– not optimal for other educational data (logs, grades, interactions with software)
Genetic Algorithms:
– uses mutation, combination and natural selection to search space of possible models.
– it can produce inconsistent answers – randomness
Neural networks:
– composes of extremely complex relationships through combining “perceptrons” in different fashions
– complicated

Over-fitting:
Fitting to noise as well as signal. 
Over-fit model will be less good for new data.
Every model is over fit – we can try to reduce it but cannot completely eliminate over-fitting
Assessment:
Check if the model transfers to new contexts or it is over-fit to a specific context
Test model on unseen data
Training set > Test set

Cross-validation:
– split data points into N equal sized groups
– Train on all groups except one and test on the last group.
– Repeat for all groups changing the training ans test data groups for all possible combinations.
K-fold: pick a number K, to split into this number of groups
Leave-out-one: Every data point is a fold (avoids stratification issues)
Variants:
Flat-Cross validation -each point has equal chance of being placed in a fold
Student-level cross validation – minimum requirement (testing generalization to new students)
Other levels like school, lesson, demography, software package etc.

Uses of Prediction Modeling

My ideas in using prediction modeling for education:

1. Predict future career path and train accordingly:
If we are able to predict the future career path of students based on their interests in subjects, we can give more field-level training. That kind of education will be more meaningful to students to gain the skills required in the industry. Students will also be more interested to learn what they like, rather than being forced upon to learn something they don’t prefer.

2. Provide help for weak students:
Not all students require the same amount of help to understand the subject. Some students may learn easier than others. If we predict the different possible points where students may find difficulties, we can provide help in the specific areas.

3. Identify competencies:
If we can identify the competencies of students and what they lack, we can provide more guidance in that area. For example, a student does his work perfectly and exhibits good leadership, but doesn’t practice teamwork, we can guide him to learn teamwork competency better.

Competency 2.3

Competency 2.3: Evaluate the impact of policy and strategic planning on systems-level deployment of learning analytics.

#Assignment 68

Learning analytics could create a bigger impact on learning if implemented top-down than bottom up due to the availability of big data.However, the deployment of learning analytics faces many challenges at the institutional level:


1. Acceptance:
To work big on big data, big support is needed from the top management. The top management should foresee the future and possibilities of learning analytics as to what it can achieve. Only with promised outcomes, they can be expected to support it at a big level. It is not a small change to bring about in a day.

2. Management:
A new department may be needed to manage what should be done in learning analytics. This will require funding, responsible experts, manpower and technical training. Do the institutions have what it takes to commit to this new venture?

3. Ethics:
Personal Data Protection is a growing concern these days. When data is analyzed, it has to pass through humans and systems. How safe can our data be? Could there be a possible breach in security and what could be its implication?

When we have answers for all these questions, we could probably move forward to the next era of data analytics!