Reference: Anthony, L., & Lashkia, G. V. (2003). Mover: A machine learning tool to assist in the reading and writing of technical papers. IEEE transactions on professional communication, 46(3), 185-193.
Background:
- Identifying the structure of text helps in reading and writing research articles.
- The structure of research article introductions in terms of moves is explained in the CARS model (Ref: J. M. Swales, “Aspects of Article Introductions,” Univ. Aston, Language Studies Unit, Birmingham, UK, Res. Rep. No. 1, 1981.).
Problem:
- Identifying the moves in a particular type of article.
- Time-consuming identification of moves by raters (manual annotation) with no immediate feedback.
Purpose:
- To provide immediate feedback on move structures in the given text.
Method:
- Using supervised learning to identify moves from 100 IT research article (RA) abstracts.
- Machine readable abstracts were further pre-processed with subroutines to remove irrelevant characters from raw text.
- Data labelled based on the modified CARS model which had three main moves with further steps under each move as below: (Ref: L. Anthony, “Writing research article introductions in software engineering: How accurate is a standard model?,” IEEE Trans. Prof. Commun., vol. 42, pp. 38–46, Mar. 1999.)
- Establish a territory
- Claim centrality
- Generalize topics
- Review previous research
- Establish a niche
- Counter claim
- Indicate a gap
- Raise questions
- Continue a tradition
- Occupy the niche
- Outline purpose
- Announce research
- Announce findings
- Evaluate research
- Indicate RA structure
- Establish a territory
- Supervised Learning System – Implementation details:
- Bag of clusters representation was implemented:
- Dividing input text into clusters of 1-5 word tokens to capture key phrases and discourse markers as features.
- Bag of words model does not consider word order and semantics by splitting input text into word tokens – not useful in the discourse level.
- E.g. “Once upon a time there were three bears” clusters –> “once”, “upon”, “once upon”, “once upon a time”
- Not useful clusters (noise) removal using statistical measures – Information Gain (IG) scores used to remove clusters below threshold.
- ‘Location’ feature added to take note of preceding and later sentences – position index of sentence in the abstract.
- Additional training feature for the classifier – probability of common structural step groupings.
- Naive Bayes learning classifier outperformed other models.
- Tool available for download as AntMover.
- Bag of clusters representation was implemented:
Results:
- Evaluation of Mover:
- Training: 554 examples, Test: 138 examples
- Five fold cross validation, Average accuracy: 68%
- Classes (steps within the structural moves) with few examples had lower accuracy. Incorrectly classified steps were mostly from the same move (note similarity among 3.1, 3.2. 3.3)
- Features to improve accuracy:
- When two most likely decisions are used (instead of predicting only one class) using the Naive Bayes probabilities, accuracy increased to 86%.
- Flow optimization effectiveness improved accuracy by 2%.
- Manual correction of steps adding new training data (second opinion of students on the moves classified by the system used for retraining the model).
Discussion:
- Based on two practical applications in classroom, the usage of ‘Mover’ assisted students to
- identify unnoticed moves in manual analysis.
- analyze moves much faster than manual analysis.
- better understand own writing and prevent distorted views.
- Implications:
- Important vocabulary can be identified for teaching from the ordered cluster of words.
- Trained examples can be used as exemplars.
- Aid for immediate analysis of text structure.
- Future Work:
- Increasing the accuracy of Mover.
- Expanding to more fields – currently implemented for engineering and science text types.