Notes: ‘Digital support for academic writing: A review of technologies and pedagogies’

I came across this review article on writing tools published in 2019, and wanted to make some quick notes to come back to in this post. I’m following the usual format I use for article notes which summarizes the gist of a paper with short descriptions under respective headers. I had a few thoughts on what I thought the paper missed, which I will also describe in this post.


Carola Strobl, Emilie Ailhaud, Kalliopi Benetos, Ann Devitt, Otto Kruse, Antje Proske, Christian Rapp (2019). Digital support for academic writing: A review of technologies and pedagogies. Computers & Education 131 (33–48).


  • To present a review of the technologies designed to support writing instruction in secondary and higher education.


Data collection:

  • Writing tools collected from two sources: 1) Systematic search in literature databases and search engines, 2) Responses from the online survey sent to research communities on writing instruction.
  • 44 tools selected for fine-grained analysis.

Tools selected:

Academic Vocabulary
Article Writing Tool
C-SAW (Computer-Supported Argumentative Writing)
Carnegie Mellon prose style tool
Correct English (Vantage Learning)
Deutsch-uni online
DicSci (Dictionary of Verbs in Science)
Editor (Serenity Software)
Essay Jack
Essay Map
Klinkende Taal
Marking Mate (standard version)
My Access!
Open Essayist
Paper rater
PEG Writing
Research Writing Tutor
Right Writer
SWAN (Scientific Writing Assistant)
Scribo – Research Question and Literature Search Tool
Thesis Writer
Turnitin (Revision Assistant)
White Smoke

Inclusion criteria:

  • Tools intended solely for primary and secondary education, since the main focus of the paper was on higher education.
  • Tools with the sole focus on features like grammar, spelling, style, or plagiarism detection were excluded.
  • Technologies without an instructional focus, like pure online text editors and tools, platforms or content management systems excluded.

I have my concerns in the way tools were included for this analysis, particularly because some key tools like AWA/ AcaWriter,
Writing Mentor, Essay Critic, and Grammarly were not considered. This is one of the main limitations I found in the study. It is not clear how the tools were selected in the systematic search as there is no information about the databases and keywords used for the search. The way tools focusing on higher education were picked is not explained as well.

Continue reading “Notes: ‘Digital support for academic writing: A review of technologies and pedagogies’”

Notes: Discourse classification into rhetorical functions

Reference: Cotos, E., & Pendar, N. (2016). Discourse classification into rhetorical functions for AWE feedback. calico journal, 33(1), 92.


  • Computational techniques can be exploited to provide individualized feedback to learners on writing.
  • Genre analysis on writing to identify moves (communicative goal) and steps (rhetorical functions to help achieve the goal) [Swales, 1990].
  • Natural language processing (NLP) and machine learning categorization approach are widely used to automatically identify discourse structures (E.g. Mover, prior work on IADE).


  • To develop an automated analysis system ‘Research Writing Tutor‘ (RWT) for identifying rhetorical structures (moves and steps) from research writing and provide feedback to students.


  • Sentence level analysis – Each sentence classified to a move, step within the move.
  • Data: Introduction section from 1020 articles – 51 disciplines, each discipline containing 20 articles, total of 1,322,089 words.
  • Annotation Scheme:
    • 3 moves, 17 steps – Refer Table 1 from the original paper for detailed annotation scheme (Based on the CARS model).
    • Manual annotation using XML based markup by the Callisto Workbench.
  • Supervised learning approach steps:
    1. Feature selection:
      • Important features – unigrams, trigrams
      • n-gram feature set contained 5,825 unigrams and 11,630 trigrams for moves, and 27,689 unigrams and 27,160 trigrams for steps.
    2. Sentence representation:
      • Each sentence is represented as a n-dimensional vector in the R^n Euclidean space.
      • Boolean representation to indicate presence or absence of feature in sentence.
    3. Training classifier:
      • SVM model for classification.
      • 10-fold cross validation.
      • precision higher than recall – 70.3% versus 61.2% for the move classifier and 68.6% versus 55% for the step classifier – objective is to maximize accuracy.
      • RWT analyzer has two cascaded SVM – move classifier followed by step classifier.


  • Move and step classifiers predict some elements better than the others (Refer paper for detailed results):
    • Move 2 most difficult to identify (sparse training data).
    • Move 1 gained best recall- less ambiguous cues.
    • 10 out of 17 steps were predicted well.
    • Overall move accuracy of 72.6% and step accuracy of 72.9%.

Future Work:

  • Moving beyond sentence level to incorporate context information and sequence of moves/steps.
  • Knowledge-based approach for hard to identify steps – hand written rules and patterns.
  • Voting algorithm using independent analyzers.

Notes: XIP – Automated rhetorical parsing of scientific metadiscourse

Reference: Simsek, D., Buckingham Shum, S., Sandor, A., De Liddo, A., & Ferguson, R. (2013). XIP Dashboard: visual analytics from automated rhetorical parsing of scientific metadiscourse. In: 1st International Workshop on Discourse-Centric Learning Analytics, 8 Apr 2013, Leuven, Belgium.


Learners should have the ability to critically evaluate research articles and be able to identify the claims and ideas in scientific literature.


  • Automating analysis of research articles to identify evolution of ideas and findings.
  • Describing the Xerox Incremental Parser (XIP) which identifies rhetorically significant structures from research text.
  • Designing a visual analytics dashboard to provide overviews of the student corpus.


  • Argumentative Zoning (AZ) to annotate moves in research articles by Simone Teufel.
  • Sample discourse moves:
    • Summarizing: “The purpose of this article….”
    • Contrasting ideas: “With an absence of detailed work…”
      • Sub-classes: novelty, surprise, importance, emerging issue, open question
  • XIP outputs a raw output file containing semantic tags and concepts extracted from text.
  • Data: Papers from LAK & EDM conferences and journal – 66 LAK and 239 EDM papers extracting 7847 sentences and 40163 concepts.
  • Dashboard design – Refer original paper to see the process involved in prototyping the visualizations.


  • XIP is now embedded in the Academic Writing Analytics (AWA) tool by UTS. AWA provides analytical and reflective reports on students’ writing.

Notes: Automatic recognition of conceptualization zones in scientific articles

Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7), 991-1000.


  • Scientific discourse analysis helps in distinguishing the nature of knowledge in research articles (facts, hypothesis, existing and new work).
  • Annotation schemes vary across disciplines in scope and granularity.


  • To build a finer grained annotation scheme to capture the structure of scientific articles (CoreSC scheme).
  • To automate the annotation of full articles at sentence level with CoreSC scheme using machine learning classifiers (SAPIENT “Semantic Annotation of Papers: Interface & ENrichment Tool” available for download here).



  • 265 articles from biochemistry and chemistry, containing 39915 sentences (>1 million words) annotated in three phrases by multiple experts.
  • XML aware sentence splitter SSSplit used for splitting sentences.


  • First layer of the CoreSC scheme with 11 categories for annotation:
    • Background (BAC), Hypothesis (HYP), Motivation (MOT), Goal (GOA), Object (OBJ), Method (MET), Model (MOD), Experiment (EXP), Observation (OBS), Result (RES) and Conclusion (CON).


  1. Text classification:
    • Sentences classified independent of each other.
    • Uses Support Vector Machine (SVM).
    • Features extracted based on different aspects of a sentence: location within the paper, document structure (global features) to local features. For the complete list of features used, refer the paper.
  2. Sequence labelling:
    • Labels assigned to satisfy dependencies among sentences.
    • Uses Conditional Random Fields (CRF).

Results and discussion:

  • F-score: Ranges from 76% for EXP (Experiment) to 18% for the low frequency category MOT(Motivation) [Refer complete results from runs configured with different settings and features in Table 2 of the paper].
  • Most important features: n-grams (primarily bigrams), Grammatical triples (GRs), verbs, global features such as history (sequence of labels) and section headings (Detailed explanation for the features
  • Classifiers: LibS has the highest accuracy at 51.6%, CRF at 50.4% and LibL at 47.7%.

Application/Future Work:

  • Can be applied to create executive summaries of full papers (based on the entire content and not just abstracts) to identify key information in a paper.
  • CoreSC annotated biology papers to be used for guiding information extraction and retrieval.
  • Generalization to new domains in progress.

Notes: Computational analysis of move structures in academic abstracts


Wu, J. C., Chang, Y. C., Liou, H. C., & Chang, J. S. (2006, July). Computational analysis of move structures in academic abstracts. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 41-44). Association for Computational Linguistics.


  • Swales pattern for research articles: Introduction, Methods, Results, Discussion (IMRD) and Creating a Research Space (CARS) model.
  • Studying the rhetorical structure of tests is found to be useful to aid reading and writing (Mover tool notes here).


  • To automatically analyze move structures (Background, Purpose, Method, Result, and Conclusion) from research article abstracts.
  • To develop an online learning system CARE (Concordancer for Academic wRiting in English) using move structures to help novice writers.


  • Processes involved:


  • TANGO Concordancer used for extracting collocations with chunking and clause information – Sample  Verb-Noun collocation structures in corpus: VP+NP, VP+PP+NP, and VP+NP+PP (Ref: Jian, J. Y., Chang, Y. C., & Chang, J. S. (2004, July). TANGO: Bilingual collocational concordancer. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions (p. 19). Association for Computational Linguistics.)
    • TANGO Tool accessible here.
  • Data: Corpus of 20,306 abstracts (95,960 sentences) from Citeseer. Manual tagging of moves in 106 abstracts containing 709 sentences. 72,708 collocation types extracted and manually tagged 317 collocations with moves.
  • Hidden Markov Model (HMM) trained using 115 abstracts containing 684 sentences.
  • Different parameters evaluated for the HMM model: “the frequency of collocation types, the number of sentences with collocation in each abstract, move sequence score and collocation score”


  • Precision of 80.54% achieved when 627 sentences were qualified with following parameters: weight of transitional probability function 0.7 , frequency threshold for a collocation to be applicable – 18 (crucial to exclude unreliable collocation).


  • CARE system interface created for querying and looking up sentences for a specific move.
  • System is expected to help non native speakers write abstracts for research articles.

Notes: Visualizing sequential patterns for text mining


Wong, P. C., Cowley, W., Foote, H., Jurrus, E., & Thomas, J. (2000). Visualizing sequential patterns for text mining. In Information Visualization, 2000. InfoVis 2000. IEEE Symposium on (pp. 105-111). IEEE.


  • Mining Sequential patterns aims to identify recurring patterns from data over a period of time.
  • A pattern is a finite series of elements from the same domain A -> B -> C -> D
  • Each pattern has a minimum ‘support’ value which indicates the percentage of pattern occurrence. (E.g. 90% of people who did this process, did the second process, followed by the third process)
  • Sequential pattern vs association rule:
    • Sequential pattern – studies ordering/arrangement of elements E.g. A -> B -> C -> D
    • Association rule – studies togetherness E.g. A+B+C -> D


  • Presenting a visual data mining system that combines pattern discovery and visualizations.



Open source corpus containing 1170 news articles from 1991 to 1997 and harvested news of 1990 from TREC5 distribution.


  1. Topic Extraction: Identifies the topic in documents based on the co-occurrence of words. Words separated by white space evaluated – stemming done, prepositions, pronouns, adjectives, and gerunds ignored.
  2. Multiresolution binning: Bins articles with the same timestamp (E.g. Binning by day, week, month, year)

Discovery of sequential patterns by Visualization:

  • Plotting topics/ topic combinations over time.
  • Strength: Can quickly view overall patterns and individual occurrence of events.
  • Weakness: No knowledge on exact connections that make up the pattern and statistical support on the individual patterns.

Discovery of sequential patterns by Data mining:

  • Building patterns on n-ary tree with elements as nodes.
  • Patterns are valid if the support value is greater than threshold.
  • A sample pattern mining from given input data is given in Figure 2 of the paper.
  • Strength: Provides accurate statistical (support) values for all weak and strong patterns.
  • Weakness: Loses temporal and locality information, large number of patterns produced in text format making human interpretation harder.

Visual Data Mining system:


  • Combining visualization and data mining to compensate each others’ weaknesses (Refer Figure 4 & 5 in the paper to see the pattern visualizations).
  • Binning resolution can be changed to see different patterns based on day, week, month, year etc.
  • Patterns associated to a particular topic can be picked.


  • Strength of pattern is not easily identifiable from the visualization without statistical measures. Pattern mining gets enhanced by graphical encoding with spatial and temporal information.
  • Knowledge discovery by humans is aided by combining statistical data mining and visualization.

Future Work:

  • Handling larger data sets using secondary memory support and improve display.
  • Integrating more techniques like association rules into visual data mining environment.

Notes: Discipline-independent argumentative zoning


Teufel, S., Siddharthan, A., & Batchelor, C. (2009, August). Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3 (pp. 1493-1502). Association for Computational Linguistics.


  • Argumentative Zoning (AZ) classifies each sentence into one of the categories below (inspired by knowledge claim KC) of authors :
    • Aim, Background, Basis, Contrast, Other and Textual.

[Refer AZ scheme – Teufel, S., Carletta, J., & Moens, M. (1999, June). An annotation scheme for discourse-level argumentation in research articles. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics (pp. 110-117). Association for Computational Linguistics.]


  • Establishing a modified AZ scheme AZ-II with fine grained categories (11 instead of 7)  to recognize structure and relational categories.
  • Experimenting annotation using AZ scheme in two distinct domains: Chemistry and Computational Linguistics (CL).
  • Testing an annotation scheme to systematically exclude prior domain knowledge of annotators.



  • Domain independent categories so that the annotations can be done based on general, rhetorical and linguistic knowledge and no scientific domain knowledge is necessary.
  • Annotators are semi-informed experts following the rules below so that the existing domain knowledge has minimalist interference with annotations:
    • Justification is required for all annotations based on text based evidence such as cues, and other linguistic principles.
    • Discipline specific generics are provided based on high level domain knowledge so that the annotators can identify the validity of knowledge claims made in the domain (E.g. a “Chemistry primer” with high level information regarding common scientific terms to help a non-expert).
    • Guidelines are given with descriptions for annotating the categories; some categories might require domain knowledge for distinguishing them (e.g. Authors mentioning about the failure of previous methods: OWN_FAIL vs ANTISUPP, Reasoning required to come to conclusions from results: OWN_RES vs OWN_CONC).


  • Data:
    • Chemistry – 30 journal articles, 3745 sentences
    • CL – 9 conference articles, 1629 sentences
  • Independent annotations using web based tool. Refer example annotations in appendix of the paper.


  • Inter-annotator agreement: Fleiss Kappa coefficient, κ = 0.71 for Chemistry and κ = 0.65 for CL.
  • Wide variation in the frequency of categories –> fewer examples for supervised learning for rare categories (Refer ‘Figure 3: Frequency of AZ-II Categories’ in the paper to see the frequency distinctions between the two domains).
  • Pairwise agreement calculated to see the impact of domain knowledge between annotators: κAB  = 0.66, κBC  = 0.73 and κAB  = 0.73 –> Largest disagreement between expert (A) and non-expert (C).
  • Inter-annotator agreement to see the distinction between categories: κbinary = 0.78 for chemistry and κbinary = 0.65 for CL –> Easier distinction of categories in Chemistry than CL.
  • Krippendorff’s category distinctions to see how a category falls apart from the other collapsed categories: κ=0.71 for chemistry, κ=0.65 for CL
    • Well distinguised: OWN MTHD, OWN RES and FUT
    • Less distinguised: ANTISUPP, OWN FAIL and PREV OWN –> troubleshooting required for guidelines
  • Comparison of AZ-II to original AZ annotation scheme by collapsing into 6-category AZ annotation: κ=0.75 –> annotation of high consistency.


  • Positive result for domain independent application of AZ scheme and training non experts as annotators.
  • Annotating more established discipline like Chemistry was easier than CL.

Future Work:

  • Automation of AZ annotation
  • Expanding annotation guidelines to other disciplines and longer journal papers.




Notes: Mover – a Machine Learning tool to analyze technical research papers

Reference: Anthony, L., & Lashkia, G. V. (2003). Mover: A machine learning tool to assist in the reading and writing of technical papers. IEEE transactions on professional communication, 46(3), 185-193.


  • Identifying the structure of text helps in reading and writing research articles.
  • The structure of research article introductions in terms of moves is explained in the CARS model (Ref: J. M. Swales, “Aspects of Article Introductions,” Univ. Aston, Language Studies Unit, Birmingham, UK, Res. Rep. No. 1, 1981.).


  • Identifying the moves in a particular type of article.
  • Time-consuming identification of moves by raters (manual annotation) with no immediate feedback.


  • To provide immediate feedback on move structures in the given text.


  • Using supervised learning to identify moves from 100 IT research article (RA) abstracts.
  • Machine readable abstracts were further pre-processed with subroutines to remove irrelevant characters from raw text.
  • Data labelled based on the modified CARS model which had three main moves with further steps under each move as below: (Ref: L. Anthony, “Writing research article introductions in software engineering: How accurate is a standard model?,” IEEE Trans. Prof. Commun., vol. 42, pp. 38–46, Mar. 1999.)
    1. Establish a territory
      1. Claim centrality
      2. Generalize topics
      3. Review previous research
    2. Establish a niche
      1. Counter claim
      2. Indicate a gap
      3. Raise questions
      4. Continue a tradition
    3. Occupy the niche
      1. Outline purpose
      2. Announce research
      3. Announce findings
      4. Evaluate research
      5. Indicate RA structure


  • Supervised Learning System – Implementation details:
    • Bag of clusters representation was implemented:
      • Dividing input text into clusters of 1-5 word tokens to capture key phrases and discourse markers as features.
      • Bag of words model does not consider word order and semantics by splitting input text into word tokens – not useful in the discourse level.
      • E.g. “Once upon a time there were three bears” clusters –> “once”, “upon”, “once upon”, “once upon a time”
      • Not useful clusters (noise) removal using statistical measures – Information Gain (IG) scores used to remove clusters below threshold.
      • ‘Location’ feature added to take note of preceding and later sentences – position index of sentence  in the abstract.
        • Additional training feature for the classifier – probability of common structural step groupings.
    • Naive Bayes learning classifier outperformed other models.
    • Tool available for download as AntMover.


  • Evaluation of Mover:
    • Training: 554 examples, Test: 138 examples
    • Five fold cross validation, Average accuracy: 68%
    • Classes (steps within the structural moves) with few examples had lower accuracy. Incorrectly classified steps were mostly from the same move (note similarity among 3.1, 3.2. 3.3)
    • Features to improve accuracy:
      • When two most likely decisions are used (instead of predicting only one class) using the Naive Bayes probabilities, accuracy increased to 86%.
      • Flow optimization effectiveness improved accuracy by 2%.
      • Manual correction of steps adding new training data (second opinion of students on the moves classified by the system used for retraining the model).


  • Based on two practical applications in classroom, the usage of ‘Mover’ assisted students to
    • identify unnoticed moves in manual analysis.
    • analyze moves much faster than manual analysis.
    • better understand own writing and prevent distorted views.
  • Implications:
    • Important vocabulary can be identified for teaching from the ordered cluster of words.
    • Trained examples can be used as exemplars.
    • Aid for immediate analysis of text structure.
  • Future Work:
    • Increasing the accuracy of Mover.
    • Expanding to more fields – currently implemented for engineering and science text types.


Notes: NLP Techniques for peer feedback

Reference: Xiong, W., Litman, D. J., & Schunn, C. D. (2012). Natural language processing techniques for researching and improving peer feedback. Journal of Writing Research, 4(2), 155-176.


  • Feedback on writing is seen to improve students’ writing, but the process is resource intensive.
  • Possible options to reduce the workload in giving feedback:
    • Direct feedback using technology assisted approaches (from grammar checks to complex computational linguistics).
    • Peer Review [Considered in this paper].
  • Peer review:
    • Good feedback from a group of peers is found to be as useful as the instructor’s feedback and even weaker writers are seen to provide useful feedback to stronger writers (See references in original paper).
    • When providing feedback on other students’ work, students become mindful of the mistakes and improve their own writing.
    • Some web-based peer review systems: PeerMark in, SWoRD (used in this study) and Calibrated Peer Review.


  • Challenge lies in the form of feedback provided by peers – peer feedback might not be in a form useful to make revisions. Key features identified to aid revisions:
    1. Localized information (Providing exact location details like paragraph, page numbers or quotations).
    2. Concrete solution (Suggesting possible solution rather than just pointing the problem).
  • Research problem: Studying peer review is hard with a large amount of feedback data.
  • Practical problem: Identifying useful feedback for students and possible interventions to help them provide good feedback.


  • To automatically process peer feedback and identify the presence or absence of the two key features (Providing feedback on feedback for students and automatically coding feedback for researchers).
  • Refer prototype shown in figure 1 of the original paper that suggests students to provide localized and explicit solutions.

Technical – How? (Details explained in study 1 and study 2)

  1. Building a domain lexicon from common unigrams and bigrams in student papers.
  2. Counting basic features like domain words, modals, negations, overlap between comment and paper etc. from each feedback.
  3. Creating a logic model to identify the type of feedback (Contains localization information or not/ contains explicit solution or not) – classification task in machine learning.

Method and Results:

Study 1 – Localization Detection:

  • Each feedback comment represented as a vector of the four attributes below:
    1. regularExpressionTag: Regular expressions to match phrases that use location in a comment (E.g. “on page 5”).
    2. #domainWord: Counting the number of domain-related words in a comment (based on the domain lexicon gathered from frequent terms in student papers).
    3. sub-domain-obj, deDeterminer: Extracting syntactic attributes (sub-domain-obj) and count of words like “this, that, these, those” which are demonstrative determiners.
    4. windowSize, #overlaps: Extracting the length of matching words from the document to identify quotes (windowSize) and words overlapped.
  • Weka models to automatically code localization information. The decision tree model had better accuracy (77%, recall 82%, precision 73%) in predicting if a feedback was localized or not. To refer the rules that made up the decision tree, take a look at Figure 2 of the original paper.

Study 2 – Solution Detection:

  • Feedback comment represented as vectors using the three types of attributes(Refer table 2 in the original paper for details).
    • Simple features like word count and the order of comment in overall feedback.
    • Essay attributes to capture the relationship between the comment and the essay and domain topics.
    • Keyword attributes semi-automatically learned based on semantic and syntactic functions.
  • Logistic Regression model to detect the presence/ absence of explicit solutions (accuracy 83%, recall 91%, precision 83%). Domain-topic words followed by suggestions were highly associated with prediction. Detailed coefficients of attributes predicting presence of solution can be referred in Table 3 of the original paper.

Study 3: Can Research Rely on Automatic Coding?

  • Comparing automatically coded data to hand coded data to see if the accuracy is sufficiently high for practical implementation.
  • Helpfulness ratings by peers and 2 experts (content, writing experts) on peer comments at a review level.
  • To account for expert ratings:
    • Regression analysis using feedback type proportions (praise only comments, summary only comments, problem/solution containing comments), proportion localized critical comments, and proportion solution providing comments as predictors.
    • 10 fold cross validation – SVM best fit.
    • To check if same models are built using machine coded and hand coded data – 10 stepwise regressions. Refer Table 4 in the original paper to see the feedback features commonly included in the model by the different raters – Different features were helpful for different raters.
    • Overall regression model is similar to hand coded localization data (Most of the positivity, solution and localization were similar between hand coding and automatic coding).


  • Predictive models for detecting localization and solution information are statistical tools and do not provide deep content insights.
  • To be integrated into SWoRD to provide real time feedback on comments.
  • Technical note: Comments were already pre-processed – segmented into idea units by hand; data split by hand into comment type (summary, praise, criticism).
  • Future work:
    • Examine impact of feedback on feedback comments
    • Obtaining generalization across courses
    • Improving accuracy of prediction


Notes: The calibration of student judgement through self-assessment

Reference: Boud, D., Lawson, R., & Thompson, D. G. (2015). The calibration of student judgement through self-assessment: disruptive effects of assessment patterns. Higher Education Research & Development, 34(1), 45-59.


  • Effective judgement of own work is an important attribute for HDR students
  • Focused on Self-Assessment (also represented as Self-regulation and Metacognition in few works)


  • Self assessment is not facilitated using systematic activities, but rather thought to be achieved from normal tasks.
  • Students are not given feedback on their judgement.
  • Ensure that different assessment methods don’t distract students’ ability of learning and judgement.


  • To investigate if students’ judging improves over time through criteria-based self-assessment in the given units of study in two parts:
    1. Replicate improvement of student judgement over time with more data from different disciplines (Repeat questions 1- 4)
    2. Investigate improvement of judging skill over sequential modules and analyze based on assessment type, assessment criteria (New questions 1-4)

Method/ Context:

  • Voluntary self assessment of students in authentic settings using an online assessment system ReView™.
  • Percentage marks from continuous sliding scale stored – of both students and tutors.
  • Data from 5 year period in Design, and 3 year period in Business course from two Australian universities.
  • 182 design students, 1162 business students.

Results and Discussion:

Repeat Q1: Does accuracy in students’ self-assessed grades vary by student performance?

  • Ability level – High (distinction or high distinction), low (fail) and mid (pass or credit)
  • p values
  • Yes. Low ability students did not improve judgement over time. Significantly higher improvement for mid ability students (Students matched tutors grading)

Repeat Q2: Does students’ accuracy in self-assessment relate to improved performance?

  • Accuracy levels: Over-, accurate and Under-estimators
  • One way ANOVA
  • Accurate estimators had increased scores over time

Repeat Q3: Do students’ marks converge with tutors’ within a course module?

  • Series of paired t-tests
  • Convergence found in design data, but not in business data
    • Design assessment tasks are scaffolded to lead one to other, but business uses different modes of assessment within a course module.

Repeat Q4: Does the gap between student grades and tutor grades reduce over time?

  • Yes. Students’ judgement improves over the time of the degree programme, but not very useful as the convergence happens almost at the end of the degree programme.

New Q1: Does the gap between student grades and tutor grades reduce across modules designed as a sequence?

  • No data for design and reduction in gap (erratic patterns with no gradual reduction) between student and tutor grades for business with sequential modules
  • Leading to examine the type of assessments

New Q2: Does mode/type of assessment task (e.g., written assignment, project, and presentation) influence students’ judgement of grades?

  • Data was inconsistent, despite showing earlier convergence for few assessment types. Most assessment types saw convergence in iterations 2 or 3 (Refer Table 1 from the original paper).
    • Could be due to difference in tasks within each assessment type

New Q3: Does analysis of criteria that relate to type of assessment task influence students’ judgement of grades?

  • Consistent and related criteria for particular assessment type fosters faster calibration in iterations 1 or 2 (Refer Table 2 from the original paper).


  • Criterion is provided for assessment since students are not experts, however a holistic evaluation of own work is recommended.
  • Not possible to identify the cause for improvement from independent measures.
  • Whole population of students is not included, especially the less engaged students who might be low achieving.
  • There might have been other informal factors (not measured in this study) to influence the results like the following: comments received from staff, discussions with peers, and students’ own aspirations.