Automated Writing Feedback in AcaWriter

You might be familiar with my research in the field of Writing Analytics, particularly Automated Writing Feedback during my PhD and beyond. The work is based off an automated feedback tool called AcaWriter (previously called Automated Writing Analytics/ AWA) which we developed at the Connected Intelligence Centre, University of Technology Sydney.

Recently we have come up with resources to spread the word and introduce the tool to anyone who wants to learn more. First is an introductory blog post I wrote for the Society for Learning Analytics Research (SoLAR) Nexus publication. You can access the full blog post here: https://www.solaresearch.org/2020/11/acawriter-designing-automated-feedback-on-writing-that-teachers-and-students-trust/

We also ran a 2 hour long workshop online as part of a LALN event to add more detail and resources for others to participate. Details are here: http://wa.utscic.edu.au/events/laln-2020-workshop/

Video recording from the event is available for replay:

Learn more: https://cic.uts.edu.au/tools/awa/

Automated Revision Graphs – AIED 2020

I’ve recently had my writing analytics work published at the 21st international conference on artificial intelligence in education (AIED 2020) where the theme was “Augmented Intelligence to Empower Education”. It is a short paper describing a text analysis and visualisation method to study revisions. It introduced ‘Automated Revision Graphs’ to study revisions in short texts at a sentence level by visualising text as graph, with open source code.

Shibani A. (2020) Constructing Automated Revision Graphs: A Novel Visualization Technique to Study Student Writing. In: Bittencourt I., Cukurova M., Muldner K., Luckin R., Millán E. (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12164. Springer, Cham. [pdf] https://doi.org/10.1007/978-3-030-52240-7_52

I did a short introductory video for the conference, which can be viewed below:

I also had another paper I co-authored on multi-modal learning analytics lead by Roberto Martinez, which received the best paper award in the conference. The main contribution of the paper is a set of conceptual mappings from x-y positional data (captured from sensors) to meaningful measurable constructs in physical classroom movements, grounded in the theory of Spatial Pedagogy. Great effort by the team!

Details of the second paper can be found here:

Martinez-Maldonado R., Echeverria V., Schulte J., Shibani A., Mangaroska K., Buckingham Shum S. (2020) Moodoo: Indoor Positioning Analytics for Characterising Classroom Teaching. In: Bittencourt I., Cukurova M., Muldner K., Luckin R., Millán E. (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham. [pdf] https://doi.org/10.1007/978-3-030-52237-7_29

New Research Publications in Learning Analytics

Three of my journal articles got published recently, two on learning analytics/ writing analytics implementations [Learning Analytics Special Issue in The Internet and Higher Education journal], and one on a text analysis method [Educational Technology Research and Development journal]. that I worked on earlier (many years ago in fact, which just got published!).

Article 1: Educator Perspectives on Learning Analytics in Classroom Practice

The first one is predominantly qualitative in nature, based on instructor interviews of their experiences in using Learning Analytics tools such as the automated Writing feedback tool AcaWriter. It provides a practical account of implementing learning analytics in authentic classroom practice from the voices of educators. Details below:

Abstract: Failing to understand the perspectives of educators, and the constraints under which they work, is a hallmark of many educational technology innovations’ failure to achieve usage in authentic contexts, and sustained adoption. Learning Analytics (LA) is no exception, and there are increasingly recognised policy and implementation challenges in higher education for educators to integrate LA into their teaching. This paper contributes a detailed analysis of interviews with educators who introduced an automated writing feedback tool in their classrooms (triangulated with student and tutor survey data), over the course of a three-year collaboration with researchers, spanning six semesters’ teaching. It explains educators’ motivations, implementation strategies, outcomes, and challenges when using LA in authentic practice. The paper foregrounds the views of educators to support cross-fertilization between LA research and practice, and discusses the importance of cultivating educators’ and students’ agency when introducing novel, student-facing LA tools.

Keywords: learning analytics; writing analytics; participatory research; design research; implementation; educator

Citation and article link: Antonette Shibani, Simon Knight and Simon Buckingham Shum (2020). Educator Perspectives on Learning Analytics in Classroom Practice [Author manuscript]. The Internet and Higher Education. https://doi.org/10.1016/j.iheduc.2020.100730. [Publisher’s free download link valid until 8 May 2020].

Article 2: Implementing Learning Analytics for Learning Impact: Taking Tools to Task

The second one led by Simon Knight provides a broader framing for how we define impact in learning analytics. It defines a model addressing the key challenges in LA implementations based on our writing analytics example. Details below:

Abstract: Learning analytics has the potential to impact student learning, at scale. Embedded in that claim are a set of assumptions and tensions around the nature of scale, impact on student learning, and the scope of infrastructure encompassed by ‘learning analytics’ as a socio-technical field. Drawing on our design experience of developing learning analytics and inducting others into its use, we present a model that we have used to address five key challenges we have encountered. In developing this model, we recommend: A focus on impact on learning through augmentation of existing practice; the centrality of tasks in implementing learning analytics for impact on learning; the commensurate centrality of learning in evaluating learning analytics; inclusion of co-design approaches in implementing learning analytics across sites; and an attention to both social and technical infrastructure.

Keywords: learning analytics, implementation, educational technology, learning design

Citation and article link:  Simon Knight, Andrew Gibson and Antonette Shibani (2020). Implementing Learning Analytics for Learning Impact: Taking Tools to Task. The Internet and Higher Education. https://doi.org/10.1016/j.iheduc.2020.100729.

Article 3: Identifying patterns in students’ scientific argumentation: content analysis through text mining using LDA

The third one led by Wanli Xing discusses the use of Latent Dirichlet Allocation, a text mining method to study argumentation patterns in student writing (in an unsupervised way). Details below:

Abstract: Constructing scientific arguments is an important practice for students because it helps them to make sense of data using scientific knowledge and within the conceptual and experimental boundaries of an investigation. In this study, we used a text mining method called Latent Dirichlet Allocation (LDA) to identify underlying patterns in students written scientific arguments about a complex scientific phenomenon called Albedo Effect. We further examined how identified patterns compare to existing frameworks related to explaining evidence to support claims and attributing sources of uncertainty. LDA was applied to electronically stored arguments written by 2472 students and concerning how decreases in sea ice affect global temperatures. The results indicated that each content topic identified in the explanations by the LDA— “data only,” “reasoning only,” “data and reasoning combined,” “wrong reasoning types,” and “restatement of the claim”—could be interpreted using the claim–evidence–reasoning framework. Similarly, each topic identified in the students’ uncertainty attributions— “self-evaluations,” “personal sources related to knowledge and experience,” and “scientific sources related to reasoning and data”—could be interpreted using the taxonomy of uncertainty attribution. These results indicate that LDA can serve as a tool for content analysis that can discover semantic patterns in students’ scientific argumentation in particular science domains and facilitate teachers’ providing help to students.

Keywords: text mining, latent dirichlet allocation, educational data mining, scientific argumentation

Citation and article link:  Wanli Xing, Hee-Sun Lee and Antonette Shibani (2020). Identifying patterns in students’ scientific argumentation: content analysis through text mining using Latent Dirichlet Allocation. Educational Technology Research and Development. https://doi.org/10.1007/s11423-020-09761-w.

Working with Jupyter notebooks #code

Jupyter is an open source program that helps you share and run code in many different programming languages. Jupyter notebooks are great to quickly prototype different versions of code, as they are easy to edit and try different outputs. The format of a Jupyter notebook is similar to reports in the form of Markdowns that are usually used in R. It can contain blocks of text, code, equations and results (including visualizations) all in one page. We’ve used Jupyter notebooks to run text analysis workshops in conferences, and the feedback was pretty good.

The Writing Analytics workshop is starting at #LAK18. Jupyter notebooks are being used. #great pic.twitter.com/56Zd66ku9L

I find that Jupyter notebooks are great for sharing code and results across different people, and if you’re hosting it, it saves a lot of trouble in organising a workshop where you want participants to install software. It works well for non-technical audience too, since they can choose to ignore what’s inside the code block by simply running it and focus on the results block. They are quite popular now for data science experiments, so this post will be a good place to start to know and use them. You can use an already available notebook (if you’ve downloaded one from Github) and play with it, or create your own Jupyter notebook from scratch. This post will guide you to create your own notebook from scratch demonstrating some basic text analysis in Python.

Installing Jupyter

If you want to try a Jupyter notebook first without installing anything, you can do so in this notebook hosted in the official Jupyter site. If you want to install your own copy of Jupyter running in your machine to develop code, then use one of the two options below:

  • If you are new to Python programming, and don’t have python installed in your machine, the easiest way to install Jupyter is by downloading the Anaconda distribution. This comes with in-built Python (you can choose either 2.7 or 3.6 version of Python when you download the distribution – the code I’m writing in this post is in 2.7).
  • If you already have Python working in your machine (as I did), the easiest way is to install Jupyter using the pip command as you do for any Python package. Note that if pip and python are already setup in your system path, you can simply use $ pip install jupyter from the command prompt.

Now that Jupyter is installed, type the command below in your anaconda prompt/command prompt to start a Jupyter notebook:

$ jupyter notebook

The Jupyter homepage opens in your default browser at http://localhost:8888, displaying the files present in the current folder like below. You can now create a new Python jupyter notebook by clicking on New -> Python2 (or Python 3 if you have Python version 3). You can move between folders or create a new folder for your Python notebooks. To change the default opening directory, you should first move to the required path using cd in the command prompt, and then type$ jupyter notebookOpen the created notebook, which would look like this:

This cell is a code block by default, which can be changed to a markdown text block from the drop-down list (check the figure above) to add narrative text accompanying the Python code. Now name your notebook, and try adding both a code block, and markdown block with different levels of text following the sample here:

To execute the blocks, click on the Run button (Alternatively, use Ctrl+Enter in Windows – Keyboard shortcuts can be found in Help -> Keyboard shortcuts). This renders the output of your code and your markdown text like this:

That’s it. You have a simple Jupyter notebook running on your machine. Now to try a bit more, here’s the sample code you can download and run to do some basic text analysis. I’ve defined three steps in this code: Importing required packages, defining input text, and analysis. Before importing the packages/ libraries you need in step 1 however, they should be first installed in your machine. This can be done using the Pip command in the command prompt/anaconda prompt like this:  $ pip install wordcloud (If you run into problems with that, the other option is to download an appropriate version of the package’s wheel from here and install it using $pip install C:/some-dir/some-file.whl).

Python code for the three steps is below:

#Step 1 - Importing libraries
 
from wordcloud import WordCloud, STOPWORDS  #For word cloud generation
import matplotlib.pyplot as plt             #For displaying figures
import re                          #Regular expresions for string operations


#Step 2 - Defining input text

inputtext = "A cockatoo is a parrot that is any of the 21 species belonging to the bird family Cacatuidae, the only family in the superfamily Cacatuoidea. Along with the Psittacoidea (true parrots) and the Strigopoidea (large New Zealand parrots), they make up the order Psittaciformes (parrots). The family has a mainly Australasian distribution, ranging from the Philippines and the eastern Indonesian islands of Wallacea to New Guinea, the Solomon Islands and Australia. Cockatoos are recognisable by the showy crests and curved bills. Their plumage is generally less colourful than that of other parrots, being mainly white, grey or black and often with coloured features in the crest, cheeks or tail. On average they are larger than other parrots; however, the cockatiel, the smallest cockatoo species, is a small bird. The phylogenetic position of the cockatiel remains unresolved, other than that it is one of the earliest offshoots of the cockatoo lineage. The remaining species are in two main clades. The five large black coloured cockatoos of the genus Calyptorhynchus form one branch. The second and larger branch is formed by the genus Cacatua, comprising 11 species of white-plumaged cockatoos and four monotypic genera that branched off earlier; namely the pink and white Major Mitchell's cockatoo, the pink and grey galah, the mainly grey gang-gang cockatoo and the large black-plumaged palm cockatoo. Cockatoos prefer to eat seeds, tubers, corms, fruit, flowers and insects. They often feed in large flocks, particularly when ground-feeding. Cockatoos are monogamous and nest in tree hollows. Some cockatoo species have been adversely affected by habitat loss, particularly from a shortage of suitable nesting hollows after large mature trees are cleared; conversely, some species have adapted well to human changes and are considered agricultural pests. Cockatoos are popular birds in aviculture, but their needs are difficult to meet. The cockatiel is the easiest cockatoo species to maintain and is by far the most frequently kept in captivity. White cockatoos are more commonly found in captivity than black cockatoos. Illegal trade in wild-caught birds contributes to the decline of some cockatoo species in the wild. Source: https://en.wikipedia.org/wiki/Cockatoo"


print("\nInput text for analysis:\n ")
print(inputtext)


#Step 3 - Analysis

print "Summary statistics of input text:"

wordcount = len(re.findall(r'\w+', inputtext))
print "Wordcount: ", wordcount

charcount = len(inputtext) #including spaces
print "Number of characters: ", charcount

#More options for wordclouds here: https://github.com/amueller/word_cloud
wordcloud = WordCloud(    stopwords=STOPWORDS,
                          background_color='black',
                         ).generate(inputtext)

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

The downloadable ipynb file is available on Github.

Other notes:

  • This post is intended for anyone who wants to start working with Jupyter notebooks, and assumes prior understanding of programming in Python. The Jupyter notebook is another environment to easily work with code, but the coding process is still very traditional. If you’re new to Python programming, this website is a good place to start.
  • You can use multiple versions of Python to run Jupyter notebooks by changing its Kernel (the computational engine which executes the code). I have both Python 2 & Python 3 installed, and I switch between them for different programs as needed.
  • While Jupyter notebooks are mainly used to run Python code, they can also be used to run R programs, which requires R kernel to be installed. The blog post below is a useful guide to do that: https://www.datacamp.com/community/blog/jupyter-notebook-r

Tools for automated rhetorical analysis of academic writing

Alert – Long post!

In this post, I’m presenting a summary of my review on tools for automatically analyzing rhetorical structures from academic writing.

The tools considered are designed to cater to different users and purposes. AWA and RWT aim to provide feedback for improving students’ academic writing. Mover and SAPIENTA on the other hand, are to help researchers identify the structure of research articles. ‘Mover’ even allows users to give a second opinion on the classification of moves and add new training data (This can lead to a less accurate model if students with less expertise add potentially wrong training data). However, these tools have a common thread and fulfill the following criteria:

  • They look at scientific text – Full research articles, abstracts or introductions. Tools to automate argumentative zoning of other open text (Example) are not considered.
  • They automate the identification of rhetorical structures (zones, moves) in research articles (RA) with sentence being the unit of analysis.
  • They are broadly based on the Argumentative Zoning scheme by Simone Teufel or the CARS model by John Swales (Either the original schema or modified version of it).

Tools (in alphabetical order):

  1. Academic Writing Analytics (AWA) – Summary notes here

AWA also has a reflective parser to give feedback on students’ reflective writing, but the focus of this post is on the analytical parser. AWA demo, video courtesy of Dr. Simon Knight:

  1. Mover – Summary notes here

Available for download as a stand alone application. Sample screenshot below:

antmover

  1. Research Writing Tutor (RWT) – Summary notes here

RWT demo, video courtesy of Dr. Elena Cotos:

  1. SAPIENTA – Summary notes here.

Available for download as a stand alone java application or can be accessed as a web service. Sample screenshot of tagged output from SAPIENTA web service below:

sapienta-outputAnnotation Scheme:

The general aim of the schemes used is to be applicable to all academic writing and this has been successfully tested across data from different disciplines. A comparison of the schemes used by the tools is shown in the below table:

ToolSource & DescriptionAnnotation Scheme
AWAAWA Analytical scheme (Modified from AZ for sentence level parsing)-Summarizing
-Background knowledge
-Contrasting ideas
-Novelty
-Significance
-Surprise
-Open question
-Generalizing
Mover Modified CARS model
-three main moves and further steps
1. Establish a territory
-Claim centrality
-Generalize topics
-Review previous research
2. Establish a niche
-Counter claim
-Indicate a gap
-Raise questions
-Continue a tradition
3. Occupy the niche
-Outline purpose
-Announce research
-Announce findings
-Evaluate research
-Indicate RA structure
RWTModified CARS model
-3 moves, 17 steps
Move 1. Establishing a territory
-1. Claiming centrality
-2. Making topic generalizations
-3. Reviewing previous research
Move 2. Identifying a niche
-4. Indicating a gap
-5. Highlighting a problem
-6. Raising general questions
-7. Proposing general hypotheses
-8. Presenting a justification
Move 3. Addressing the niche
-9. Introducing present research descriptively
-10. Introducing present research purposefully
-11. Presenting research questions
-12. Presenting research hypotheses
-13. Clarifying definitions
-14. Summarizing methods
-15. Announcing principal outcomes
-16. Stating the value of the present research
-17. Outlining the structure of the paper
SAPIENTAfiner grained AZ scheme
-CoreSC scheme with 11 categories in the first layer
-Background (BAC)
-Hypothesis (HYP)
-Motivation (MOT)
-Goal (GOA)
-Object (OBJ)
-Method (MET)
-Model (MOD)
-Experiment (EXP)
-Observation (OBS)
-Result (RES)
-Conclusion (CON)

Method:

The tools are built on different data sets and methods for automating the analysis. Most of them use manually annotated data as a standard for training the model to automatically classify the categories. Details below:

ToolData typeAutomation method
AWAAny research writingNLP rule based - Xerox Incremental Parser (XIP) to annotate rhetorical functions in discourse.
MoverAbstractsSupervised learning - Naïve Bayes classifier with data represented as bag of clusters with location information.
RWTIntroductionsSupervised learning using Support Vector Machine (SVM) with n-dimensional vector representation and n-gram features.
SAPIENTA Full articleSupervised learning using SVM with sentence aspect features and Sequence Labelling using Conditional Random Fields (CRF) for sentence dependencies.

Others:

  • SciPo tool helps students write summaries and introductions for scientific texts in Portuguese.
  • Another tool CARE is a word concordancer used to search for words and moves from research abstracts- Summary notes here.
  • A ML approach considering three different schemes for annotating scientific abstracts (No tool).

If you think I’ve missed a tool which does similar automated tagging in research articles, do let me know so I can include it in my list 🙂

Notes: Discourse classification into rhetorical functions

Reference: Cotos, E., & Pendar, N. (2016). Discourse classification into rhetorical functions for AWE feedback. calico journal, 33(1), 92.

Background:

  • Computational techniques can be exploited to provide individualized feedback to learners on writing.
  • Genre analysis on writing to identify moves (communicative goal) and steps (rhetorical functions to help achieve the goal) [Swales, 1990].
  • Natural language processing (NLP) and machine learning categorization approach are widely used to automatically identify discourse structures (E.g. Mover, prior work on IADE).

Purpose:

  • To develop an automated analysis system ‘Research Writing Tutor‘ (RWT) for identifying rhetorical structures (moves and steps) from research writing and provide feedback to students.

Method:

  • Sentence level analysis – Each sentence classified to a move, step within the move.
  • Data: Introduction section from 1020 articles – 51 disciplines, each discipline containing 20 articles, total of 1,322,089 words.
  • Annotation Scheme:
    • 3 moves, 17 steps – Refer Table 1 from the original paper for detailed annotation scheme (Based on the CARS model).
    • Manual annotation using XML based markup by the Callisto Workbench.
  • Supervised learning approach steps:
    1. Feature selection:
      • Important features – unigrams, trigrams
      • n-gram feature set contained 5,825 unigrams and 11,630 trigrams for moves, and 27,689 unigrams and 27,160 trigrams for steps.
    2. Sentence representation:
      • Each sentence is represented as a n-dimensional vector in the R^n Euclidean space.
      • Boolean representation to indicate presence or absence of feature in sentence.
    3. Training classifier:
      • SVM model for classification.
      • 10-fold cross validation.
      • precision higher than recall – 70.3% versus 61.2% for the move classifier and 68.6% versus 55% for the step classifier – objective is to maximize accuracy.
      • RWT analyzer has two cascaded SVM – move classifier followed by step classifier.

Results:

  • Move and step classifiers predict some elements better than the others (Refer paper for detailed results):
    • Move 2 most difficult to identify (sparse training data).
    • Move 1 gained best recall- less ambiguous cues.
    • 10 out of 17 steps were predicted well.
    • Overall move accuracy of 72.6% and step accuracy of 72.9%.

Future Work:

  • Moving beyond sentence level to incorporate context information and sequence of moves/steps.
  • Knowledge-based approach for hard to identify steps – hand written rules and patterns.
  • Voting algorithm using independent analyzers.

Notes: XIP – Automated rhetorical parsing of scientific metadiscourse

Reference: Simsek, D., Buckingham Shum, S., Sandor, A., De Liddo, A., & Ferguson, R. (2013). XIP Dashboard: visual analytics from automated rhetorical parsing of scientific metadiscourse. In: 1st International Workshop on Discourse-Centric Learning Analytics, 8 Apr 2013, Leuven, Belgium.

Background:

Learners should have the ability to critically evaluate research articles and be able to identify the claims and ideas in scientific literature.

Purpose:

  • Automating analysis of research articles to identify evolution of ideas and findings.
  • Describing the Xerox Incremental Parser (XIP) which identifies rhetorically significant structures from research text.
  • Designing a visual analytics dashboard to provide overviews of the student corpus.

Method:

  • Argumentative Zoning (AZ) to annotate moves in research articles by Simone Teufel.
  • Rhetorical moves tagged by XIP – partly overlap and partly different from AZ scheme: SUMMARIZING, BACKGROUND KNOWLEDGE, CONTRASTING IDEAS, NOVELTY, SIGNIFICANCE, SURPRISE, OPEN QUESTION, GENERALIZING
  • Sample discourse moves:
    • Summarizing: “The purpose of this article….”
    • Contrasting ideas: “With an absence of detailed work…”
      • Sub-classes: novelty, surprise, importance, emerging issue, open question
  • XIP outputs a raw output file containing semantic tags and concepts extracted from text.
  • Data: Papers from LAK & EDM conferences and journal – 66 LAK and 239 EDM papers extracting 7847 sentences and 40163 concepts.
  • Dashboard design – Refer original paper to see the process involved in prototyping the visualizations.

Tool:

  • XIP is now embedded in the Academic Writing Analytics (AWA) tool by UTS. AWA provides analytical and reflective reports on students’ writing.

Notes: Automatic recognition of conceptualization zones in scientific articles

Reference:
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7), 991-1000.

Background:

  • Scientific discourse analysis helps in distinguishing the nature of knowledge in research articles (facts, hypothesis, existing and new work).
  • Annotation schemes vary across disciplines in scope and granularity.

Purpose:

  • To build a finer grained annotation scheme to capture the structure of scientific articles (CoreSC scheme).
  • To automate the annotation of full articles at sentence level with CoreSC scheme using machine learning classifiers (SAPIENT “Semantic Annotation of Papers: Interface & ENrichment Tool” available for download here).

Method:

Data:

  • 265 articles from biochemistry and chemistry, containing 39915 sentences (>1 million words) annotated in three phrases by multiple experts.
  • XML aware sentence splitter SSSplit used for splitting sentences.

Scheme:

  • First layer of the CoreSC scheme with 11 categories for annotation:
    • Background (BAC), Hypothesis (HYP), Motivation (MOT), Goal (GOA), Object (OBJ), Method (MET), Model (MOD), Experiment (EXP), Observation (OBS), Result (RES) and Conclusion (CON).

Implementation:

  1. Text classification:
    • Sentences classified independent of each other.
    • Uses Support Vector Machine (SVM).
    • Features extracted based on different aspects of a sentence: location within the paper, document structure (global features) to local features. For the complete list of features used, refer the paper.
  2. Sequence labelling:
    • Labels assigned to satisfy dependencies among sentences.
    • Uses Conditional Random Fields (CRF).

Results and discussion:

  • F-score: Ranges from 76% for EXP (Experiment) to 18% for the low frequency category MOT(Motivation) [Refer complete results from runs configured with different settings and features in Table 2 of the paper].
  • Most important features: n-grams (primarily bigrams), Grammatical triples (GRs), verbs, global features such as history (sequence of labels) and section headings (Detailed explanation for the features
  • Classifiers: LibS has the highest accuracy at 51.6%, CRF at 50.4% and LibL at 47.7%.

Application/Future Work:

  • Can be applied to create executive summaries of full papers (based on the entire content and not just abstracts) to identify key information in a paper.
  • CoreSC annotated biology papers to be used for guiding information extraction and retrieval.
  • Generalization to new domains in progress.

Notes: Computational analysis of move structures in academic abstracts

Reference:

Wu, J. C., Chang, Y. C., Liou, H. C., & Chang, J. S. (2006, July). Computational analysis of move structures in academic abstracts. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 41-44). Association for Computational Linguistics.

Background:

  • Swales pattern for research articles: Introduction, Methods, Results, Discussion (IMRD) and Creating a Research Space (CARS) model.
  • Studying the rhetorical structure of tests is found to be useful to aid reading and writing (Mover tool notes here).

Purpose:

  • To automatically analyze move structures (Background, Purpose, Method, Result, and Conclusion) from research article abstracts.
  • To develop an online learning system CARE (Concordancer for Academic wRiting in English) using move structures to help novice writers.

Method:

  • Processes involved:

care-system

  • TANGO Concordancer used for extracting collocations with chunking and clause information – Sample  Verb-Noun collocation structures in corpus: VP+NP, VP+PP+NP, and VP+NP+PP (Ref: Jian, J. Y., Chang, Y. C., & Chang, J. S. (2004, July). TANGO: Bilingual collocational concordancer. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions (p. 19). Association for Computational Linguistics.)
    • TANGO Tool accessible here.
  • Data: Corpus of 20,306 abstracts (95,960 sentences) from Citeseer. Manual tagging of moves in 106 abstracts containing 709 sentences. 72,708 collocation types extracted and manually tagged 317 collocations with moves.
  • Hidden Markov Model (HMM) trained using 115 abstracts containing 684 sentences.
  • Different parameters evaluated for the HMM model: “the frequency of collocation types, the number of sentences with collocation in each abstract, move sequence score and collocation score”

Results:

  • Precision of 80.54% achieved when 627 sentences were qualified with following parameters: weight of transitional probability function 0.7 , frequency threshold for a collocation to be applicable – 18 (crucial to exclude unreliable collocation).

Conclusion:

  • CARE system interface created for querying and looking up sentences for a specific move.
  • System is expected to help non native speakers write abstracts for research articles.

Notes: Visualizing sequential patterns for text mining

Reference:

Wong, P. C., Cowley, W., Foote, H., Jurrus, E., & Thomas, J. (2000). Visualizing sequential patterns for text mining. In Information Visualization, 2000. InfoVis 2000. IEEE Symposium on (pp. 105-111). IEEE.

Background:

  • Mining Sequential patterns aims to identify recurring patterns from data over a period of time.
  • A pattern is a finite series of elements from the same domain A -> B -> C -> D
  • Each pattern has a minimum ‘support’ value which indicates the percentage of pattern occurrence. (E.g. 90% of people who did this process, did the second process, followed by the third process)
  • Sequential pattern vs association rule:
    • Sequential pattern – studies ordering/arrangement of elements E.g. A -> B -> C -> D
    • Association rule – studies togetherness E.g. A+B+C -> D

Purpose:

  • Presenting a visual data mining system that combines pattern discovery and visualizations.

Method:

Datasets:

Open source corpus containing 1170 news articles from 1991 to 1997 and harvested news of 1990 from TREC5 distribution.

Pre-processing:

  1. Topic Extraction: Identifies the topic in documents based on the co-occurrence of words. Words separated by white space evaluated – stemming done, prepositions, pronouns, adjectives, and gerunds ignored.
  2. Multiresolution binning: Bins articles with the same timestamp (E.g. Binning by day, week, month, year)

Discovery of sequential patterns by Visualization:

  • Plotting topics/ topic combinations over time.
  • Strength: Can quickly view overall patterns and individual occurrence of events.
  • Weakness: No knowledge on exact connections that make up the pattern and statistical support on the individual patterns.

Discovery of sequential patterns by Data mining:

  • Building patterns on n-ary tree with elements as nodes.
  • Patterns are valid if the support value is greater than threshold.
  • A sample pattern mining from given input data is given in Figure 2 of the paper.
  • Strength: Provides accurate statistical (support) values for all weak and strong patterns.
  • Weakness: Loses temporal and locality information, large number of patterns produced in text format making human interpretation harder.

Visual Data Mining system:

visual-data-mining

  • Combining visualization and data mining to compensate each others’ weaknesses (Refer Figure 4 & 5 in the paper to see the pattern visualizations).
  • Binning resolution can be changed to see different patterns based on day, week, month, year etc.
  • Patterns associated to a particular topic can be picked.

Result/Discussion:

  • Strength of pattern is not easily identifiable from the visualization without statistical measures. Pattern mining gets enhanced by graphical encoding with spatial and temporal information.
  • Knowledge discovery by humans is aided by combining statistical data mining and visualization.

Future Work:

  • Handling larger data sets using secondary memory support and improve display.
  • Integrating more techniques like association rules into visual data mining environment.