Notes: Automatic recognition of conceptualization zones in scientific articles

Reference:
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7), 991-1000.

Background:

  • Scientific discourse analysis helps in distinguishing the nature of knowledge in research articles (facts, hypothesis, existing and new work).
  • Annotation schemes vary across disciplines in scope and granularity.

Purpose:

  • To build a finer grained annotation scheme to capture the structure of scientific articles (CoreSC scheme).
  • To automate the annotation of full articles at sentence level with CoreSC scheme using machine learning classifiers (SAPIENT “Semantic Annotation of Papers: Interface & ENrichment Tool” available for download here).

Method:

Data:

  • 265 articles from biochemistry and chemistry, containing 39915 sentences (>1 million words) annotated in three phrases by multiple experts.
  • XML aware sentence splitter SSSplit used for splitting sentences.

Scheme:

  • First layer of the CoreSC scheme with 11 categories for annotation:
    • Background (BAC), Hypothesis (HYP), Motivation (MOT), Goal (GOA), Object (OBJ), Method (MET), Model (MOD), Experiment (EXP), Observation (OBS), Result (RES) and Conclusion (CON).

Implementation:

  1. Text classification:
    • Sentences classified independent of each other.
    • Uses Support Vector Machine (SVM).
    • Features extracted based on different aspects of a sentence: location within the paper, document structure (global features) to local features. For the complete list of features used, refer the paper.
  2. Sequence labelling:
    • Labels assigned to satisfy dependencies among sentences.
    • Uses Conditional Random Fields (CRF).

Results and discussion:

  • F-score: Ranges from 76% for EXP (Experiment) to 18% for the low frequency category MOT(Motivation) [Refer complete results from runs configured with different settings and features in Table 2 of the paper].
  • Most important features: n-grams (primarily bigrams), Grammatical triples (GRs), verbs, global features such as history (sequence of labels) and section headings (Detailed explanation for the features
  • Classifiers: LibS has the highest accuracy at 51.6%, CRF at 50.4% and LibL at 47.7%.

Application/Future Work:

  • Can be applied to create executive summaries of full papers (based on the entire content and not just abstracts) to identify key information in a paper.
  • CoreSC annotated biology papers to be used for guiding information extraction and retrieval.
  • Generalization to new domains in progress.