Reference:
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7), 991-1000.
Background:
- Scientific discourse analysis helps in distinguishing the nature of knowledge in research articles (facts, hypothesis, existing and new work).
- Annotation schemes vary across disciplines in scope and granularity.
Purpose:
- To build a finer grained annotation scheme to capture the structure of scientific articles (CoreSC scheme).
- To automate the annotation of full articles at sentence level with CoreSC scheme using machine learning classifiers (SAPIENT “Semantic Annotation of Papers: Interface & ENrichment Tool” available for download here).
Method:
Data:
- 265 articles from biochemistry and chemistry, containing 39915 sentences (>1 million words) annotated in three phrases by multiple experts.
- XML aware sentence splitter SSSplit used for splitting sentences.
Scheme:
- First layer of the CoreSC scheme with 11 categories for annotation:
- Background (BAC), Hypothesis (HYP), Motivation (MOT), Goal (GOA), Object (OBJ), Method (MET), Model (MOD), Experiment (EXP), Observation (OBS), Result (RES) and Conclusion (CON).
Implementation:
- Text classification:
- Sentences classified independent of each other.
- Uses Support Vector Machine (SVM).
- Features extracted based on different aspects of a sentence: location within the paper, document structure (global features) to local features. For the complete list of features used, refer the paper.
- Sequence labelling:
- Labels assigned to satisfy dependencies among sentences.
- Uses Conditional Random Fields (CRF).
Results and discussion:
- F-score: Ranges from 76% for EXP (Experiment) to 18% for the low frequency category MOT(Motivation) [Refer complete results from runs configured with different settings and features in Table 2 of the paper].
- Most important features: n-grams (primarily bigrams), Grammatical triples (GRs), verbs, global features such as history (sequence of labels) and section headings (Detailed explanation for the features
- Classifiers: LibS has the highest accuracy at 51.6%, CRF at 50.4% and LibL at 47.7%.
Application/Future Work:
- Can be applied to create executive summaries of full papers (based on the entire content and not just abstracts) to identify key information in a paper.
- CoreSC annotated biology papers to be used for guiding information extraction and retrieval.
- Generalization to new domains in progress.