Reference:
Wong, P. C., Cowley, W., Foote, H., Jurrus, E., & Thomas, J. (2000). Visualizing sequential patterns for text mining. In Information Visualization, 2000. InfoVis 2000. IEEE Symposium on (pp. 105-111). IEEE.
Background:
- Mining Sequential patterns aims to identify recurring patterns from data over a period of time.
- A pattern is a finite series of elements from the same domain A -> B -> C -> D
- Each pattern has a minimum ‘support’ value which indicates the percentage of pattern occurrence. (E.g. 90% of people who did this process, did the second process, followed by the third process)
- Sequential pattern vs association rule:
- Sequential pattern – studies ordering/arrangement of elements E.g. A -> B -> C -> D
- Association rule – studies togetherness E.g. A+B+C -> D
Purpose:
- Presenting a visual data mining system that combines pattern discovery and visualizations.
Method:
Datasets:
Open source corpus containing 1170 news articles from 1991 to 1997 and harvested news of 1990 from TREC5 distribution.
Pre-processing:
- Topic Extraction: Identifies the topic in documents based on the co-occurrence of words. Words separated by white space evaluated – stemming done, prepositions, pronouns, adjectives, and gerunds ignored.
- Multiresolution binning: Bins articles with the same timestamp (E.g. Binning by day, week, month, year)
Discovery of sequential patterns by Visualization:
- Plotting topics/ topic combinations over time.
- Strength: Can quickly view overall patterns and individual occurrence of events.
- Weakness: No knowledge on exact connections that make up the pattern and statistical support on the individual patterns.
Discovery of sequential patterns by Data mining:
- Building patterns on n-ary tree with elements as nodes.
- Patterns are valid if the support value is greater than threshold.
- A sample pattern mining from given input data is given in Figure 2 of the paper.
- Strength: Provides accurate statistical (support) values for all weak and strong patterns.
- Weakness: Loses temporal and locality information, large number of patterns produced in text format making human interpretation harder.
Visual Data Mining system:
- Combining visualization and data mining to compensate each others’ weaknesses (Refer Figure 4 & 5 in the paper to see the pattern visualizations).
- Binning resolution can be changed to see different patterns based on day, week, month, year etc.
- Patterns associated to a particular topic can be picked.
Result/Discussion:
- Strength of pattern is not easily identifiable from the visualization without statistical measures. Pattern mining gets enhanced by graphical encoding with spatial and temporal information.
- Knowledge discovery by humans is aided by combining statistical data mining and visualization.
Future Work:
- Handling larger data sets using secondary memory support and improve display.
- Integrating more techniques like association rules into visual data mining environment.