The topic of telling stories from data is huge and probably needs many many hours and books to explain the ideal ways of doing it. But Dr. Roberto Martinez did a great job in giving us a quick introduction to the topic and its pragmatic application in an hour at his talk at the UTS LX lab. It very much aligned with the Connected Intelligence Centre‘s vision of building staff capacity in data science particularly by keeping human in the center of the data. This post includes my notes from this talk where I summarize some of the key messages.
Humans are producing enormous amounts of data these days. According to recent statistics, 2.5 quintillion bytes of data are created every day and the pace keeps growing. But, there is a stark contrast between data and knowledge – Data by itself means very little, and knowledge is created only when the data is made sense of. We might be drowning in data, but not in knowledge. Roberto compares this abundance of data to oysters and an insight to a pearl. We need to open many oysters to maybe find one pearl.
The rest of the blog is divided into two main sections 1. Data Storytelling, 2. Data visualization, and a few overall key messages that I took away from the talk.
The value of data is not the data itself, but how we present it. This is what makes storytelling really important to present insights from data. It is not about presenting ALL the data we have, but to highlight the main insights from the data that should be noted. It is about finding patterns from the data to make people engaged with the story just like finding hooks in a fictional story. It often operates in conjunction with data visualization to communicate results from data. Check out the list of resources given at the end of this post for detailed reading.
There are a few ways to make the insights clear and pop out when communicating the story from data:
- The first step is to declutter the data by removing all the noise. This can be done by stripping down all the unwanted information and building up on the useful insights.
- The next key thing to do is to foreground things that are important. We do not want too much ink/ data that makes the results too complicated to understand.
- A data story approach can be used merging narrative and visuals together to engage audience and point to key messages from the data (see examples of line graphs annotated this way here). Also check out this interesting article and podcast on the good and bad of storytelling for further reading.
Continue reading “Telling stories with data and visualizations – Some key messages”
Wong, P. C., Cowley, W., Foote, H., Jurrus, E., & Thomas, J. (2000). Visualizing sequential patterns for text mining. In Information Visualization, 2000. InfoVis 2000. IEEE Symposium on (pp. 105-111). IEEE.
- Mining Sequential patterns aims to identify recurring patterns from data over a period of time.
- A pattern is a finite series of elements from the same domain A -> B -> C -> D
- Each pattern has a minimum ‘support’ value which indicates the percentage of pattern occurrence. (E.g. 90% of people who did this process, did the second process, followed by the third process)
- Sequential pattern vs association rule:
- Sequential pattern – studies ordering/arrangement of elements E.g. A -> B -> C -> D
- Association rule – studies togetherness E.g. A+B+C -> D
- Presenting a visual data mining system that combines pattern discovery and visualizations.
Open source corpus containing 1170 news articles from 1991 to 1997 and harvested news of 1990 from TREC5 distribution.
- Topic Extraction: Identifies the topic in documents based on the co-occurrence of words. Words separated by white space evaluated – stemming done, prepositions, pronouns, adjectives, and gerunds ignored.
- Multiresolution binning: Bins articles with the same timestamp (E.g. Binning by day, week, month, year)
Discovery of sequential patterns by Visualization:
- Plotting topics/ topic combinations over time.
- Strength: Can quickly view overall patterns and individual occurrence of events.
- Weakness: No knowledge on exact connections that make up the pattern and statistical support on the individual patterns.
Discovery of sequential patterns by Data mining:
- Building patterns on n-ary tree with elements as nodes.
- Patterns are valid if the support value is greater than threshold.
- A sample pattern mining from given input data is given in Figure 2 of the paper.
- Strength: Provides accurate statistical (support) values for all weak and strong patterns.
- Weakness: Loses temporal and locality information, large number of patterns produced in text format making human interpretation harder.
Visual Data Mining system:
- Combining visualization and data mining to compensate each others’ weaknesses (Refer Figure 4 & 5 in the paper to see the pattern visualizations).
- Binning resolution can be changed to see different patterns based on day, week, month, year etc.
- Patterns associated to a particular topic can be picked.
- Strength of pattern is not easily identifiable from the visualization without statistical measures. Pattern mining gets enhanced by graphical encoding with spatial and temporal information.
- Knowledge discovery by humans is aided by combining statistical data mining and visualization.
- Handling larger data sets using secondary memory support and improve display.
- Integrating more techniques like association rules into visual data mining environment.