Working with Jupyter notebooks #code

Jupyter is an open source program that helps you share and run code in many different programming languages. Jupyter notebooks are great to quickly prototype different versions of code, as they are easy to edit and try different outputs. The format of a Jupyter notebook is similar to reports in the form of Markdowns that are usually used in R. It can contain blocks of text, code, equations and results (including visualizations) all in one page. We’ve used Jupyter notebooks to run text analysis workshops in conferences, and the feedback was pretty good.

The Writing Analytics workshop is starting at #LAK18. Jupyter notebooks are being used. #great

I find that Jupyter notebooks are great for sharing code and results across different people, and if you’re hosting it, it saves a lot of trouble in organising a workshop where you want participants to install software. It works well for non-technical audience too, since they can choose to ignore what’s inside the code block by simply running it and focus on the results block. They are quite popular now for data science experiments, so this post will be a good place to start to know and use them. You can use an already available notebook (if you’ve downloaded one from Github) and play with it, or create your own Jupyter notebook from scratch. This post will guide you to create your own notebook from scratch demonstrating some basic text analysis in Python.

Installing Jupyter

If you want to try a Jupyter notebook first without installing anything, you can do so in this notebook hosted in the official Jupyter site. If you want to install your own copy of Jupyter running in your machine to develop code, then use one of the two options below:

  • If you are new to Python programming, and don’t have python installed in your machine, the easiest way to install Jupyter is by downloading the Anaconda distribution. This comes with in-built Python (you can choose either 2.7 or 3.6 version of Python when you download the distribution – the code I’m writing in this post is in 2.7).
  • If you already have Python working in your machine (as I did), the easiest way is to install Jupyter using the pip command as you do for any Python package. Note that if pip and python are already setup in your system path, you can simply use $ pip install jupyter from the command prompt.

Now that Jupyter is installed, type the command below in your anaconda prompt/command prompt to start a Jupyter notebook:

$ jupyter notebook

The Jupyter homepage opens in your default browser at http://localhost:8888, displaying the files present in the current folder like below. You can now create a new Python jupyter notebook by clicking on New -> Python2 (or Python 3 if you have Python version 3). You can move between folders or create a new folder for your Python notebooks. To change the default opening directory, you should first move to the required path using cd in the command prompt, and then type$ jupyter notebookOpen the created notebook, which would look like this:

This cell is a code block by default, which can be changed to a markdown text block from the drop-down list (check the figure above) to add narrative text accompanying the Python code. Now name your notebook, and try adding both a code block, and markdown block with different levels of text following the sample here:

To execute the blocks, click on the Run button (Alternatively, use Ctrl+Enter in Windows – Keyboard shortcuts can be found in Help -> Keyboard shortcuts). This renders the output of your code and your markdown text like this:

That’s it. You have a simple Jupyter notebook running on your machine. Now to try a bit more, here’s the sample code you can download and run to do some basic text analysis. I’ve defined three steps in this code: Importing required packages, defining input text, and analysis. Before importing the packages/ libraries you need in step 1 however, they should be first installed in your machine. This can be done using the Pip command in the command prompt/anaconda prompt like this:  $ pip install wordcloud (If you run into problems with that, the other option is to download an appropriate version of the package’s wheel from here and install it using $pip install C:/some-dir/some-file.whl).

Python code for the three steps is below:

The downloadable ipynb file is available on Github.

Other notes:

  • This post is intended for anyone who wants to start working with Jupyter notebooks, and assumes prior understanding of programming in Python. The Jupyter notebook is another environment to easily work with code, but the coding process is still very traditional. If you’re new to Python programming, this website is a good place to start.
  • You can use multiple versions of Python to run Jupyter notebooks by changing its Kernel (the computational engine which executes the code). I have both Python 2 & Python 3 installed, and I switch between them for different programs as needed.
  • While Jupyter notebooks are mainly used to run Python code, they can also be used to run R programs, which requires R kernel to be installed. The blog post below is a useful guide to do that:

Telling stories with data and visualizations – Some key messages

The topic of telling stories from data is huge and probably needs many many hours and books to explain the ideal ways of doing it. But Dr. Roberto Martinez did a great job in giving us a quick introduction to the topic and its pragmatic application in an hour at his talk at the UTS LX lab. It very much aligned with the Connected Intelligence Centre‘s  vision of building staff capacity in data science particularly by keeping human in the center of the data. This post includes my notes from this talk where I summarize some of the key messages.

Humans are producing enormous amounts of data these days. According to recent statistics, 2.5 quintillion bytes of data are created every day and the pace keeps growing. But, there is a stark contrast between data and knowledge – Data by itself means very little, and knowledge is created only when the data is made sense of. We might be drowning in data, but not in knowledge. Roberto compares this abundance of data to oysters and an insight to a pearl. We need to open many oysters to maybe find one pearl.

The rest of the blog is divided into two main sections 1. Data Storytelling, 2. Data visualization, and a few overall key messages that I took away from the talk.

Data Storytelling:

The value of data is not the data itself, but how we present it. This is what makes storytelling really important to present insights from data. It is not about presenting ALL the data we have, but to highlight the main insights from the data that should be noted. It is about finding patterns from the data to make people engaged with the story just like finding hooks in a fictional story. It often operates in conjunction with data visualization to communicate results from data. Check out the list of resources given at the end of this post for detailed reading.

There are a few ways to make the insights clear and pop out when communicating the story from data:

  • The first step is to declutter the data by removing all the noise. This can be done by stripping down all the unwanted information and building up on the useful insights.
  • The next key thing to do is to foreground things that are important. We do not want too much ink/ data that makes the results too complicated to understand.
  • A data story approach can be used merging narrative and visuals together to engage audience and point to key messages from the data (see examples of line graphs annotated this way here). Also check out this interesting article and podcast on the good and bad of storytelling for further reading.

Continue reading “Telling stories with data and visualizations – Some key messages”