Working with Jupyter notebooks #code

Jupyter is an open source program that helps you share and run code in many different programming languages. Jupyter notebooks are great to quickly prototype different versions of code, as they are easy to edit and try different outputs. The format of a Jupyter notebook is similar to reports in the form of Markdowns that are usually used in R. It can contain blocks of text, code, equations and results (including visualizations) all in one page. We’ve used Jupyter notebooks to run text analysis workshops in conferences, and the feedback was pretty good.

The Writing Analytics workshop is starting at #LAK18. Jupyter notebooks are being used. #great pic.twitter.com/56Zd66ku9L

I find that Jupyter notebooks are great for sharing code and results across different people, and if you’re hosting it, it saves a lot of trouble in organising a workshop where you want participants to install software. It works well for non-technical audience too, since they can choose to ignore what’s inside the code block by simply running it and focus on the results block. They are quite popular now for data science experiments, so this post will be a good place to start to know and use them. You can use an already available notebook (if you’ve downloaded one from Github) and play with it, or create your own Jupyter notebook from scratch. This post will guide you to create your own notebook from scratch demonstrating some basic text analysis in Python.

Installing Jupyter

If you want to try a Jupyter notebook first without installing anything, you can do so in this notebook hosted in the official Jupyter site. If you want to install your own copy of Jupyter running in your machine to develop code, then use one of the two options below:

  • If you are new to Python programming, and don’t have python installed in your machine, the easiest way to install Jupyter is by downloading the Anaconda distribution. This comes with in-built Python (you can choose either 2.7 or 3.6 version of Python when you download the distribution – the code I’m writing in this post is in 2.7).
  • If you already have Python working in your machine (as I did), the easiest way is to install Jupyter using the pip command as you do for any Python package. Note that if pip and python are already setup in your system path, you can simply use $ pip install jupyter from the command prompt.

Now that Jupyter is installed, type the command below in your anaconda prompt/command prompt to start a Jupyter notebook:

$ jupyter notebook

The Jupyter homepage opens in your default browser at http://localhost:8888, displaying the files present in the current folder like below. You can now create a new Python jupyter notebook by clicking on New -> Python2 (or Python 3 if you have Python version 3). You can move between folders or create a new folder for your Python notebooks. To change the default opening directory, you should first move to the required path using cd in the command prompt, and then type$ jupyter notebookOpen the created notebook, which would look like this:

This cell is a code block by default, which can be changed to a markdown text block from the drop-down list (check the figure above) to add narrative text accompanying the Python code. Now name your notebook, and try adding both a code block, and markdown block with different levels of text following the sample here:

To execute the blocks, click on the Run button (Alternatively, use Ctrl+Enter in Windows – Keyboard shortcuts can be found in Help -> Keyboard shortcuts). This renders the output of your code and your markdown text like this:

That’s it. You have a simple Jupyter notebook running on your machine. Now to try a bit more, here’s the sample code you can download and run to do some basic text analysis. I’ve defined three steps in this code: Importing required packages, defining input text, and analysis. Before importing the packages/ libraries you need in step 1 however, they should be first installed in your machine. This can be done using the Pip command in the command prompt/anaconda prompt like this:  $ pip install wordcloud (If you run into problems with that, the other option is to download an appropriate version of the package’s wheel from here and install it using $pip install C:/some-dir/some-file.whl).

Python code for the three steps is below:

#Step 1 - Importing libraries
 
from wordcloud import WordCloud, STOPWORDS  #For word cloud generation
import matplotlib.pyplot as plt             #For displaying figures
import re                          #Regular expresions for string operations


#Step 2 - Defining input text

inputtext = "A cockatoo is a parrot that is any of the 21 species belonging to the bird family Cacatuidae, the only family in the superfamily Cacatuoidea. Along with the Psittacoidea (true parrots) and the Strigopoidea (large New Zealand parrots), they make up the order Psittaciformes (parrots). The family has a mainly Australasian distribution, ranging from the Philippines and the eastern Indonesian islands of Wallacea to New Guinea, the Solomon Islands and Australia. Cockatoos are recognisable by the showy crests and curved bills. Their plumage is generally less colourful than that of other parrots, being mainly white, grey or black and often with coloured features in the crest, cheeks or tail. On average they are larger than other parrots; however, the cockatiel, the smallest cockatoo species, is a small bird. The phylogenetic position of the cockatiel remains unresolved, other than that it is one of the earliest offshoots of the cockatoo lineage. The remaining species are in two main clades. The five large black coloured cockatoos of the genus Calyptorhynchus form one branch. The second and larger branch is formed by the genus Cacatua, comprising 11 species of white-plumaged cockatoos and four monotypic genera that branched off earlier; namely the pink and white Major Mitchell's cockatoo, the pink and grey galah, the mainly grey gang-gang cockatoo and the large black-plumaged palm cockatoo. Cockatoos prefer to eat seeds, tubers, corms, fruit, flowers and insects. They often feed in large flocks, particularly when ground-feeding. Cockatoos are monogamous and nest in tree hollows. Some cockatoo species have been adversely affected by habitat loss, particularly from a shortage of suitable nesting hollows after large mature trees are cleared; conversely, some species have adapted well to human changes and are considered agricultural pests. Cockatoos are popular birds in aviculture, but their needs are difficult to meet. The cockatiel is the easiest cockatoo species to maintain and is by far the most frequently kept in captivity. White cockatoos are more commonly found in captivity than black cockatoos. Illegal trade in wild-caught birds contributes to the decline of some cockatoo species in the wild. Source: https://en.wikipedia.org/wiki/Cockatoo"


print("\nInput text for analysis:\n ")
print(inputtext)


#Step 3 - Analysis

print "Summary statistics of input text:"

wordcount = len(re.findall(r'\w+', inputtext))
print "Wordcount: ", wordcount

charcount = len(inputtext) #including spaces
print "Number of characters: ", charcount

#More options for wordclouds here: https://github.com/amueller/word_cloud
wordcloud = WordCloud(    stopwords=STOPWORDS,
                          background_color='black',
                         ).generate(inputtext)

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

The downloadable ipynb file is available on Github.

Other notes:

  • This post is intended for anyone who wants to start working with Jupyter notebooks, and assumes prior understanding of programming in Python. The Jupyter notebook is another environment to easily work with code, but the coding process is still very traditional. If you’re new to Python programming, this website is a good place to start.
  • You can use multiple versions of Python to run Jupyter notebooks by changing its Kernel (the computational engine which executes the code). I have both Python 2 & Python 3 installed, and I switch between them for different programs as needed.
  • While Jupyter notebooks are mainly used to run Python code, they can also be used to run R programs, which requires R kernel to be installed. The blog post below is a useful guide to do that: https://www.datacamp.com/community/blog/jupyter-notebook-r