Blog with Nikola and Github: setup and workflow

Github blog with Nikola and Jupyter notebooks - setup and workflow

Setting up the github blog and creating a first post with Nikola was surprisingly easy, after having struggled with some alternatives. As easy as it was to set up, it is also easy to forget a few intricacies (although it really isn't that complicated) between one post and the next, a couple of weeks later. So here a short reminder on how to set Nikola up for Github and Jupyter notebooks, and how to post something new. It is assumed that there is already a Github account set up with a .github.io.git repository for your personal page.

There are several articles on the subject that might also be helpful, but they somehow didn't quite contain all my issues with Python (Anaconda), git and Nikola in one. So here it goes:

1. Getting started: Python environment, installing packages and Nikola, starting the git repo

This first part is rather dependent on setup and is documented more completely in various other places, such as the nikola site. Nevertheless, especially if you have Anaconda, the following could be useful.

Nikola has quite a few dependencies that are possibly different from those for everyday Python work. So it makes sense to create a new environment. This is easy with conda (I chose a prompt ending with # because of a conflict with the dollar sign within markdown quotations. I didn't actually do these commands as root, neither should you have to). By default, my conda installation will create Python 3.5, which is what we need.

user@macbook:~/Projects/user.github.io# conda create --name blog
user@macbook:~/Projects/user.github.io# source activate datablog
(blog)user@macbook:~/Projects/user.github.io#

Conda offers a possibility to copy an environment (using conda list --explicit) but this doesn't work cross-platform (I already had an environment set-up on Linux and wanted to have the same on my Mac. Alas.).

Using the list of commands outlined here (NB: check it if you start from scratch and also need to set up a Github account), one installs some necessary packages and Nikola with pip. But not before we have installed pip itself in our new environment:

(blog)user@macbook:~/Projects/user.github.io# which pip 
/Users/user/anaconda/bin/pip

Not located in our environment, so install pip first:

(blog)user@macbook:~/Projects/user.github.io# conda install pip
/Users/user/anaconda/envs/blog/bin/pip

That looks better. Now we can install in our environment with pip (NB: from now on, omitting the path and prompt):

pip install nikola pyzmq tornado jinja2 requests sphinx markdown ghp-import2

Upgrade nikola with extras, as recommended on the nikola site

pip install --upgrade "Nikola[extras]"

One thing is still missing: the Jupyter notebook, which adds a rather long list of more packages. Install with conda:

conda install notebook

In my most recent case, I did a git clone into my directory because I already had a Github repo with Nikola. Typically, you will first do a nikola init mysite, and add and commit this to a repo, and set a github location as your remote. I am using it for a personal page, so have the following remote:

https://github.com/<user>/<user>.github.io.git

See here for some more hints with respect to the git repo. And I liked this post a lot too.

There is something important to keep in mind with the git repo: there is a source branch (src) and a branch where the HTML will be located, in this case master (for project pages this can be a directory "/docs" inside branch master or directly in a branch gh-pages).

A second thing related to git: make sure there is a .gitignore file as described in the nikola handbook. Fixing conflicts with files in /cache is no fun.

To see the file structure of what will be published on the site, do:

git checkout master

This will show an index.html, a directory blogs and several more site-related things. Going back to the src branch with:

git checkout src

you will see a conf.py file for Nikola, plus a bunch of other stuff.

By now, we have a Python environment with all necessary packages including Nikola, and a git repository that is linked to our personal Github repo

2. Adding a Jupyter notebook file as a post

Let's make the new post with a Jupyter notebook:

nikola new_post -f ipynb

Open the Jupyter console and navigate to the created, empty notebook.

jupyter notebook posts/new_post.ipynb

Now the actual fun starts: you can write text, equations (using the magic command %%latex) and of course Python code, including plots if you make them. This post was also written in a Jupyter notebook. You can see how your post is progressing, by telling nikola:

nikola auto -b

When you are done, save the notebook as you want it to appear. We haven't given any tags yet to the notebook, but that can easily be done now. For this post, I did:

nikola tags --add "nikola, conda, jupyter notebook" posts/blog-with-nikola-and-github-setup-and-workflow.ipynb

Note that this modifies your notebook file. Also note that if one desires to keep a post hidden from the public (for instance, when deploying the site when there is a post that is still under construction), it can be given the tag private. This can of course later be removed by nikola tags --remove

The last action: deploy the post on Github. Here, Nikola really shines by simplicity.

nikola github_deploy

C'est tout!

Using the groupby method in pandas

Using panda's groupby method

The DataFrame class of the well-known pandas library is primarily meant for working with tabular data: data where rows and columns have names associated with them. There are a lot of handy methods for all sorts of data manipulations.

This post discusses the .groupby() method. It is similar to an SQL GROUP BY command, so is typically combined with aggregate functions such as .min() or .mean().

In [1]:
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt

Let us generate some data that needs grouping. To give the data some meaning, let's say we have two temperature sensors, A and B. If one of them indicates a temperature above 0°C, its corresponding batch of the day needs to be discarded. So we want to see whether this occurred and -when it did-, on which date. To do this, we will group by date.

In [2]:
# Create timestamps 
N = 1000
datetimes = [datetime.datetime.fromtimestamp(int(ts)) for ts in np.linspace(1.5E9, 1.501E9, N)]
dates = [dt.date() for dt in datetimes]
times = [dt.time() for dt in datetimes]
In [3]:
# Generate data by smoothing random noise (to get something that remotely resembles real data)
# First define a function that returns a gaussian kernel
def make_gaussian_kernel(window_length, sigma):
    kernel_window = 0.5+np.arange(-window_length/2, window_length/2)
    return 1./np.sqrt(np.pi * 2 * sigma ** 2) * np.exp(-(kernel_window ** 2 / (2 * sigma ** 2)))
In [4]:
# Give B be a a shorter correlation time scale and larger sigma (before filtering)
np.random.seed(12)
data = pd.DataFrame({'Date': dates,
                     'Time':times,
                     'A':-5 + np.convolve(5 * np.random.randn(N),make_gaussian_kernel(100,20),'same'),
                     'B':-5 + np.convolve(11 * np.random.randn(N),make_gaussian_kernel(50,5),'same')},
                   columns=['Date', 'Time', 'A', 'B'])
data.head()
Out[4]:
Date Time A B
0 2017-07-14 04:40:00 -5.334710 -1.701750
1 2017-07-14 04:56:41 -5.355727 -1.599209
2 2017-07-14 05:13:22 -5.374562 -1.612101
3 2017-07-14 05:30:03 -5.403142 -1.731941
4 2017-07-14 05:46:44 -5.417975 -1.939321
In [5]:
data[['A','B']].plot()
plt.show();

Let's look at the maximum values of A and B directly: we group by Date, and take the max of the columns for each group. We select only columns A and B

In [6]:
data.groupby('Date').max()[['A','B']]
Out[6]:
A B
Date
2017-07-14 -5.334710 -1.599209
2017-07-15 -5.393345 -1.656905
2017-07-16 -5.813294 -2.144248
2017-07-17 -3.818759 -2.841216
2017-07-18 -3.836921 -3.817385
2017-07-19 -4.235721 -1.978026
2017-07-20 -4.200498 -1.009262
2017-07-21 -4.231529 -2.991666
2017-07-22 -4.338296 -1.940850
2017-07-23 -3.405894 0.389407
2017-07-24 -4.339183 -2.021590
2017-07-25 -5.285316 -0.537233

Since the returned object is a DataFrame, there is the .plot() method for easy plotting.

In [7]:
fig, axs = plt.subplots(1,2, figsize=(12,6))
data.groupby('Date').mean()[['A','B']].plot.bar(ax=axs.flat[0], title='mean')
data.groupby('Date').max()[['A','B']].plot.bar(ax=axs.flat[1], title='max')

axs.flat[0].axhline(0, color='k')
axs.flat[1].axhline(0, color='k')

axs.flat[0].set_ylim(-10,1)
axs.flat[1].set_ylim(-10,1)
fig.autofmt_xdate() # Rotates the dates for better appearance

plt.show();

We see that A never exceeded 0 degrees, whereas B exceeded the limit once, on 23 July.

An additional method I don't want to withhold here is the .agg() method. As you might suspect, it is a more generic aggregate method, and we can pass lists (and dictionaries) to indicate what aggregations should be done (on what columns). A DataFrame with multi-index is returned (which means we need to do some looping if we want to plot the invidual results separately), but the on-screen print looks neat:

In [8]:
data.groupby('Date').agg(['min','mean','max'])[['A','B']]
Out[8]:
A B
min mean max min mean max
Date
2017-07-14 -5.840168 -5.586130 -5.334710 -11.028303 -4.735670 -1.599209
2017-07-15 -5.982163 -5.612892 -5.393345 -9.347597 -4.890982 -1.656905
2017-07-16 -6.523278 -6.165463 -5.813294 -11.950225 -7.311235 -2.144248
2017-07-17 -5.872409 -4.914605 -3.818759 -9.135479 -5.613660 -2.841216
2017-07-18 -5.332909 -4.772618 -3.836921 -9.696789 -6.452641 -3.817385
2017-07-19 -5.331248 -4.987833 -4.235721 -7.326675 -4.898382 -1.978026
2017-07-20 -5.759595 -4.971836 -4.200498 -8.854382 -4.676810 -1.009262
2017-07-21 -5.882903 -4.953391 -4.231529 -11.454454 -7.272826 -2.991666
2017-07-22 -5.100427 -4.629432 -4.338296 -10.950150 -5.907628 -1.940850
2017-07-23 -4.903280 -3.961689 -3.405894 -8.616527 -4.221630 0.389407
2017-07-24 -5.281518 -4.676212 -4.339183 -8.930439 -6.342465 -2.021590
2017-07-25 -5.736660 -5.554209 -5.285316 -7.745590 -5.181151 -0.537233

Would we have data stretching over a longer period, we might be interested in statistics about the aggregate. We can add a lambda function telling us whether the maximum exceeds zero within the group:

In [12]:
data.groupby('Date').agg(['min','mean','max', lambda value:value.max()>0])[['A','B']].describe()
Out[12]:
A B
min mean max <lambda> min mean max <lambda>
count 12.000000 12.000000 12.000000 12.0 12.000000 12.000000 12.000000 12.000000
mean -5.628880 -5.065526 -4.519455 0.0 -9.586384 -5.625423 -1.845682 0.083333
std 0.449701 0.581180 0.751737 0.0 1.467542 1.034041 1.121780 0.288675
min -6.523278 -6.165463 -5.813294 0.0 -11.950225 -7.311235 -3.817385 0.000000
25% -5.875032 -5.562190 -5.297665 0.0 -10.969688 -6.370009 -2.318490 0.000000
50% -5.748128 -4.962614 -4.287009 0.0 -9.241538 -5.397405 -1.959438 0.000000
75% -5.318816 -4.748517 -4.109604 0.0 -8.794918 -4.852154 -1.451722 0.000000
max -4.903280 -3.961689 -3.405894 0.0 -7.326675 -4.221630 0.389407 1.000000

From the mean of our lambda, we see that we exceeded 0°C in 8.3% of cases (1/12th).

Concluding:

The pandas.DataFrame.groupby() returns a GroupBy object, on which we can apply aggregate functions such as .mean(). We will group typically on dates or categorical data. Using the .agg() method, we can apply multiple operations on multiple columns.

By directly applying .plot() or .describe() on the resulting DataFrame, our analysis is basically done in a single line of code.