Over the summer of 2018, I completed the Deep Learning Specialization on Coursera taught by Andrew Ng. It was a very good series of courses that introduced me to a lot of concepts in current neural networks research in a way that was very approachable, engaging, and left the door wide open for further exploration. The interviews with prominent researchers in the field were particularly interesting – Geoffrey Hinton, Ian Goodfellow, Yoshua Bengio, and others.
Homework assignments and tests were partly composed of Python implementations of neural networks and mostly required completing small sections of functions. Typically this required having an understanding of the operations being performed and how to accomplish them using NumPy matrix operations. These exercises were very useful, but I felt I still didn’t quite understand the details at the level I wanted. I really didn’t have an appreciation for the underlying theory and operation despite being able to finish the coding assignments.
To fill in the gaps, I started reading Neural Networks for Pattern Recognition by Christopher Bishop. While this book is a bit dated (first edition published in 1996) and doesn’t cover the latest, greatest techniques in Deep Learning or Convolutional Neural Nets, it gives a fantastic treatment of neural nets from a statistical learning theory perspective. The first two chapters cover Statistical Pattern Recognition and Probability Density Estimation including topics on classification and regression, maximum likelihood estimation, and other topics you would expect from a text on statistical learning. Chapters 3 and 4 get into the details of single- and multi-layer neural networks and their mathematical underpinnings.
In addition to reading this book, I decided that to really understand neural networks, I needed to implement them from scratch. I decided to use base R for this since I was more familiar with how to perform matrix operations in R and my intent was to understand neural nets, not the necessary functions in Python. I came up with my list of necessary activation functions for my minimalist implementation:
- leaky ReLU
- softmax (for multinomial outputs)
and set about coding the necessary pieces for each: the activation functions themselves (except for tanh, which R supplies), error functions (which all turn out to be the same), derivatives, a neural network data structure, forward propagation, and backward propagation. In the end, I was able to do the whole thing in about 200 lines of R code which you can access on GitHub, if you’re interested: NNFS
As I mention on GitHub, there are a lot of things this is not due to the fact that this was a learning process: fast – it hasn’t been optimized for performance in any way; general purpose – it does some things generally but not a long list; tidy – base R gave me the tools necessary without over-burdening the process; done – I have no doubt I’ll revisit this often, and bugs in the code are nearly guaranteed.
Here’s are some of the things I learned during this process:
Data Structures are important
No surprise there. I often get stuck in the planning phase of coding and overthink the design. I felt myself falling into that trap and and a few times just wrote code knowing full well it would change. My goal was to have the forward propagation and backward propagation steps be as generic as possible so that I could introduce new activation functions easily. This wasn’t particularly difficult in general, but did require some special treatment to account for multinomial outputs like softmax, as well as structuring the different activation functions and their derivatives in a common form.
Backpropagation is very elegant
Reading about the backpropagation algorithm, it seems almost obvious that this is how you would want to train a neural network, but it is easy to see that when it is being explained after the fact. After all, I didn’t have to come up with the idea nor prove the theory. The history of its development is very interesting (see Wikipedia for a brief description) and the paper “Learning representations by back-propagating errors” (Rumelhart, David E.’ Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). Nature. 323 (6088): 533-536) brought recognition to the technique.
Coding a neural network from scratch isn’t that hard, except for the hard parts
I was surprised at how much progress I made and that it all was coming together so quickly. Then I started on that last fundamental piece – backpropagation. Understanding the math behind backpropagation wasn’t the challenging bit – that had been worked out and explained to me in Bishop’s book and the 1986 paper by Rumelhart, Hinton and Williams. Putting it all together in code and ensuring all the math worked regardless of which activation functions composed the network was tricky. I even resorted to pen and paper to make sure that I understood the shapes and sizes of the various matrices involved and that I got the linear algebra right. As I added activation functions, I had to revisit the algorithm a few times but settled on a form that supported everything I needed.
There’s so much more
Dropout, weight normalization, autoencoders, Boltzman machines, radial basis functions, convolutional layers, and the list goes on and on. The field is incredibly broad and incredibly deep. I shouldn’t be surprised given the breadth and depth of statistical learning theory and all the branches there are. Neural networks are but one more tool in the toolbox, but a very diverse and interesting tool. (I’m particularly fascinated with autoencoders and Boltzman machines for generating representations for very high dimensional spaces, but that’s another blog post. The linked posts from Rubik’s Code are very good introductions.)
This exploration also reinforced my interest in Statistics and the theoretical underpinnings of those tools, as well as providing more incentive to continue learning across the Statistical Learning and Machine Learning worlds. Just one more step on that journey and one more gap filled only to reveal the incredibly vast landscape yet to be explored.
And then I just started playing. I’ve saved many examples in the NNFS code on GitHub.
Two-class classification was simple even when there were clusters buried inside other clusters, so I tried 6 classes. And then, of course, I went for 10 classes. Maybe that was a little excessive. The original data was randomly generated with 2 input dimensions, and is shown as dots with each class a different color. The colored regions show the class predictions given by the neural network for that point in the graph.
The spiral problem was particularly interesting. Here I generated a spiral of red points on a background of blue. As before, the colored dots are the data and the regions are the predictions from the trained neural network composed of 3 hidden layers having 5 neurons each. (Data was generated using slightly modified code from here.)
I tried my hand at the MNIST digit recognition dataset from a Kaggle competition. My implementation can’t compete with those based on Convolutional Neural Nets, but I can get about 90+% accuracy on a hold out set of 25% of the original data while training on 50% and using another 25% as a validation set. The figure below shows some of the misclassified examples from a neural net with 4 hidden layers with 25, 20, 15, and 10 neurons each. (Note that there are 768 inputs for each 28×28 pixel image.)Finally, I tried predicting the RGB value for a pixel in an image given only the X and Y coordinates. For the figures below I used 3 hidden layers with 50 neurons each, so this is a big, expressive network that is working to memorize RGB values for a 200×200 pixel image. Maybe not an overly useful architecture or project, but entertaining, none the less. Below, the target images are on the left, as if you couldn’t tell.