I really enjoyed reading Artificial Intelligence – A Guide for Thinking Humans by Melanie Mitchell. The author is a professor of computer science and an artificial intelligence (AI) researcher. The book is her attempt at working out if the singularity is near (or at least likely), or if we still are far from creating any true intelligence. In the process, the reader gets an excellent overview of the state of the art in areas such as image recognition, game play, and natural language processing. Even though it is aimed at general readers, I found it to be very good in technical content.
I don’t have any experience working with AI and machine learning (ML). But lately I have been playing around with a very simple neural network in Python. The code comes from the book Classic Computer Science Problems in Python, and trying it out really helped me understand how it works. We also read Grokking Deep Learning in the book club at work. Seeing a neural network that starts with random weights and, after training, is able to make good predictions is almost magical. However, at the same time I don’t see the network as intelligent in any way. To me, it is more just clever use of a form of statistics.
When I talk to other software developers, I find that a lot of them believe we are headed towards the singularity. Or at least that level 5 autonomous cars are imminent (“When do you thing it will be illegal for humans to drive?”). I have a hard time seeing the path to that, and Melanie Mitchell is equally skeptical. In the introductory part of the book, she explains the arguments from “singularitarians” like Ray Kurzweil. It boils down to the power of exponential growth – with ever more powerful computers, we will soon be able to recreate human-level intelligence. There are of course skeptics as well, for example pointing out that the exponential growth applies more to hardware than software. In any case, by going through and explaining how the various versions of AI work today, Mitchell gives the reader more information to found their opinion on. And in the process you learn a lot about the state of the art of AI.
Here are summaries of what I liked the most from the different parts of the book. Even though the chapters are pretty short, she manages to pack a lot of relevant information into them.
Even though the idea of building machines that can think has been around for a very long time, the beginning of AI can be pinpointed to a summer workshop in 1956 at Dartmouth college. The “big four” pioneers of the field were present: John McCarthy, Marvin Minsky, Allen Newell and Herbert Simon. There was great optimism in the early years. Herbert Simon predicted that “machines will be capable, within twenty years, of doing any work that a man can do”.
The approaches to AI soon split into different directions. An important distinction is between symbolic AI and subsymbolic AI. In symbolic AI, the goal is to build systems that can reason like humans do when solving problems. The approach involves codifying rules and applying them. This idea dominated the first three decades of the AI field, and produced so called expert systems.
Subsymbolic AI tries to recreate the partly unconscious process of, for example, recognizing a face, or identifying spoken words. The approach is to mimic how the brain’s neurons work. Frank Rosenblatt invented the perceptron in the late 1950s. It sums a number of inputs, multiplied by different weights. If the sum is greater than a threshold, it produces output 1 (it “fires”), otherwise it produces output 0. This is the precursor of the building blocks of today’s neural networks, used for among other things image recognition.
AI turned out to be much harder than expected. As Marvin Minsky later observed: “Easy things are hard”. The original goal of AI – computers that could communicate using natural language, describe what they saw, and learn from only a few examples are things young children can easily do. But it is hard to get machines to do it. When the results did not materialize, funding dried up, and “AI winter” followed.
The big advances in image recognition in the past decade have come from the subsymbolic branch of AI. Mitchell describes how pictures of handwritten figures are processed using neural networks with backpropagation. A network is built up of several layers, and each layer consists of many perceptron-like units. Typically there is one unit per pixel in the input layer. Then there are a number of hidden layers, and finally an output layer that indicates what kind of picture it is. All the units in a layer are connected to all the units in the neighboring layers. The weights in each unit determine what output it will give, and those weights are adjusted during training. The error (the output compared to the expected output) determines how much the weights should be changed. These error corrections are propagated back through the layers. After many rounds of training, the network is configured to predict based on the input. She then goes on to describe how convolutions are used to build up representations of larger structures in pictures, such as edges and shapes, when classifying images.
These systems require labeled images for training. There are some standard image collections that are used for comparing the performance of image classification systems. One such set is Image Net, consisting of 1.2 million labeled pictures. Amazon’s Mechanical Turk system was used to get humans to label many of the pictures. There is a good description of how intense the competitions between various research groups was in trying to get the highest score classifying the Image Net pictures.
Criticism: While the image recognition systems of today are truly impressive (for example Facebook’s facial recognition), there are some problems with them. It is often claimed that they “learn on their own”. But apart from needing labeled input, there are also many hyperparameters that need to be set. For example, the number of units and layers in the network, and the learning rate. These settings can have a big impact on the performance of the neural net, and finding the right combination is more art than science at the moment.
Then there are adversarial examples. Researchers have found that you can take an image of say a school bus, and change some pixels in a way that can not be detected by humans, but that will fool the system into misclassifying it as, for example, an ostrich. There are also ways of generating pictures that look like random noise, but that will be classified with 99% certainty to be a specific object. These adversarial examples raise the question “What, precisely, are these networks learning?”
There are also examples where the answer to that question is not what you expect. One system was trained to distinguish between pictures of landscapes and animals. It worked quite well, until it was discovered that it only distinguished between blurry or sharp backgrounds. The reason was that most pictures of animals had the animal in focus, and the background blurry, whereas the landscape pictures were all sharp. Again, not what was expected. However, you do want the system to pick up on traits that humans don’t notice or can’t see. The problem is that it is hard to know what those characteristics are.
This section starts with a toy example of how reinforcement learning works. A robot dog is being trained to kick a soccer ball. Random movements are performed, and when a sequence of moves lead to a successful outcome (the ball is kicked), this is recorded as something to do more of. However, during the training, you must also sometimes try new moves, even if you have already found some successful ones. This is in order to explore the whole space of potential actions (explore vs exploit). To know what actions to perform that were previously successful, you need to store the state and actions, and the corresponding value those actions are estimated to earn. For the robot dog, the state might be its position and the ball position, and the actions are moves and whether to kick or not. The state, actions and values are stored in a table called a Q-table, and this form of reinforcement learning is sometimes called Q-learning. Q is used instead of V (for value), because the letter V was used for something else in the original paper.
DeepMind was using reinforcement learning (Q-learning), combined with deep neural networks, when it made systems that could play the classic Atari computer games like Breakout, Space Invaders, Asteroids and Pong. They called their approach deep Q-learning. The state in these cases is the current frame (the pixels of the current screen), and three previous frames from previous time steps. To select an action (for Breakout – paddle left, paddle right, or no-op), a convolutional neural network (like from image classification) is used. A trick is needed for how to update the weights, since there is no known answer like there is for a labeled picture.
In reinforcement learning, the value of an action is an estimate of the reward earned at the end of the training episode, if this action is taken. This estimate should be more accurate towards the end of the episode. The trick is to assume the current output of the network is closer to being correct than its output from the previous iteration. So the strategy used is to minimize the difference between the current and previous iterations. The network thus learns to make the outputs consistent from iteration to the next. This is called temporal difference learning. In many cases, these systems learned to play the Atari games much better than humans could.
Two other game-playing programs are also covered in this section. The first is IBM’s Deep Blue that beat Garry Kasparov in chess in 1997. That program worked by evaluating potential future positions from a tree of possible moves, and then used the minimax algorithm to determine what move to make. Perhaps the most famous game playing example is AlphaGo, the program that defeated Lee Sedol in Go in 2016.
AlphaGo from DeepMind is similar to Deep Blue in that it analyses move sequences in the game tree. But there are differences that made Go a tougher challenge. There are more potential moves in Go, so the tree of moves is even bigger. Furthermore, nobody has been able to come up with a good evaluation function for how good a given board configuration is. So AlphaGo uses Monte Carlo tree search. Since it is impossible to explore all possible moves in the game tree, it picks a few at random (the Monte Carlo part) and plays those out until the game ends in a win or a loss. The moves along the way are also picked at random. Simulating the game till the end for such a pick is called a roll-out from that position. The result (win or loss) of the roll-out is used to update the statistics for which moves are good at each position. AlphaGo performed close to 2,000 roll-outs per turn, so eventually the statistics for which move to make become pretty good.
The Monte Carlo tree search is complemented with a deep convolutional neural network. The network is trained to assign rough values for all possible moves from the current position. The ConvNet indicates which moves are good candidates for roll-outs. After the roll-outs, the updated values for what moves were good are used to update The ConvNet’s output, via backpropagation. Eventually, the ConvNet will learn to recognize patterns. The program was improved by playing games against itself, about five million times. AlphaGo thus used a combination of reinforcement learning, Monte Carlo tree search, and deep convolutional neural networks.
Criticism: These game playing successes led DeepMind to claim that they had demonstrated that “… even in the most challenging domains it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules”. However, as impressive as the win over Lee Sedol was, it is important to remember that AlphaGo’s skills at playing Go are only useful for playing Go. They don’t help in any other game, let alone in other tasks. In other words, there is no “transfer learning”.
Further, a lot of real world tasks do not have a state that is as easy to define as the state in a game. The same goes for evaluating the effect of an action. Also, at least in the AlphaGo case, there was human guidance in deciding to use Monte Carlo tree search, as well as in setting the hyper parameters for the ConvNet.
Natural language processing (NLP) means getting computers to deal with human language. One example is sentiment classification, where the goal is to automate whether a short sentence is positive or negative. For example, for a movie review like “A little too dark for my taste”, did the reviewer like the movie or not? Some early NLP systems looked at the occurrence of individual words to try to determine the sentiment. For example, “dark” in the above example could indicate a negative view. However, in “Despite the heavy subject, there’s enough humor to keep it from becoming too dark”, the sentiment is positive, even though “dark” appears again.
Just looking at individual words is not enough to capture the meaning of a sentence. One improvement is to use a Recurrent Neural Network (RNN). It handles two problems – the variable length input (sentences), and the fact that the order of the words in the sentence is important. The differences compared to the neural network used for image classification are that a hidden unit also has connections to itself and other hidden units (recurrent connections), and that the sentence is processed in time steps (one step per word). The output (positive or negative sentiment) is only output after all the words have been processed. The recurrent connections allow it to process each word while remembering the context (the previous words in the sentence).
One problem remains – the input to the neural network must be numbers. How should you encode the input words into numbers? One way is one-hot encoding. If you have 20,000 words, you make one slot for each word. When a given word appears, its slot is one, and all the other slots are zero. The problem with this approach is that there is no way of knowing if two words are close to each other in meaning. For example, hated and disliked in a move review should have approximately the same meaning, but one-hot encoding will not capture that relation.
Enter word2vec. In 2013, researchers at Google came up with a clever scheme to represent words as vectors in a 300-dimensional space. As input, they used massive amounts of text from Google News. For each sentence, they created all the pairs of words occurring next to each other (excluding all short words like a and the). For example, “A man went into a restaurant and ordered a hamburger” would create the pairs (man, went), (went, into), (into, restaurant), (restaurant, ordered) and (ordered, hamburger). These pairs, and the reverse, like (hamburger, ordered), were used to train a regular neural network to predict which words would occur next to each other.
In this case, one-hot encoding was used for both the input and the output. As an example, if there were 700,000 distinct words, then there would be 700,000 inputs to the neural network, and 700,000 outputs. The hidden layer had 300 units. After the network has been trained with billions of word pairs, for a given input word, it was able to indicate how likely the output words were of occurring next to the given input word. For example, if the input is hamburger, the corresponding slot in the input is one, and all the others are zero. Of the 700,000 output slots, a high value indicates that its corresponding word is likely to appear next to hamburger. Now for the clever bit (in my mind): for each word, the values on the 300 hidden units are used to form the vector for that word. This means that for each of the 700,000 words, a 300-element vector is created. This vector is similar for words with similar meaning. For example, words close to France were Spain, Belgium, Netherlands, Italy and so on, because they all occurred in similar contexts in sentences. Similarly, words close to hamburger are burger, cheeseburger, sandwich, hot dog, tacos and fries.
The word vectors are used in for example Google translate. When translating from say English to French, the words of the input sentence are turned into their corresponding vectors. Then the recurrent neural network encodes the sentence in time-steps (one per word). The sentence is now encoded in the activations of the hidden units. This encoding is given as input to a decoder network, in the example for French. This is another recurrent network, but one where the outputs are numbers representing the words that form the translated sentence.
NLP also includes reading comprehension. One example of the difficulties of reading comprehension is determining what “it” refers to. For example:
Sentence 1: “I poured water from the bottle into the cup until it was full”
Question: “What was full?” A. The bottle. B. The cup.
Sentence 2: “I poured water from the bottle into the cup until it was empty”
Question: “What was empty?” A. The bottle. B. The cup.
These kinds of tests are called Winograd schemas, named for the pioneering NLP researcher Terry Winograd. The best performance of any program at the time of writing the book was 61% – better than random guessing, but far below human performance. Since these kinds of language question usually require some form of real world knowledge (if you pour water from a bottle it becomes empty, not full), it has been proposed to use a series of this kind of questions as an alternative to the Turing test.
Criticism. There has been enormous progress in many areas of NLP. But to get even higher accuracy it looks like there is a need to actually understand the text – finding patterns in the texts are not enough.
Meaning and Understanding
Despite all the successes of the various systems described so far, a common weak point for them is that there is no real “understanding” in them. For example, a state of the art image recognition system does not understand what is in the picture, even if it can correctly classify it. The last few chapters of the book discus what would be needed to gain true understanding.
First, Mitchell discusses all the implicit knowledge humans have of the world. We know how objects behave in the world. If you drop an object, it will fall, and it will stop, bounce or possibly break when it hits the ground. An object hidden behind another object is still there. This is called intuitive physics. There is also intuitive biology – we know a dog can move on its own, but a stroller can’t. We can also imagine different possible scenarios that could happen. Many of these capacities could be explained to come from us experiencing the physical world (embodiment). Perhaps embodiment is also needed for AI systems that can understand the world the way we do.
There is also an interesting example of abstraction and pattern-finding: Bongard problems. I had not come across these before, but they are featured in Gödel, Escher, Bach: An Eternal Golden Braid (I really should read it). They consist of 6 images of shapes to the left and 6 images to the right. The object is to figure out how the six to the left differ from the six to the right. For example, all the pictures to the left could be one large and one small object, whereas the pictures to the right all contain two small objects. This is quite a hard problem to solve with a program, but much easier for humans.
There is also an interesting example from Mitchell’s own research on analogy making. If abc is changed to abd, how should pqrs be changed. Most people would answer pqrt (replace the last letter with its successor in the alphabet). But there are other possible answers, like pqrd (replace the last letter with d). It was quite interesting to learn about the attempts to write programs that could perform these kinds of tasks automatically. Finally, there are ten questions and answers on what Mitchell thinks the future of AI holds.
Here are some fun facts I learned from the book:
- In the 1950s, IBM engineer Arthur Samuel programmed an IBM 701 computer to play checkers. He did this without an operating system or even assembler – he had to write everything with op codes and addresses. The program was among the earliest machine-learning programs. Indeed, it was Samuel who coined the term machine learning. (page 152)
- In 1966, for the Summer Vision Project, “Minsky hired a first-year undergraduate and assigned him a problem to solve over the summer: connect a television camera to a computer and get the machine to describe what it sees.” (page 69)
- Breakout was Atari’s effort to create a single-player version of Pong. The design and implementation of Breakout was originally assigned in 1975 to a twenty-year-old employee called Steve Jobs. He enlisted the help of his friend Steve Wozniak. (page 147)
Maybe there will be artificial general intelligence (AGI) one day. But it doesn’t look like any of the existing techniques will bring us there, at least not by themselves. Despite that, I think today’s AI systems are fantastic engineering feats and enormously useful. This book on how these current systems work was a joy to read. It was clear, concise, and very interesting, and I learned a lot from it.