How To Create A Mind - BestLightNovel.com
You’re reading novel How To Create A Mind Part 6 online at BestLightNovel.com. Please use the follow button to get notification about the latest chapter next time when you visit BestLightNovel.com. Use F11 button to read novel in full-screen(PC only). Drop by anytime you want to read free – fast – latest novel. It’s great if you could leave a comment, share your opinion about the new chapters, new novel with others on the internet. We’ll do our best to bring you the finest, latest novel everyday. Enjoy
There is another important consideration: How do we set the many parameters that control a pattern recognition system's functioning? These could include the number of vectors that we allow in the vector quantization step, the initial topology of hierarchical states (before the training phase of the hidden Markov model process prunes them back), the recognition threshold at each level of the hierarchy, the parameters that control the handling of the size parameters, and many others. We can establish these based on our intuition, but the results will be far from optimal.
We call these parameters "G.o.d parameters" because they are set prior to the self-organizing method of determining the topology of the hidden Markov models (or, in the biological case, before the person learns her lessons by similarly creating connections in her cortical hierarchy). This is perhaps a misnomer, given that these initial DNA-based design details are determined by biological evolution, though some may see the hand of G.o.d in that process (and while I do consider evolution to be a spiritual process, this discussion properly belongs in chapter 9 chapter 9).
When it came to setting these "G.o.d parameters" in our simulated hierarchical learning and recognizing system, we again took a cue from nature and decided to evolve them-in our case, using a simulation of evolution. We used what are called genetic or evolutionary algorithms (GAs), which include simulated s.e.xual reproduction and mutations.
Here is a simplified description of how this method works. First, we determine a way to code possible solutions to a given problem. If the problem is optimizing the design parameters for a circuit, then we define a list of all of the parameters (with a specific number of bits a.s.signed to each parameter) that characterize the circuit. This list is regarded as the genetic code in the genetic algorithm. Then we randomly generate thousands or more genetic codes. Each such genetic code (which represents one set of design parameters) is considered a simulated "solution" organism.
Now we evaluate each simulated organism in a simulated environment by using a defined method to a.s.sess each set of parameters. This evaluation is a key to the success of a genetic algorithm. In our example, we would run each program generated by the parameters and judge it on appropriate criteria (did it complete the task, how long did it take, and so on). The best-solution organisms (the best designs) are allowed to survive, and the rest are eliminated.
Now we cause each of the survivors to multiply themselves until they reach the same number of solution creatures. This is done by simulating s.e.xual reproduction: In other words, we create new offspring where each new creature draws one part of its genetic code from one parent and another part from a second parent. Usually no distinction is made between male or female organisms; it's sufficient to generate an offspring from any two arbitrary parents, so we're basically talking about same-s.e.x marriage here. This is perhaps not as interesting as s.e.xual reproduction in the natural world, but the relevant point here is having two parents. As these simulated organisms multiply, we allow some mutation (random change) in the chromosomes to occur.
We've now defined one generation of simulated evolution; now we repeat these steps for each subsequent generation. At the end of each generation we determine how much the designs have improved (that is, we compute the average improvement in the evaluation function over all the surviving organisms). When the degree of improvement in the evaluation of the design creatures from one generation to the next becomes very small, we stop this iterative cycle and use the best design(s) in the last generation. (For an algorithmic description of genetic algorithms, see this endnote.)11 The key to a genetic algorithm is that the human designers don't directly program a solution; rather, we let one emerge through an iterative process of simulated compet.i.tion and improvement. Biological evolution is smart but slow, so to enhance its intelligence we greatly speed up its ponderous pace. The computer is fast enough to simulate many generations in a matter of hours or days, and we've occasionally had them run for as long as weeks to simulate hundreds of thousands of generations. But we have to go through this iterative process only once; as soon as we have let this simulated evolution run its course, we can apply the evolved and highly refined rules to real problems in a rapid fas.h.i.+on. In the case of our speech recognition systems, we used them to evolve the initial topology of the network and other critical parameters. We thus used two self-organizing methods: a GA to simulate the biological evolution that gave rise to a particular cortical design, and HHMMs to simulate the cortical organization that accompanies human learning.
Another major requirement for the success of a GA is a valid method of evaluating each possible solution. This evaluation needs to be conducted quickly, because it must take account of many thousands of possible solutions for each generation of simulated evolution. GAs are adept at handling problems with too many variables for which to compute precise a.n.a.lytic solutions. The design of an engine, for example, may involve more than a hundred variables and requires satisfying dozens of constraints; GAs used by researchers at General Electric were able to come up with jet engine designs that met the constraints more precisely than conventional methods.
When using GAs you must, however, be careful what you ask for. A genetic algorithm was used to solve a block-stacking problem, and it came up with a perfect solution...except that it had thousands of steps. The human programmers forgot to include minimizing the number of steps in their evaluation function.
Scott Drave's Electric Sheep project is a GA that produces art. The evaluation function uses human evaluators in an open-source collaboration involving many thousands of people. The art moves through time and you can view it at electricsheep.org.
For speech recognition, the combination of genetic algorithms and hidden Markov models worked extremely well. Simulating evolution with a GA was able to substantially improve the performance of the HHMM networks. What evolution came up with was far superior to our original design, which was based on our intuition.
We then experimented with introducing a series of small variations in the overall system. For example, we would make perturbations (minor random changes) to the input. Another such change was to have adjacent Markov models "leak" into one another by causing the results of one Markov model to influence models that are "nearby." Although we did not realize it at the time, the sorts of adjustments we were experimenting with are very similar to the types of modifications that occur in biological cortical structures.
At first, such changes hurt performance (as measured by accuracy of recognition). But if we reran evolution (that is, reran the GA) with these alterations in place, it would adapt the system accordingly, optimizing it for these introduced modifications. In general, this would restore performance. If we then removed the changes we had introduced, performance would be again degraded, because the system had been evolved to compensate for the changes. The adapted system became dependent on the changes.
One type of alteration that actually helped performance (after rerunning the GA) was to introduce small random changes to the input. The reason for this is the well-known "overfitting" problem in self-organizing systems. There is a danger that such a system will overgeneralize to the specific examples contained in the training sample. By making random adjustments to the input, the more invariant patterns in the data survive, and the system thereby learns these deeper patterns. This helped only if we reran the GA with the randomization feature on.
This introduces a dilemma in our understanding of our biological cortical circuits. It had been noticed, for example, that there might indeed be a small amount of leakage from one cortical connection to another, resulting from the way that biological connections are formed: The electrochemistry of the axons and dendrites is apparently subject to the electromagnetic effects of nearby connections. Suppose we were able to run an experiment where we removed this effect in an actual brain. That would be difficult to actually carry out, but not necessarily impossible. Suppose we conducted such an experiment and found that the cortical circuits worked less effectively without this neural leakage. We might then conclude that this phenomenon was a very clever design by evolution and was critical to the cortex's achieving its level of performance. We might further point out that such a result shows that the orderly model of the flow of patterns up the conceptual hierarchy and the flow of predictions down the hierarchy was in fact much more complicated because of this intricate influence of connections on one another.
But that would not necessarily be an accurate conclusion. Consider our experience with a simulated cortex based on HHMMs, in which we implemented a modification very similar to interneuronal cross talk. If we then ran evolution with that phenomenon in place, performance would be restored (because the evolutionary process adapted to it). If we then removed the cross talk, performance would be compromised again. In the biological case, evolution (that is, biological evolution) was indeed "run" with this phenomenon in place. The detailed parameters of the system have thereby been set by biological evolution to be dependent on these factors, so that changing them will negatively affect performance unless we run evolution again. Doing so is feasible in the simulated world, where evolution only takes days or weeks, but in the biological world it would require tens of thousands of years.
So how can we tell whether a particular design feature of the biological neocortex is a vital innovation introduced by biological evolution-that is, one that is instrumental to our level of intelligence-or merely an artifact that the design of the system is now dependent on but could have evolved without? We can answer that question simply by running simulated evolution with and without these particular variations to the details of the design (for example, with and without connection cross talk). We can even do so with biological evolution if we're examining the evolution of a colony of microorganisms where generations are measured in hours, but it is not practical for complex organisms such as humans. This is another one of the many disadvantages of biology.
Getting back to our work in speech recognition, we found that if we ran evolution (that is, a GA) separately separately on the initial design of (1) the hierarchical hidden Markov models that were modeling the internal structure of phonemes and (2) the HHMMs' modeling of the structures of words and phrases, we got even better results. Both levels of the system were using HHMMs, but the GA would evolve design variations between these different levels. This approach still allowed the modeling of phenomena that occurs in between the two levels, such as the smearing of phonemes that often happens when we string certain words together (for example, "How are you all doing?" might become "How're y'all doing?"). on the initial design of (1) the hierarchical hidden Markov models that were modeling the internal structure of phonemes and (2) the HHMMs' modeling of the structures of words and phrases, we got even better results. Both levels of the system were using HHMMs, but the GA would evolve design variations between these different levels. This approach still allowed the modeling of phenomena that occurs in between the two levels, such as the smearing of phonemes that often happens when we string certain words together (for example, "How are you all doing?" might become "How're y'all doing?").
It is likely that a similar phenomenon took place in different biological cortical regions, in that they have evolved small differences based on the types of patterns they deal with. Whereas all of these regions use the same essential neocortical algorithm, biological evolution has had enough time to fine-tune the design of each of them to be optimal for their particular patterns. However, as I discussed earlier, neuroscientists and neurologists have noticed substantial plasticity in these areas, which supports the idea of a general neocortical algorithm. If the fundamental methods in each region were radically different, then such interchangeability among cortical regions would not be possible.
The systems we created in our research using this combination of self-organizing methods were very successful. In speech recognition, they were able for the first time to handle fully continuous speech and relatively unrestricted vocabularies. We were able to achieve a high accuracy rate on a wide variety of speakers, accents, and dialects. The current state of the art as this book is being written is represented by a product called Dragon Naturally Speaking (Version 11.5) for the PC from Nuance (formerly Kurzweil Computer Products). I suggest that people try it if they are skeptical about the performance of contemporary speech recognition-accuracies are often 99 percent or higher after a few minutes of training on your voice on continuous speech and relatively unrestricted vocabularies. Dragon Dictation is a simpler but still impressive free app for the iPhone that requires no voice training. Siri, the personal a.s.sistant on contemporary Apple iPhones, uses the same speech recognition technology with extensions to handle natural-language understanding.
The performance of these systems is a testament to the power of mathematics. With them we are essentially computing what is going on in the neocortex of a speaker-even though we have no direct access to that person's brain-as a vital step in recognizing what the person is saying and, in the case of systems like Siri, what those utterances mean. We might wonder, if we were to actually look inside the speaker's neocortex, would we see connections and weights corresponding to the hierarchical hidden Markov models computed by the software? Almost certainly we would not find a precise match; the neuronal structures would invariably differ in many details compared with the models in the computer. However, I would maintain that there must be an essential mathematical equivalence to a high degree of precision between the actual biology and our attempt to emulate it; otherwise these systems would not work as well as they do.
LISP
LISP (LISt Processor) is a computer language, originally specified by AI pioneer John McCarthy (19272011) in 1958. As its name suggests, LISP deals with lists. Each LISP statement is a list of elements; each element is either another list or an "atom," which is an irreducible item const.i.tuting either a number or a symbol. A list included in a list can be the list itself, hence LISP is capable of recursion. Another way that LISP statements can be recursive is if a list includes a list, and so on until the original list is specified. Because lists can include lists, LISP is also capable of hierarchical processing. A list can be a conditional such that it only "fires" if its elements are satisfied. In this way, hierarchies of such conditionals can be used to identify increasingly abstract qualities of a pattern.
LISP became the rage in the artificial intelligence community in the 1970s and early 1980s. The conceit of the LISP enthusiasts of the earlier decade was that the language mirrored the way the human brain worked-that any intelligent process could most easily and efficiently be coded in LISP. There followed a mini-boomlet in "artificial intelligence" companies that offered LISP interpreters and related LISP products, but when it became apparent in the mid-1980s that LISP itself was not a shortcut to creating intelligent processes, the investment balloon collapsed.
It turns out that the LISP enthusiasts were not entirely wrong. Essentially, each pattern recognizer in the neocortex can be regarded as a LISP statement-each one const.i.tutes a list of elements, and each element can be another list. The neocortex is therefore indeed engaged in list processing of a symbolic nature very similar to that which takes place in a LISP program. Moreover, it processes all 300 million LISP-like "statements" simultaneously.
However, there were two important features missing from the world of LISP, one of which was learning. LISP programs had to be coded line by line by human programmers. There were attempts to automatically code LISP programs using a variety of methods, but these were not an integral part of the language's concept. The neocortex, in contrast, programs itself, filling its "statements" (that is, the lists) with meaningful and actionable information from its own experience and from its own feedback loops. This is a key principle of how the neocortex works: Each one of its pattern recognizers (that is, each LISP-like statement) is capable of filling in its own list and connecting itself both up and down to other lists. The second difference is the size parameters. One could create a variant of LISP (coded in LISP) that would allow for handling such parameters, but these are not part of the basic language.
LISP is consistent with the original philosophy of the AI field, which was to find intelligent solutions to problems and to code them directly in computer languages. The first attempt at a self-organizing method that would teach itself from experience-neural nets-was not successful because it did not provide a means to modify the topology of the system in response to learning. The hierarchical hidden Markov model effectively provided that through its pruning mechanism. Today, the HHMM together with its mathematical cousins makes up a major portion of the world of AI.
A corollary of the observation of the similarity of LISP and the list structure of the neocortex is an argument made by those who insist that the brain is too complicated for us to understand. These critics point out that the brain has trillions of connections, and since each one must be there specifically by design, they const.i.tute the equivalent of trillions of lines of code. As we've seen, I've estimated that there are on the order of 300 million pattern processors in the neocortex-or 300 million lists where each element in the list is pointing to another list (or, at the lowest conceptual level, to a basic irreducible pattern from outside the neocortex). But 300 million is still a reasonably big number of LISP statements and indeed is larger than any human-written program in existence.
However, we need to keep in mind that these lists are not actually specified in the initial design of the nervous system. The brain creates these lists itself and connects the levels automatically from its own experiences. This is the key secret of the neocortex. The processes that accomplish this self-organization are much simpler than the 300 million statements that const.i.tute the capacity of the neocortex. Those processes are specified in the genome. As I will demonstrate in chapter 11 chapter 11, the amount of unique information in the genome (after lossless compression) as applied to the brain is about 25 million bytes, which is equivalent to less than a million lines of code. The actual algorithmic complexity is even less than that, as most of the 25 million bytes of genetic information pertain to the biological needs of the neurons, and not specifically to their information-processing capability. However, even 25 million bytes of design information is a level of complexity we can handle.
Hierarchical Memory Systems
As I discussed in chapter 3 chapter 3, Jeff Hawkins and Dileep George in 2003 and 2004 developed a model of the neocortex incorporating hierarchical lists that was described in Hawkins and Blakeslee's 2004 book On Intelligence On Intelligence. A more up-to-date and very elegant presentation of the hierarchical temporal memory method can be found in Dileep George's 2008 doctoral dissertation.12 Numenta has implemented it in a system called NuPIC (Numenta Platform for Intelligent Computing) and has developed pattern recognition and intelligent data-mining systems for such clients as Forbes and Power a.n.a.lytics Corporation. After working at Numenta, George has started a new company called Vicarious Systems with funding from the Founder Fund (managed by Peter Thiel, the venture capitalist behind Facebook, and Sean Parker, the first president of Facebook) and from Good Ventures, led by Dustin Moskovitz, cofounder of Facebook. George reports significant progress in automatically modeling, learning, and recognizing information with a substantial number of hierarchies. He calls his system a "recursive cortical network" and plans applications for medical imaging and robotics, among other fields. The technique of hierarchical hidden Markov models is mathematically very similar to these hierarchical memory systems, especially if we allow the HHMM system to organize its own connections between pattern recognition modules. As mentioned earlier, HHMMs provide for an additional important element, which is modeling the expected distribution of the magnitude (on some continuum) of each input in computing the probability of the existence of the pattern under consideration. I have recently started a new company called Patterns, Inc., which intends to develop hierarchical self-organizing neocortical models that utilize HHMMs and related techniques for the purpose of understanding natural language. An important emphasis will be on the ability for the system to design its own hierarchies in a manner similar to a biological neocortex. Our envisioned system will continually read a wide range of material such as Wikipedia and other knowledge resources as well as listen to everything you say and watch everything you write (if you let it). The goal is for it to become a helpful friend answering your questions- Numenta has implemented it in a system called NuPIC (Numenta Platform for Intelligent Computing) and has developed pattern recognition and intelligent data-mining systems for such clients as Forbes and Power a.n.a.lytics Corporation. After working at Numenta, George has started a new company called Vicarious Systems with funding from the Founder Fund (managed by Peter Thiel, the venture capitalist behind Facebook, and Sean Parker, the first president of Facebook) and from Good Ventures, led by Dustin Moskovitz, cofounder of Facebook. George reports significant progress in automatically modeling, learning, and recognizing information with a substantial number of hierarchies. He calls his system a "recursive cortical network" and plans applications for medical imaging and robotics, among other fields. The technique of hierarchical hidden Markov models is mathematically very similar to these hierarchical memory systems, especially if we allow the HHMM system to organize its own connections between pattern recognition modules. As mentioned earlier, HHMMs provide for an additional important element, which is modeling the expected distribution of the magnitude (on some continuum) of each input in computing the probability of the existence of the pattern under consideration. I have recently started a new company called Patterns, Inc., which intends to develop hierarchical self-organizing neocortical models that utilize HHMMs and related techniques for the purpose of understanding natural language. An important emphasis will be on the ability for the system to design its own hierarchies in a manner similar to a biological neocortex. Our envisioned system will continually read a wide range of material such as Wikipedia and other knowledge resources as well as listen to everything you say and watch everything you write (if you let it). The goal is for it to become a helpful friend answering your questions-before you even formulate them-and giving you useful information and tips as you go through your day. you even formulate them-and giving you useful information and tips as you go through your day.
The Moving Frontier of AI: Climbing the Competence Hierarchy
1. A long tiresome speech delivered by a frothy pie topping.
2. A garment worn by a child, perhaps aboard an operatic s.h.i.+p.
3. Wanted for a twelve-year crime spree of eating King Hrothgar's warriors; officer Beowulf has been a.s.signed the case.
4. It can mean to develop gradually in the mind or to carry during pregnancy.
5. National Teacher Day and Kentucky Derby Day.
6. Wordsworth said they soar but never roam.
7. Four-letter word for the iron fitting on the hoof of a horse or a card-dealing box in a casino.
8. In act three of an 1846 Verdi opera, this Scourge of G.o.d is stabbed to death by his lover, Odabella.
-Examples of Jeopardy! Jeopardy! queries, all of which Watson got correct. Answers are: meringue harangue, pinafore, Grendel, gestate, May, skylark, shoe. For the eighth query, Watson replied, "What is Attila?" The host responded by saying, "Be more specific?" Watson clarified with, "What is Attila the Hun?," which is correct. queries, all of which Watson got correct. Answers are: meringue harangue, pinafore, Grendel, gestate, May, skylark, shoe. For the eighth query, Watson replied, "What is Attila?" The host responded by saying, "Be more specific?" Watson clarified with, "What is Attila the Hun?," which is correct.
The computer's techniques for unraveling Jeopardy! Jeopardy! clues sounded just like mine. That machine zeroes in on key words in a clue, then combs its memory (in Watson's case, a 15-terabyte data bank of human knowledge) for cl.u.s.ters of a.s.sociations with these words. It rigorously checks the top hits against all the contextual information it can muster: the category name; the kind of answer being sought; the time, place, and gender hinted at in the clue; and so on. And when it feels "sure" enough, it decides to buzz. This is all an instant, intuitive process for a human clues sounded just like mine. That machine zeroes in on key words in a clue, then combs its memory (in Watson's case, a 15-terabyte data bank of human knowledge) for cl.u.s.ters of a.s.sociations with these words. It rigorously checks the top hits against all the contextual information it can muster: the category name; the kind of answer being sought; the time, place, and gender hinted at in the clue; and so on. And when it feels "sure" enough, it decides to buzz. This is all an instant, intuitive process for a human Jeopardy! Jeopardy! player, but I felt convinced that under the hood my brain was doing more or less the same thing. player, but I felt convinced that under the hood my brain was doing more or less the same thing.-Ken Jennings, human Jeopardy! Jeopardy! champion who lost to Watson champion who lost to Watson I, for one, welcome our new robot overlords.-Ken Jennings, paraphrasing The Simpsons, The Simpsons, after losing to Watson after losing to Watson Oh my G.o.d. [Watson] is more intelligent than the average Jeopardy! Jeopardy! player in answering player in answering Jeopardy! Jeopardy! questions. That's impressively intelligent. questions. That's impressively intelligent.-Sebastian Thrun, former director of the Stanford AI Lab Watson understands nothing. It's a bigger steamroller.-Noam Chomsky
Artificial intelligence is all around us-we no longer have our hand on the plug. The simple act of connecting with someone via a text message, e-mail, or cell phone call uses intelligent algorithms to route the information. Almost every product we touch is originally designed in a collaboration between human and artificial intelligence and then built in automated factories. If all the AI systems decided to go on strike tomorrow, our civilization would be crippled: We couldn't get money from our bank, and indeed, our money would disappear; communication, transportation, and manufacturing would all grind to a halt. Fortunately, our intelligent machines are not yet intelligent enough to organize such a conspiracy.
What is new in AI today is the viscerally impressive nature of publicly available examples. For example, consider Google's self-driving cars (which as of this writing have gone over 200,000 miles in cities and towns), a technology that will lead to significantly fewer crashes, increased capacity of roads, alleviating the requirement of humans to perform the ch.o.r.e of driving, and many other benefits. Driverless cars are actually already legal to operate on public roads in Nevada with some restrictions, although widespread usage by the public throughout the world is not expected until late in this decade. Technology that intelligently watches the road and warns the driver of impending dangers is already being installed in cars. One such technology is based in part on the successful model of visual processing in the brain created by MIT's Tomaso Poggio. Called MobilEye, it was developed by Amnon Shashua, a former postdoctoral student of Poggio's. It is capable of alerting the driver to such dangers as an impending collision or a child running in front of the car and has recently been installed in cars by such manufacturers as Volvo and BMW.
I will focus in this section of the book on language technologies for several reasons. Not surprisingly, the hierarchical nature of language closely mirrors the hierarchical nature of our thinking. Spoken language was our first technology, with written language as the second. My own work in artificial intelligence, as this chapter has demonstrated, has been heavily focused on language. Finally, mastering language is a powerfully leveraged capability. Watson has already read hundreds of millions of pages on the Web and mastered the knowledge contained in these doc.u.ments. Ultimately machines will be able to master all of the knowledge on the Web-which is essentially all of the knowledge of our human-machine civilization.
English mathematician Alan Turing (19121954) based his eponymous test on the ability of a computer to converse in natural language using text messages.13 Turing felt that all of human intelligence was embodied and represented in language, and that no machine could pa.s.s a Turing test through simple language tricks. Although the Turing test is a game involving written language, Turing believed that the only way that a computer could pa.s.s it would be for it to actually possess the equivalent of human-level intelligence. Critics have proposed that a true test of human-level intelligence should include mastery of visual and auditory information as well. Turing felt that all of human intelligence was embodied and represented in language, and that no machine could pa.s.s a Turing test through simple language tricks. Although the Turing test is a game involving written language, Turing believed that the only way that a computer could pa.s.s it would be for it to actually possess the equivalent of human-level intelligence. Critics have proposed that a true test of human-level intelligence should include mastery of visual and auditory information as well.14 Since many of my own AI projects involve teaching computers to master such sensory information as human speech, letter shapes, and musical sounds, I would be expected to advocate the inclusion of these forms of information in a true test of intelligence. Yet I agree with Turing's original insight that the text-only version of the Turing test is sufficient. Adding visual or auditory input or output to the test would not actually make it more difficult to pa.s.s. Since many of my own AI projects involve teaching computers to master such sensory information as human speech, letter shapes, and musical sounds, I would be expected to advocate the inclusion of these forms of information in a true test of intelligence. Yet I agree with Turing's original insight that the text-only version of the Turing test is sufficient. Adding visual or auditory input or output to the test would not actually make it more difficult to pa.s.s.
One does not need to be an AI expert to be moved by the performance of Watson on Jeopardy! Jeopardy! Although I have a reasonable understanding of the methodology used in a number of its key subsystems, that does not diminish my emotional reaction to watching it- Although I have a reasonable understanding of the methodology used in a number of its key subsystems, that does not diminish my emotional reaction to watching it-him?-perform. Even a perfect understanding of how all of its component systems work-which no one actually has-would not help you to predict how Watson would actually react to a given situation. It contains hundreds of interacting subsystems, and each of these is considering millions of competing hypotheses at the same time, so predicting the outcome is impossible. Doing a thorough a.n.a.lysis-after the fact-of Watson's deliberations for a single three-second query would take a human centuries.
To continue my own history, in the late 1980s and 1990s we began working on natural-language understanding in limited domains. You could speak to one of our products, called Kurzweil Voice, about anything you wanted, so long as it had to do with editing doc.u.ments. (For example, "Move the third paragraph on the previous page to here.") It worked pretty well in this limited but useful domain. We also created systems with medical domain knowledge so that doctors could dictate patient reports. It had enough knowledge of fields such as radiology and pathology that it could question the doctor if something in the report seemed unclear, and would guide the physician through the reporting process. These medical reporting systems have evolved into a billion-dollar business at Nuance.
Understanding natural language, especially as an extension to automatic speech recognition, has now entered the mainstream. As of the writing of this book, Siri, the automated personal a.s.sistant on the iPhone 4S, has created a stir in the mobile computing world. You can pretty much ask Siri to do anything that a self-respecting smartphone should be capable of doing (for example, "Where can I get some Indian food around here?" or "Text my wife that I'm on my way," or "What do people think of the new Brad Pitt movie?"), and most of the time Siri will comply. Siri will entertain a small amount of nonproductive chatter. If you ask her what the meaning of life is, she will respond with "42," which fans of The Hitchhiker's Guide to the Galaxy The Hitchhiker's Guide to the Galaxy will recognize as its "answer to the ultimate question of life, the universe, and everything." Knowledge questions (including the one about the meaning of life) are answered by Wolfram Alpha, described on will recognize as its "answer to the ultimate question of life, the universe, and everything." Knowledge questions (including the one about the meaning of life) are answered by Wolfram Alpha, described on page 170 page 170. There is a whole world of "chatbots" who do nothing but engage in small talk. If you would like to talk to our chatbot named Ramona, go to our Web site KurzweilAI.net and click on "Chat with Ramona."
Some people have complained to me about Siri's failure to answer certain requests, but I often recall that these are the same people who persistently complain about human service providers also. I sometimes suggest that we try it together, and often it works better than they expect. The complaints remind me of the story of the dog who plays chess. To an incredulous questioner, the dog's owner replies, "Yeah, it's true, he does play chess, but his endgame is weak." Effective compet.i.tors are now emerging, such as Google Voice Search.
That the general public is now having conversations in natural spoken language with their handheld computers marks a new era. It is typical that people dismiss the significance of a first-generation technology because of its limitations. A few years later, when the technology does work well, people still dismiss its importance because, well, it's no longer new. That being said, Siri works impressively for a first-generation product, and it is clear that this category of product is only going to get better.
Siri uses the HMM-based speech recognition technologies from Nuance. The natural-language extensions were first developed by the DARPA-funded "CALO" project.15 Siri has been enhanced with Nuance's own natural-language technologies, and Nuance offers a very similar technology called Dragon Go! Siri has been enhanced with Nuance's own natural-language technologies, and Nuance offers a very similar technology called Dragon Go!16 The methods used for understanding natural language are very similar to hierarchical hidden Markov models, and indeed HHMM itself is commonly used. Whereas some of these systems are not specifically labeled as using HMM or HHMM, the mathematics is virtually identical. They all involve hierarchies of linear sequences where each element has a weight, connections that are self-adapting, and an overall system that self-organizes based on learning data. Usually the learning continues during actual use of the system. This approach matches the hierarchical structure of natural language-it is just a natural extension up the conceptual ladder from parts of speech to words to phrases to semantic structures. It would make sense to run a genetic algorithm on the parameters that control the precise learning algorithm of this cla.s.s of hierarchical learning systems and determine the optimal algorithmic details.
Over the past decade there has been a s.h.i.+ft in the way that these hierarchical structures are created. In 1984 Douglas Lenat (born in 1950) started the ambitious Cyc (for enCYClopedic) project, which aimed to create rules that would codify everyday "commonsense" knowledge. The rules were organized in a huge hierarchy, and each rule involved-again-a linear sequence of states. For example, one Cyc rule might state that a dog has a face. Cyc can then link to general rules about the structure of faces: that a face has two eyes, a nose, and a mouth, and so on. We don't need to have one set of rules for a dog's face and then another for a cat's face, though we may of course want to put in additional rules for ways in which dogs' faces differ from cats' faces. The system also includes an inference engine: If we have rules that state that a c.o.c.ker spaniel is a dog, that dogs are animals, and that animals eat food, and if we were to ask the inference engine whether c.o.c.ker spaniels eat, the system would respond that yes, c.o.c.ker spaniels eat food. Over the next twenty years, and with thousands of person-years of effort, over a million such rules were written and tested. Interestingly, the language for writing Cyc rules-called CycL-is almost identical to LISP.
Meanwhile, an opposing school of thought believed that the best approach to natural-language understanding, and to creating intelligent systems in general, was through automated learning from exposure to a very large number of instances of the phenomena the system was trying to master. A powerful example of such a system is Google Translate, which can translate to and from fifty languages. That's 2,500 different translation directions, although for most language pairs, rather than translate language 1 directly into language 2, it will translate language 1 into English and then English into language 2. That reduces the number of translators Google needed to build to ninety-eight (plus a limited number of non-English pairs for which there is direct translation). The Google translators do not use grammatical rules; rather, they create vast databases for each language pair of common translations based on large "Rosetta stone" corpora of translated doc.u.ments between two languages. For the six languages that const.i.tute the official languages of the United Nations, Google has used United Nations doc.u.ments, as they are published in all six languages. For less common languages, other sources have been used.
The results are often impressive. DARPA runs annual compet.i.tions for the best automated language translation systems for different language pairs, and Google Translate often wins for certain pairs, outperforming systems created directly by human linguists.
Over the past decade two major insights have deeply influenced the natural-language-understanding field. The first has to do with hierarchies. Although the Google approach started with a.s.sociation of flat word sequences from one language to another, the inherent hierarchical nature of language has inevitably crept into its operation. Systems that methodically incorporate hierarchical learning (such as hierarchical hidden Markov models) provided significantly better performance. However, such systems are not quite as automatic to build. Just as humans need to learn approximately one conceptual hierarchy at a time, the same is true for computerized systems, so the learning process needs to be carefully managed.
The other insight is that hand-built rules work well for a core of common basic knowledge. For translations of short pa.s.sages, this approach often provides more accurate results. For example, DARPA has rated rule-based Chinese-to-English translators higher than Google Translate for short pa.s.sages. For what is called the tail of a language, which refers to the millions of infrequent phrases and concepts used in it, the accuracy of rule-based systems approaches an unacceptably low asymptote. If we plot natural-language-understanding accuracy against the amount of training data a.n.a.lyzed, rule-based systems have higher performance initially but level off at fairly low accuracies of about 70 percent. In sharp contrast, statistical systems can reach the high 90s in accuracy but require a great deal of data to achieve that.
Often we need a combination of at least moderate performance on a small amount of training data and then the opportunity to achieve high accuracies with a more significant quant.i.ty. Achieving moderate performance quickly enables us to put a system in the field and then to automatically collect training data as people actually use it. In this way, a great deal of learning can occur at the same time that the system is being used, and its accuracy will improve. The statistical learning needs to be fully hierarchical to reflect the nature of language, which also reflects how the human brain works.
This is also how Siri and Dragon Go! work-using rules for the most common and reliable phenomena and then learning the "tail" of the language in the hands of real users. When the Cyc team realized that they had reached a ceiling of performance based on hand-coded rules, they too adopted this approach. Hand-coded rules provide two essential functions. They offer adequate initial accuracy, so that a trial system can be placed into widespread usage, where it will improve automatically. Secondly, they provide a solid basis for the lower levels of the conceptual hierarchy so that the automated learning can begin to learn higher conceptual levels.
As mentioned above, Watson represents a particularly impressive example of the approach of combining hand-coded rules with hierarchical statistical learning. IBM combined a number of leading natural-language programs to create a system that could play the natural-language game of Jeopardy! Jeopardy! On February 1416, 2011, Watson competed with the two leading human players: Brad Rutter, who had won more money than anyone else on the quiz show, and Ken Jennings, who had previously held the On February 1416, 2011, Watson competed with the two leading human players: Brad Rutter, who had won more money than anyone else on the quiz show, and Ken Jennings, who had previously held the Jeopardy! Jeopardy! champions.h.i.+p for the record time of seventy-five days. champions.h.i.+p for the record time of seventy-five days.
By way of context, I had predicted in my first book, The Age of Intelligent Machines The Age of Intelligent Machines, written in the mid-1980s, that a computer would take the world chess champions.h.i.+p by 1998. I also predicted that when that happened, we would either downgrade our opinion of human intelligence, upgrade our opinion of machine intelligence, or downplay the importance of chess, and that if history was a guide, we would minimize chess. Both of these things happened in 1997. When IBM's chess supercomputer Deep Blue defeated the reigning human world chess champion, Garry Kasparov, we were immediately treated to arguments that it was to be expected that a computer would win at chess because computers are logic machines, and chess, after all, is a game of logic. Thus Deep Blue's victory was judged to be neither surprising nor significant. Many of its critics went on to argue that computers would never master the subtleties of human language, including metaphors, similes, puns, double entendres, and humor.
The accuracy of natural-language-understanding systems as a function of the amount of training data. The best approach is to combine rules for the "core" of the language and a data-based approach for the "tail" of the language.
That is at least one reason why Watson represents such a significant milestone: Jeopardy! Jeopardy! is precisely such a sophisticated and challenging language task. Typical is precisely such a sophisticated and challenging language task. Typical Jeopardy! Jeopardy! queries includes many of these vagaries of human language. What is perhaps not evident to many observers is that Watson not only had to master the language in the unexpected and convoluted queries, but for the most part its knowledge was not hand-coded. It obtained that knowledge by actually reading 200 million pages of natural-language doc.u.ments, including all of Wikipedia and other encyclopedias, comprising 4 trillion bytes of language-based knowledge. As readers of this book are well aware, Wikipedia is not written in LISP or CycL, but rather in natural sentences that have all of the ambiguities and intricacies inherent in language. Watson needed to consider all 4 trillion characters in its reference material when responding to a question. (I realize that queries includes many of these vagaries of human language. What is perhaps not evident to many observers is that Watson not only had to master the language in the unexpected and convoluted queries, but for the most part its knowledge was not hand-coded. It obtained that knowledge by actually reading 200 million pages of natural-language doc.u.ments, including all of Wikipedia and other encyclopedias, comprising 4 trillion bytes of language-based knowledge. As readers of this book are well aware, Wikipedia is not written in LISP or CycL, but rather in natural sentences that have all of the ambiguities and intricacies inherent in language. Watson needed to consider all 4 trillion characters in its reference material when responding to a question. (I realize that Jeopardy! Jeopardy! queries are answers in search of a question, but this is a technicality-they ultimately are really questions.) If Watson can understand and respond to questions based on 200 million pages-in three seconds!-there is nothing to stop similar systems from reading the other billions of doc.u.ments on the Web. Indeed, that effort is now under way. queries are answers in search of a question, but this is a technicality-they ultimately are really questions.) If Watson can understand and respond to questions based on 200 million pages-in three seconds!-there is nothing to stop similar systems from reading the other billions of doc.u.ments on the Web. Indeed, that effort is now under way.
When we were developing character and speech recognition systems and early natural-language-understanding systems in the 1970s through 1990s, we used a methodology of incorporating an "expert manager." We would develop multiple systems to do the same thing but would incorporate somewhat different approaches in each one. Some of the differences were subtle, such as variations in the parameters controlling the mathematics of the learning algorithm. Some variations were fundamental, such as including rule-based systems instead of hierarchical statistical learning systems. The expert manager was itself a software program that was programmed to learn the strengths and weaknesses of these different systems by examining their performance in real-world situations. It was based on the notion that these strengths were orthogonal; that is, one system would tend to be strong where another was weak. Indeed, the overall performance of the combined systems with the trained expert manager in charge was far better than any of the individual systems.
Watson works the same way. Using an architecture called UIMA (Unstructured Information Management Architecture), Watson deploys literally hundreds of different systems-many of the individual language components in Watson are the same ones that are used in publicly available natural-language-understanding systems-all of which are attempting to either directly come up with a response to the Jeopardy! Jeopardy! query or else at least provide some disambiguation of the query. UIMA is basically acting as the expert manager to intelligently combine the results of the independent systems. UIMA goes substantially beyond earlier systems, such as the one we developed in the predecessor company to Nuance, in that its individual systems can contribute to a result without necessarily coming up with a final answer. It is sufficient if a subsystem helps narrow down the solution. UIMA is also able to compute how much confidence it has in the final answer. The human brain does this also-we are probably very confident of our response when asked for our mother's first name, but we are less so in coming up with the name of someone we met casually a year ago. query or else at least provide some disambiguation of the query. UIMA is basically acting as the expert manager to intelligently combine the results of the independent systems. UIMA goes substantially beyond earlier systems, such as the one we developed in the predecessor company to Nuance, in that its individual systems can contribute to a result without necessarily coming up with a final answer. It is sufficient if a subsystem helps narrow down the solution. UIMA is also able to compute how much confidence it has in the final answer. The human brain does this also-we are probably very confident of our response when asked for our mother's first name, but we are less so in coming up with the name of someone we met casually a year ago.
Thus rather than come up with a single elegant approach to understanding the language problem inherent in Jeopardy! Jeopardy! the IBM scientists combined all of the state-of-the-art language-understanding modules they could get their hands on. Some use hierarchical hidden Markov models; some use mathematical variants of HHMM; others use rule-based approaches to code directly a core set of reliable rules. UIMA evaluates the performance of each system in actual use and combines them in an optimal way. There is some misunderstanding in the public discussions of Watson in that the IBM scientists who created it often focus on UIMA, which is the expert manager they created. This leads to comments by some observers that Watson has no real understanding of language because it is difficult to identify where this understanding resides. Although the UIMA framework also learns from its own experience, Watson's "understanding" of language cannot be found in UIMA alone but rather is distributed across all of its many components, including the self-organizing language modules that use methods similar to HHMM. the IBM scientists combined all of the state-of-the-art language-understanding modules they could get their hands on. Some use hierarchical hidden Markov models; some use mathematical variants of HHMM; others use rule-based approaches to code directly a core set of reliable rules. UIMA evaluates the performance of each system in actual use and combines them in an optimal way. There is some misunderstanding in the public discussions of Watson in that the IBM scientists who created it often focus on UIMA, which is the expert manager they created. This leads to comments by some observers that Watson has no real understanding of language because it is difficult to identify where this understanding resides. Although the UIMA framework also learns from its own experience, Watson's "understanding" of language cannot be found in UIMA alone but rather is distributed across all of its many components, including the self-organizing language modules that use methods similar to HHMM.
A separate part of Watson's technology uses UIMA's confidence estimate in its answers to determine how to place Jeopardy! Jeopardy! bets. While the Watson system is specifically optimized to play this particular game, its core language- and knowledge-searching technology can easily be adapted to other broad tasks. One might think that less commonly shared professional knowledge, such as that in the medical field, would be more difficult to master than the general-purpose "common" knowledge that is required to play bets. While the Watson system is specifically optimized to play this particular game, its core language- and knowledge-searching technology can easily be adapted to other broad tasks. One might think that less commonly shared professional knowledge, such as that in the medical field, would be more difficult to master than the general-purpose "common" knowledge that is required to play Jeopardy! Jeopardy! Actually, the opposite is the case: Professional knowledge tends to be more highly organized, structured, and less ambiguous than its commonsense counterpart, so it is highly amenable to accurate natural-language understanding using these techniques. As mentioned, IBM is currently working with Nuance to adapt the Watson technology to medicine. Actually, the opposite is the case: Professional knowledge tends to be more highly organized, structured, and less ambiguous than its commonsense counterpart, so it is highly amenable to accurate natural-language understanding using these techniques. As mentioned, IBM is currently working with Nuance to adapt the Watson technology to medicine.
The conversation that takes place when Watson is playing Jeopardy! Jeopardy! is a brief one: A question is posed, and Watson comes up with an answer. (Again, technically, it comes up with a question to respond to an answer.) It does not engage in a conversation that would require tracking all of the earlier statements of all partic.i.p.ants. (Siri actually does do this to a limited extent: If you ask it to send a message to your wife, it will ask you to identify her, but it will remember who she is for subsequent requests.) Tracking all of the information in a conversation-a task that would clearly be required to pa.s.s the Turing test-is a significant additional requirement but not fundamentally more difficult than what Watson is doing already. After all, Watson has read hundreds of millions of pages of material, which obviously includes many stories, so it is capable of tracking through complicated sequential events. It should therefore be able to follow its own conversations and take that into consideration in its subsequent replies. is a brief one: A question is posed, and Watson comes up with an answer. (Again, technically, it comes up with a question to respond to an answer.) It does not engage in a conversation that would require tracking all of the earlier statements of all partic.i.p.ants. (Siri actually does do this to a limited extent: If you ask it to send a message to your wife, it will ask you to identify her, but it will remember who she is for subsequent requests.) Tracking all of the information in a conversation-a task that would clearly be required to pa.s.s the Turing test-is a significant additional requirement but not fundamentally more difficult than what Watson is doing already. After all, Watson has read hundreds of millions of pages of material, which obviously includes many stories, so it is capable of tracking through complicated sequential events. It should therefore be able to follow its own conversations and take that into consideration in its subsequent replies.
Another limitation of the Jeopardy! Jeopardy! game is that the answers are generally brief: It does not, for example, pose questions of the sort that ask contestants to name the five primary themes of game is that the answers are generally brief: It does not, for example, pose questions of the sort that ask contestants to name the five primary themes of A Tale of Two Cities A Tale of Two Cities. To the extent that it can find doc.u.ments that do discuss the themes of this novel, a suitably modified version of Watson should be able to respond to this. Coming up with such themes on its own from just reading the book, and not essentially copying the thoughts (even without the words) of other thinkers, is another matter. Doing so would const.i.tute a higher-level task than Watson is capable of today-it is what I call a Turing testlevel task. (That being said, I will point out that most humans do not come up with their own original thoughts either but copy the ideas of their peers and opinion leaders.) At any rate, this is 2012, not 2029, so I would not expect Turing testlevel intelligence yet. On yet another hand, I would point out that evaluating the answers to questions such as finding key ideas in a novel is itself not a straightforward task. If someone is asked who signed the Declaration of Independence, one can determine whether or not her response is true or false. The validity of answers to higher-level questions such as describing the themes of a creative work is far less easily established.
It is noteworthy that although Watson's language skills are actually somewhat below that of an educated human, it was able to defeat the best two Jeopardy! Jeopardy! players in the world. It could accomplish this because it is able to combine its language ability and knowledge understanding with the perfect recall and highly accurate memories that machines possess. That is why we have already largely a.s.signed our personal, social, and historical memories to them. players in the world. It could accomplish this because it is able to combine its language ability and knowledge understanding with the perfect recall and highly accurate memories that machines possess. That is why we have already largely a.s.signed our personal, social, and historical memories to them.
Although I'm not prepared to move up my prediction of a computer pa.s.sing the Turing test by 2029, the progress that has been achieved in systems like Watson should give anyone substantial confidence that the advent of Turing-level AI is close at hand. If one were to create a version of Watson that was optimized for the Turing test, it would probably come pretty close.
American philosopher John Searle (born in 1932) argued recently that Watson is not capable of thinking. Citing his "Chinese room" thought experiment (which I will discuss further in chapter 11 chapter 11), he states that Watson is only manipulating symbols and does not understand the meaning of those symbols. Actually, Searle is not describing Watson accurately, since its understanding of language is based on hierarchical statistical processes-not the manipulation of symbols. The only way that Searle's characterization would be accurate is if we considered every step in Watson's self-organizing processes to be "the manipulation of symbols." But if that were the case, then the human brain would not be judged capable of thinking either.
It is amusing and ironic when observers criticize Watson for just just doing statistical a.n.a.lysis of language as opposed to possessing the "true" understanding of language that humans have. Hierarchical statistical a.n.a.lysis is exactly what the human brain is doing when it is resolving multiple hypotheses based on statistical inference (and indeed at every level of the neocortical hierarchy). Both Watson and the human brain learn and respond based on a similar approach to hierarchical understanding. In many respects Watson's knowledge is far more extensive than a human's; no human can claim to have mastered all of Wikipedia, which is only part of Watson's knowledge base. Conversely, a human can today master more conceptual levels than Watson, but that is certainly not a permanent gap. doing statistical a.n.a.lysis of language as opposed to possessing the "true" understanding of language that humans have. Hierarchical statistical a.n.a.lysis is exactly what the human brain is doing when it is resolving multiple hypotheses based on statistical inference (and indeed at every level of the neocortical hierarchy). Both Watson and the human brain learn and respond based on a similar approach to hierarchical understanding. In many respects Watson's knowledge is far more extensive than a human's; no human can claim to have mastered all of Wikipedia, which is only part of Watson's knowledge base. Conversely, a human can today master more conceptual levels than Watson, but that is certainly not a permanent gap.
One important system that demonstrates the strength of computing applied to organized knowledge is Wolfram Alpha, an answer engine (as opposed to a search engine) developed by British mathematician and scientist Dr. Wolfram (born 1959) and his colleagues at Wolfram Research. For example, if you ask Wolfram Alpha (at WolframAlpha.com), "How many primes are there under a million?" it will respond with "78,498." It did not look up the answer, it computed it, and following the answer it provides the equations it used. If you attempted to get that answer using a conventional search engine, it would direct you to links where you could find the algorithms required. You would then have to plug those formulas into a system such as Mathematica, also developed by Dr. Wolfram, but this would obviously require a lot more work (and understanding) than simply asking Alpha.
Indeed, Alpha consists of 15 million lines of Mathematica code. What Alpha is doing is literally computing the answer from approximately 10 trillion bytes of data that have been carefully curated by the Wolfram Research staff. You can ask a wide range of factual questions, such as "What country has the highest GDP per person?" (Answer: Monaco, with $212,000 per person in U.S. dollars), or "How old is Stephen Wolfram?" (Answer: 52 years, 9 months, 2 days as of the day I am writing this). As mentioned, Alpha is used as part of Apple's Siri; if you ask Siri a factual question, it is handed off to Alpha to handle. Alpha also handles some of the searches posed to Microsoft's Bing search engine.
In a recent blog post, Dr. Wolfram reported that Alpha is now providing successful responses 90 percent of the time.17 He also reports an exponential decrease in the failure rate, with a half-life of around eighteen months. It is an impressive system, and uses handcrafted methods and hand-checked data. It is a testament to why we created computers in the first place. As we discover and compile scientific and mathematical methods, computers are far better than unaided human intelligence in implementing them. Most of the known scientific methods have been encoded in Alpha, along with continually updated data on topics ranging from economics to physics. In a private conversation I had with Dr. Wolfram, he estimated that self-organizing methods such as those used in Watson typically achieve about an 80 percent accuracy when they are working well. Alpha, he pointed out, is achieving about a 90 percent accuracy. Of course, there is self-selection in both of these accuracy numbers in that users (such as myself) have learned what kinds of questions Alpha is good at, and a similar factor applies to the self-organizing methods. Eighty percent appears to be a reasonable estimate of how accurate Watson is on He also reports an exponential decrease in the failure rate, with a half-life of around eighteen months. It is an impressive system, and uses handcrafted methods and hand-checked data. It is a testament to why we created computers in the first place. As we discover and compile scientific and mathematical methods, computers are far better than unaided human intelligence in implementing them. Most of the known scientific methods have been encoded in Alpha, along with continually updated data on topics ranging from economics to physics. In a private conversation I had with Dr. Wolfram, he estimated that self-organizing methods such as those used in Watson typically achieve about an 80 perc