You must have heard of IBM's Watson system. It is, of course, the computer that won the Jeopardy competition against the show's previous champions. Jeopardy is a popular quiz show in which the competitors are provided clues and have to give questions that satisfy these clues. For example, a clue like 'This computer beat the reigning world chess champion' would elicit a question 'Who is Deep Blue?'. As you can see, the questions given by the competitors are easy questions of the nature 'What is', 'Who is', so the Jeopary question answer format can be considered like any other quiz show. The clues however are complex covering a wide array of topics, and could include puns, puzzles, and maths. The competitors also place bets on each questions. Competing at 'Jeopardy' thus requires the right combination of 'natural language understanding, broad knowledge, confidence and strategy'.
Watson's victory thus represents a major milestone for natural language processing, and particularly the sub-area known as 'Question-Answering'. Question-Answering systems have great practical use for building expert systems, customer support system, decision making tools, enterprise search systems.
Watch Watson's winning performance here:
This paper, Building Watson: An Overview of the DeepQA project, from IBM provides an overview of Watson and the DeepQA architecture that underlies it. The DeepQA architecture defines a framework for development of QA systems in an extensible and modular method, allowing different components to be customized, and to build robust QA systems that can be ported across domains. Figure 1 shows a high level diagram of the Watson's major components, and how queries are routed through it.
- Query Analysis: This is the first stage, where the input clue is analyzed to determine the question category (puzzle, pune, mathematical, numeric, logical, etc.) and the answer type (person, location, organization, etc.). Complex clues are also decomposed into simpler clues.
- Hypothesis Generation: Watson has at its disposal many sources of information like encyclopedias, books, lists of things like people, countries, etc. Watson does not attempt to get the correct answer straightaway. Instead, it first focusses on generating as many possible candidate answers, called 'hypotheses'. This is to ensure that good answers are not missed in the pursuit of the perfect answer. The attempt is to increase recall at this stage.
- Soft Filtering: Watson may generate hundreds and thousands of hypotheses, which then have to be analyzed in detail to find the correct answer. To limit this deep analysis to only the most relevant answers, Watson filters out the bad candidates by employing a few techniques like mismatch between the expected and candidate answer type.
- Hypothesis and Evidence scoring: Now Watson does a deep analysis of the candidate answers by employing sophisticated linguistic and statistical techniques, and looks to gather evidence for each hypothesis. This is one of the most critical parts of Watson since the evidence collected will determine how good the answer is and how confident Watson can be about it.
- Merging and Ranking: Once the evidence is collected, the confidence scores are generated for each candidate and candidates ranked. Now, looking at the answer's confidence level Watson decides if it should answer the question or not.
Figure 1: DeepQA Architecture (Source: The IBM paper)
The flexibility in the DeepQA architecture is achieved through the use of the UIMA text analysis framework. At one point in the trials, Watson was taking about two hours to generate an answer. The answer was to parallelize Watson with UIMA-AS and this got the response time down to the quiz show's average of 2 to 5 seconds. The improvement in accuracy is even more startling. When the IBM team stared working on Watson, the difference between the show's participants and early prototypes of Watson was huge. Figure 2 depicts the evolution in Watson's performance. It started from the baseline where the precision and recall were nowhere near the cloud of points corresponding to actual human competitors, but gradually reached human level performance.
Figure 2: Watson's accuracy over time (Source: The IBM paper)
What enabled Watson to reach this level of performance? Many of the underlying analysis algorithms aren't new, but have been around in the research community for a long time. More than groundbreaking original research, it is pragmatic engineering that lies at the core of Watson's success and the following are the salient contributory factors:
- Building an end-to-end system: Very early, the team build a baseline end-to-end system and then kept iterating and improving the system. They defined end-to-end evaluation metrics which captured the performance of the system as a whole, and not focusing only on the individual component accuracies at the initial stages. This helped make the correct trade-offs.
- Pervasive Confidence estimation: Every component in Watson gives a confidence estimate along with its response. This is critical since these confidence scores can be aggregated to get the final confidence on the answers and allows easy integration of components of varying accuracy. The rule is that no component is assumed to be perfect, but each makes available its confidence estimate of the answers.
- Many experts: There may be competing algorithms to do the same task. Rather than using the best, the system uses multiple algorithms so as to get diverse results and evidence. The confidence estimates help to blend the diverse results.
- Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.
- Massive parallelism: As mentioned, exploiting massive parallelism allows looking through a large number of hypotheses.