Friday, July 30, 2010

Speech-recognition software continues to improve

From The NY Times State of the Art column:


Nuance, the company that makes Dragon NaturallySpeaking for Windows, is in a pretty sweet position: It’s essentially a monopoly. One by one, its competitors in the speech-recognition business have either left the market (Philips), gone out of business (Lernout & Hauspie) or turned over its product to Nuance (I.B.M.). Even the sole Mac speech-recognition program, MacSpeech Dictate, can no longer be considered a kind of rival; Nuance bought it this year.

Only the underappreciated, mostly ignored Speech program in Microsoft Windows is around to keep Nuance on its toes.

But here’s the thing: When you’re a monopoly, what incentive do you have to innovate? Does Nuance have the spine to keep prices down and quality up when it’s the only game in town?

As a clue, you can try out NaturallySpeaking 11, which goes on sale Thursday. This upgrade follows the same philosophy as the last few annual updates. It’s full of nips and tucks, all welcome, all well-executed, though none killer — and the annual improvement in dictation accuracy.

Nuance says the new version is 15 percent more accurate. Which is fine, if barely noticeable (how much better is a 15 percent gain when you’re already getting 99.6 percent accuracy?). More interesting is how it got there.

Back in December, Nuance began offering a free iPhone app, Dragon Dictation. You speak; the company’s computers in Boston analyze your snippet; within seconds, the converted, typed text appears on your screen.

But this was no altruistic move; Nuance had an ulterior motive. Its computers keep copies of those hundreds of thousands of dictated messages (no names attached, of course), creating an amazing central archive of American voices and speech patterns. Nuance engineers later exploited this gold mine, using it to test out new recognition algorithms to improve Dragon’s accuracy. Sneaky, eh?

The accuracy is so good that you no longer have to begin by reading a four-minute training text, as in years past. I installed the software on my PC, skipped the training, and dictated one of my old columns, 1,300 words. It achieved 100 percent accuracy, even correctly nailing toughies like “LinkedIn,” “Twitterific,” “freebies” and “twentysomethings.” (It made one error, but I’m letting it off the hook for not recognizing the Web site name Bebo.)

Visual changes greet you, too. The biggest one is a cheat sheet of commands that fills a panel on the right side of your screen. It eats up a lot of space, but it’s probably a big help to people who have never realized that you can do a lot more than speak-to-type. You can also control the computer itself.

You can open programs (“open Firefox”), pick menu commands, click Web links, move the cursor, format text (“italicize ‘The New York Times’ ”) and so on. In version 11, you can apply the same formatting (like bold, italic or underline) to every occurrence of a word or phrase in a document. That came in handy when I dictated something about Twitter, and Dragon consistently refused to capitalize it. No problem; I capitalized all occurrences afterward, with a single command.

NaturallySpeaking has never handled children’s voices well, but that’s changed, too. Now even first graders can be first-class speech citizens.

Dictation software is not goof-proof; it may never be. Heaven knows, it’s hard enough understanding other people’s speech even if you’re a human being. (I keep a file of favorite NatSpeak gaffes over the years — “the right or left” became “the writer left”; “a case we summarily dismissed” became “a case we so merrily dismissed”; and “oxymoron” once became “ax a moron,” which is often a tempting idea.)

Over time, your software is supposed to get better and better because each time it makes a mistake, you’re supposed to correct it with your voice. You say, “correct ‘ax a moron,’ ” and up pops a numbered list of alternate transcriptions. You say “choose 2,” the program fixes the text, it learns from its mistake, and you go on.

Unfortunately, to the company’s great frustration, a certain percentage of NatSpeak owners never used the “Correct” spoken command to fix mistakes. Instead, they just double-clicked the error and typed over it, depriving the software of the chance to learn. For these people, accuracy never goes up.

So in version 11, Nuance did another sneaky thing; if you manually edit something NatSpeak transcribed, the software compares the new phrase you type with what you originally said. If you change text to something completely different (“a hot day” to “a scorching afternoon”), the software assumes you’re just editing. But if they sound alike — if you change “basic aberration” to “basic operation,” say — then the software concludes that you’re correcting a transcription error, and it learns. In other words, accuracy will now improve even for people who refuse to get with the program.

Version 10 introduced shortcut commands like “Search the Web for ‘san diego pizzerias” or “search maps for 200 West 79 Street, New York, New York.” You’d marvel as your PC flew into action, bringing up a Google search or Mapquest page for whatever you said.

There’s more of that in Version 11. In addition to “search the Web for...”, “search e-mail for...”, and “search my computer for...”, you can now speak similar commands to search Wikipedia, Twitter, Facebook and eBay.

There are other improvements, too. If your PC has a multicore processor, NatSpeak divides up the recognition tasks to get better transcription results. The little yellow floating box, where a half-formed transcription used to appear before pouring the text into your document, is gone; now a little Dragon cursor moves along with your text, and changes shape to indicate when it’s ignoring the incoming sound, like when you cough.

And if you have one of the 18 approved digital audio recorders, you can prepare NatSpeak for transcribing your voice recordings after only four minutes of training. It used to be 15. (No, the program still can’t transcribe interviews; no software can. That task, where multiple speakers are talking colloquially with no punctuation, far from the microphone, is still too daunting a task.)

There’s been some price improvement, too; the Pro version, with features for managing people’s voice accounts over a network, now costs $600 instead of $900, and the Legal edition goes for $800 instead of $1,200. Still stratospheric, but no longer something out of Monty Python.

In short, what’s good about NaturallySpeaking has gotten better. But some of what’s wrong with it stays wrong.

For example, you can edit by voice, with complete random-access control of what you’ve already “typed,” in many important programs — all the Microsoft Office programs, for example, and, now, the free Open Office Writer word processor. But when you dictate into a program like Skype, all you get is creepy random strings of letters, completely useless unless you work in a license-plate factory.

The feature that purports to insert commas and periods automatically, without your having to speak them, is still so tentative and flaky, it wastes more time than it saves.

Writing by dictating still requires a mental adjustment; you pretty much have to know what phrase you want before you start speaking. And accuracy results vary widely according to accents and other factors.

It’s probably not worth the $100 to upgrade to Dragon 11 if you already have 10 (and maybe even 9). But if you have an earlier version, or if you’ve never even tried dictation software, you’ll probably be amazed at how far the technology has come. Yes, Nuance has a near-monopoly in the speech-recognition game, but it’s nice to see it making steady improvements and price cuts as if it didn’t.