The Feldman File: Google and the limits of tweaking

Late last year, several observers wrote about what they believe is deterioration in the quality of Google's search results:

Jeff Atwood of Stack Overflow wrote that sites that have simply copied Stack Overflow's articles and surrounded them with advertisements are showing up higher in Google's search results than the original Stack Overflow content. According to Google's own rules, that's not supposed to happen.
Vivek Wadhwa wrote in TechCrunch: "Google has become a jungle: a tropical paradise for spammers and marketers. Almost every search takes you to websites that want you to click on links that make them money, or to sponsored sites that make Google money. There’s no way to do a meaningful chronological search."
The "wheels fell off" of Google Search for Paul Kedrosky when he tried to decide which new dishwaher to buy. Thanks to keyword-driven content generated by content mills such as Demand Media and the Yahoo Contributor Network (formerly Associated Content), "Pages and pages of Google results...are just, for practical purposes, advertisements in the loose guise of articles, original or re-purposed. It hearkens back to the dark days of 1999, before Google arrived, when search had become largely useless, with results completely overwhelmed by spam and info-clutter."

Content providers have been trying to "game" Google's results ever since Google became a serious search engine player, but Google has always been able to adapt its algorithms to keep the best results coming up at the top. Now, however, it looks like the gamers are winning, and that's opening the door for other search engines.

This may not be a perfect, or even a relevant, analogy, but it may help explain what Google is facing. 25 years ago, Kurzweil was the only company that could read and convert virtually any typeface to ASCII (OCR, or optical character recognition). They did it by having the machine operator scan in examples of the material to be converted, and then individually identify each character ("this is an "L"...this is an "I"...this is a lower-case "i") until the reader could understand the test set. Then, the operator could scan in the complete set of documents, and the Kurzweil device would read and convert them. However, there were always characters that it still couldn't read, and the operator would have to stop and correct the mistakes. These corrections would further train the system.

The Kurzweil system could only recognize a limited number of typefaces at a time, because it would get confused. Over time, more training and corrections actually led to lower accuracy, as the system could no longer distinguish between similar characters such as "e", "o" and "q", "E" and "F", "D" and "O", or "I", "i", "L", "l" and "1". Early systems relied on character shapes alone and didn't use dictionaries or context checks. As a result, at some point the operator had to discard the training set and train the device all over again.

True algorithmic recognition systems from Palantir/Calera eventually solved the problem and were able to read the vast majority of typefaces without any training. Eventually, through acquisitions and mergers, the technologies of Kurzweil and Palantir/Calera fell under one roof at ScanSoft, and are currently sold as OmniPage 17 by Nuance.

My point is that the training technology of Kurzweil eventually reached its limit. Even after adding the best fixes the company could think of, its technology was eventually supplanted by algorithimically-based shape recognition, augmented with dictionaries and context analysis. Google could now face the same challenge. Having tweaked and augmented its search algorithms for years, it may no longer be able to keep up with attempts to game its system. In order to truly fix the problem, Google may have to either switch to a fundamentally different search and filtering technology, or bolt on a radically different approach, such as social searching.

As the Kurzweil case suggests, technologies have limits, and once those limits are reached, it may take radical, not just incremental, changes to the technologies in order to either get further improvements or to avoid going backward.

The Feldman File

Tuesday, January 04, 2011

Google and the limits of tweaking

No comments: