The Semantics of “Semantic”

News: Google gives search a refresh, the Wall Street Journal Reports.
Views: Lots of these changes sound great to me, as best I can tell from a WSJ report. I’m a fan of improving relevance and I like the idea of “examining a Web page and identifying information about specific entities referenced on it, rather than only look for keywords” depending on how it works. The language evokes processes behind linked data, especially the idea of unambiguous identifiers for entities, allowing aggregation of relevant data from all kinds of sources.

But how? The web hasn’t (yet) gotten around to tagging entities with stable URIs, and Google’s expertise is in crunching text. I suspect the answer involves lots of processing power grinding down text to its component parts and matching up bits with some level of reliability, as opposed to the (supposed) certainty of identification via stable URI. On one side, that seems a better strategy than sitting around waiting for certainty to descend from the heavens and resolve all data everywhere into one giant, tidy triple store. On the other side, that’s still the old method — “look for keywords” — just the really KEY keywords. In other keywords, between the lines of the WSJ story, it’s looking like a turbocharged version of the current technology.

If the change involves a difference in kind, it’s the effort to amass that giant and growing entity database, with pretty likely results spit back at users keystroke by keystroke. Considering how the entities are identified, the signal still is certain to contain enough noise to paint a fuzzy picture, leaving satisficing users to make do with almost-relevant information. Good enough for most, maybe, and better than the current flood of recall. But it doesn’t live up to the promises of precision inherent in the “semantic” language.

And the provenance of the information in that entity database raises a potentially larger problem. In the extreme case where Google mashes up the entire web into its own proprietary database, it disintermediates all the sources on which it relies for the value of that database. To the extent that it starves the geese laying those golden eggs, it reduces its own ability to attract the eyeballs it wants to sell to advertisers. By linking to the providers, it automatically shares that wealth and keeps them going, while capturing a share for itself. If it were to become a single-source category killer, it would compromise its own supply of information while fobbing off users with dumbed down overviews that only kinda sorta meet their needs. That looks from here like a reduction in value on all sides rather than the value-add from a well-integrated general search tool.

Would providers hate that outcome enough to start setting their robots.txt files to restrict or even block Google? They probably can’t expect a lot better deal from its competitors, if it were a viable option at all. Realistically, that extreme case isn’t an especially likely outcome, but it certainly raises (or reemphasizes) questions about the point at which optimizing for one provider begins to draw down value for the system as a whole.

The nod to Facebook and other competitors is expected in a WSJ piece, but not much is made of social search as an alternative. G+ hasn’t gained near the traction it would need to be a solid basis for judgments of relevance, despite Google’s attempts to make it so. FB’s prospects for a better outcome seem no better, given the mundane patter on which it would be drawing to power suggestions in search results. True semantic web technologies seem the far better path to truly relevant search results, maybe via the relatively simple shortcut of microdata. That set of options has the additional happy benefit of preserving the diversity and diffusion needed for a healthy ecology of information providers.