How Google Attempts to Understand What a Query or Page is About Based Upon Word Relationships
A little crunched for time, and feeling both hungry and lazy, I treated myself to a meal from the local Taco Bell tonight. It’s a new store, and I like that when you place an order at the drive-through, it shows what you ordered on a screen, and the person taking the order asks if your order is right before charging you for it and processing the order. The voice coming out of the speakers asked me if my order looked correct. I took a look, and responded, “I guess it does.” I had to guess, because the list didn’t look very legible:
1 Smthr Burr SC
1 Chalupa SPR Stk
1 Sft Taco Bf Spr
1 Lrg Root Beer
Of course, the screen shows abbreviations for the order, because it needs to abbreviate those words if they stand a chance of fitting on a single line on a ticker tape receipt. That doesn’t make my order any easier to read or understand when it’s displayed that way, and it really doesn’t need to be presented on the computer screen as abbreviated words as long as the abbreviations only appear on the receipt. Repeating what I ordered on a screen and allowing me to confirm the order is a really good idea. But, using the abbreviations for the receipt on that confirmation screen isn’t such a good idea. The people taking the order may recognize the abbreviations, especially after at least one night of having to look at them. But, even though the items look similar to what I ordered, they seem more like gibberish to me.
In the 2008 paper Finding Cars, Goddesses and Enzymes: Parametrizable Acquisition of Labeled Instances for Open-Domain Information Extraction, the authors describe how text on web pages might be labeled as it is crawled, to understand the concepts found in words on those pages. The paper may be a few years old, but Google was granted a patent on a similar process that was granted this past May. If words in queries are processed in the same way, to better understand the concepts in them, then search results can be returned on the basis of matching concepts in a query to concepts found on web pages.
The patent is:
Extracting and scoring class-instance pairs
Invented by Marius Pasca
Assigned to Google
US Patent 8,452,763
Granted May 28, 2013
Filed: March 19, 2010
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for extracting and scoring class-instance pairs. One method includes applying extraction patterns to document text to derive class-instance pairs, determining a frequency score and a diversity score for each distinct class-instance pair, and determining a pair score for each class-instance pair from the frequency score and the diversity score.
Another method includes applying extraction patterns to document text to derive candidate class-instance pairs, determining, for each distinct candidate class-instance pair, a number of distinct phrases from which the distinct candidate class-instance pair was derived, and determining a pair score for each distinct candidate class-instance pair from the number of distinct phrases from which the candidate class-instance pair was extracted.
Google’s recent Hummingbird update is aimed at examining long and complex queries and returning search results for those queries that don’t necessarily rely upon matching all the words within those queries. The focus of the paper and patent is on finding patterns in data that is mined from web pages by looking for relationships between words in “class-instance” pairs. As the patent tells us:
A class-instance pair is made up of a class name corresponding to a name of an entity class and an instance name corresponding to an instance of the entity class. The instance of the entity class has an “is-a” relationship with the entity class; in other words, the instance of the entity class is an example of the entity class. An example class-instance pair is the pair (food, pizza), because pizza is a food.
By better understanding that “Pizza is a food”, it makes it easier for Google to understand what is meant by “pizza” when it appears in a query or on a web page, and to match up queries and Web pages that both include that class instance pair. Much like knowing that “Smthr Burr SC” is a menu item that I may or may not have ordered at Taco Bell, makes it easier for me to know that what was meant by the abbreviation is “Smothered Burrito, Shredded Chicken”. Yes, that’s what I ordered.
A couple of days ago, I wrote about a patent from Google where the search engine tries to find “known for” terms of interest for entities. A restaurant might be known for a famous chef working there, or a specific menu item that might be unique to that restaurant; like Gordon Ramsay’s restaurants are known for his version of Beef Wellington. The post was How Google Finds ‘Known For’ Terms for Entities. What makes that patent similar to this one is that it focuses upon a specific type of relationship – a “known for” relationship. This new patent also looks for a relationship, an “is a” relationship.
If you want to dig into the process or mathematics behind how Google might identify is a relationships and extract terms and concepts that fit into those patterns from data extracted from the web, you can get a sense of those from the paper and the patent. What’s more important here is understanding that Google is building a knowledge base of concepts and relationships between words that can help it return relevant results for queries.
When Google acquired, or merged with (technically it was called a merger), Applied Semantics in 2003, Google also inherited Applied Semantic’s CIRCLA Technology. At the heart of the technology was the ability to learn about and understand relationships between words. I’ve mentioned “known for” relationships and “is a” relationships, but here are some other relationships mentioned in a white paper about Circla:
- Synonymy/antonymy (e.g. “good” is an antonym of “bad”)
- Similarity (“gluttonous” is similar to “greedy”)
- Hypernymy (is a kind of / has kind) (“horse” has kind “Arabian”)
- Membership (“commissioner” is a member of “commission”)
- Metonymy (whole/part relations) (“motor vehicle” has part “clutch pedal”)
- Substance (e.g. “lumber” has substance “wood”)
- Product (e.g. “Microsoft Corporation” produces “Microsoft Access”)
- Attribute (“past”, “preceding” are attributes of “timing”)
- Causation (e.g. “travel” causes “displacement” or “motion”)
- Entailment (e.g. “buying” entails “paying”)
- Lateral bonds (concepts closely related to one another, but not in one of the other relationships, e.g. “dog” and “dog collar”)
The future of rankings of search results may rely upon Google building a concept-based knowledge base that understands the relationship between words, as well as probabilities that a certain relationship was intended when words are used on a page. For example, a page that mentions Microsoft might be about Microsoft as a member of technology companies, or it might be about Microsoft products. If you write a page that includes “Microsoft” in it, and the page also mentions Cisco, Redhat, Apple and Sun Microsystems, there’s a decent chance that the page is about technology companies. If you write a different page that includes “Microsoft” in it, and it also mentions Access and Word and Excel, then the page is more likely to be about products produced by Microsoft.
The words that you choose to use on a web page might send signals to Google about the relationships between those words, influencing Google’s interpretation of your page.
It’s possible that someone reading that last paragraph might say, “That’s obvious, and if you write naturally those relationships will appear on their own.” But writing “naturally” isn’t just something that flows from your mind to your fingers to your keyboard to your page. Knowing that Google will try to understand the relationships between words that appear in a query or on a page makes it less likely that in creating those queries or that content, you don’t send mixed signals that might be caused by a lack of focus on showing off those relationships.
About the Author
Bill Slawski is the Director of Search Marketing at Go Fish Digital and has been promoting websites since 1996. He often blogs about SEO and search-related patents and white papers on his blog SEO by the Sea. Originally, as an in-house SEO who then worked at agencies and as a solo consultant, he has worn a lot of different hats and has tested and tried out ideas from patents and papers as an ongoing SEO education. Connect with him at Twitter, @bill_slawski, if you’d like to stay in touch or have questions.
Photo thanks to Anders Sandberg (Random search)
Want even more SEO news? Sign up for the SEO Copywriting Buzz newsletter today!