Google Mini images in results ASP.NET

Providing search for our ASP.NET site has been left to an implementation of the Google Mini. This choice was for speed of set up, user familiarity with result sets generated and the fact that Microsoft Search was not released when the decision was made. The Google Mini is a 1U high rack mount server hardware supplied by Google. It provides a web browser administrator interface and lets you integrate it into your site in a few different ways. We fire requests to it and get an XML results file back that we then manipulate to produce a good search experience. We use the GSALib port of the gsa-japi Jarva library.

Now we have it up and running it has now become time to start tinkering a bit to try and get better results. The first challenge was getting product image thumbnails shown next to the search results from the Google Mini. This actually was very simple to achieve. The Mini understands meta tags in your html. If you put the following into the <head></head> section of the product pages:
<meta name="tw-prodimg" content="48350_01.jpg" />
Then then when the mini indexes your pages, this tag together with any others present, will be stored as a collection against that page’s result. You must explicitly ask for the meta tags to be included in the XML results from the Mini if you wish to consume them in the resulting XML from the search appliance.

   1: Dim objQuery As New GSA.Query
   2: Dim collections(0) As String
   3: collections(0) = _searchSiteCollection
   4: objQuery.setSiteCollections(collections)
   5: objQuery.setFrontend(_searchFrontEnd)
   6: objQuery.setOutputFormat(GSALib.Constants.Output.XML_NO_DTD)
   7: objQuery.setOutputEncoding(Constants.Encoding.UTF8)
   8: objQuery.setAccess(Constants.Access.PUBLIC)
   9: objQuery.setScrollAhead(CInt(_searchStartPageIndex))
  10: objQuery.setMaxResults(_searchPageSize)
  11: objQuery.setFilter(_searchFilter)
  12: 'Set set FetchMataFields=* to get all meta tags associated with page result
  13: Dim o As String() = {"*"}
  14: objQuery.setFetchMetaFields(o)


As you can see a call to setFetchMetaFields has been passed “*” this means return all meta tags from this pages result. You may if you prefer pass a string array of meta tags you are interested in seeing to reduce the returned tags. You may also use meta tags for filtering result sets by meta tag, but not part of this discussion.

Now  the results will include an XML node <MT> that contains the meta tags, this is exposed through the GSA library as a string collection hanging off the search page result it is associated with. Thus we can now show the image on the page using data binding in our repeater control, thus:

<asp:HyperLink ID="sImgLnk" NavigateUrl='<%# Server.HtmlDecode(eval("Url")) %>'
  runat="server" meta:resourcekey="HypImageResource2"></asp:HyperLink>:HyperLink>
   1: Protected Sub repeaterProductResults_ItemDataBound(ByVal sender As Object, ByVal e As System.Web.UI.WebControls.RepeaterItemEventArgs) Handles repeaterProductResults.ItemDataBound
   2:     If (e.Item.ItemType = ListItemType.Item) Or (e.Item.ItemType = ListItemType.AlternatingItem) Then
   3:         Dim SearchResult As GSA.Result = CType(e.Item.DataItem, GSA.Result)
   4:         If SearchResult.Metas.Contains("tw-ItemImg") Then
   5:             DirectCast(e.Item.FindControl("sImgLnk"), HyperLink).ImageUrl = ConfigurationManager.AppSettings("PathToProductThumbs").ToString & SearchResult.Metas("can-ItemImg")
   6:         Else
   7:             DirectCast(e.Item.FindControl("sImgLnk"), HyperLink).ImageUrl = ConfigurationManager.AppSettings("PathToProductThumbs").ToString & "noimage.gif"
   8:         End If
   9:     End If
  10: End Sub


Thus you now have item images against the items. You can obviously expand this so that all your pages could have a “searchimage” meta tag so that news items or other content could all have individual thumbs.

Google Mini excluding ASP.NET page fragments

You have configured your Google Mini, got it integrated with you site. What you find now is that your results are getting skewed by irrelevant content on your site. This is what I’ve just found.

Exclude unwanted page sections

The result set was upset by the “customers who bought this also bought…” and the site page header and footer. This turned out very simple to resolve. There is a HTML tag that can be used to stop parts of the page from getting indexed. The definition of these are found in this document, excluding Unwanted Text from the Index.
Here are the examples pulled from that documentation for brevity;

<!--googleoff: anchor--><A href=sharks_rugby.html>shark </A> <!--googleon: anchor-->
<!--googleoff: snippet-->Come to the fair!<!--googleon: snippet-->
<!--googleoff: all-->Come to the fair!<!--googleon: all-->

You surround the control or section of the page you do not want to participate in the results with one of the three HTML comment tags shown above. This will not affect the rendering of you page but does mean something to the Google search appliance.

Index: The words between the tags are ignored by Google, they are treated as if they don’t occur on the page at all.

anchor: text in the html anchor tag to another page will not cause that destination page to appear as a result due to the link on this page.

Snippet: the search result will not use the text between the tags in the auto generated snippet that is included in the results.

all: Turns on all the attributes. Text between the tags is not indexed, followed to another linked-to page, or used for a snippet.

To solve my problem googleoff was applied to;

  • “Customers who bought this bought” control reference
  • Product category breadcrumb on the product pages
  • master page header and footers
    This has resulted in “contact us” not returning every page in the site any more, as it used to be linked from every page through the site master pages and made the snippets much more relevant from search results.
    Resulting in much richer results. Caution should be applied to avoid excluding too much of your content from Google as you can’t predict what and why someone is searching on your site. Excluding too much content may hinder them finding what they require or prevent them ever getting what they need.
    Check the documentation for other controls you have available to control the indexing of pages (the crawl).