Google Mini Remove URL from index

One of my ASP.NET ecommerce applications uses URL rewriting for product pages. For example:

Item Sku Number: 473-151
Product Description: Bright Products, Black box converter
Website URL: href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter

Note: Text after the SKU item number is irrelevant as it is disregarded for the purposes of the ASP.NET engine, only the 473-151 finds the page.

Google Mini

We use a Google Mini to index the site and provide search results to the site users. As the product page can be entered from a number of differing routes in the past using different access URLs, and to help keep the page count down in the results from the box we use the canonical header meta tag to provide what should be the definitive page url for this page.

Canonical headers are supported by all the search engines of importance. The tag looks like this;
<link rel="canonical" href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter />

Change of description

Recently a supplier complained as the description of the product in the URL for the item was wrong, although on the page it was correct. After investigation it was found that the item description had been changed as the supplier had rebranded the brand name, see below.

Item Sku Number: 473-151
Product Description: Mighty Products, Black box converter
Website URL: href=http://www.mydomain.com/Products/473-151- Bright Products, Black box converter

This meant that when the item was searched for in the Google mini, it found “Black box converter” but had the incorrect URL shown above. It should have the url as follows;

Website URL: href=http://www.mydomain.com/Products/473-151- Mighty Products, Black box converter

What's wrong?

So what is wrong? It turns out that the Google Mini still has the old URL in the index. In fact it turns out that the page is very persistent at staying in the index. Thus the box happily crawls it each night.

It seems “the index” is a list of pages the Google Mini has found at some time in the past. In fact a page can now have been “unlinked” from the site, having no inbound links to it, but it will still persist in the index and thus results.

The only way to remove a page from the Google Mini Index is highlighted in this document Administering Crawl for Web and File Share Content: Introduction, here it sates that;

  • The license limit is exceeded
  • The crawl pattern is changed
  • The robots.txt file is changed
  • Document is not found (404)

These are the only ways that a page will be removed. As in this scenario, the page still returns a valid page, as it has the same item SKU number, it keeps indexing under the wrong URL potentially forever!

Also it is worth noting that if you are having problems with re-indexing the content of the page rather than the URL of the page then check the “Last-Modified” header that is being returned by the page in the response from the web server. This is particularly an issue in dynamic pages as normally static pages will be dealt with appropriately from the last modified date of the file on the file system of the site. You can study the headers from the page by using a developer tool bar (now built into IE8).

Solution attempt 1

Aha I thought I know how to tackle this. The old URL no longer exists now, as it has been superseded by the new page, thus the ASP.NET site should be issuing a response.status = “301 moved permanently” to force the page out of the Google Mini to index the new page page and register that URL and presumably drop the old URL from the index.

Couple of lines and problem was solved I thought.

If Not officialUriForPage.PathAndQuery.EndsWith(Request.RawUrl) Then
  Response.Clear()
  Response.Status = "301 Moved Permanently"
  Response.AddHeader("Location", utility.GetPublicProductURL( _
            Me.ProductDetails.ProductId, Me.ProductDetails.ItemDescription))
  Response.End()
End If

 

So now the old page will issue a “301 moved permanently” response to the browser and Google Mini, it will go index that new page and drop the old URL – However it don’t work that way.

Solution attempt 2

After the overnight index solution 1 turned turned out a failure. Reading the documentation again it turns out the Goole Mini is being helpful and returning both URL’s, the new and moved URL, for any searches that have a search hit inside the new URL content. It seems that the four methods of removal noted earlier really are the only way to remove a page from the index.

Action

I could put the URL I wanted to remove from the index into the “Don’t Crawl URLS” box of the crawl pattern definition in the Google Crawl admin pages. This would then cause the Google Mini to, after 15 minutes to six hours, re-examine the index and realise this page no longer should be there and remove it. This would be done under the criteria “The crawl pattern is changed”, item two of the list of conditions for removal of pages in the list earlier. I could then remove the don’t crawl URL again from the Google Box so I don’t forget and accidentally block a future new URL replacement. This should work for a few pages, we have about 15,000 products online, need something better.

Instead I went for the last option in the list, “If the search appliance receives a 404 (Document not found) error from the Web server when attempting to fetch a document, the document is removed from the index.”.

Hence I changed the code sample above to redirect to our generic 404 not found page rather than redirecting with the moved redirect. Check that the 404 page responds in the header with a 404 status code or the Google Mini will not see the 404 status. However I don’t want this to happen for end users only the Google Mini. This is because for an end user they just want to be redirected to the URL, a 404 not found is rude and would make lost sales as users assume the item no longer exists. Luckily the Goole box sends a configurable user_agent variable in requests, so we can behave differently to it.

If Not IsNothing(System.Web.HttpContext.Current.Request. _
         ServerVariables("HTTP_USER_AGENT")) _
    AndAlso System.Web.HttpContext.Current.Request. _
         ServerVariables("HTTP_USER_AGENT").Contains("gsa-crawler") Then
    'Not found for Google Mini
    Response.Clear()
    Response.Status = "404 Not Found"
    Response.AddHeader("Location", "/ErrorPages/404.aspx")
    Response.End()
Else
    'Perm redirect
    Response.Clear()
    Response.Status = "301 Moved Permanently"
    Response.AddHeader("Location", common.utility.GetPublicProductURL( _
                Me.ProductDetails.ProductId, Me.ProductDetails.ItemDescription))
    Response.End()
End If

I hope the problem is now resolved.

Invalid Viewstate - Request path: /ScriptResource.axd

 

Getting these errors in our logs more and more, they have been building up recently. Thirty today was the final straw, I had to investigate more.
The exception is as follows:

Process information: 
    Process ID: 5164 
    Process name: w3wp.exe 
    Account name: NT AUTHORITY\NETWORK SERVICE 
 
Exception information: 
    Exception type: HttpException 
    Exception message: Invalid viewstate. 
 
Request information: 
    Request URL: http://www.mydomain.co.uk/ScriptResource.axd?d=zssJ_ZkntDf8wJe24iZ0zF_fJVnfucP3oqIlDwt8BO1EweVFWfL2juu9RlhRVTDPTWMvo7NxKPBKbedKroducts</a></span></li><li><span><ahttp://www.mydomain.co.uk/ScriptResource.axd?d=zssJ_ZkntDf8wJe24iZ0zF_fJVnfucP3oqIlDwt8BO1EweVFWfL2juu9RlhRVTDPTWMvo7NxKPBKbedKroducts&lt;/a></span></li><li><span><ahttp://www.mydomain.co.uk/ScriptResource.axd?d=zssJ_ZkntDf8wJe24iZ0zF_fJVnfucP3oqIlDwt8BO1EweVFWfL2juu9RlhRVTDPTWMvo7NxKPBKbedKroducts&lt;/a></span></li><li><span><ahttp://www.mydomain.co.uk/ScriptResource.axd?d=zssJ_ZkntDf8wJe24iZ0zF_fJVnfucP3oqIlDwt8BO1EweVFWfL2juu9RlhRVTDPTWMvo7NxKPBKbedKroducts&lt;/a></span></li><li><span><a%20href=href=href=href= 
    Request path: /ScriptResource.axd 

I found a promising lead from Project 31-A blog that lead me to Microsoft connect and this report, Bug IE8 – 4K dropped - "Invalid viewstate" when loading ScriptResource.axd or WebResource.axd (asp net) 
It seems there is an IE8 bug that  might explain the occurrences growing. The description of the issue matches what we are seeing to a tee. I have started work on trying to mitigate our exposure to the issue by removing the META-Content tags content="text/html; charset=ISO-8859-1" from our pages that have them and hope this reduces the impact.

I am worried that this IS affecting the user experience of our site. More testing required on this one…

SQL Server 2005/2008 – its great

We upgraded our SQL server from 2000 to 2008 early this year and although I have been using 2005 for a while for Microsoft Dynamics GP, only the website really stretches my TSQL skills to the limit. I have a library of scripts that solve most problems or can be adapted to address most issues that occur in Dynamics GP developed in SQL server 2000 days.

This year I have been wallowing in

  1. Common Table Expressions (CTE)
  2. Availability of the TOP command in INSERT DELETES
  3. Ranking expressions
  4. VARCHAR(MAX) data types, and the exciting date types in SQL 2008
  5. Outputting affected rows (INSERT/UPDATE/DELETE)
  6. Enhancements in TOP to allow use to filter results in SProcs
  7. Intelisense in Management Studio

 

These new friends almost eradicate cursors and triggers for all but the most complex scenarios.

I found that Programming Microsoft® SQL Server® 2008 (PRO-Developer) was great step up book to get from SQL 2000 experienced DBA to SQL 2008. No wasting time going over basics, very factual expecting you to pick up from where you left with SQL 2000.

Another book I still enjoy today is the The Guru's Guide to Transact SQL (Paperback), that is technically a little outdated but the way this book makes you think about SQL really gives you a leg up to a higher  level of TSQL programming that  you can still apply to the newer versions of SQL server.

Deduplication

I had a duplication problem to solve today. The usual issue where there is a plain text imported file that has issues. The import table had no primary key and many thousands of duplicate rows that needed removing. This could be done with SQL server 2000, it was a pain and never very pretty. Now  the solution is beautiful.

Create the table, fill it with some duplicate lines.

CREATE TABLE [ItemImages_Import](
    [ItemCode] [varchar](31) NULL,
    [Graphic] [varchar](255) NULL,
    [Position] [smallint] NULL
) ON [PRIMARY]
 
GO

then goodness me all you have to do is run this script and your duplicates are no more!

WITH VirtTable 
as (select
 itemcode,graphic,ROW_NUMBER() 
    OVER (PARTITION BY itemcode  order by graphic
     as GroupCounter)
from ItemImages_Import)
Delete from VirtTable where GroupCounter>1

This uses the ranking expression ROW_NUMBER() to create a temporary “key” that is then used to delete the items after the “1”st instance in the table of the duplicate. This is great stuff.

Create Row number in a Grouping using TSQL

My next problem is that I wanted to renumber the position of the images in the table using the sort column, overwriting the existing values;

288-288      graphic1.jpg     1
288-288      graphic2.jpg     2
288-288      graphic3.jpg     3
288-999      graphic4.jpg     1
288-999      graphic5.jpg     2
289-100      graphic6.jpg     1
289-100      graphic7.jpg     2
289-100      graphic8.jpg     3
etc

Again its a doddle once you get your head around rankings.

UPDATE ItemImages_Import
set position=0;
WITH VirtTable 
as 
(select *,
 ROW_NUMBER() 
    OVER (PARTITION BY itemcode 
     order by graphic)as GroupCounter
from ItemImages_Import)
UPDATE  VirtTable
 SET position=GroupCounter;
select * from ItemImages_Import order by 
    ItemCode, Position

Just to prove it works I blank the column first with the update.

Happy TSQLing!

response.TransmitFile bad coding day

File Downloads from ASP.NET

All I had to do was serve the file as a download page to the customer, there the pain began.

If providing file downloads, it is important to buffer stream the file to the user to manage the server memory. If many users start downloading your files together and you are not buffering the file, then the whole will be held in memory multiple times thus causing memory exhaustion. Buffering means only a few bytes of each downloading file is held in memory at once.

response.TransmitFile

In .NET 2.0 a great new method was added to the framework, “response.TransmitFile”. This method takes out a load of code that we used to have to code whenever we wanted to provide a file to the customer for download. It streams the file to the user and handles the buffering of the file. Look at the following simple code.

response.Clear()
response.ClearContent()
response.ClearHeaders()
response.ContentType = "application/x-msexcel"
response.AppendHeader("Content-Disposition", "attachment; filename=PriceList.xlsx")
response.TransmitFile(StrFilename)
response.End()

 

Now compare that to this version where no buffering is ocurring, the whole file is read to memory then sent to browser  Protect files for download - Code Project 

'This is the function that you have to use
Private Function downloadfile(ByVal strFile As String)
    Dim fs As FileStream
    Dim strContentType As String
    ' This is the important part, because you going to use the local path 
    'to get the file
    Dim strPath = Me.Server.MapPath( _
        System.Configuration.ConfigurationSettings.AppSettings("uploadDirectory")) & "\"
    Dim strFileName As String = strFile
    fs = File.Open(strPath & strFileName, FileMode.Open)
    Dim bytBytes(fs.Length) As Byte
    fs.Read(bytBytes, 0, fs.Length)
    fs.Close()
    Response.AddHeader("Content-disposition","attachment; filename=" & strFileName)
    Response.ContentType = "application/octet-stream"
    Response.BinaryWrite(bytBytes)
    Response.End()
    Return True
End Function

 

The problem

I got the error below when I implemented my download page;

The file you are trying to open, xxxxxx is in a different format 
than specified by the file extension. Verify that the file is not corrupted 
and is from a trusted source before opening the file. 
Do you want to open the file now?

I could not open the excel file successfully. I opened the downloaded file in textpad to look at what was happening inside, but as office stores its files in zips this didn’t reveal anything immediately to my eye. Instead I tried using a text file to download and that revealed the issue that lead to me finding my mistake.

The downloaded file exhibited corruption, and thus I started looking for issues that could  corrupt the file. These included tracing being switched on for the site, IIS settings, wondering if it has something to do with my new Windows 7 machine etc. After several hours from starting what should have been a twenty minute job, I discovered my error. See my erroneous below;

My error

' WARNING THIS CODE SMIPPET HAS AN ERROR DO NOT COPY!
Dim StrFilename = Files(0) 
response.Clear()
response.ContentType = "application/x-msexcel"
response.AppendHeader("Content-Disposition", "attachment; filename=PriceList.xlsx") 
response.TransmitFile(StrFilename, IO.FileMode.Open, IO.FileAccess.Read)
response.End()

See what I’d done – I didn’t for ages. As I had taken some code from another streaming download that I wrote in .NET 1.1 I copied the filename and file opening line. Only I also copied the io.FileMode.Open, IO.FileAccess.Read, as it happens there is an overload for TransmitFile that takes and offset in the file and length. Of course the enumeration for IO.FileMode.Open resolves to an integer so no compile error or runtime error, instead my file was truncated to virtually nothing!

On getting rid of these two parameters in all worked great! A number of hours wasted eliminating issues down to the line of code that were wrong. Funny how you can create masses of functionality and code one day and another get bogged down by a stupid error like this!