To inform entertain and excite my kids, Jamie, Patrick, Aaron & Sarah Middleburgh, our family and friends.

about me
photo of Dave Middleburgh
Hong Kong

blogspot visit counter
  powered by BLOGGER

Case of Missing Cybercop

2 weeks ago, purely on a whim, I searched for "cybercop" using the "search this blog" feature in the blogger toolbar. Imagine my dissapointment when the screen on the left was returned (match not found!!) especially since I had a recent post where the term was in the URL ; the page Title and the main heading as well as on the page I was searching from and indeed every other page on my blog in the links sections.

This immediately triggered a series of questions and introspection" Why was the blog search not returning any results? ; whats the value of a site search in the toolbar if it does return expected results ?; how do other search engines see my blog and its content? what have I done "'wrong" ? ( IMHO there is nothing wrong with a little paranoia!!)

To put things in context, I don't really care where my pages come in general search results although it must be confusing for anyone searching for David Middleburgh to get results for my cousin or vice versa. It seemed to me however that a site search should at least "work" and since the one in the toolbar didn't, I reinstated the Picosearch which I had on the site previously. This does work and can be found on the sitemap tab.

This of course didn't explain why the blogger search was "naff" nor why I couldn't find expected content on Google; Yahoo and other engines even when I did focussed searches. It's worth remembering that searching in the widest sense involves 3 sequetional proceses

  1. Discovery The searchbots have to find your site in order for it to be indexed, otherwise you simply don't go the party
  2. Indexing; Your site and content has got to be readable by the bot: its no good if the bot finds the site but can't read it or is it indexed other than intended by author. To build on the party analagy is like giving an invitation to party but getting date wrong or planning a toga party but end up hosting a "wake"
  3. Retrieving results: This is a "consumer" focussed process: The goal of a good search engine is to maximize recall ie bring back as many of results as possible within a resonable time which match the search arguments entered and to maximize their relevancy ie rank the results putting the most useful ones in front of consumer first (and least useful last)
Personally I have no problem with Google (or anyone elses) algorithms; eg ranking results based on weighting factors including inbound links from "authoratitive" sites. topicality of post ; locality of searcher v locality of data source or the weather outside the google office last weekend. I also have no problem with results not being returned due to timeouts or being suppressed becuase they are substantively similar to other results from the same domain (have you ever looked at numbers shown in right hand corner of results screen or clicked on the show similar results link on last page of results ?) Indeed I have no problems with changing algorithms which push once ranking pages to the bottom of Marianna Trench : I am ene quite comfortable in fact with my alexa ranking!! I actually think that most of my posts are quite trivial (this one excepted)

The key question is whether the content was in fact indexed or not and It quickly became apparent that there was a discovery/indexing problem. Google and Yahoo both offer sitemapping facilities to let you help them spider the site more effecively. So after checking and cleaning up broken links etc ....

I switched on the Googles tools and discovered that that the last time they spidered my blog was in July 2005 when they changed their spidering system (and coincidentally when I temporarily stopped blogging). And all the decreasing traffic I was getting was was coming from old indexes which were being cleared down as a result of me switching off caching in May 2006 . I have switched caching back on (only way on a page basis i can work out what is being indexed and when) and prompted google to respider the site. It only discovered the last post and I gcame to conclusion that the bot appears to discover/index through the default blogger atom feed (first feed listed in the home page header rather than the RSS feed listed second or the HTML itself). My blog settings are such that the home page amd the atom feed only shows the last post. This means that if I did 2 postings a week and the bot visited once a week only half the posts would get discovered. I removed the atom feed from the header and pointed the sitemapping tool at the RSS feed (which contains the last 15 posts - am still looking at this ) Since there was shortfall of about 45 posts (mostly prior to July 2005) not being indexed I put up a temporary page on another (non authoratitive) property listing them for the bot to pick up; When it did, I took this page down since it could distort inbound link counts ( and I am actually quite ethical)

The Yahoo tools work similarly although in a couple of key respect its not as advanced; It does not yet let you put a metatag in homepage to authenticate ownership which means you can't see all the useful statistics. Apparently they are working on this. There is also no facility to indicate a priority that an author would put on pages. I think this facility is available(?) in google XML site maps not that its any use if you are using the public Google blogging service because you can't load an XML site map anywhere useful. If the author could indicate the page priority then Google or other search engines would know which one to display when ther are a number of "similar" pages. I am completely at a loss to understand why some of my pages display over other similar pages.

I am now looking at how (a mystery) and IceRocket index the site. I have given up on MSN which has no tools .....

| More