Sriram Krishnan (Moved to http://www.sriramkrishnan.com/blog)

Search. Usability. Virtual machines.Geek stuff

<December 2008>
SuMoTuWeThFrSa
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910


Navigation

Subscriptions

News

Link blog
Technorati Profile
The Blogs I read
Creative Commons Licence
This work is licensed under a Creative Commons License.


Monday, November 22, 2004 - Posts

Reverse Engineering Google Desktop Search

With all the hype around, I thought I would take a look into the internals of Google desktop search to see how it works. Now, I could have probably got a lot more information by disassembling it but I didn't know whether this was prohibited by the license agreement or something. So I used non-intrusive tools like FileMon, Regmon to see it in action. Do note that this post only covers how it crawls and indexes. I haven't looked into how the results are displayed yet - but Scott Hanselman has done some interesting analysis in this area. I've just started writing code for my own version of desktop search - so I wanted to check out how the market leaders work. If someone has a build of MSN Desktop Search, feel free to mail it to me :-)

The Program

The application runs 3 exes at any given point in time- GoogleDesktopIndex.exe, GoogleDesktopSearch.exe and GoogleDesktopCrawl.exe. Of this, GoogleDesktopSearch.exe is the most boring. This is the icon you click on and it runs in your system tray and also launches the other exes. After that, GoogleDesktopCrawl.exe (from here onwards, referred to as the crawler) and GoogleDesktopIndex.exe (the indexer) communicate with each other.

The Registry Keys

The registry keys give you access to options that you don't get with the installer. On my system, the key is present at HKEY_USERS\S-1-5-21-2025429265-1682526488-854245398-1003\Software\Google\Google Desktop. Now, you see some interesting options here.The 'date_dir' value lets you specify where you want your index stored.This is a big thing for people who don't have 1 GB free on their OS partition. The 'indexing_resume_time' gets set every time you say 'Pause  indexing for 15 minutes'. The indexer polls this value to figure out when it should start up again.The others are also pretty self-explanatory though I'm puzzled as to what 'kThrottlerDataNameId' does. This is an interesting key as both the indexer and crawler have a mutex with the same name. I really don't have a guess as to what this does.

On Start-up

The first time the program is run, it goes through your entire hard-disk,email,etc. This is the job of the crawler. Now, how it goes about finding all this stuff is pretty interesting. I ran Filemon(available from www.sysinternals.com) to see what were the files it was reading. The crawler seems to jump around a lot and doesn't seem to read a lot of stuff in order. For example, I had the crawler deep inside D:/Perl and then it jumped to somewhere deep inside E:\. However, this is only for files. I'm not sure how the Outlook mail and the chat logs are being read - or maybe I just missed it.

As expected, the crawler first goes through your desktop , my documents and their subfolders. This makes sense as these are the most visible folders and you would want changes in them to be reflected instantly in the results

Here's the interesting bit - the crawler reads through EXEs. In fact, it reads the entire EXEs and it reads all of them. Now, I have no idea why they would do this - what information could they be looking for *inside* EXEs? I don't think they're looking for plain text strings inside EXEs as search results don't bring them up.This is still a huge mystery to me.

At the same time, the indexer keeps on churning in the background. I'll cover how these 2 communicate and what the indexer writes to a bit later.

Btw, an interesting snippet of information for those among you who poked fun at Microsoft's Indexing Service when Google Desktop Search got released. GDS uses a lot of indexing services' functions to read through office documents - it links to query.dll and calls it for every office document it finds. This is another example of Google using Microsoft technology to take the fight to Microsoft :-)

When you change something

Now,the moment you change something, the crawler picks it up and reads through the file. But what is interesting is how it is notified - it is the indexer and not the crawler which signs up to notifications. If you are curious as to how these notifications are done, check out MSDN for the FindFirstChange and FindNextChange APIs. The indexer on start-up subscribes to be notified when anything changes on any drive . Now, these pose an interesting problem for the indexer - it gets notified when it writes to the disk too (as this counts as files being changed). In fact, it gets tons of such messages - a lot of work must have been put into dealing with those notification messages as quickly as possible.

Now that the indexer knows that something has changed, it has to tell the crawler. Now, here's the problem. I really haven't figured out how the crawler and the indexer talk to each other. But I'm willing to bet they're using memory-mapped files as they both call the MapViewOfFile APIs and Sysinternals' process explorer shows that both hold references to memory-mapped segments.My problem is that FileMon doesn't seem to pick up memory mapped files being written to - so I really can't see when and where this writing is taking place. My guess is that they have a standard producer-consumer system and they use a mutex to synchronize. In fact, there's evidence to point to this as process explorer shows that both the crawler and the indexer access the same mutexes (the _GD_* mutexes). Or I could be completely wrong and this communication. Basically, this is nothing but the indexer telling the crawler -"Hey - that file has changed, go check it out and tell me the contents". I don't understand why this notification work is done by the indexer - somehow seems more logical to be done by the crawler. But hey - who am I to question Google's engineers?

Now that the crawler has been given a file to go and crawl, it does so dutifully. Again, this puzzles me - the indexer already knows some file has been changed. So why doesn't it go read it itself instead of handing work off to the crawler and having the crawler report back to it? No idea.

Now that the crawler has actually read through the file, it needs to tell the indexer. Again, I'm not able to track down how the inter-process communication works.

[Digression : Here I want to talk about why I'm not able to find out how this communication actually works. Basically, it is because I'm too dumb at this kind of thing. Someone like Raymond Chen could probably do this in his sleep and not bat an eyelid.

This is a place where I'm really kicking myself for not knowing WinDBG well enough and being a newbie at this kind of debugging. I tried to see what file is being mapped by setting a breakpoint on the appropriate function in kernel32 but that didn't work (though breakpoints in other dlls work perfectly. I then tried to set a breakpoint on all kernel32 functions by doing 'bm kernel32!*' but threw back the same error message about not finding symbols. This inspite of me having symbols properly set up and it working for all other apps I've been trying to debug. I'll update this post when I figure out what I'm screwing up. I'm going to indulge in some Matt Pietrek-ness by using SoftIce or by using WinDBG better to find out what actually is going on.]

Here's the interesting bit - the crawler doesn't seem to write to any of the files in the Documents and Settings\<UserName>\Local Settings\Application Data\Google folder directly. It sends all this stuff back to the indexer (probably through memory-mapped files again) who goes about indexing it again.

The Files

Here's where some deep magic is happening.In this post, 'deep magic' means I'm too dumb to figure out what exactly is happening so I'm going to try and take a guess at it. Anyway, with that disclaimer out of the way, let's get back to understanding how the files look like. Now, the indexer calls gzip.dll so I'm pretty sure all the files are highly compressed. Anyway, I couldn't find any human readable strings in it.

Here are the files which I see in my index folder

 dbc2em.cf1
 dbc2emh.ht1
 dbdam
 dbdao
 dbeam
 dbeao
 dbm
 dbu2dm.cf1
 dbu2dmh.ht1
 dbvm.cf1
 dbvmh.ht1
 fii.cf1
 fiih.ht1
 hes.evt
 rpm.cf1
 rpmh.ht1

Take everything from now on with an even bigger pinch of salt than what you've been treating yourself to so far.

Right after the crawler reads through the changes, the indexer starts reading through some of the *.ht1 files and writes to some of the *.cf1 files. However, this doesn't work out all the time. When the crawler was doing the first time indexing of my comp, there was a lot of writing happening to these cf1 files. However, now (after finishing the full-crawl), the indexer just seems to read from both the .cf1 and the .ht1 files. I'm not sure as to why this is the case - probably because new files are not being added to the database? In that case, I tried saving a new file and still didn't get the indexer to write to the .cf1 files. This didn't happen - in fact, the indexer didn't write to any location at all! I would appreciate some help here as I'm clueless as to why this is happening -I'm not able to nail down what set of events cause what files to be written to.

I'm guessing that the db* files are some sort of indices (probably a reverse-lookup index). This guess comes from the observation that these files are read from in small chunks - mostly 4 bytes. So those 4 bytes could be some sort of index or a document id into the .cf1 and .ht1 files where the meat of the content seems to be stored.

When you search for something

I still haven't analyzed this part properly but I wanted to write up this blog post before I got involved in other work (like exams, for e.g). Read Scott Hanselman's post to see how GDS hooks into Wininet.dll to make sure every search result coming back from Google.com also invokes Google Desktop Search in the process. The indexer runs a local server at 4664. Another Google joke I guess - to choose a palindrome for a port. This number rings a bell - has Google used this number somewhere else?

Google desktop search for all its faults is a great piece of work. But I have to admit - the Google guys really know their platforms. I mean - one application which uses memory-mapped files,the indexing service, outlook and outlook express add-ins and also dll-hooking. All of this points to Google having people who know Windows really well. Probably all those ex-Microsoft employees :-)

 

 

 

posted Monday, November 22, 2004 1:01 AM by sriram




Powered by Dot Net Junkies, by Telligent Systems