July 2005 - Posts

Windows Vista Forums on DotNetJunkies

I created five new threaded discussion forums for Windows Vista (the operating system formerly known as "Longhorn"):

  • AERO
    AERO stands for Authentic, Energetic, Reflective, and Open and is the user experience guidelines for Windows Vista, governing the look and feel of the operating system.
  • Windows Presentation Foundation (Avalon)
    Windows Presentation Foundation (formerly code name "Avalon") is the presentation subsystem class libraries in WinFX.
  • Windows Communication Foundation (Indigo)
    Windows Communication Foundation (formerly code name "Indigo") is a set of .NET technologies for building and running connected systems. It is a new breed of communications infrastructure built around the Web services architecture.
  • InfoCard
    "InfoCard" is the code name for a WinFX component that provides the consistent user experience required by the identity metasystem.
  • RSS in Windows Vista
    Today RSS is primarily used for news sites, blogs, and increasingly for audio-based serialized content. But RSS has the potential for broader reach and to more deeply integrate the information it delivers across applications of various kinds.
     

Search Scoping vs. Search Filtering

In a meeting today I participated in a discussion about search (specifically searching of community resources of course). In the discussion we realized there was a lack of clarity on the difference between a scoped search and a filtered search - depending on who you asked they were either the same thing or very different. So I did a little research, and here are my findings.

The Problem Statement
Here are a few variations of the positions in the meeting:

  • Is scoping the same as filtering?
  • Is scope something you pass into a search and filter is something you do with the result set to reduce it?
  • Is scope the limitation of a search to a specific set of resources, and filter is a way of excluding content based on content attribution?

Research
According to Dictionary.com, scope is:

scope (n.)
 1. The range of one's perceptions, thoughts, or actions.
 2. Breadth or opportunity to function. See Synonyms at room.
 3. The area covered by a given activity or subject. See Synonyms at range.
 4. The length or sweep of a mooring cable.
 5. Informal. A viewing instrument such as a periscope, microscope, or telescope.

Also according to Dictionary.com, filer is:

fil·ter (n.)
 1.
  a. A porous material through which a liquid or gas is passed in order to separate the fluid from suspended particulate matter.
  b. A device containing such a material, especially one used to extract impurities from air or water.
 2.
  a. Any of various electric, electronic, acoustic, or optical devices used to reject signals, vibrations, or radiations of certain frequencies while allowing others to pass.
  b. A colored glass or other transparent material used to select the wavelengths of light allowed to reach a photosensitive material.
 3. Computer Science. A program or routine that blocks access to data that meet a particular criterion: a Web filter that screens out vulgar sites.

I continued to search around looking at other resources. Here are a couple of other references:

UseIt.com - Scoped Search: Sometimes special areas of a site are sufficiently coherent and distinct from the rest of the site that it makes sense to offer a scoped search: restricted to search that subsite only (the search scope).
Panoptic Search says, "...scoped search achieves a similar effect to the Google site:query operator, but allows specification of a comma-separated list of sites to be included or excluded."

Summary
My assessment, which of course should be taken as the final and authoritative word on the topic, is that scoping is the act of limiting a search to a set of resources (websites, sub-sites, etc.) while filtering is the act of excluding content based on contextual things, such as attribution or keywords. An example would be:

Search only dotnetjunkies.com and sqljunkies.com for the term "SqlDataAdapter" and return only content that included code in C#

In this example, the query term is "SqlDataAdapter", the scope is dotnetjunkies.com and sqljunkies.com, and the filter is only content with C# code.

There it is, the definitive difference between a scoped search and a filtered search, and how they can work together.

FCS and Codezone: Web Service vs. RSS vs. Crawling

Recently I was asked why we are using a crawler-based system for the Federated Community Services instead of RSS or Web Services. This is an excellent question, and it requires a little history.

Federated Community Services (FCS) is a Web Service platform that will expose community features to a set of websites, specifically the Codezone Community websites and the Microsoft subsidiary websites. The services will include a variety of useful things, and the first one we are launching is the Codezone Community Content Search service, which I usually refer to simply as Search. The Search service idea dates back to a program that Microsoft started in 2001 called CodeWise. The CodeWise community was made up of a number of both online and offline community resources, such as websites, training companies and magazines. At the time I was a CodeWise member as both an author and the owner of DotNetJunkies.com.

As the program progressed, Microsoft introduced the concept of a community search feature in Visual Studio. At the time it was hoped that it could get in the Everett release (VS.NET 2003), but unfortunately it didn't make the cut. Plans went into full motion to include the community search concept in Whidbey (VS 2005).

There had been many discussions between Microsoft and the CodeWise sites about how the search feature should work, and a couple key features were identified:

  1. The site owners must be able to control what content is included in the search index.
  2. The search results should list the title, description and the source URL of the content.
  3. The user should be redirected to the actual content on the page - the content itself should never be syndicated.
There were other features and issues, but these were the critical and defining features. The solution was that the sites would expose the information to Microsoft through a Web Service. An aggregator that would call a Web Service hosted by each of the CodeWise websites on a nightly basis. The Web Service would expose information about what content on the site was new, what had changed and what had been deleted. The aggregator would use this information to update the search index.

From the time this model was announced many of the site owners expressed that they didn't like this idea. It meant that they had to create and maintain a Web Service to expose this information. For some it was difficult because they didn't have the time or resources, for others it was difficult because their sites were static content, not data-driven content, making it difficult to expose in a Web Service. Many sites had asked why Microsoft couldn't just crawl the sites using spidering technology that has existed for years, rather than requiring the site owners to build a Web Service. There was a large expression that being included in the search index should mean that the site owners had to build new features to accommodate Microsoft, but that Microsoft should invest in using technology that had little or no impact on the sites.

The Web Service aggregator system went into place. In launched late primarily because many of the site owners didn't have their Web Services ready in time. Finally the community search was available to the public in the Visual Studio 2005 beta. Unfortunately there wasn't a lot of content getting into the system. Out of the 30+ sites that were part of the CodeWise program, only 12-sites had created a Web Service that worked reliably, and only 4-sites were exposing content through their service on a regular basis. Basically it was easy to say that while this solution was a good technical solution, it was not a good practical solution.

In October 2004 I was asked to contract to Microsoft for a year (I have since accepted a full-time position on this project) to fix what was wrong with the CodeWise program from a technical perspective (Amy Sorokas is the marketing superwoman). Specifically I was charged with fixing the CodeWise Search problem, which was in jeopardy of getting pulled from Whidbey due to lack of participation from the CodeWise sites. I began looking at the history of the program, which I was intimately aware of, and at the requests, complaints, and feedback provided over the 4-years of the program. It became quickly apparent that the problems were based around three things:
  1. It was too difficult for a site owner to get their content included in the search index.
  2. It was too difficult, or there was too little benefit for a content publisher (site owner or author) to attribute their content according to a taxonomy established by Microsoft.
  3. If the site owners were going to expose their content in a search index, they wanted to be able to expose the search service as well.
I (along with a couple others) determined that the solution was to be tackled in three stages.
  1. Reduce the barrier to entry for getting content into the search system and add the ability for full-text search.
  2. Determine a way to automagically attribute the content reducing the amount of attribution we ask the site owners to do.
  3. Expose the search functionality as a Web Service for the site owners to implement if they chose.
The solution to #1 was to move away from a system that required the site owners to build a Web Service. We looked into the possibility of RSS, but that didn't provide any better results - the site owners were still required to build a feed based on our specification. Ultimately we wanted the site owners to have to do as little as possible. We revisited the idea of a crawling or spidering mechanism. Something that wouldn't require very much of the sites, and would rely on existing technology - HTTP and HTML.

We met with some internal teams and eventually found a solution developer by our Assistance Platform team. It is a crawler that we can point to a master index, which is simply a list of URLs that we want to index. The URLs we point to are actually similar indexes on each of the sites - a list of links the site owner wants us to index. Nothing but basic HTML. This could be generated dynamically (for data-driven sites) or it could be a static page - its just HTML. The links in the index point to the actual content we will index. We don't follow links off of the site, and we don't go any deeper than the content page listed. This ensures that we are only indexing what the site owner wants us to.

We have been working on this solution for a while, and actually have the crawler running in production now. We are finishing the code that plugs Visual Studio 2005 into our index, and the Web Service for exposing the Search service. The Search service should be online July 19, 2005, barring any last minute VS2005 integration bugs that come up. The changes will show up in Visual Studio 2005 within a few weeks of that (I haven't gotten a firm date yet).

So, in a nutshell, the reason we chose a crawler over a Web Service or RSS was based on feedback we had gotten from the CodeWise (now rebranded to Codezone) Community site owners themselves. There was a definite need to create a solution where the site owners didn't have to build something to expose a service or feed to us, and crawling and indexing technology proved to be a solution that significantly lowers the technical barrier to entry.

More on Codezone and FCS to come soon.

Barbecues and Name Dropping

This weekend I went to a barbecue at Eric Ewing's. Eric and Bruce do this annual get together at their incredibly wonderful home on Lake Washington. While there I chatted with Robert Scoble about when we are launching the Federated Community Services and when the new Codezone index will be exposed in Visual Studio 2005. I talked with Jeff Sandquist about the new Mustang GT (which apparently Scobelizer was not too impressed with), and I got beaten up by Kevin Briody for not having a good personal website, and not representing the Sammamish Plateau by not blogging enough.

Kevin had mentioned that Betsy Aoki had blogged about me in regard to her trip to TechEd Europe last week. Sure enough she did - she thinks I look like a pirate.

I am newly recommited to blog more, and more specifically blog a lot more about Codezone and the Federated Community Services. I will occassionally toss is a blog post about non-tech stuff, such as wake baording, poker, or trips to the Home Depot (You can do it. We can help.).

Expect a Codezone blog later today.