Thursday, January 08, 2004 - Posts

The semantic web using xml namespaces

I've got yet another wacky idea as a way to create the semantic web using xml namespaces.  There are some hurdles and I'll discuss those later.  One of the common problems with the web is that machines can not get the context of the markup.  When you talk about Java are you talking about the island (Indonesia), a slang term for coffee or that programming language?  The machine doesn't know. 

Well, that's a problem.  Let’s say you are reading a web site about coffee and you want to find some Java grade A coffee.   If you put Java in a Google search guess what you get, nothing but links to Java programming.  I’m a .NET guy so I start going though the links and finally on result number 68 I get the first reference to the word “coffee” and guess again.  It has nothing to do with coffee and everything to do with Java programming.  On Google hit #150, I got something about the Indonesian island, and at hit #250 I stopped looking for anything to do with Java and Coffee. 

So in order to create the semantic web there is RDF.  While RDF works, I was thinking of a last post of mine about xml namespaces.  Namespaces are about context, e.g.  This node (<xhtml:b/>) belongs to the XHTML collection.  And I thought what a great way to provide context or “Semantic meaning” to XML!  It’s simple, and fairly lightweight.  All you do is namespace your nodes, and then provide processing instructions that can be used by a semantic parser.  Below is a sample of what I mean. 

<?xml version="1.0" encoding="UTF-8"?>
<?semantic namespace="pjava"  library="http://www.sun.com/programming/java" about="Computer programming language"?>
<? semantic namespace="cjava" library="http://www.dictionary.com/slang" about="Coffee from the island of Idonesia"?>
<root xmlns:cjava="http://www.dictionary.com/slang" xmlns:pjava="http://www.sun.com/programming/java" xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <node>
      <pjava:subnode>Java programming is ruled by Sun</pjava:subnode>
      <cjava:subnode><xhtml:b>Java coffee</xhtml:b> is grown in the Sun</cjava:subnode>
   </node>
</root>

Notice that the above XML is a lightweight way to provide semantics.  The other item is that by using xml namespaces we can markup our current markup documents regardless of what kind of xml document it is.  We can add semantic namespace to RSS documents, we can use it in WordML, or just plain old xml.  Of course, you could end up with a large number of semantic processing instruction nodes; however it would allow you to reuse that information throughout your document.  And xml namespaces allow your content context to cascade enabling you to do something like this.

<?xml version="1.0" encoding="UTF-8"?>
<?semantic namespace="dotnet" library="http://msdn.microsoft.com/vcsharp/" about=".NET Programming using c#" lang="English"?>
<?semantic namespace="java"  library="http://www.sun.com/programming/java" about="Computer programming language" lang="English"?>
<?semantic namespace="me"  library="sean@somewhere.com" about="Sean's email address" lang="English"?>
<root xmlns:dotnet="http://msdn.microsoft.com/vcsharp/" xmlns:java="http://www.sun.com/programming/java" xmlns:me="http://www.somewhere.com" xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <node>
      <dotnet:subnode>
         <me:semantic>My</me:semantic> Microsoft programming language of choice is C#.  C# is also becoming crossover language of choice for <java:semantic>Java programmers</java:semantic>.</dotnet:subnode>
   </node>
</root>

Notice that the dotnet namespace is the parent node with two other semantic subnodes one <me:semantic> refers in the semantic processing node to an email address to contact me (not my real address). And the other is for the Java reference.  Another thing you’ll notice is that there is no semantic processing node for the xhtml namespace.  This way you can have namespaces that are not included in your semantic processing and this will be helpful in adding semantic information to things like Word documents and existing html documents.

Here’s a breakdown of the semantic processing instructions. 

<?semantic namespace="dotnet" library="http://msdn.microsoft.com/vcsharp/" about=".NET Programming using c#" lang="English"?>
<?semantic namespace="me"  library="sean@somewhere.com" about="Sean's email address" lang="English"?>

First we have the “<?semantic” portion of the node, this identifies semantic processing instruction information is available to xml parsers. Second the namespace attribute links the xml namespace to semantic information.  Third, the library attribute gives you a resource identifier, which could be a web site, an email address or even a physical address.  The about attribute gives additional context information and lastly the optional lang attribute is the language format.  In the case of the email address it supplies you with the person’s language of choice.

I think the killer app would be a task pane in Word, Excel, or a sidebar app in Longhorn that would allow you to highlight a section of text and you then click a namespace (similar to the “Styles and Formatting” taskbar) to add that metadata to the document.  These could allow you to easily add semantic meaning to your documents.

I welcome all thoughts and comments and remember this is a first pass.