Thursday, March 17, 2005 - Posts

Screen Scraping DotNetJunkies - oh and a host of other .Text sites

The what?

Here's a link to my DNJScraper.  It works with most of the themes I have tested it with though if you see any error please send them to me :)

And Why?

Recently I have had to interface with old web applications.  Due to a limitation in the database filesystem the application was written on (not being thread safe) Where as the web application it self was handling the threading. I decided to not access the file directly and interface via the old web application over HTTP.

Whilst readignan article in some linux magazine that showed you how to use Ruby to spyder a bloging engine as Ruby had "Great String manipulation and Network tools".  I thought it's got to be eaiser to use .net's XMLDOM, so I've decided to do it with dotNetJunkies .Text, and it is alot eaiser!

Oh How?

Back in Febuary I bloged about HTML - XHTML a HTML to XHTML Conversion with SGMLReader tool that I was using. With that i mind here's the rest of the app.

To start Here's "SiteInterface.asms.cs" This is a simple class that can either POST or GET to any web site and return vaild XHTML.

using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Diagnostics;
using System.Web;
using System.Web.Services;
/* Extra Namespaces Required */
using System.Text;        // Needed For UTF8Encoding
using System.Net;        // Needed For WebClient

namespace DNJScrape
{
    /// <summary>
    /// Summary description for SiteInterface.
    /// </summary>
    public class SiteInterface : System.Web.Services.WebService
    {
        public SiteInterface()
        {
            //CODEGEN: This call is required by the ASP.NET Web Services Designer
            InitializeComponent();
        }
        #region Component Designer generated code
        
        //Required by the Web Services Designer 
        private IContainer components = null;
                
        /// <summary>
        /// Required method for Designer support - do not modify
        /// the contents of this method with the code editor.
        /// </summary>
        private void InitializeComponent()
        {
        }
        /// <summary>
        /// Clean up any resources being used.
        /// </summary>
        protected override void Dispose( bool disposing )
        {
            if(disposing && components != null)
            {
                components.Dispose();
            }
            base.Dispose(disposing);        
        }
        
        #endregion
        /// <summary>
        /// Creates Vaild XHTML from 
        /// </summary>
        /// <param name="cHtml">Invalid Html string to be procesed</param>
        /// <returns></returns>
        [WebMethod]
        public string MakeValid( string cHtml )
        {
            string cXHTML = "";
            SgmlReaderDll.SGMLReaderHelper Reader = new SgmlReaderDll.SGMLReaderHelper();
            cXHTML = Reader.ProcessString(cHtml);
            return cXHTML;
        }

        /// <summary>
        /// Returns valid xhtml form any url using GET i.e. note the vars are appended this is how GET works
        /// </summary>
        /// <param name="cURL"> i.e. http://somedomain.com/default.aspx?var1=foo&var2=bar</param>
        /// <returns></returns>
        [WebMethod]
        public string LoadPageGET( string cURL )
        {
            WebClient oClient = new WebClient();
            UTF8Encoding oEncode = new UTF8Encoding();
            return this.MakeValid( oEncode.GetString( oClient.DownloadData( cURL ) ) );
        }

        /// <summary>
        /// Returns valid xhtml form any url
        /// using GET 
        /// ie http://somedomain.com/default.aspx
        /// the vars are posted to the server via Headers
        /// </summary>
        /// <param name="cURL"> i.e. http://somedomain.com/default.aspx</param>
        /// <param name="cParams">string "var1=foo&var2=bar"</param>
        /// <returns></returns>
        [WebMethod]
        public string LoadPagePOST( string cURL, string cParams )
        {
            WebClient oClient = new WebClient();
            UTF8Encoding oEncode = new UTF8Encoding();
            oClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
            return this.MakeValid( oEncode.GetString( oClient.UploadData(cURL, "POST", Encoding.ASCII.GetBytes(cParams)) ) );
        }
    }
}

 

This is "DNJProcessing.asmx.cs" this is where the Blogs get processed.  It contains 4 methods, all of which are [WebMethods] (I had ideas of creating an XAP / chubby client for it). 

using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Diagnostics;
using System.Web;
using System.Web.Services;
/* Extra Namespaces Required */
using System.Xml;        // Needed For Xml...
namespace DNJScrape
{
    /// <summary>
    /// Summary description for DNJProcessing.
    /// </summary>
    public class DNJProcessing : System.Web.Services.WebService
    {
        public SiteInterface oSiteInterface;
        public DNJProcessing()
        {
            //CODEGEN: This call is required by the ASP.NET Web Services Designer
            InitializeComponent();
            this.oSiteInterface = new SiteInterface();
        }
        #region Component Designer generated code
        
        //Required by the Web Services Designer 
        private IContainer components = null;
                
        /// <summary>
        /// Required method for Designer support - do not modify
        /// the contents of this method with the code editor.
        /// </summary>
        private void InitializeComponent()
        {
        }
        /// <summary>
        /// Clean up any resources being used.
        /// </summary>
        protected override void Dispose( bool disposing )
        {
            if(disposing && components != null)
            {
                components.Dispose();
            }
            base.Dispose(disposing);        
        }
        
        #endregion
        /// <summary>
        /// Query DNJ .Text weblog for the archive list of months that the user has blogs for
        /// </summary>
        /// <param name="cURL"> http://dotnetjunkies.com/weblog/kmotion/ </param>
        /// <returns>A collection of urls for ie http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03.aspx and months</returns>
        [WebMethod]
        public ReturnObjs.MonthsArchive [] GetMonths( string cURL )
        {        
            // Create an XmlDocument
            XmlDocument oXHTML = new XmlDocument();
            // Get the XHTML for the give url using GET. And Load in to oXHTML
            oXHTML.LoadXml( this.oSiteInterface.LoadPageGET( cURL ) );
            //string cResult = "";
            // Select the <h3>Archives</h3> tag.  
            XmlNode oArchiveTitle        = oXHTML.SelectSingleNode("//h3[text()='Archives']");
            if( oArchiveTitle == null )
            {
                // scrape for this <h1 class = "listtitle">Archives</h1>
                oArchiveTitle        = oXHTML.SelectSingleNode("//h1[text()='Archives']");
            }
            
            // Select the <a> tags of the <ul> (Node Next to <h3> in the DOM).
            XmlNodeList oArchiveLinks    = oArchiveTitle.NextSibling.SelectNodes("li/a");
            // Create a ReturnObj;
            ReturnObjs.MonthsArchive [] oResult = new DNJScrape.ReturnObjs.MonthsArchive [oArchiveLinks.Count];
            int nCounter = 0;
            // Fill ReturnObj with Data
            foreach( XmlNode oMonth in oArchiveLinks )
            {
                oResult[nCounter]            = new ReturnObjs.MonthsArchive();
                // Take the textnodes value 
                oResult[nCounter].cMonth    = oMonth.FirstChild.Value;
                // Take the href attrib value
                oResult[nCounter].cUrl        = oMonth.SelectSingleNode("@href").Value;
                // Could do some string / regexp work with cMonth to obtain the number of posts

                nCounter++;
            }
            return oResult;
        }
        /// <summary>
        /// Query DNJ .Text weblog for the list of Posts for a given month
        /// </summary>
        /// <param name="cURL"> http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03.aspx </param>
        /// <returns></returns>
        [WebMethod]
        public ReturnObjs.MonthPosts []  GetMonthPosts( string cURL )
        {
            // Create an XmlDocument
            XmlDocument oXHTML = new XmlDocument();
            // Get the XHTML for the give url using GET. And Load in to oXHTML
            oXHTML.LoadXml( this.oSiteInterface.LoadPageGET( cURL ) );
            // Select the links for each post <div class="post"><h5><a>This link</a></h5></div> 
            XmlNodeList oPosts        = oXHTML.SelectNodes("//div[@class='post']/h5/a");
            if( oPosts.Count == 0 )
            {
                //scrape for this <li class = "entrylistitem">
                oPosts        = oXHTML.SelectNodes("//li[@class='entrylistitem']/a");
            }
            // Create a ReturnObj;
            ReturnObjs.MonthPosts [] oResult = new DNJScrape.ReturnObjs.MonthPosts [oPosts.Count];
            int nCounter = 0;
            // Fill ReturnObj with Data
            foreach( XmlNode oPost in oPosts )
            {
                oResult[nCounter]                = new ReturnObjs.MonthPosts();
                // Take the textnodes value 
                oResult[nCounter].cPostTitle    = oPost.FirstChild.Value;
                // Take the href attrib value
                oResult[nCounter].cUrl            = oPost.SelectSingleNode("@href").Value;
                nCounter++;                
            }            
            return oResult;    
        }

        /// <summary>
        /// Query DNJ .Text weblog for the Post.
        /// </summary>
        /// <param name="cURL"> http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03/14/60261.aspx </param>
        /// <returns></returns>
        [WebMethod]
        public XmlNode  GetPost( string cURL )
        {
            // Create an XmlDocument
            XmlDocument oXHTML = new XmlDocument();
            // Get the XHTML for the give url using GET. And Load in to oXHTML
            oXHTML.LoadXml( this.oSiteInterface.LoadPageGET( cURL ) );
            // Select the links for each post <div class="post"><h5><a>This link</a></h5></div> 
            XmlNode oPost        = oXHTML.SelectSingleNode("//div[@class='post']");
            if( oPost == null )
            {
                // scrape for this     <div class = "singlepost">
                oPost        = oXHTML.SelectSingleNode("//div[@class='singlepost']");
            }
            return oPost;
        }
    
        /// <summary>
        /// Query DNJ .Text weblog for all the Posts by a junkie!.
        /// </summary>
        /// <param name="cURL"> http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03/14/60261.aspx </param>
        /// <returns></returns>
        [WebMethod]
        public string GetAllPosts( string cURL )
        {
            string cResult = "";
            ReturnObjs.MonthsArchive [] oMonths = this.GetMonths( cURL );
            foreach( ReturnObjs.MonthsArchive oMonth in oMonths )
            {
                ReturnObjs.MonthPosts [] oMonthPosts = this.GetMonthPosts( oMonth.cUrl );
                foreach( ReturnObjs.MonthPosts oMonthPost in oMonthPosts )
                {
                    cResult += this.GetPost( oMonthPost.cUrl ).OuterXml;
                }
            }
            return cResult;
        }
    }
}

The methods are :-

  • public ReturnObjs.MonthsArchive [] GetMonths( string cURL )

This method recives a URL that's the ROOT of the blogers area (http://dotnetjunkies.com/WebLog/kmotion/) .  It returns the months the users has been bloging for and the respective URL's for theses months.

  • public ReturnObjs.MonthPosts []  GetMonthPosts( string cURL )

This method recives a URL that's the location of blogs for the given month (http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03.aspx) .  It returns a list of all the posts the users has made that month and the respective URL's for theses posts.

  • public XmlNode  GetPost( string cURL )

This method recives a URL that's the location of the Post (http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03/01/57618.aspx).  It returns a Valid XHTML Node of the Post.

  • public string GetAllPosts( string cURL )  *PLEASE NOTE*  - I have only used this once and I have disabled it but left the code in place.  I thought is was a bit irrisponsible as it could put a heavy load on the DNJ servers if the number of posts the user has made is quite large.

This method recives a URL that's the ROOT of the blogers area ( http://dotnetjunkies.com/WebLog/kmotion/).  It returns a Valid XHTML string of the ALL Posts made by that user.

This is "DNJScrape.ReturnObjs.MonthsArchive" and "DNJScrape.ReturnObjs.MonthPosts"  These are some very simple classes to handle results from the Web Service.

using System;
namespace DNJScrape.ReturnObjs
{
    public class MonthsArchive
    {
        public MonthsArchive(){    }
        public string cMonth;
        public string cUrl;
    }
    
    public class MonthPosts
    {
        public MonthPosts()    {}
        public string cPostTitle;
        public string cUrl; 
    }
}

And Where?

  • Demo - DNJScraper is a simple ASP.NET application I used to wrap the web services.
  • Code - DNJScrape.zip is the compleate source for the application.

So What?

As It happens it also work for more .Text blog engines, not to surprising really but I don't think it will work on all.  It's dependant ion the theme chosen by the bloger.

Have fun

Andy

Now playing: Metallica - WhereverIMayRoom

 

 

 

with 4 Comments