The what?
Here's a link to my DNJScraper.
It works with most of the themes I have tested it with though if you see any
error please send them to me :)
And Why?
Recently I have had to interface with old web applications. Due to a
limitation in the database filesystem the application was written on (not being
thread safe) Where as the web application it self was handling the threading. I
decided to not access the file directly and interface via the old web
application over HTTP.
Whilst readignan article in some linux magazine that showed you how to use
Ruby to spyder a bloging engine as Ruby had "Great String
manipulation and Network tools". I thought it's got to be eaiser to
use .net's XMLDOM, so I've decided to do it with dotNetJunkies
.Text, and it is alot eaiser!
Oh How?
Back in Febuary I bloged about
HTML - XHTML a HTML to XHTML Conversion with SGMLReader tool that I
was using. With that i mind here's the rest of the app.
To start Here's "SiteInterface.asms.cs" This is a simple class
that can either POST or GET to any web site and return vaild XHTML.
using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Diagnostics;
using System.Web;
using System.Web.Services;
/* Extra Namespaces Required */
using System.Text; // Needed For UTF8Encoding
using System.Net; // Needed For WebClient
namespace DNJScrape
{
/// <summary>
/// Summary description for SiteInterface.
/// </summary>
public class SiteInterface : System.Web.Services.WebService
{
public SiteInterface()
{
//CODEGEN: This call is required by the ASP.NET Web Services Designer
InitializeComponent();
}
#region Component Designer generated code
//Required by the Web Services Designer
private IContainer components = null;
/// <summary>
/// Required method for Designer support - do not modify
/// the contents of this method with the code editor.
/// </summary>
private void InitializeComponent()
{
}
/// <summary>
/// Clean up any resources being used.
/// </summary>
protected override void Dispose( bool disposing )
{
if(disposing && components != null)
{
components.Dispose();
}
base.Dispose(disposing);
}
#endregion
/// <summary>
/// Creates Vaild XHTML from
/// </summary>
/// <param name="cHtml">Invalid Html string to be procesed</param>
/// <returns></returns>
[WebMethod]
public string MakeValid( string cHtml )
{
string cXHTML = "";
SgmlReaderDll.SGMLReaderHelper Reader = new SgmlReaderDll.SGMLReaderHelper();
cXHTML = Reader.ProcessString(cHtml);
return cXHTML;
}
/// <summary>
/// Returns valid xhtml form any url using GET i.e. note the vars are appended this is how GET works
/// </summary>
/// <param name="cURL"> i.e. http://somedomain.com/default.aspx?var1=foo&var2=bar</param>
/// <returns></returns>
[WebMethod]
public string LoadPageGET( string cURL )
{
WebClient oClient = new WebClient();
UTF8Encoding oEncode = new UTF8Encoding();
return this.MakeValid( oEncode.GetString( oClient.DownloadData( cURL ) ) );
}
/// <summary>
/// Returns valid xhtml form any url
/// using GET
/// ie http://somedomain.com/default.aspx
/// the vars are posted to the server via Headers
/// </summary>
/// <param name="cURL"> i.e. http://somedomain.com/default.aspx</param>
/// <param name="cParams">string "var1=foo&var2=bar"</param>
/// <returns></returns>
[WebMethod]
public string LoadPagePOST( string cURL, string cParams )
{
WebClient oClient = new WebClient();
UTF8Encoding oEncode = new UTF8Encoding();
oClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
return this.MakeValid( oEncode.GetString( oClient.UploadData(cURL, "POST", Encoding.ASCII.GetBytes(cParams)) ) );
}
}
}
This is "DNJProcessing.asmx.cs" this is where the Blogs get
processed. It contains 4 methods, all of which are [WebMethods] (I had
ideas of creating an XAP / chubby client for it).
using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Diagnostics;
using System.Web;
using System.Web.Services;
/* Extra Namespaces Required */
using System.Xml; // Needed For Xml...
namespace DNJScrape
{
/// <summary>
/// Summary description for DNJProcessing.
/// </summary>
public class DNJProcessing : System.Web.Services.WebService
{
public SiteInterface oSiteInterface;
public DNJProcessing()
{
//CODEGEN: This call is required by the ASP.NET Web Services Designer
InitializeComponent();
this.oSiteInterface = new SiteInterface();
}
#region Component Designer generated code
//Required by the Web Services Designer
private IContainer components = null;
/// <summary>
/// Required method for Designer support - do not modify
/// the contents of this method with the code editor.
/// </summary>
private void InitializeComponent()
{
}
/// <summary>
/// Clean up any resources being used.
/// </summary>
protected override void Dispose( bool disposing )
{
if(disposing && components != null)
{
components.Dispose();
}
base.Dispose(disposing);
}
#endregion
/// <summary>
/// Query DNJ .Text weblog for the archive list of months that the user has blogs for
/// </summary>
/// <param name="cURL"> http://dotnetjunkies.com/weblog/kmotion/ </param>
/// <returns>A collection of urls for ie http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03.aspx and months</returns>
[WebMethod]
public ReturnObjs.MonthsArchive [] GetMonths( string cURL )
{
// Create an XmlDocument
XmlDocument oXHTML = new XmlDocument();
// Get the XHTML for the give url using GET. And Load in to oXHTML
oXHTML.LoadXml( this.oSiteInterface.LoadPageGET( cURL ) );
//string cResult = "";
// Select the <h3>Archives</h3> tag.
XmlNode oArchiveTitle = oXHTML.SelectSingleNode("//h3[text()='Archives']");
if( oArchiveTitle == null )
{
// scrape for this <h1 class = "listtitle">Archives</h1>
oArchiveTitle = oXHTML.SelectSingleNode("//h1[text()='Archives']");
}
// Select the <a> tags of the <ul> (Node Next to <h3> in the DOM).
XmlNodeList oArchiveLinks = oArchiveTitle.NextSibling.SelectNodes("li/a");
// Create a ReturnObj;
ReturnObjs.MonthsArchive [] oResult = new DNJScrape.ReturnObjs.MonthsArchive [oArchiveLinks.Count];
int nCounter = 0;
// Fill ReturnObj with Data
foreach( XmlNode oMonth in oArchiveLinks )
{
oResult[nCounter] = new ReturnObjs.MonthsArchive();
// Take the textnodes value
oResult[nCounter].cMonth = oMonth.FirstChild.Value;
// Take the href attrib value
oResult[nCounter].cUrl = oMonth.SelectSingleNode("@href").Value;
// Could do some string / regexp work with cMonth to obtain the number of posts
nCounter++;
}
return oResult;
}
/// <summary>
/// Query DNJ .Text weblog for the list of Posts for a given month
/// </summary>
/// <param name="cURL"> http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03.aspx </param>
/// <returns></returns>
[WebMethod]
public ReturnObjs.MonthPosts [] GetMonthPosts( string cURL )
{
// Create an XmlDocument
XmlDocument oXHTML = new XmlDocument();
// Get the XHTML for the give url using GET. And Load in to oXHTML
oXHTML.LoadXml( this.oSiteInterface.LoadPageGET( cURL ) );
// Select the links for each post <div class="post"><h5><a>This link</a></h5></div>
XmlNodeList oPosts = oXHTML.SelectNodes("//div[@class='post']/h5/a");
if( oPosts.Count == 0 )
{
//scrape for this <li class = "entrylistitem">
oPosts = oXHTML.SelectNodes("//li[@class='entrylistitem']/a");
}
// Create a ReturnObj;
ReturnObjs.MonthPosts [] oResult = new DNJScrape.ReturnObjs.MonthPosts [oPosts.Count];
int nCounter = 0;
// Fill ReturnObj with Data
foreach( XmlNode oPost in oPosts )
{
oResult[nCounter] = new ReturnObjs.MonthPosts();
// Take the textnodes value
oResult[nCounter].cPostTitle = oPost.FirstChild.Value;
// Take the href attrib value
oResult[nCounter].cUrl = oPost.SelectSingleNode("@href").Value;
nCounter++;
}
return oResult;
}
/// <summary>
/// Query DNJ .Text weblog for the Post.
/// </summary>
/// <param name="cURL"> http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03/14/60261.aspx </param>
/// <returns></returns>
[WebMethod]
public XmlNode GetPost( string cURL )
{
// Create an XmlDocument
XmlDocument oXHTML = new XmlDocument();
// Get the XHTML for the give url using GET. And Load in to oXHTML
oXHTML.LoadXml( this.oSiteInterface.LoadPageGET( cURL ) );
// Select the links for each post <div class="post"><h5><a>This link</a></h5></div>
XmlNode oPost = oXHTML.SelectSingleNode("//div[@class='post']");
if( oPost == null )
{
// scrape for this <div class = "singlepost">
oPost = oXHTML.SelectSingleNode("//div[@class='singlepost']");
}
return oPost;
}
/// <summary>
/// Query DNJ .Text weblog for all the Posts by a junkie!.
/// </summary>
/// <param name="cURL"> http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03/14/60261.aspx </param>
/// <returns></returns>
[WebMethod]
public string GetAllPosts( string cURL )
{
string cResult = "";
ReturnObjs.MonthsArchive [] oMonths = this.GetMonths( cURL );
foreach( ReturnObjs.MonthsArchive oMonth in oMonths )
{
ReturnObjs.MonthPosts [] oMonthPosts = this.GetMonthPosts( oMonth.cUrl );
foreach( ReturnObjs.MonthPosts oMonthPost in oMonthPosts )
{
cResult += this.GetPost( oMonthPost.cUrl ).OuterXml;
}
}
return cResult;
}
}
}
The methods are :-
-
public
ReturnObjs.MonthsArchive [] GetMonths( string cURL
)
This method recives a URL that's the ROOT
of the blogers area (http://dotnetjunkies.com/WebLog/kmotion/)
. It returns the months the users has been bloging for and the
respective URL's for theses months.
-
public
ReturnObjs.MonthPosts [] GetMonthPosts(
string
cURL
)
This method recives a URL that's the
location of blogs for the given month (http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03.aspx) . It returns a
list of all the posts the users has made that month and the respective URL's for
theses posts.
-
public
XmlNode GetPost(
string
cURL )
This method recives a URL that's the
location of the Post (http://dotnetjunkies.com/WebLog/kmotion/archive/2005/03/01/57618.aspx). It returns a
Valid XHTML Node of the Post.
-
public
string
GetAllPosts(
string
cURL )
*PLEASE NOTE*
- I have only used this once and I have
disabled it but left the code in place. I thought is was a bit
irrisponsible as it could put a heavy load on the DNJ servers if the number of
posts the user has made is quite large.
This method
recives a URL that's the ROOT of the blogers area (
http://dotnetjunkies.com/WebLog/kmotion/).
It returns a Valid XHTML string of the ALL Posts made by
that user.
This is "DNJScrape.ReturnObjs.MonthsArchive" and "DNJScrape.ReturnObjs.MonthPosts" These
are some very simple classes to handle results from the Web Service.
using System;
namespace DNJScrape.ReturnObjs
{
public class MonthsArchive
{
public MonthsArchive(){ }
public string cMonth;
public string cUrl;
}
public class MonthPosts
{
public MonthPosts() {}
public string cPostTitle;
public string cUrl;
}
}
And Where?
-
Demo - DNJScraper is
a simple ASP.NET application I used to wrap the web services.
-
Code - DNJScrape.zip is
the compleate source for the application.
So What?
As It happens it also work for more .Text blog engines, not to surprising really
but I don't think it will work on all. It's dependant ion the theme
chosen by the bloger.
Have fun
Andy
Now playing: Metallica - WhereverIMayRoom