posted on Wednesday, December 15, 2004 5:57 PM
by
jdixon
Limitation in SQL's XML Support
I managed to outsmart myself again....
I've been working on an application that processes large XML files. One of the routines involves passing the XML to a stored procedure, which then inserts certain parts of the XML into various tables. This is done via sp_xml_preparedocument and openxml. This is fairly trivial and works quite well.
Now, this system processes a lot of XML. Some of the tables involved have reached 15+ million rows, and insert performance has decreased to a point where I had to take action. I decided that, in addition to normal SQL tuning, I would break this task into two parts. First, the XML would be placed, unparsed, into a staging table. A separate process would take the data from this table, perform the proper parsing via openxml, and insert the resulting data into the various destination tables. This separation of work would allow the main process to operate more quickly.
I created the staging table and tested the system performance. The system ran 400% faster. Cool.
Next, I had to write the process that would retrieve the data from this staging table and insert it into the destination tables. This is when I discovered my mistake.
The XML is stored in NTEXT fields in the staging table. I needed to read this into a local variable, so that I could pass it to the sp_xml_preparedocument stored procedure. This can't be done, because local variables can't be declared as NTEXT. I was stuck.
I found a good article on the net that explained a workaround. However, in the article, the author states that he has used this workaround on documents that were “over 36,000 bytes in size.” My documents sometimes reach 20 MB in size. His method doesn't work on documents over about 300K. (You should check it out, though. It's a good read.) I had to solve my own problem.
So, as I was running out of time, I decided to fall back to the one solution that I knew would work. I wrote a .NET process that runs on a background thread. This process reads the XML from the staging table (via an FOR XML EXPLICIT query), parses this data via a couple of calls to the XmlDocument's SelectSingleNode method, and then uses another stored procedure to insert the data into the destination tables. This works, but I don't like it. I would rather have all of this data manipulation occur on the SQL Server, instead of being passed back and forth between the SQL Server and the Application Server. If anyone wants to send some alternatives my way, I would be most appreciative.
I can't wait for SQL 2005. Maybe this kind of thing will go away....