Archive of UserLand's first discussion group, started October 5, 1998.

Re: Processing Word as Input?

Author:Paul Howson
Posted:4/20/1999; 10:40:52 PM
Topic:Processing Word as Input?
Msg #:5145 (In response to 5128)
Prev/Next:5144 / 5146

What I've read about Office 2000 is that the xml support is xml "islands" embedded in html to preserve extended formatting information. So this is unlikely to be of much use.

I believe the two fruitful avenues might be:

1. Getting Word to save as html (supposedly the default for office 2000) then parsing that "as xml" then translating using xmltr or similar tool. This will rely on the html being well formed xml. Might not be the case. With this approach you would be limited to html's structural vocabulary e.g. headings 1 to 6 (or whatever it is), bold, italic, etc. Nevertheless in many cases this would be sufficient to encode the structure of a document.

2. Using RTF. RTF is not as bad as you might think. And it is backward compatible a long way (e.g. to Word 5 on the Mac which many people still use). RTF looks frightening, but its actually quite simple, and its easy to parse, and easy to throw away the stuff you don't want. The prototype I had working in Frontier was just slow because it was doing C-style getchar() for parsing the RTF file. With a ucmd or DLL to do the basic parsing (in C), it could be very fast. I'll post a description of this strategy in a separate message soon.




This page was archived on 6/13/2001; 4:49:26 PM.

© Copyright 1998-2001 UserLand Software, Inc.