Archive of UserLand's first discussion group, started October 5, 1998.

Re: Processing Word as Input?

Author:Paul Howson
Posted:4/19/1999; 6:45:27 PM
Topic:Processing Word as Input?
Msg #:5125 (In response to 5112)
Prev/Next:5124 / 5126

This is a need for publishers especially. Word is the defacto standard authoring tool (like it or not). People who author material for print publishing mostly use word (as a publishing person, it is my experience that material comes to me in Word format).

A route that I've investigated is to use RTF. RTF is a text-only format. Its design is terrible, but it does operate according to defined rules. You can parse RTF and construct your own representation (according to your needs) of a Word document.

As a publisher, I want to retain information about paragraph styles and convert this to xml markup. I want to retain some of the direct formatting which users apply (e.g. bold and italic) and convert this to xml markup. There's also a lot of garbage formatting (usually) that I want to throw away.

A couple of years back I had a prototype in Frontier to parse rtf. The plan was to deduce the document structure from the way heading styles had been applied and then to re-encode the document as xml --- so it can be cross-media published with Frontier and xmltr. I never had the time to follow it through.

I still think this idea has merit.


There are responses to this message:


This page was archived on 6/13/2001; 4:49:25 PM.

© Copyright 1998-2001 UserLand Software, Inc.