backlinxxx Social Bookmark Button


Converting MS Word
Documents to Text


Version 2.1, February, 2000
by Titmouse


[ editor's note: regardless of whether you use MS Word or not, the basic information in this article applies to all word/text processor problems, and is worth reading if you want to avoid some very common problems. ]

Like many others, I prefer to do my serious writing in Microsoft Word. While it's not perfect, it has the tools and capabilities to do everything I need and most of what I want. But messages and stories posted to Usenet newsgroups should be in plain ASCII text and, as many have discovered, Word does not make this conversion gracefully. This is a source of much complication, confusion and irritation. After many trials and errors, I think I've finally figured most of it out, and this is a report on my conclusions.

The best and most successful tactic is problem avoidance. It is much easier to prevent things that will cause conversion problems than to fix them after the fact. Some of the information presented here is general and concerns basic formatting, but most of it deals with the specific issues of converting Word documents.

The discussion assumes you are working with Microsoft Word 97 or Word 2000. Although I have not make exhaustive tests with the new version, no changes appear to be necessary in this document. The two macros presented below also work without modification.


Initial Considerations

Most conversion problems stem from three sources: document formats, paragraph formats, and the extended character set. If you can avoid introducing problems, the conversion should go smoothly. Word is designed to defeat our purpose here, though, so we will have to force it to do what we want. Defaults for all three of these problem areas are wrong for text documents posted to the Internet.

Unless all you do is write stories for posting to the Internet, however, the changes you will need to make are not ones you will want for other kinds of documents. A secondary problem, then, is how to avoid wrecking Word for other purposes.

I thought originally that it would be relatively simple -- just create a new template designed for plain text documents with all the bells and whistles turned off, and there should be no problem. Not so. My second theory was that the issues could be resolved by creating an alternate version of the Normal.Dot template. Also not so. Rather than recount a long process of experimentation, I'll just report my conclusions.


The Document Format

Problems with both document and paragraph formats are most easily handled by creating a template that you can use whenever you start a new story or other longer document intended for the Internet. The template will have the correct font, margins and paragraph format. Then, if you remember to turn off Word's fancy text gizmos as explained later, you can write your document without creating problems for yourself.

To create a new template, launch Word and modify the blank document as follows. First, go to File, Page Setup and set the left and right margins. The top and bottom margins don't matter, as they are ignored when the document is saved as text.

You'll be using a fixed width font, so the line length will determine the number of characters per line. I use and recommend 55 characters per line (as in this document) and strongly recommend that you not exceed 60 characters per line. When you post to the Internet, your message will be handled and read by a wide variety of programs. If your line length is too long, one or more of them may force an early line wrap. You won't know about it until your story arrives on the newsgroup with alternating long and short lines. I've never seen this problem occur with 55 character lines, however.

How to set the margins depends on the font size. Word's default is 10-point type, but I recommend 12- point type, which produces 10 characters per inch in fixed width fonts. In this case, you'll need a 5.5- inch line length. Any combination of left and right margins that totals three inches will work. I use a left margin of one inch and a right margin of two inches, but 1.25 and 1.75 works just as well.

If you insist on 10-point type, you'll need a 4.5-inch line length, so the margins should total four inches. That's actually 54 characters per line, if you're paying close attention, but near enough to 55.

METRIC NOTE: If you use the common alternative standard of A4 paper and metric measurements, the above recommendations translate to left and right margins of 33mm, assuming the font is Courier New at 12 points. This produces 56 characters per line.

All the other settings on the Page Setup dialog should remain the same as usual, so just click okay to set the margins. If you prefer to work in Page Layout View (rather than Normal View), set it now by selecting View, Page Layout. Assuming you plan to use 12-point type, click on the Zoom control near the right edge of the Standard toolbar (the one that begins with the New, Open and Save icons) and set the Zoom to 75%. If you stick with 10-point type, skip this last step.

If your standard setup includes headers or footers, eliminate them from this document. Go to View, Headers and Footers and delete anything in either of them. Otherwise, this information will appear in the final text version.


Font and Paragraph Format

Now, press Ctrl-A (or use Edit, Select All) to select the entire document. While the document may appear blank, it contains a paragraph marker. In Word, the paragraph marker is much more than an end-of-line character; a great deal of formatting is stored with it. If you don't include that initial paragraph marker in your changes, the defaults will remain and return to bite you later.

With the entire document selected, change the font to Courier New, 12 point. If you have another fixed- width font that you prefer, you can use it, since font information will not be saved in the final text version.

Now, with the entire document still selected, go to Format, Paragraph. Make sure that Alignment is set to Left, Indentation and Spacing are zeroed, and the Outline level is set to Body Text. The most vital setting is for Special. The default is First Line with a half-inch indent. Set this to (none).

This last setting causes problems for many users. Although the First Line setting will indent the first line of paragraphs, no tab or other character is actually placed in the document to cause it. Instead, the setting is stored in the paragraph marker and disappears when converted to text, which is why you see a lot of documents where the indent appears to have been lost. In fact, it was never really there in the first place.

The final option, Line Spacing, should be set to single. Don't worry about tab settings. Click okay to implement the paragraph format.

Now, we're ready to save our new template. Select File, Save As. Give your template a name -- I call mine 'Text' -- and change the type to 'Document Template.' Word will automatically place it in your template folder. Click the Save button.


Sticking with ASCII

Now for the thorniest problem, which is Word's insistence on putting extended character codes in documents and leaving them there even when you convert them to text. A little explanation is needed here, although some will already be familiar with this information.

The Usenet standard for text-oriented newsgroups calls for plain ASCII text. ASCII (American Standard Code for Information Interchange) predates widespread computer use and is most closely associated with Teletype machines. It is a seven-bit coding scheme, since seven bits provide 128 numbers (0-127). At the time, that seemed sufficient to represent the 52 capital and lower case letters, the 10 digits, common punctuation symbols, and various control codes for line feeds, carriage returns, tabs, page feeds and so on.

Binary computers, though, use powers of two, most famously the eight-bit byte. ASCII coding fit neatly into a byte, with one bit left over which was initially ignored. That didn't last, of course, and several schemes evolved for extending the character set by using that spare bit to provide an additional 128 codes (128-255). The most popular of these today is ANSI (American National Standards Institute) in which the first 128 codes correspond to ASCII. What the upper 128 represent, at least in the Microsoft world, depends on context, including language, font and software.

Here's the problem. When Word converts a document to text, it uses ANSI, not ASCII. Extended character codes above 127 remain in the text. What shows up on the screen -- letters from other languages, math symbols, and little black boxes for anything the software can't display -- depends partly on which flavor of conversion you used but mostly on the software used to read it.

There is no cure within Word; your only choice is prevention.


Avoiding Extended
Character Codes

While you can put extended codes in your documents intentionally -- nearly everything on the Insert menu will do so, for example -- the ones Word does for you without asking are the biggest source of problems. These mostly originate from the 'AutoFormat As You Type' tab of the AutoCorrect page of the Tools menu. The 'AutoCorrect' tab contributes a few additional gotchas, and the (plain) 'AutoFormat' tab can also cause problems.

The crux of the problem is that these settings are not stored in any template. They stay with the program, not the document, and they retain their settings until you change them explicitly.

Since you probably will want at least some of these features turned on for standard Word documents, there are only two choices. One is to turn them on and off manually depending on what you're working on; the other choice is to use a pair of macros to do the work for you. (You'll still have to remember to run the macros, of course.) I have included the two macros necessary and will explain how to implement and use them later.


Copying Your AutoCorrect Setup

Before making any changes, make a copy of your current setup. Start Word with a blank document, click on 'Tools' on the top ribbon menu, and then choose 'AutoCorrect...' You will see the AutoCorrect page with four tabs: AutoCorrect, AutoFormat As You Type, AutoText, and AutoFormat.

The third of these, AutoText, provides boilerplate entries that require a manual step to insert in a document. If you use this facility in documents intended for publication in Usenet newsgroups, just make sure such entries don't contain non-ASCII characters. This caveat aside, AutoText is not relevant to our text-conversion problems.

We may change the other three tabs, though, so let's make a backup copy. Click on the AutoCorrect heading to make sure the dialog has the focus, then hold down the Alt key and press PrintScreen. This copies the dialog to the clipboard. Close the dialog and, in your blank document, press Ctrl + V (or click Edit, Paste) to insert a picture of the dialog in your document. Press Enter.

Now return to Tools, AutoCorrect. Select the 'AutoFormat As You Type' tab. Press Alt + PrintScreen again. Close the dialog and press Ctrl + V to insert a copy of this tab in your document. Press Enter.

Finally, return to Tools, AutoCorrect one more time and select the 'AutoFormat' tab. Copy it to the clipboard with Alt + PrintScreen, close the dialog, and insert it into your document with Ctrl + V. Now, save the document as 'AutoCorrect Settings' and print a copy for reference.


The AutoCorrect Tab

This tab is concerned with typing mistakes. In the top part are five checkbox options. I have four of the five turned on normally, omitting the second 'Capitalize first letter of sentences.' In my experience, checking this box makes Word capitalize things I don't want it to. In any case, you can set the first four checkboxes according to your preferences. They don't create conversion problems.

The fifth checkbox controls the bottom half of this tab and can cause problems, however. In particular, it converts (c) and (r) to the Copyright and Registered symbols and three successive periods to the ellipsis symbol. These, of course, all require extended codes. In preparing a plain text document, you don't need to change any of the replacements. Just uncheck the 'Replace text as you type' checkbox, and Word will ignore the list. This also means it will not correct the many common typographical errors on the list, however, so a spelling check becomes more important than ever.

There is an alternative, which is what I've chosen to do. I deleted the first several entries in the table -- the ones that convert smilies as well as the copyright and registered symbols. Now I can leave the autocorrection of common typos turned on without danger of substituting an illegal character. It's something of an awkward choice, but personally I'd rather catch the typos.


The AutoFormat-As-You-Type Tab

This is the bad boy, responsible for most of the problems experienced in converting Word documents to plain text. For standard documents, I have everything checked except for hyperlinks, the third from last. For text documents, I turn everything off.

As you can see, the middle section converts straight quotes to curly quotes, ordinals to superscript, common fractions to their graphic equivalents, dual hyphens to real dashes, and *bold* and _underlining_ to actual bold and underlining. All of these use upper level codes and most of them don't convert properly to text.


The AutoFormat Tab

The settings on this tab are almost identical with those on the previous one. Where the first makes its changes as you type, the changes on this tab are made only if you tell Word -- by selecting Format, AutoFormat -- to perform them. If you don't do that, you can leave these settings alone. Since I change my settings via macro, it's just as easy to switch them off and on.


Macros to Turn Text
Settings On and Off

As you can see, considerable labor is required to change these settings manually, especially if you switch between document types frequently. As mentioned earlier, you can't solve this problem by putting the desired settings in the Text.dot template. You can't even fix it by creating an alternate version of Normal.dot, the template Word always uses. The AutoCorrect settings are independent of the template.

Instead, the simplest way to switch is with a pair of macros. You could record them yourself if you know how, but I've provided copies here and directions on how to create them.

First, if you haven't already, save this document and load it into Word. Find this location again, and follow the steps below. Be sure you have a copy of your original AutoCorrect settings before proceeding.

As provided, the macros switch almost everything off for text documents and back on for others. You may prefer a different setup. It's easy to change. The lines in the macro correspond exactly to the checkboxes on the three AutoCorrect tabs, with True meaning checked and False meaning unchecked. Using the copy of your setup as a guide, change the Text_OFF settings in the provided example from True to False or vice versa.

The Text_OFF settings should correspond to your current, preferred setup for normal documents. I recommend that you use the suggested settings for the Options section of Text_ON, but the first four entries in the AutoCorrect section can be changed as desired. The fifth entry under AutoCorrect toggles the Replace Text feature off and on. If you delete the problem- causing entries from the table, you can leave this alone. Just delete the line from both macros and it won't be changed by either of them.

Keep a copy of this document with your preferred settings. If you decide later to modify them, it's easy to change the macros. First, edit the text to reflect your new preferences. Then go to the Macros dialog (Alt + F8), delete the old versions, and then recreate them using your modified versions.


Creating the Macros

In the section immediately below labeled TEXT_ON MACRO, highlight and copy the lines between START and STOP. The shortcut for Copy is Ctrl + C.




TEXT_ON MACRO

START

With AutoCorrect

  .CorrectInitialCaps = True

  .CorrectSentenceCaps = False

  .CorrectDays = True

  .CorrectCapsLock = True

  .ReplaceText = False

End With

With Options

  .AutoFormatAsYouTypeApplyHeadings = False

  .AutoFormatAsYouTypeApplyBorders = False

  .AutoFormatAsYouTypeApplyBulletedLists = False

  .AutoFormatAsYouTypeApplyNumberedLists = False

  .AutoFormatAsYouTypeApplyTables = False

  .AutoFormatAsYouTypeReplaceQuotes = False

  .AutoFormatAsYouTypeReplaceSymbols = False

  .AutoFormatAsYouTypeReplaceOrdinals = False

  .AutoFormatAsYouTypeReplaceFractions = False

  .AutoFormatAsYouTypeReplacePlainTextEmphasis = False

  .AutoFormatAsYouTypeReplaceHyperlinks = False

  .AutoFormatAsYouTypeFormatListItemBeginning = False

  .AutoFormatAsYouTypeDefineStyles = False

  .AutoFormatApplyHeadings = False

  .AutoFormatApplyLists = False

  .AutoFormatApplyBulletedLists = False

  .AutoFormatApplyOtherParas = False

  .AutoFormatReplaceQuotes = False

  .AutoFormatReplaceSymbols = False

  .AutoFormatReplaceOrdinals = False

  .AutoFormatReplaceFractions = False

  .AutoFormatReplacePlainTextEmphasis = False

  .AutoFormatReplaceHyperlinks = False

  .AutoFormatPreserveStyles = False

  .AutoFormatPlainTextWordMail = False

End With

STOP



Now, press Alt + F8. This brings up the Macros dialog. If there's anything in the top box, Macro Name, press the Delete key to clear it. Type Text_ON, then click the Create box.

This will open the Visual Basic Editor. In the right pane, you should see the cursor on a blank line. Above it will be several lines beginning with 'Sub Text_ON.' Immediately below will be a line that says 'End Sub.' Press Ctrl + V (or use Edit, Paste) to insert the text you copied. Click the X in the upper right corner, which will close the Visual Basic Editor and return you to this document.

Now, repeat the process to create a Text_OFF macro. Begin by copying the following lines between START and STOP as before:




TEXT_OFF MACRO

START

With AutoCorrect

  .CorrectInitialCaps = True

  .CorrectSentenceCaps = False

  .CorrectDays = True

  .CorrectCapsLock = True

  .ReplaceText = True

End With

With Options

  .AutoFormatAsYouTypeApplyHeadings = True

  .AutoFormatAsYouTypeApplyBorders = True

  .AutoFormatAsYouTypeApplyBulletedLists = True

  .AutoFormatAsYouTypeApplyNumberedLists = True

  .AutoFormatAsYouTypeApplyTables = True

  .AutoFormatAsYouTypeReplaceQuotes = True

  .AutoFormatAsYouTypeReplaceSymbols = True

  .AutoFormatAsYouTypeReplaceOrdinals = True

  .AutoFormatAsYouTypeReplaceFractions = True

  .AutoFormatAsYouTypeReplacePlainTextEmphasis = True

  .AutoFormatAsYouTypeReplaceHyperlinks = True

  .AutoFormatAsYouTypeFormatListItemBeginning = True

  .AutoFormatAsYouTypeDefineStyles = True

  .AutoFormatApplyHeadings = True

  .AutoFormatApplyLists = True

  .AutoFormatApplyBulletedLists = True

  .AutoFormatApplyOtherParas = True

  .AutoFormatReplaceQuotes = True

  .AutoFormatReplaceSymbols = True

  .AutoFormatReplaceOrdinals = True

  .AutoFormatReplaceFractions = True

  .AutoFormatReplacePlainTextEmphasis = True

  .AutoFormatReplaceHyperlinks = True

  .AutoFormatPreserveStyles = True

  .AutoFormatPlainTextWordMail = True

End With

STOP



Once again, press Alt + F8 to bring up the Macros dialog. Press the Delete key to clear the Macro Name box, and type Text_OFF, then click the Create box.

The cursor will again be on a blank line below several lines beginning with 'Sub Text_OFF' and above a line that says 'End Sub.' Press Ctrl + V (or use Edit, Paste) to insert the text you copied. Click the X in the upper right corner to close the Visual Basic Editor and return to this document.

You should now have two macros, Text_ON and Text_OFF. To test them, press Alt + F8, and double-click the Text_ON macro (or click Text_ON and then the Run button). Go to the Tools, AutoCorrect dialog and check the 'AutoFormat As You Type' tab. Everything should be turned off. Now run the Text_OFF macro and check the dialog again. Everything should be switched back to your preferred settings.


Creating Your Text

So, with these tools in hand, you're ready to start a new project. To create a document, use File, New and select the Text template you created earlier. Before doing anything, run the Text_ON macro. You'll need to run the macro again each time you begin a new editing session and run the Text_OFF macro whenever you switch to another kind of document.

Now, all you have to do is to keep in mind the eventual goal. Mostly that means not doing things you know won't convert, such as Word styles, bulleted lists, sections breaks, columns, and so on. Avoid bold, italics and underlining. If you need this kind of emphasis, follow the plain text conventions of indicating bold by preceding and following the text with asterisks like *this* and underlining or italic with underscores like _this._ With the AutoFormat features turned off, these will not be converted.

For titles, I recommend a simple block at the left margin, as in the following example.

Converting Word Documents to Text
By Titmouse
(C) August, 1999

You may wish to use capital letters for the actual title. For section headings, I recommend placing two blank lines before and one after. I've used this convention throughout this document.

If you want to underline a heading, do so with hyphens on a separate line beneath. Keep in mind, however, that if you do this in any font other than Courier (or some other monospaced font), you actually have no idea how many hyphens are needed unless you count the characters in the heading. Most fonts are proportional. Each character, that is, has a separate width, so that an 'm' and an 'i' take up different amounts of line space. With monospaced fonts like Courier, each character has the same width.

You also need to decide how you want to separate paragraphs in your text. There are two basic approaches. In one, paragraphs are not indented and an extra blank line separates them. In the other, paragraphs are indented with a tab or spaces and the extra line is omitted. Either of these is acceptable, but the first is preferred. Some software seems to strip out tabs and spaces.


Saving the Document as Text

While you're working on the story or article, save it as a normal Word document. You'll probably want to maintain an archive version in that format anyway. When you've finished the final editing for your story and are ready to post it, save a new copy as


MS-DOS Text with Line Breaks

Then close the document in Word (or exit Word), double-click on your new document to load it into Notepad or Wordpad, and inspect it carefully for surprises. If you need to make corrections other than centering titles and headings with spaces, go back to your Word document to make them and then resave over your text version, always specifying 'MS-DOS Text with Line Breaks.'

There is an alternative for those who use tab-indented paragraphs or spaces to provide formatting. If you save your final text version as 'MS-DOS Text with Layout,' Word converts tabs to spaces and generally preserves the visual layout. For reasons that escape me, an extension of 'asc' is used for such documents. You'll probably want to rename it with 'txt,' since the 'asc' extension probably won't be recognized. Be aware, though, that some software eliminates "extra" spaces. This is why block format is preferred.

When you're ready to post it, open the text version, copy the contents and paste into whatever software you're using to post with. This should work in all cases except for longer stories that exceed the limits of certain providers (AOL, most notoriously). If you have that problem, you'll need to go back to your original story and break it into segments that fit under the limits.


Final Thoughts

Okay, that's more than enough. I hope I haven't left out anything significant or made any stupid mistakes. I'm sure wiser heads will let me know, if so. I'll repost this note periodically with accumulated corrections. A copy of the latest version will also be available on the FAQs pages (both web and ftp versions) at ASSTR.

After the original publication of this document, there was considerable discussion about various problems in converting existing documents to ASCII and correcting format problems in other people's documents. I included some ideas in the original version, but this seems to me to be a topic of sufficient complexity to require it's own discussion. If there's enough interest, I would be willing to take it on.

Please note that if you want to e-mail me directly, the address is 'nitesweats |AT| aol.com' not the dummy address in the header.

Peace,
Titmouse




Go to our Writer Guidelines page


Go to the Online Story Submission Form


Go back to Main Stories Index Page


Go to the top of this page