Clean Source Files

When someone asks me to format an ebook for them, they always ask what kind of file they should send. I want to say, “Send a txt file with the italics tagged.” Being a realist, I know that isn’t realistic. Text editors are about as much fun to compose in as would be eating oatmeal every meal for the rest of your life. Visually, they suck. Almost every writer I know is used to and comfortable with a manuscript looking a certain way on the screen. They have habits, they have quirks, they need their writing to look a certain way in order for the story to feel “right.” I get that–I’m the same way. I’m comfortable reading in a text editor, but still don’t like the way it looks. So to answer the question as to what kind of file, I tell people, “Whatever you have.”  Most people send me a doc or docx file created on a word processor.

What does an ideal doc file look like? (Show/Hide is toggled so you can see the spaces and paragraph returns)

Your source file is just that, the source, the master file. It will be used to create other files which are formatted for a specific use. Whatever formatting you put into the source file will have to be removed before it can be formatted into an ebook.

To get the best looking, goof-free ebooks, you need to start with a clean source file. To clean an existing file, you need to:

  • Remove tabs
  • Remove headers and page numbering
  • Uncenter centered text
  • Tag scene or section breaks
  • Tag special layouts such as lists, block quotes, centered text, etc.
  • Remove extra paragraph returns
  • Remove page breaks
  • Remove line breaks
  • Remove columns
  • Remove text boxes
  • Remove extra spaces between words, sentences, and before and after paragraph returns
  • Tag special formatting such as italics, bolding and underlining
  • Get all the curly quotes turned in the right direction
  • Deal with special characters as needed

TABS: Innocuous little bastids that can play havoc in ebook files. They can cause skipped lines, compressed text, font size changes and other atrocities. Instead of tabs to indent paragraphs, use a style sheet.

Use Find/Replace to get rid of tabs in MS Word: Type ^t in  the FIND box and leave the REPLACE box blank. Hit Replace All and the tabs will disappear.

HEADERS/PAGE NUMBERS: There are no page numbers in ebooks. If you leave in headers and page numbers, they will float randomly throughout the text. Close any header/footer boxes and shut off the page numbering. Warning, hiding the display is NOT the same as turning it off completely. Make sure the headers and footers are gone.

CENTERED TEXT: Not a biggie, unless you’ve used tabs or the space bar to center text. In those cases you have to remove the tabs and spaces.

TAG SCENE OR SECTION BREAKS: In the next step we will remove extra paragraph returns, so do this step first. If the writer used paragraph returns rather than a pound sign, asterisks, two-headed goats or something to indicate a scene or section break, now is the time to find them. Tag them with a pound sign.

TAG SPECIAL LAYOUTS: This might include block quotes, bullet lists, snippets of poetry–anything that doesn’t use the default paragraph style. I use tags to make them easy to find when I’m formatting an ebook. It doesn’t matter how you tag the text, as long as it is unique so you can run a search for it.

EXTRA PARAGRAPH RETURNS: These can play havoc in ebooks and cause blank pages or misaligned text. If you use Word to format your ebook, there are specific instances to use extra paragraph returns. I’ll cover that in another cheat sheet. For your source file, get rid of every extra return—double returns between paragraphs, between headers and body text, between pages. If the paragraph return is not at the end of text, make it go away.

Use FIND/REPLACE in MS Word to get rid of extra paragraph returns: Type ^p^p in the Find box and ^p in the Replace box. Keep clicking Replace All until it tells you it can’t find any more.

REMOVE PAGE BREAKS: If you are using Word to create your ebook, and you used the INSERT PAGE BREAK command, okay. If it’s necessary to nuke your file in order to remove excess codes, the page breaks will disappear and you’ll have to reinsert them. May as well get in the habit of not bothering with page breaks in source files.

REMOVE LINE BREAKS: While line breaks in a Word file translate fairly well into ebooks, they can go awry if you attempt to use them to micro-manage text alignment. Text “flows” on an ereader, plus users can adjust the size of text and line spacing, and that means line breaks can end up in odd places. There are ways to manipulate text without using line breaks.

REMOVE COLUMNS: See above about text flow. If you are converting Word files into ebooks, columns will turn into gobbletygook. One needs to use tables to create columns in html.

REMOVE TEXT BOXES: Text boxes will not work in ebooks. Take them out.

REMOVE EXTRA SPACES: Ereaders faux-justify text by spreading out the spaces between words (ick). Extra spaces between sentences can turn HUGE. Extra spaces at the beginning of paragraph can make your text look wobbly. Extra spaces at the ends of paragraphs can cause blank pages in an ebook.

Use FIND/REPLACE in Word to get rid of extra spaces: Type two spaces in the Find box and one space in the Replace box, then keep clicking Replace All until Word says there are no more to find. For spaces at the beginning of paragraphs type ^p(space bar) in the Find box and ^p in the Replace box. Do a Replace All. To get rid of spaces at the ends of paragraphs, type (space bar)^p in the Find box and ^p in the Replace box. Do a Replace All.

TAG SPECIAL FORMATTING: Special formatting includes italics, bolding and underlining.

If during composition you have used style sheets, turned off auto-correct and auto-formatting features, and haven’t done anything to your source file that introduces weird coding, and you plan to use a copy of the document file to create a Word file to submit to distributors, you can skip this step.

Everybody else needs to tag their special formatting because when you nuke this file to get rid of excess coding or copy it into a text editor for html coding the formatting will disappear. Tagging is easy.

IF I am making a Word file, I use these tags:

  • -STARTI- for italics
  • -STARTB- for bold
  • -STARTU- for underline
  • -END- to close the tag

You don’t want to use html tags such as < i > or < b > in Word because it will seriously screw up the Find/Replace function. The all caps and hyphens make the tags unique and easy to find.

To tag special formatting with Find/Replace, leave the Find box empty, but ask it to look for italics, or bold, or underlining. In the Replace box type (for italics) -STARTI-^&-END- and do a Replace All. All of your italicized text will look like this “-STARTI-now is the time for all good men-END-

After you have run the file through a text editor to get rid of excess coding, you can restore the special formatting with Find/ Replace. In the Find box activate “wild cards” in the search options. Type -STARTI-*-END- in the Find box, then leave the Replace box blank but activate italics (under search options FONT). Do a Replace All and your italics are restored. Then use Find/Replace to remove the tags. Try this. It takes a lot more time to read these instructions than it does to tag and untag your text.

A word about using underlining in ebooks. Live links in an ebook display as underlined text. So when you use underlines there is a small risk that readers could pause in their reading to see where a “link” leads and if it leads nowhere, it could annoy them. Just something to consider.

TURN THE CURLY QUOTES IN THE RIGHT DIRECTION: This is a fun one (aka known as pain in my behind). In Find/Replace, in the Find box type ” and in the Replace box type ” and do a Replace All. Word will very kindly turn all the curly quotes in the “right” direction. Mostly. If you end a sentence with an em dash, Word will make the ending quote mark a left-double quote mark. Backward. You can use Find to search those out and manually reverse the backward quotes. Or you can wait until you do a final proofread on your ereader. Or, if you hire someone to format the ebook, you can give them a head’s up that you used this method and they will root the oddballs out for you. Same goes for apostrophes.

SPECIAL CHARACTERS: Most word processors have a special characters map or the equivalent. Word will allow you to insert hundreds of special characters and they will print beautifully.

Ebooks use ASCII characters. Anything that is not an ASCII character will show up in an ebook as a question mark, a weird little box or number or some bizarro character that makes no sense at all. Characters outside of ASCII (and reserved characters–straight quotes, ampersands and less-than or greater-than marks) need to be inserted as named entities. If your text requires special characters, my best recommendation is to do a mock-up and convert it into a mobi or epub file and see how it looks on an actual ereader. If you end up with question marks or odd boxes, you’ll need to use named html entities or find a replacement.

There you go, do all the above and you’ll have a nice, clean source file to work with. If you hire a formatter and send them a file that is clean and tagged, the formatter will think very kindly of you. It might even save you some money. If you format your own ebooks, you’ve greatly–immensely–reduced the risk of introducing formatting goofs into the finished product.

If you intend to format an ebook in Word, I recommend you make a copy of the clean source file then load the copy into a text editor. That will strip out most of the obnoxious coding that can cause hiccups. Then all you do is open a new file in Word, apply your formatting style sheet, and copy the stripped file into the new file and you are ready to format a masterpiece.

12 thoughts on “Clean Source Files

  1. Thanks, Jaye – I never would have gotten all my formatting into a Word document without your instructions: it fought me tooth and nail.

    BTW, If you just copy and paste -STARTI-^&-END- from above into the Word Search and Replace text box, it adds an extra ^ – and doesn’t work.

    The learning curve is steep – Word 2011 for Mac is entirely different from 2004 (plus I went from Snow Leopard to Mt. Lion – and now it wants me to upgrade to Mavericks) – and they don’t give you a manual! I like having the Scrivener manual – the MS online help is meh.

    I so appreciate your help.


  2. Pingback: Ebook Creation #ebooks | A Listly List

  3. Pingback: How I Created My First eBook | Daily PlanIt

  4. Jaye – is there a way to find/replace a hard return? I’m working on a colleagues manuscript and they used SHIFT-ENTER to format a lot of citations. I haven’t figured out what keystrokes I can use to find those.

  5. Pingback: A Word About…Word. | The Yellow Buick Review

  6. Pingback: Indie Writers: Make MS Word Work for You Instead of Against You | Taylor Grace

  7. Pingback: Self-publishing writers have too many choices! | liebjabberings

  8. Pingback: Write Well, Write To Sell - Self-pub how to–Part 3: Cleaning source files

  9. Pingback: Write Well, Write To Sell -

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s