Pre-Production Check List: Cleaning Text

Hi, folks! Popping out of my mole hole for a breather. With so many writers getting ebooks ready for the holiday rush, it’s time for a quick refresher course in that most essential step: Cleaning up the text to get it ready for formatting.

Clean text is the key ingredient for a good looking ebook that works the way it’s supposed to. Over the course of doing a LOT of ebooks (seems like a thousand this week alone, heh) I’ve come up with a little check list to take me through the steps.

  • CHECK: Copy the file
  • CHECK: Tag special formatting–italics, underlining, bolding
  • CHECK: Scan for special styling–quotes, song lyrics, poetry, letters, etc.–tag those instances

Tagging. Because I use several different programs when working on one project, I’ve come up with tags that transfer from program to program without giving search functions fits. It doesn’t matter what tags you use as long as they are easy to find and don’t contain any characters that cause program meltdowns.

  • CHECK: Kill any tabs (I do this in Word because it’s so easy–one global search for ^t and a global replace with nothing. All done.)
  • CHECK: Turn ‘soft’ returns into ‘hard’ returns. Soft returns do funny things when copy/pasted back and forth. Easier to deal with them now. (UPDATE: It was pointed out that I didn’t explain how to do this. Oops. In Word, it is very easy. Search for ^l (lower case L) and do a global replace with ^p.)
  • CHECK: Copy/paste the entire file into a text editor

Why a text editor? Unlike a word processor, a text editor doesn’t add anything to the file unless I specifically tell it to. No hidden codes, no surprises. I use Notepad++, freeware that is powerful, easy to learn, and makes formatting ebooks in html a breeze.

  • CHECK: Eliminate extra spaces. Between sentences, after paragraphs, before paragraphs, between words. All must go.
  • CHECK: Tag scene breaks. Blank lines show up in manuscripts, often for no reason at all. I want to make sure a blank line is supposed to be there, so I tag all deliberately blank lines.
  • CHECK: Eliminate extra paragraph returns. Don’t need them, don’t want them, make them all go away. I usually leave a blank line where there is supposed to be a page or section break. All the rest go.
  • CHECK: Clean up special formatting tags. Rewriting and revising often leaves artifacts–italicized blank spaces, for instance. Also, when formatting with html, styling should be within a paragraph. There are rules. Making sure all the special formatting follows the rules makes my life easier.
  • CHECK: Search for inappropriate paragraph breaks. This is a real problem with books that have been scanned from print and restored via OCR. I search for paragraphs that begin with lower case characters or end without punctuation, and that finds most of the inappropriate breaks. (the rest are found in the final proofread)
  • CHECK: Search for reserved characters: straight quotes, straight apostrophes, ampersands, greater and lesser than brackets. These don’t always cause problems, but sometimes they do and that can cause interesting hiccups in an ebook. Easier to just turn them into named entities.
  • CHECK: Seek out non-ASCII characters and symbols. These will turn into question marks or bizarre symbols in the text editor. Ebook readers will not render them, so they must be turned into named entities.
  • CHECK: Standardize punctuation. Ebooks are real books, and require real printer’s punctuation. I go through and make sure em dashes are em dashes and not quickie writer shorthand, that ellipses all look the same, that apostrophes and quote marks are turned in the correct direction.

That checklist takes care of almost everything. Even though it sounds like a lot, most of the steps can be taken care of in one or two Find/Replace operations. Most manuscripts I work on can be cleaned up in less than an hour.

Even if you are formatting your ebook in a word processor or in Scrivener, this is good practice for every project. (Skip the steps about using named entities, but do check for non-ASCII characters) It will clean out the junk the programs put in and go a long, long way toward making your ebook look professional.

Have fun! (I’m headed back to the mole hole)



11 thoughts on “Pre-Production Check List: Cleaning Text

  1. Jaye,

    Great checklist! The regexes comes in handy when checking for busted paragraphs. Out of curiosity, which non-ASCII characters are you having rendering problems and on which platform? We use UTF-8 characters for everything (even Asian language books like Chinese, Korean, and Thai), and we haven’t seen any problems (the e-ink Kindles has weird spacing on the Thai, but it still renders). We’re not talking about bad unicode values that show up as squares in a text editor, but rather characters outside what’s on the standard US keyboard. Hope we’re not screwing something up on something we didn’t test on…

    • I get the oddball occasionally, usually in Word files. The writer inserts a special character or symbol from a non-ASCII character set. They usually show up in foreign words or in symbols. I try to catch them in the clean-up, but if not, they show up in the proofread. I ALWAYS look for them in files that are the result of scanned and OCR’d material. OCR has a bad habit of inserting weird characters that look all right in a Word file, but don’t render in ebooks.

      Just so folks know, UTF-8 has a wide range of characters it will render perfectly. If you can’t find the character or symbol in the standard ASCII set, there is probably a named entity to use. w3schools has it all listed.

    • I have a Welsh character: “w” with a carat on top. It’s used for the first w in “Cwn Annwn”, the Hound of Annwn (the hounds of hell). I can enter it in Word or Scrivener as a special character, but it does not render properly in HTML and there is no named entity for it (curses). I make do without the carat.

      • Although the w with a ^ is probably rare, I wonder if the w3school guys could come with a character and assign a name to it. Email them and ask.

  2. I wish to god I knew what you are talking about, Jaye. I can read the words but they register as gobbledy-gook in my seriously underachieving brain.

  3. A great refresher on the all-important prep one must complete to create a good-looking e-book. I just would add one thing to the list — ensure that any italicized text that is also quoted (e.g., phone conversations to indicate the person not “in” the scene) includes the beginning and/or ending quotation mark (as appropriate) within whatever wrapper one uses to indicate that the text should be italicized. Not doing so may cause the e-book to render with a “floating quote mark” at the end of a line or the beginning of the next line.

  4. Hi Jon. You bring up a point about a weird little glitch I have noticed in Kindle formats. I have no idea why it does it–it makes no sense. But using regular html open and close tags for italics and bolding sometimes causes the text to hiccup. It causes an extra space or repeated text or a line jump. I have not noticed it in EPUB files, but it happens often enough in Kindle files that I don’t use them. I use span classes instead.

  5. Pingback: Amazon’s Kindle Create for Ebooks | QA Productions

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s