Pre-Production Check List: Cleaning Text

Hi, folks! Popping out of my mole hole for a breather. With so many writers getting ebooks ready for the holiday rush, it’s time for a quick refresher course in that most essential step: Cleaning up the text to get it ready for formatting.

Clean text is the key ingredient for a good looking ebook that works the way it’s supposed to. Over the course of doing a LOT of ebooks (seems like a thousand this week alone, heh) I’ve come up with a little check list to take me through the steps.

  • CHECK: Copy the file
  • CHECK: Tag special formatting–italics, underlining, bolding
  • CHECK: Scan for special styling–quotes, song lyrics, poetry, letters, etc.–tag those instances

Tagging. Because I use several different programs when working on one project, I’ve come up with tags that transfer from program to program without giving search functions fits. It doesn’t matter what tags you use as long as they are easy to find and don’t contain any characters that cause program meltdowns.

  • CHECK: Kill any tabs (I do this in Word because it’s so easy–one global search for ^t and a global replace with nothing. All done.)
  • CHECK: Turn ‘soft’ returns into ‘hard’ returns. Soft returns do funny things when copy/pasted back and forth. Easier to deal with them now. (UPDATE: It was pointed out that I didn’t explain how to do this. Oops. In Word, it is very easy. Search for ^l (lower case L) and do a global replace with ^p.)
  • CHECK: Copy/paste the entire file into a text editor

Why a text editor? Unlike a word processor, a text editor doesn’t add anything to the file unless I specifically tell it to. No hidden codes, no surprises. I use Notepad++, freeware that is powerful, easy to learn, and makes formatting ebooks in html a breeze.

  • CHECK: Eliminate extra spaces. Between sentences, after paragraphs, before paragraphs, between words. All must go.
  • CHECK: Tag scene breaks. Blank lines show up in manuscripts, often for no reason at all. I want to make sure a blank line is supposed to be there, so I tag all deliberately blank lines.
  • CHECK: Eliminate extra paragraph returns. Don’t need them, don’t want them, make them all go away. I usually leave a blank line where there is supposed to be a page or section break. All the rest go.
  • CHECK: Clean up special formatting tags. Rewriting and revising often leaves artifacts–italicized blank spaces, for instance. Also, when formatting with html, styling should be within a paragraph. There are rules. Making sure all the special formatting follows the rules makes my life easier.
  • CHECK: Search for inappropriate paragraph breaks. This is a real problem with books that have been scanned from print and restored via OCR. I search for paragraphs that begin with lower case characters or end without punctuation, and that finds most of the inappropriate breaks. (the rest are found in the final proofread)
  • CHECK: Search for reserved characters: straight quotes, straight apostrophes, ampersands, greater and lesser than brackets. These don’t always cause problems, but sometimes they do and that can cause interesting hiccups in an ebook. Easier to just turn them into named entities.
  • CHECK: Seek out non-ASCII characters and symbols. These will turn into question marks or bizarre symbols in the text editor. Ebook readers will not render them, so they must be turned into named entities.
  • CHECK: Standardize punctuation. Ebooks are real books, and require real printer’s punctuation. I go through and make sure em dashes are em dashes and not quickie writer shorthand, that ellipses all look the same, that apostrophes and quote marks are turned in the correct direction.

That checklist takes care of almost everything. Even though it sounds like a lot, most of the steps can be taken care of in one or two Find/Replace operations. Most manuscripts I work on can be cleaned up in less than an hour.

Even if you are formatting your ebook in a word processor or in Scrivener, this is good practice for every project. (Skip the steps about using named entities, but do check for non-ASCII characters) It will clean out the junk the programs put in and go a long, long way toward making your ebook look professional.

Have fun! (I’m headed back to the mole hole)

 

Advertisement

Taking Some of the Pain Out Of Proofreading Your Ebook

Okay, everybody, raise your hand and wave it wildly if you love proofreading your ebook!

*crickets*

Yeah, me, too. Nonetheless, proofreading your ebook is essential. By that I mean, actually opening the ebook on your Kindle or Nook or iPad or phone or magic toaster, and going over it word by word, character by character. You’re not just looking at the text. Funny things can happen during conversion. You need to find the goofs and glitches and fix them.

If you don’t have an ereader? Download Calibre onto your computer. It’s free, the display is attractive, and while it doesn’t give you the exact display you’d find on a handheld ereader, it is good enough you should be able to spot the worst problems.

When I proofread an ebook I’ve produced, I load it onto one of my Kindles, run it through its paces (make sure all the links work, and that it responds properly to all the user-interface commands, and that the navigation guide is properly displayed), then I go through the text. I pull up the actual file on my computer and make corrections as I find them. No biggie.

Where the process gets sticky is when someone else proofreads. I prefer the author proof the text. Not just because it’s time-consuming and not much fun, but because the author is the most deeply invested in their work and the final proof is their opportunity to tweak and polish. Plus, they can actually see how graphical elements look in “real time” and see if text effects look good on the screen.

You can’t mark up an ebook. Oh, you can use bookmarks and notes, but it’s ridiculously difficult transferring those to another device, especially when working with “document” as opposed to “book” files. And because such things as “percentage of book read” and “location” depend on the device and the user font and spacing preferences, those are not reliable markers either. What I’ve been doing is asking writers to type out their notes with enough text for me to search and to note which chapter or section the goof/change is in. There are inherent problems with this method. One is typos (the writer’s and mine). Another is fatigue. If you’re tired, the temptation is there to think, Ah, a backward quote mark doesn’t really matter, or What difference does this not-quite right word make?

I stumbled onto a method with a book that required two proofreaders. The key is Square Brackets.

[ ]

In the books I produce, there is usually no reason to use square brackets. That makes them, for search purposes, unique characters. What I did was copy the ebook file(s) and turn them into text files. Windows and Mac users have a basic text editor included (under Accessories in Windows–mine is called Notepad). It will open a text file. So the writer opens the text file and while they are proofreading the ebook on their device or in Calibre, if they find a goof or want a change, they can mark up the file. All they have to do is enclose any changes in square brackets. It looks like this:

aProof1When the author is done, they send the entire file back to me. I open it side by side with the ebook file, search for square brackets and voila! I can see the author’s notes in “real time.” If there are text changes, I can copy the author’s exact text and paste it into the ebook file. No typos. (watch those quote marks and apostrophes–make sure you don’t accidentally use straight quotes instead of curly) Last night I keyed in the corrections from the above example. What would have been a two to three hour job using the old method, took me instead about 30 minutes. That included going back through and double-checking my work. The writer reports that after she got over her shock over how weird the text file looks, the job was much, much easier on her end, too.

What about the rest of you? Has anyone else found simpler or more effective ways for proofreading ebooks when two or more people are involved in the process?

 

 

Scan, OCR and Restore BackList Books

This week I read a comment on a blog (can’t remember where–sorry) where a writer said she was putting off reissuing her backlist titles because she didn’t have accessible computer files for them and so she’d have to scan the actual books, run them through an OCR program and format them. She didn’t know how to do that.

I hear ya, sister. A few months ago I’d have nodded in agreement, and said, “Yep, too hard, too time-consuming, too expensive.” Now, however, having spent the past few months restoring nearly two dozen old paperback books from scans and turning them into ebooks, I know it’s NOT too hard, it IS time-consuming, and the cost can range from dollars per page (expensive) to FREE (DIY option).

(Another option is to retype the book, but quite frankly, folks, unless you are a super-typist with wrists of steel–which I most certainly am not–that is a daunting proposition.)

You know me. Somebody sez, “Can you do this?” and I reply, “How hard can it be?” Then I bumble and fumble around until I figure out how to do it. Then I come on here and am able to give you some tips that mean you can skip the bumbling and fumbling part. Unless you enjoy b&f. In that case, you can stop reading this post.

This is for the Do-It-Yourselfers.

SCANNING

Do a Google search for “scanning books” and the result will come up with thousands of services that will take your old books or manuscripts and turn them into pdf or doc files. Some services will scan the book without harming the binding, some will chop off the spine, destroying the book. Prices range from per-page costs to flat-rate. I haven’t used any of those services, so I can’t recommend any of them. You’ll have to do your own research.

You can also take your old books or manuscripts to a copy store such as Fed-Ex/Kinkos or a full-service office supply store such as Staples, and either do it yourself on their equipment or have them do it for you.

If you happen to own a scanner, you can do it at home. This is the insane option because quite frankly most home scanners are ridiculous beasts that take their sweet time (I know this because I had to try it myself just to see and so scanned a nearly 300 page manuscript–easy on the hands, tough on the buttocks. It took hours!) If you are home-scanning actual pages from a paperback, you will have to play with the settings on your scanner because most are at their best scanning photos and that resolution is far too high to get good results. Best results are achieved if you copy the pages onto good quality 20# or 24# copy paper and then scan the copies.

However you choose to have your book/manuscript scanned, my recommendation is to have the scanner turn it into a pdf file. There are services and programs that will do the OCR conversion during the scan and produce a .doc, .docx or .rtf file for you. On the surface, it looks like a bargain. I think it’s dangerous because: 1) the file you receive will be huge and bloated and junked up with tons of coding that can severely mess up your ebook: 2) it will not save you any work during clean-up and in some ways it makes clean-up more of a chore; 3) it could give you a false sense of security that your file is cleaner than it actually is and your ebook could end up like so many that are on my Kindle right now, full of formatting errors and gibberish.

Here is a file that has been scanned and converted at the same time:

Here is a file that has been DIY scanned and turned into a .doc file:

It’s a big mess, too, but there are actually fewer dangerous formatting issues you will have to address. Awful as it looks, this example is easier to clean up and turn into an ebook then the first example. So save your money (and a few headaches) and run the pdf through the OCR program yourself.

OCR

PDF files are image files. Pictures of a page. In order to clean up and format the pages they must be converted into text. That’s where OCR comes in–Optical Character Recognition.

I found a nifty little program called FreeOCR. It’s a free program you download onto your computer. It’s a powerful program with a few bells and whistles–none of which I recommend you use. This is a case where the more you automate the process, the worse your results will be. There is no good substitute for the human eye and human instincts when it comes to restoring a document file. You’re better off in the long-run by doing a basic OCR conversion. That means, open the FreeOCR program, open a pdf file, then render it page by page (depending on the size of the file and the density of the type, to do a complete book the process will take between 20 minutes and an hour).

The original scanned page is on the left, the OCR conversion is on the right. You can see what a mess it is. That’s because the OCR is very efficient. It turns not only images of text into text, it turns water stains, wrinkles, shadows, and debris embedded in the paper into text, too. If there are notes in the margin, it will try to turn that into text. A basic scan also inserts a hard paragraph return at the end of every line, gets rid of paragraph indents and destroys special formatting such as bold and italics (the first time I saw this I totally freaked out). Some things convert more cleanly than others. If you’re converting a decades-old paperback where the pages have yellowed and degraded, the conversion will be a HUGE mess.

But not a hopeless mess.

CLEAN UP

FreeOCR gives you an option of saving your rendered document as a Word file. You can do that and clean up your file in Word. There is a much easier, faster and more efficient way. Use a text editor (with a little eventual help from Word). I use Notepad++, a program you can download for free. Save your OCR rendering into the clipboard (or do a right click, Select All/Copy) and paste it into the text editor.

Whether you use Word or a text editor, this is the time-consuming part of the process. And there’s no help for it. If you want a good-looking ebook, you need to make your converted file squeaky clean. (Your other option is hiring someone to do it for you. BUT–and this is a huge but–you have to make sure the service you hire is NOT automating the process, but that there is instead an actual human being going through the book word by word and restoring the text. Those automated programs are powerful and they do a good job on some projects, but I have ebooks I have purchased on my Kindle right now that are unreadable messes due to those programs.)

I have learned a few things to make the job go faster and more efficiently.

  1. Save restoring the paragraphs for last. Take a look at the image of the OCR conversion in Word. I toggled on the Show/Hide feature so you can see how every line has a paragraph return. What you see is the layout from the printed book. That can help during clean up.
  2. Work off the actual pages. Either have the actual book in front of you or split your computer screen and have the pdf file open to the scanned pages. That way if the OCR mangled the text, you can retype a word or line from the actual copy instead of trying to guess what it is supposed to say. You can also tag special formatting such as italics as you go along.
  3. Use Find/Replace.

The text will be full of oddball characters (I call them bug shit). Things like degree symbols, floating quote marks, greater and less than characters, slashes, tildes. If something doesn’t belong in your text file–Find/Replace All gets rid of it. You can also use it to get rid of headers, footers and page numbers. Once you have the text cleaned up, you can use Find/Replace All to get rid of extra paragraph returns, restore the proper paragraphs and un-hyphenate any words that had been split in the printed version. (BONUS TIP: Before you get rid of the extra paragraph returns use Find/Replace to add an extra space at the end of each line. That keeps words from being joined and makes it easier to find hyphens you want to get rid of)

So, yes, this is time-consuming, but it is not hard nor does it have to be expensive. It is definitely worthwhile to get your backlist back in circulation.

 

More About Ebook Formatting, Source Files and Tales of Tagging

First an apology for not answering every comment this week. On the “Source Files Update” post there were some great comments. People are coming up with solutions and solving problems. So go read the comments over there. One commenter in particular is hard at work on the subject of formatting ebooks from word processor files. I’ve been corresponding with William Ockham regarding his efforts to create a program that will make it easy to format a word processor file into a good-looking ebook. I’ve sent William some grotty files and he’s been problem solving. I’ve brought one of his comments over to this post so you can get a better idea of what he’s doing:

Wow, I’m flattered. I’ve been busy with my guest blogging stint over at http://www.thepassivevoice.com and didn’t see all these comments. Since there is some interest here, I’ll share what I can of my plans. I firmly believe that writers should use whatever tool works for them. For most people, that’s Microsoft Word. Some folks are using Scrivener and almost everyone else is using some word processor (a flavor of OpenOffice or those WordPerfect holdouts).

The first thing I’m going to release is a free document to source file converter service (to use Jaye’s terms). You save your manuscript in RTF format (pretty much every program supports RTF) and upload it to my service. My program will go through and do all the stuff that Jaye talks about. It will strip all the formatting except bold, italics, and chapter headings. You get back a nice clean source file in RTF format. You load it up into your tool and save it back as a .doc file and you have a source file suitable as the input for ebook formatting. It’s not much, but it is a nice little timesaver and your ebook formatter will thank you (even if you DIY). Did I mention it would be free?

I really appreciate all the expressions of support. I hadn’t really given much thought to a Kickstarter, but I am thinking about it now. In the meantime, there is something you could do to help. I need test cases. That is, I need real manuscripts before they’ve been given the Jaye Manus treatment. If anyone has copies of their novels (or short story collections) that they wouldn’t sharing with me, I would really appreciate it. I promise not use them for anything other than perfecting my software. I will send you the cleaned up version and destroy or return the original when I’m done.

If you can help in this way, save your gnarliest files (smart quotes, em dashes, paragraphs indented with tabs and spaces, whatever) in RTF format and
email them
to razoroftruth at
gmail dot
com

Let me know what program (i.e Microsoft Word) and version (like 2000 or 2007) and whether you are using Windows, Mac, or Linux (or other Unix variant).

Which brings us to another problem I’m working on with source files–tagging. One of the things keeping me so busy this week is learning HTML. Turns out it’s kind of fun and quite the challenge. I also discovered that my resulting ebook files are much smaller–why? Who knows. But that’s a plus since I love using graphics for headers and such. Anyhow, the biggest challenge has been doing an ebook in screenplay format. It’s not difficult. It requires essentially three styles: Centered, Block Quote and Hanging Text. Since it ran about 120 pages in manuscript form, the real challenge was making sure every style was properly applied. I also wanted a way to NOT have to go in and tweak every line of text.

Now me, I happen to think FIND/REPLACE is the greatest invention since the light bulb. I’ve stated before that Word’s F/R is a powerhouse. Indeed. I also made some very interesting discoveries about Word and text editors and how they interact re formatting tags.

Le sigh…

Let’s talk about the two most common special formatting tags in the writing universe. Asterisks to indicate bolded text and underscores to indicate italics. Most editors and agents understand what those marks mean. Sending an e-query with those tags in place would be perfectly acceptable. Except… Even if you turn off the auto-formatting features, Word treats them like special characters and so does a text editor. Meaning, a text editor will strip them out. So those are out. You can use them if you like–they are easy to read–but if you ever have to copy the file into a text editor, you’ll lose the tags and your special formatting.

Anyhow, I’ve been using my own little special formatting tags–ii for italics, BB for bolding, and UU for underlining. Nobody but me sees them or has to read them, so no big deal. BUT, I am in the process of creating a cheat sheet for Source Files, and need to come up with tags that One) Make sense; Two) Are easy to remember and use; Three) Don’t activate “helpfulness” in word processors; Four) Work well in FIND/REPLACE operations. Number three is a bitch. I popped around in different programs to see how they handle various tags. Turns out non-letter characters are a problem when created in strings–Word, especially, kept getting wobbly and persnickety. Plus, some can cause problems in HTML coding because it uses so many characters for commands. For instance, I tried i/TEXT/i for italics. That seems fairly straightforward, right? It didn’t make Word go all wobbly either and it translated into a text editor. Problems arose when I did F/R operations in the text editor. I needed characters that are NOT used in coding. Which leaves out almost all of them.

Ah ha, most FIND operations can be made case sensitive. And there is one non-letter character that gave me no problems at all–the lowly dash/hyphen. So here are a few of the tags I ended up with:

  • -ITAL-   -NOITAL-
  • -CTR-     -NOCTR-
  • -BQ-        -NOBQ-
  • -NBSP-

Those might seem a little “wordy” but they are pretty self-explanatory (italics, centered text, block quote, no break space) and they don’t cause interpretation wars between programs. When I paste the Word file into the text editor, all I have to do is run FIND/REPLACE operations to insert the coding. (ex: -ITAL- becomes <i> and -NOITAL- becomes </i> to make italicized text) Most fiction doesn’t require every paragraph be tagged. So I won’t go in to the nifty little shortcuts I found.

The really important thing I’ve discovered is that not all tagging is equal and some of the old printer’s tags will not work because the programs want to do something with them and it’s not always what the writer intends.

So how about you, folks? What nifty tricks tricks have you come up for tagging the special formatting in your files?

Scrivener and the Ebook

I heard about the Scrivener writing program last year. I followed a link to Literature and Latte, roamed the site and got interested. What finally hooked me was the promise of a new way to organize a writing project. So in January I purchased the program. (Why so long? Because I’m scared of hardware. Go ahead and laugh, the old man does. Then again he’s a mechanical genius and never worries about the damned things eating him while he sleeps. When he starts fiddling with things, I have to leave the room, freaked out about him pissing off the machines. Because I’m convinced my computer is just waiting for me to do something stupid, it takes me forever to work up the nerve to install new programs. Now you know.)

As a novelist I’m not exactly tidy. My process involves notebooks, Post-it notes, scraps of envelopes, colored pens and pencils, sketch pads, clippings, piles of reference books bristling with place markers and file folders. Rolls of butcher paper, colored Sharpies, white boards and reams of printed manuscript also play a role. Clutter is part of my process. I need to see it, need to have it where I can put my hands on it. That’s what Scrivener does. It digitizes my clutter. Instead of having it scattered on my desk top, it’s on the screen, available with a click of the mouse.

Scrivener is not a word processor. It’s a writing program. It’s not for generating printed documents, it’s for generating files. (You can use it to create printed documents, but using Scrivener to write a letter to your mom would be like using a bulldozer to dig a post hole.) It very easily and quickly generates lots of different types of files, including PDF, RTF, Word, epub and mobi (Kindle) files.

Me, being me, I got ideas. Once I got interested in self-published ebooks, I had to try it for myself. I’m always interested in how things are made. Since I read almost exclusively on my Kindle now, I see a wide range of quality in ebook production. When I see a problem, I try to figure out the source of the problem. If you’re a regular reader, you have seen my obsession with em dashes and spacing issues and ways to exploit the ereader vehicle. To me, the two biggest concerns for any ebook formatter should be: Readability and easy navigation.

There are many ebook formatters who are familiar with programming, coding and HTML. I’m not one of them. I have no idea what goes on behind the screen. Because of the limitations of MS Word and the often interesting glitches that occur when converting Word files, learning ebook production increased my vocabulary (children, cover your ears). I also had problems organizing layouts. Every change made in Word offers the program new chances to screw things up. So I was figuring out HTML and basic programming, but slowly.

Then came Scrivener. It speaks my language and doesn’t require me to memorize codes and commands. With its organizational capabilities, I saw a way to make more than a simple ebook. I could make beautiful ebooks. It took me a few tries to figure out the possibilities until I achieved something that came close to matching my vision. You can see my latest creation here.

I also discovered a few quirks and limitations. BUT, by using Word’s powerhouse Find and Replace feature to take care of spacing issues and oddball punctuation, then stripping the file in a text editor, I can produce a squeaky clean file ready for layout in Scrivener. (And yeah, I know, it’s not particularly efficient, but I can do it very quickly and it makes sense to my peculiar way of thinking.) Scrivener’s special character map is far better than Word’s, too. So while I’m proofreading it is very easy to make a little cheat sheet off to the side for any special characters needed, then a simple search and replace takes care of those. Also, Scrivener’s formatting is basic, without all the desktop publishing features of Word (full of cute little traps that can make a total mess of an ebook). Since ebook formatting (for fiction) is quite basic, too, Scrivener’s simple design is perfect. I created a template so all I have to do is break up the main file, then move sections around to get the layout I want. I can send an RTF or PDF file to the writer for their approval, and if they want things changed around, no problem at all. Minimal risk of screwing up the formatting.

Graphics are a breeze with Scrivener, too. I’m not talking about covers. I’ve tried my hand at making ebook covers, with decidedly mixed results, but that’s a whole other project I’m working on. I’m talking about such things as fancy font chapter heads, scene break indicators and illustrations. Graphics open up a whole new world of possibilities to make an ebook visually appealing. I have only dabbled in graphics thus far, but I can see the potential and have several interesting experiments in mind.

Granted, not every indie author is interested in learning how to format their ebooks. Formatting isn’t difficult at all, but there is a learning curve and a million and one little details to track. If you want to focus on the writing and hire out the production jobs, that’s fine. Formatting isn’t terribly expensive and won’t break any writer’s bank. If, however, you’re a die-hard do-it-yourselfer, but you aren’t adept at programming and coding, Scrivener is an excellent way to go.

If anyone who uses Scrivener has tips and tricks for ebook production, or would like to know exactly how I created Beauty and the Feast, leave a note in the comments. I can write another blog post.