Restore Your Back List Books: Step 2: Part 1: The BIG Clean

It took months! It was so frustrating I don’t know if I can ever do it again. But I have so many back list books I want to self publish...”

I’ve heard variations on that plaintive theme many, many times. Writers want to bring their back list back to life only to discover that a) They do not have a digital copy; b) the original manuscript is a mess of markup and it’s not the edited, final version anyway; c) they can’t find anyone to restore the book/s for them that fits within their budget. So they do it themselves. Because, seriously, how hard can it be?

On the left, a scanned paperback novel; on the right, the conversion via OCR into a Word doc.

On the left, a scanned paperback novel; on the right, the conversion via OCR into a Word doc.

Doesn’t look too bad, now does it? The innocent writer sets about restoring the document, beginning at page one, and… disaster. Why? To understand why you have to know what is going on behind the scenes. The scanned document is an image file. (There are some scanners that convert to text during the scan and that saves some steps and the results are very good, but it doesn’t eliminate all the problems.) OCR is optical character recognition, meaning the program looks at a picture and decides what letter or character it might be. Depending on the typefaces used, font sizes, line height, condition of the paper and other factors, conversion can range from nearly pristine to what looks like glyphs etched into an alien spacecraft. And then… you have Word (or just about any word processor). It takes that converted text and does its utmost best to recreate the formatting.

The OCR conversion on crack, er, in a Word doc.

The OCR conversion on crack, er, in a Word doc.

The program works really hard to recreate the formatting, using various fonts, section breaks, tabs, columns, text tables, images, etc. To give you an idea about how hard it works, the screen shot you see above is for a straight text (no illustrations) novel that is 72,664 words long. The file size as it stands is 7,117 KB. Over SEVEN MB! (the text by itself creates a file that is only 408 KB) There is absolutely no way any person in the world can do battle against that mess in a reasonable amount of time. The more you try to fix the formatting, the worse it is going to get.

So, I’m going to show you a way to restore the text–not the document!–that will allow you to create a new document that is readable, workable, and editable with a minimum of fuss and bother. It won’t take months or weeks. It will take only hours or at most a few days.

You followed Step 1: Scan and Convert. Your document file is ready to work on. (I will be using Word because so many people use Word, but the principles apply to any word processor. Adjust as necessary.) The very first thing you MUST do is acquire a decent text editor. I use Notepad++. (It’s a free download, stable, and for our purposes, easy to use.) Go get it now. You can’t do what I’m about to show you without it.

Ready to begin? Open that bloated Word doc. We are going to do three things:

  1. Tag the paragraphs
  2. Delete headers and footers
  3. Tag italics

This will be a bit tedious. (I have looked for a fast Find/Replace that works every time without making things worse, and haven’t found it. So, while this boring, it DOES work every time.) So put on a movie or queue up some music, make a fresh pot of coffee and get comfortable.

What tag to use? It doesn’t matter as long as it is unique. I prefer the little diacritical character under the tilde key ` –I have never ever used it in decades of writing. I don’t even know why it’s on my keyboard, but there it is, conveniently located and it doesn’t require the Shift key. I always run a quick Find/Replace to delete any instances where OCR conversion has put those in the document. (If I ever run into a case where the writer actually used the ` then I can easily put them back in later.)

Start at the bottom, work your way up, ignore the odd things that happens to the formatting as you work.

Start at the bottom, work your way up, ignore the odd things that happens to the formatting as you work.

You will start at the bottom–it’s less crazy making, trust me. Just tag the start of every paragraph. If you reach a header or footer that is in text, highlight and delete it. If Word has turned it into an actual header or footer and it’s grayed out, you can safely ignore it. If Word has turned your chapter headings into images, then you will have to type in new ones. Make sure you tag those, too. You will also want to tag deliberate blank lines such as scene breaks. I always insert `## to indicate a deliberate blank line.

QUICK TIP: If, for whatever reason, your document isn’t displaying paragraph indents or if they are difficult to see, open the original scan and place it side by side on your computer screen. Use the original as your guide to find and tag your paragraphs.

By the time you reach the beginning the document is going to look insane–ten times worse than when you started. IT DOES NOT MATTER. One more step and you will never have to look at this particular document again.

TAG YOUR ITALICS

Word is going to do a mediocre job of restoring your italics. But that’s okay. You can get most of them now and restore the rest later. For now, do a simple Find/Replace.

RestoreBlog4This will wrap all your italics in tags. Even though you will have to change the html tags before you return the text to a word processor, I use them now to make other searches easier.

A WARNING: You may be tempted to also tag bolded text such as that found in headers or subtitles. Don’t. It’s unnecessary and it will make extra work for you down the line. Tag bolded text ONLY if instances occur within paragraphs. Otherwise, just do the italics. Same applies for such things as underlined text. I’ll give you some tips about those special cases later.

QUICK TIP: Illustrations, photographs and other graphical images are going to disappear in the next step. You can delete them as you find them if you want, but it’s not necessary. I do note them as I find them, though. What I do:` [IMAGE CAPTION "Buster Bigbelly on his famous trick pony, Pal." 1948, photo by J. Somebody, page 134] If the image lacks a caption, I insert `[IMAGE photo of a man on a horse, page 134] The page number refers to the original book. If I intend to recover and use the images in an ebook or print-on-demand, I handle those separately from the text.

Now open the text editor. In Word do a Ctrl-a (Select All). Then Ctrl-c (Copy). Open a new file in the text editor and do a Ctrl-v (paste). Now your entire document is pasted into the text editor.

From 7MB to 408KB in minutes.

From 7MB to 408KB in seconds.

I do the majority of clean up in the text editor. Every document is going to be different and have different issues. Most fiction writers aren’t familiar with text editors and it looks funny and distracting and it makes it hard for them to work. Since I can’t possibly in one blog post cover all the many searches that I use, I am going to go with the bare minimum that will get you where you need to be.

RESTORE PARAGRAPHS

Before we restore the paragraphs we are going to add a space to the end of every line. It’s not always necessary, but when it is necessary you will be very sorry you did not do this. So, since it doesn’t hurt even if you don’t need it, do it. From the menu bar select Search>Replace and open the Find/Replace box. In the Find field type \r and in the Replace field insert a blank space \r. Make it look EXACTLY like this:

RestoreBlog9Click Replace All and now you have an extra space at the end of every line. Now open the Find box and make it look EXACTLY like this:

RestoreBlog6Now click Replace All. As soon as you do, your ENTIRE text file is going to turn into a single line. There will not be a single paragraph or line break to be seen.

Next, open the Find/Replace box and make it look EXACTLY like this:

RestoreBlog7IMPORTANT: I used the diacritical mark as my tag. That is what I ask it to search for. If you used a different tag, use that. Do a Replace All and every paragraph you tagged is now restored. Use Find/Replace All to delete all your tags.

You may have missed a few or mis-tagged a few paragraphs. You can find many of them now with this search. Open the Find box and search for this \n[a-z]

RestoreBlog8Now tell it to Find Next. This will find any instances of paragraphs that begin with a lower case letter. You can fix those paragraphs manually.

Word probably used a bunch of tabs–often within paragraphs for justifying text. You want those gone. Open the Find box. Make sure the “Extended” circle is checked. In the Find field type \t and put a single space in the Replace field. Do a Replace All and all the tabs will be replaced with a space.

QUICK TIP: If you have an “oh shit” moment and have done something you did not want to do, go to Edit>Undo. Notepad++ will let you go back as many steps as you need to.

DASHES: HYPHENS AND EM DASHES

When the print book was produced, the typesetter used a variety of dashes–hyphens, em and en dashes, half-ems, 3/4 ens, etc.–to lay out the text. Words were hyphenated. I have tried turning off hyphenation in the original converted document to decidedly mixed and mostly unpleasant results. My recommendation in that regard is to not bother. Here is another of those tedious chores that requires the human eye and some common sense. You can use the Find/Replace function to help you along. Scroll through, find an instance of a dash or hyphenation, then select it with any spaces around it and search for other occurrences. You can use Replace to delete unwanted hyphenation, but be careful about using Replace All. Under Edit in the menu bar you will find the Character Panel. It contains all the ASCII characters, including such useful items as em and en dashes. You can insert them manually or use them in the Replace field.

QUICK TIP: While you are fixing the dashes, you will notice all sorts of interesting characters–what I affectionately call “bugshit”. These are OCR artifacts. You might see bullet symbols or British pound symbols or plus signs. You can delete them as you go, or do Find/Replace All to delete them en masse. Just copy/paste the character into the Find field. I highly recommend that if you do a Replace All, that you replace inappropriate characters with a blank space. It’s a lot easier to delete blank spaces than it is to root out joined words.

A WORD OF WISDOM: Relax. This is a tedious process and imprecise. If you obsess about perfection at this point you will drive yourself nuts. Don’t bother going through the text word by word, line by line or even paragraph by paragraph. This is a BIG CLEAN. Suppose your car was wrecked and you took it to the body shop, and what if the first thing the mechanic did was whip out the wax and buffing wheel and start polishing the hood? Um, no. The first thing you do is pound out the dents and make sure the mechanical parts are in working order. The time for wax and polish is later. Right now, just focus on pounding out dents.

TIDY THE SPECIAL FORMATTING/ITALICS

If for some reason OCR conversion didn’t recognize your italics, you can skip this step. I’ll give you some tips later on how to fix that. If you do have italics, be aware that conversion and Word did a sloppy job of it. Use Find to search for your tags. You can search for either <i> or </i>. Use Find Next and go through the text, deleting any unnecessary tags (such as italicized blank spaces) and tidying the rest. Make sure if you delete either an open or closed tag, that you also delete its corresponding tag. Again, if you happen to see more bugshit while you’re doing this, fix that, too.

When you’re done tidying, use Find/Replace to turn the html tags into something that will not give word processors a case of the vapors. Turn <i> into -STARTI- and </i> into -ENDI-. The hyphens and all caps will help refine your search.

GET RID OF EXTRA SPACES

Use Find/Replace All to rid your text of extra spaces.

  • In the Find field insert TWO blank spaces; in the Replace field insert ONE blank space. Click Replace All until it tells you it can find no more.
  • Make sure the “extended” circle is highlighted. In the Find field type \n with one blank space after it; in the Replace field type \n with no spaces. Click Replace All until it tells you it can find no more.
  • Make sure the “extended” circle is highlighted. In the Find field insert one blank space and \r; in the Replace field type \r with no spaces. Click Replace All until it tells you it can find no more.

Congratulations. Have a drink or a piece of dark chocolate. You deserve it. You have repaired the worst damage to your text. If the text editor is driving you nuts, you can stop using it now. In my next blog post, Part 2 of the Big Clean, I will take you back to Word so you can finish the job in a more comfortable environment. If by chance you are intrigued by the possibilities for some powerhouse searches and find/replace functions to clean up issues specific to your project, ask about them in the comments and let’s see if we can come up with a solution for you.

 

 

 

 

 

Pre-Production Check List: Cleaning Text

Hi, folks! Popping out of my mole hole for a breather. With so many writers getting ebooks ready for the holiday rush, it’s time for a quick refresher course in that most essential step: Cleaning up the text to get it ready for formatting.

Clean text is the key ingredient for a good looking ebook that works the way it’s supposed to. Over the course of doing a LOT of ebooks (seems like a thousand this week alone, heh) I’ve come up with a little check list to take me through the steps.

  • CHECK: Copy the file
  • CHECK: Tag special formatting–italics, underlining, bolding
  • CHECK: Scan for special styling–quotes, song lyrics, poetry, letters, etc.–tag those instances

Tagging. Because I use several different programs when working on one project, I’ve come up with tags that transfer from program to program without giving search functions fits. It doesn’t matter what tags you use as long as they are easy to find and don’t contain any characters that cause program meltdowns.

  • CHECK: Kill any tabs (I do this in Word because it’s so easy–one global search for ^t and a global replace with nothing. All done.)
  • CHECK: Turn ‘soft’ returns into ‘hard’ returns. Soft returns do funny things when copy/pasted back and forth. Easier to deal with them now. (UPDATE: It was pointed out that I didn’t explain how to do this. Oops. In Word, it is very easy. Search for ^l (lower case L) and do a global replace with ^p.)
  • CHECK: Copy/paste the entire file into a text editor

Why a text editor? Unlike a word processor, a text editor doesn’t add anything to the file unless I specifically tell it to. No hidden codes, no surprises. I use Notepad++, freeware that is powerful, easy to learn, and makes formatting ebooks in html a breeze.

  • CHECK: Eliminate extra spaces. Between sentences, after paragraphs, before paragraphs, between words. All must go.
  • CHECK: Tag scene breaks. Blank lines show up in manuscripts, often for no reason at all. I want to make sure a blank line is supposed to be there, so I tag all deliberately blank lines.
  • CHECK: Eliminate extra paragraph returns. Don’t need them, don’t want them, make them all go away. I usually leave a blank line where there is supposed to be a page or section break. All the rest go.
  • CHECK: Clean up special formatting tags. Rewriting and revising often leaves artifacts–italicized blank spaces, for instance. Also, when formatting with html, styling should be within a paragraph. There are rules. Making sure all the special formatting follows the rules makes my life easier.
  • CHECK: Search for inappropriate paragraph breaks. This is a real problem with books that have been scanned from print and restored via OCR. I search for paragraphs that begin with lower case characters or end without punctuation, and that finds most of the inappropriate breaks. (the rest are found in the final proofread)
  • CHECK: Search for reserved characters: straight quotes, straight apostrophes, ampersands, greater and lesser than brackets. These don’t always cause problems, but sometimes they do and that can cause interesting hiccups in an ebook. Easier to just turn them into named entities.
  • CHECK: Seek out non-ASCII characters and symbols. These will turn into question marks or bizarre symbols in the text editor. Ebook readers will not render them, so they must be turned into named entities.
  • CHECK: Standardize punctuation. Ebooks are real books, and require real printer’s punctuation. I go through and make sure em dashes are em dashes and not quickie writer shorthand, that ellipses all look the same, that apostrophes and quote marks are turned in the correct direction.

That checklist takes care of almost everything. Even though it sounds like a lot, most of the steps can be taken care of in one or two Find/Replace operations. Most manuscripts I work on can be cleaned up in less than an hour.

Even if you are formatting your ebook in a word processor or in Scrivener, this is good practice for every project. (Skip the steps about using named entities, but do check for non-ASCII characters) It will clean out the junk the programs put in and go a long, long way toward making your ebook look professional.

Have fun! (I’m headed back to the mole hole)

 

Restore Paragraphs in an OCR Scan

Earlier, I wrote a post about DIY scanning and doing an OCR rendering and clean-up of your back list books. It doesn’t have to be expensive and it’s not difficult to do. It does require patience, because cleaning up an OCR rendering takes time.

If you used FreeOCR (as I’d recommended) one thing you’ve noticed is that it inserts a hard return at the end of every single line. The first time I saw that I freaked out a bit. I envisioned having to go through the entire file, manually deleting those extra returns and restoring every paragraph. Then I discovered the hard returns actually help in cleaning up the file because I can work line by line through the text, comparing it to the original material.

Once the text is cleaned up, the paragraphs do need to be restored. If you are using Notepad++ (a text editor that I highly recommend) you can use Find/Replace to do the job. The first step takes some time, but the actual restoration uses the power of Replace All to do the job quickly.

Before you begin work on the file, do a Save As and work on the copy. That way if you mess up, the original is intact and you can easily start over.

STEP ONE: Insert an extra line between each “true” paragraph.

In order to keep an eye on what you are doing toggle on the Show Characters button. It’s in the menu bar and the icon looks like a blue pilcrow (paragraph symbol). It will display black boxes with [CR]–for carriage return–and [LF]–line feed–wherever there is a hard return.

Once you have an extra line between every true paragraph, you will need to insert an extra space at the end of every line. This way you won’t end up with joined words.

STEP TWO: Open the Find/Replace box and toggle on “extended”.
In the Find box type: \r
In the Replace box type: (space)\r
(don’t type out “space” just tap the space bar once)
Do a Replace All

Now you are going to tag the places where you WANT a hard return.

STEP THREE: In the Find box type: \r\n(space)
In the Replace box type: \r\n-N-
Do a Replace All

Now the step where you have to steel your nerves. Remove ALL the hard returns.

STEP FOUR: In the Find box type: \r\n
Leave the Replace box blank (no spaces either)
Do a Replace All.

Now you have one giant block of text with zero hard returns. But don’t freak out. Now you restore the proper paragraphs.

STEP FIVE: In the Find box type: -N-
In the Replace box type: \r\n
Do a Replace All.

Now your paragraphs are restored and there are no extra hard returns to be found. You will need to now get rid of those extra spaces at the end of each paragraph.

In the Find box type: (space)\r
In the Replace box type: \r
Do a Replace All.

That’s it. Except for the first step where you have to insert an extra line between each real paragraph, explaining this takes longer than doing it. This method is a whole lot easier than manually deleting the unwanted hard returns.

Have fun!

Scan, OCR and Restore BackList Books

This week I read a comment on a blog (can’t remember where–sorry) where a writer said she was putting off reissuing her backlist titles because she didn’t have accessible computer files for them and so she’d have to scan the actual books, run them through an OCR program and format them. She didn’t know how to do that.

I hear ya, sister. A few months ago I’d have nodded in agreement, and said, “Yep, too hard, too time-consuming, too expensive.” Now, however, having spent the past few months restoring nearly two dozen old paperback books from scans and turning them into ebooks, I know it’s NOT too hard, it IS time-consuming, and the cost can range from dollars per page (expensive) to FREE (DIY option).

(Another option is to retype the book, but quite frankly, folks, unless you are a super-typist with wrists of steel–which I most certainly am not–that is a daunting proposition.)

You know me. Somebody sez, “Can you do this?” and I reply, “How hard can it be?” Then I bumble and fumble around until I figure out how to do it. Then I come on here and am able to give you some tips that mean you can skip the bumbling and fumbling part. Unless you enjoy b&f. In that case, you can stop reading this post.

This is for the Do-It-Yourselfers.

SCANNING

Do a Google search for “scanning books” and the result will come up with thousands of services that will take your old books or manuscripts and turn them into pdf or doc files. Some services will scan the book without harming the binding, some will chop off the spine, destroying the book. Prices range from per-page costs to flat-rate. I haven’t used any of those services, so I can’t recommend any of them. You’ll have to do your own research.

You can also take your old books or manuscripts to a copy store such as Fed-Ex/Kinkos or a full-service office supply store such as Staples, and either do it yourself on their equipment or have them do it for you.

If you happen to own a scanner, you can do it at home. This is the insane option because quite frankly most home scanners are ridiculous beasts that take their sweet time (I know this because I had to try it myself just to see and so scanned a nearly 300 page manuscript–easy on the hands, tough on the buttocks. It took hours!) If you are home-scanning actual pages from a paperback, you will have to play with the settings on your scanner because most are at their best scanning photos and that resolution is far too high to get good results. Best results are achieved if you copy the pages onto good quality 20# or 24# copy paper and then scan the copies.

However you choose to have your book/manuscript scanned, my recommendation is to have the scanner turn it into a pdf file. There are services and programs that will do the OCR conversion during the scan and produce a .doc, .docx or .rtf file for you. On the surface, it looks like a bargain. I think it’s dangerous because: 1) the file you receive will be huge and bloated and junked up with tons of coding that can severely mess up your ebook: 2) it will not save you any work during clean-up and in some ways it makes clean-up more of a chore; 3) it could give you a false sense of security that your file is cleaner than it actually is and your ebook could end up like so many that are on my Kindle right now, full of formatting errors and gibberish.

Here is a file that has been scanned and converted at the same time:

Here is a file that has been DIY scanned and turned into a .doc file:

It’s a big mess, too, but there are actually fewer dangerous formatting issues you will have to address. Awful as it looks, this example is easier to clean up and turn into an ebook then the first example. So save your money (and a few headaches) and run the pdf through the OCR program yourself.

OCR

PDF files are image files. Pictures of a page. In order to clean up and format the pages they must be converted into text. That’s where OCR comes in–Optical Character Recognition.

I found a nifty little program called FreeOCR. It’s a free program you download onto your computer. It’s a powerful program with a few bells and whistles–none of which I recommend you use. This is a case where the more you automate the process, the worse your results will be. There is no good substitute for the human eye and human instincts when it comes to restoring a document file. You’re better off in the long-run by doing a basic OCR conversion. That means, open the FreeOCR program, open a pdf file, then render it page by page (depending on the size of the file and the density of the type, to do a complete book the process will take between 20 minutes and an hour).

The original scanned page is on the left, the OCR conversion is on the right. You can see what a mess it is. That’s because the OCR is very efficient. It turns not only images of text into text, it turns water stains, wrinkles, shadows, and debris embedded in the paper into text, too. If there are notes in the margin, it will try to turn that into text. A basic scan also inserts a hard paragraph return at the end of every line, gets rid of paragraph indents and destroys special formatting such as bold and italics (the first time I saw this I totally freaked out). Some things convert more cleanly than others. If you’re converting a decades-old paperback where the pages have yellowed and degraded, the conversion will be a HUGE mess.

But not a hopeless mess.

CLEAN UP

FreeOCR gives you an option of saving your rendered document as a Word file. You can do that and clean up your file in Word. There is a much easier, faster and more efficient way. Use a text editor (with a little eventual help from Word). I use Notepad++, a program you can download for free. Save your OCR rendering into the clipboard (or do a right click, Select All/Copy) and paste it into the text editor.

Whether you use Word or a text editor, this is the time-consuming part of the process. And there’s no help for it. If you want a good-looking ebook, you need to make your converted file squeaky clean. (Your other option is hiring someone to do it for you. BUT–and this is a huge but–you have to make sure the service you hire is NOT automating the process, but that there is instead an actual human being going through the book word by word and restoring the text. Those automated programs are powerful and they do a good job on some projects, but I have ebooks I have purchased on my Kindle right now that are unreadable messes due to those programs.)

I have learned a few things to make the job go faster and more efficiently.

  1. Save restoring the paragraphs for last. Take a look at the image of the OCR conversion in Word. I toggled on the Show/Hide feature so you can see how every line has a paragraph return. What you see is the layout from the printed book. That can help during clean up.
  2. Work off the actual pages. Either have the actual book in front of you or split your computer screen and have the pdf file open to the scanned pages. That way if the OCR mangled the text, you can retype a word or line from the actual copy instead of trying to guess what it is supposed to say. You can also tag special formatting such as italics as you go along.
  3. Use Find/Replace.

The text will be full of oddball characters (I call them bug shit). Things like degree symbols, floating quote marks, greater and less than characters, slashes, tildes. If something doesn’t belong in your text file–Find/Replace All gets rid of it. You can also use it to get rid of headers, footers and page numbers. Once you have the text cleaned up, you can use Find/Replace All to get rid of extra paragraph returns, restore the proper paragraphs and un-hyphenate any words that had been split in the printed version. (BONUS TIP: Before you get rid of the extra paragraph returns use Find/Replace to add an extra space at the end of each line. That keeps words from being joined and makes it easier to find hyphens you want to get rid of)

So, yes, this is time-consuming, but it is not hard nor does it have to be expensive. It is definitely worthwhile to get your backlist back in circulation.