Restore Your Back List Books: Step 2: Part 2: Create a Workable Document

All righty then. You have scanned and converted your printed book. You have cleaned out the very worst boogers and formatting. You now have pure text you can turn into a document you can actually read and edit. You are this close to having a manuscript that is no more difficult to work with than any other WIP.

Before we get into specifics, let me explain up front why I use the style and font that I’m going to use in my examples. I’m an old-school writer and for years and years I worked in standard manuscript format for submission to editors. 12 pt Courier, double-spaced, wide margins, underlining for italics. Nothing awakens my inner editor faster than 12 pt Courier, et al. That’s me. You need to use whatever style, font, etc. that works best for you. If Candara 11pt, 1.5 line spacing or Garamond 13pt, triple spaced lets you work efficiently, then use it. It doesn’t make a whit of difference what your working document LOOKS like as long you are comfortable and you can work.

First, let’s do a little prep work with our original material–the print book. No matter how careful you are, no matter how good the equipment, shit happens. Text gets garbled, a page is missed, a wrinkled page is turned into abstract art. So go through your original pages and mark sections and chapter starts with a paper clip or sticky note. If you suspect your italics or other special formatting is messed up or missing, scan through the printed pages and highlight the italics (you’d be surprised how well italics “leap” off the printed page–you can scan very quickly)

Ready? Open Word (or whatever word processor you prefer) to a blank document. Apply the style “Normal.” Open up your text editor file. Do a Ctrl-a (Select All), Ctrl-c (Copy), then go to Word and do Ctrl-v (Paste). Your text is now a document file. Looks a whole lot different from what you started with, right? Now modify the “Normal” style to make it look the way YOU want it to look. (font, line spacing, paragraph indents, etc.)

RestoreBlog10Not only does it look different, it’s a whole lot smaller, too. This sample file went from over 7MB to its current 472KB. No columns, tables, tabs, changing fonts, or any of the other bloat or nonsense that make your job so hard. Despite still needing some work, it’s readable. If you wanted to start right now from page one, word one to begin the final cleaning, you could do so without ripping out your hair or giving up in frustration.

But wait! I have some tips and tricks you can use to make the job go even faster.

BUILD A NAVIGATION GUIDE

Word has its strengths–navigation is one of them. You’re going to make it very easy to move around in your manuscript by using styles. Heading styles, to be exact. Scroll through your document and apply a heading style to your chapter heads.

RestoreBlog12If you’re using Word 2010, it has a nifty navigation panel that allows you to see where you are in your document at all times. It has plenty of levels, too. So if you have a very long, complex document, you can do something like apply Heading 1 to chapter heads; apply Heading 2 to sections; apply Heading 3 to the first paragraph after a scene break, and so on. Taking ten or twenty minutes to do this now will save you tons of time later when you, for instance, run into a patch of garbled text and need to find it in the original. It’s a whole lot easier to search a known section than it is to scroll around in the document to figure out where you are then have to paw through the original. You can modify the heading styles to look any way you want them to look. It doesn’t matter, this is for your eyes only.

QUICK TIP: If you are using an older version of Word that does not have a navigation pane, click and hold down your mouse on the right hand scroll bar. It will tell you where you are in the document.

RESTORE YOUR SPECIAL FORMATTING

Now you can restore your italics (and other special formatting). As noted before, I like to use underlining when I’m cleaning up a restored document. Underlining is more visible than italics, and it’s very easy to change the underlining to italics later if necessary. Do whatever is comfortable for you. Open up the Find/Replace box and make it look exactly like this:

RestoreBlog13Do a Replace All and done. Use Find/Replace to get rid of the tags. (Make sure you uncheck “use wildcards” and select No Formatting for the Replace field.) If by chance your italics didn’t make it through conversion, I recommend you wait until after you have proofread the text and run the final spell check before you put the italics back in. It will make searching for the text you want easier.

RESTORE CURLY QUOTES

Word also does a nifty little trick for you. In Find/Replace if you type ” in the Find field and ” in the Replace field, then do a Replace All it will turn your quote marks in the right direction (mostly). Type ‘ in the Find field and ‘ in the Replace field, do a Replace All, and it will turn your your apostrophes and single quotes, too (mostly). I say “mostly” because a few will still be turned wrong, but you can find those easily enough when you’re proofreading.

PRELIMINARY SPELL CHECK

You will make life much easier for yourself if you run a spell check BEFORE you start proofreading. By this point Word has already warned you that “there are too many spelling and grammar errors…” Wimpy. At this point you will run into a lot of joined words, mis-hyphenated words and gibberish. This is your opportunity to clean those up. In most cases, it will take a while, so put on a movie or queue up some music, make a fresh pot of coffee and make yourself comfortable.

QUICK TIP: If you open the Find/Replace box in Word 2010 you will see down at the bottom left a box for “Options”–open it.

RestoreBlog14Go through the menu and customize it to suit your document’s needs. It will make life much easier on you. Also, on the Find/Replace box (scroll up to see it) you will notice a button that says “Special.” Click that and it will open a menu that contains such special characters as em dashes and paragraphs. You can search for those.

A WARNING: Be very cautious about how you use “Change All.” Remember, OCR has interpreted images into characters, and like any interpreter, it can be sort of stupid. It’ll trip you up. At this stage, you are far better off correcting one word at a time, even if it takes some extra time.

SPECIAL FORMATTING FOR PARAGRAPHS

As you go through the document, you might find such things as letters, notes, text messages, poetry, song lyrics, lists–instances that will require special handling when the document is turned into a book. DO NOT FORMAT THESE. This is your source document. If you want to turn it into a book or an ARC, you will do so with a copy of the file. Instead, make a note (for yourself or for the person you hire to format your books) and highlight it. A few examples:

  • [NUMBERED LIST]
  • [POETRY, OFFSET AND ITALICIZED]
  • [LETTER, SIGNATURE RIGHT ALIGN]

It will make your (and possibly my) life easier. If you hire out the formatting, let your formatter know about the notes and they’ll handle it from there.

CONGRATULATIONS

Your text is now clean enough for you to go through it and treat it like any other proofreading job. It won’t leave you curled in a fetal ball, weeping about the immenseness of it all; it won’t leave you with bald patches from tearing your hair out. After your proofread (which will NOT take months) you’ll have clean, error-free text ready to be formatted into an ebook or print on demand book or both, and your readers will thank you.

So no more excuses. Get that back list back into circulation.

Advertisement

Restore Your Back List Books: Step 2: Part 1: The BIG Clean

It took months! It was so frustrating I don’t know if I can ever do it again. But I have so many back list books I want to self publish...”

I’ve heard variations on that plaintive theme many, many times. Writers want to bring their back list back to life only to discover that a) They do not have a digital copy; b) the original manuscript is a mess of markup and it’s not the edited, final version anyway; c) they can’t find anyone to restore the book/s for them that fits within their budget. So they do it themselves. Because, seriously, how hard can it be?

On the left, a scanned paperback novel; on the right, the conversion via OCR into a Word doc.

On the left, a scanned paperback novel; on the right, the conversion via OCR into a Word doc.

Doesn’t look too bad, now does it? The innocent writer sets about restoring the document, beginning at page one, and… disaster. Why? To understand why you have to know what is going on behind the scenes. The scanned document is an image file. (There are some scanners that convert to text during the scan and that saves some steps and the results are very good, but it doesn’t eliminate all the problems.) OCR is optical character recognition, meaning the program looks at a picture and decides what letter or character it might be. Depending on the typefaces used, font sizes, line height, condition of the paper and other factors, conversion can range from nearly pristine to what looks like glyphs etched into an alien spacecraft. And then… you have Word (or just about any word processor). It takes that converted text and does its utmost best to recreate the formatting.

The OCR conversion on crack, er, in a Word doc.

The OCR conversion on crack, er, in a Word doc.

The program works really hard to recreate the formatting, using various fonts, section breaks, tabs, columns, text tables, images, etc. To give you an idea about how hard it works, the screen shot you see above is for a straight text (no illustrations) novel that is 72,664 words long. The file size as it stands is 7,117 KB. Over SEVEN MB! (the text by itself creates a file that is only 408 KB) There is absolutely no way any person in the world can do battle against that mess in a reasonable amount of time. The more you try to fix the formatting, the worse it is going to get.

So, I’m going to show you a way to restore the text–not the document!–that will allow you to create a new document that is readable, workable, and editable with a minimum of fuss and bother. It won’t take months or weeks. It will take only hours or at most a few days.

You followed Step 1: Scan and Convert. Your document file is ready to work on. (I will be using Word because so many people use Word, but the principles apply to any word processor. Adjust as necessary.) The very first thing you MUST do is acquire a decent text editor. I use Notepad++. (It’s a free download, stable, and for our purposes, easy to use.) Go get it now. You can’t do what I’m about to show you without it.

Ready to begin? Open that bloated Word doc. We are going to do three things:

  1. Tag the paragraphs
  2. Delete headers and footers
  3. Tag italics

This will be a bit tedious. (I have looked for a fast Find/Replace that works every time without making things worse, and haven’t found it. So, while this boring, it DOES work every time.) So put on a movie or queue up some music, make a fresh pot of coffee and get comfortable.

What tag to use? It doesn’t matter as long as it is unique. I prefer the little diacritical character under the tilde key ` –I have never ever used it in decades of writing. I don’t even know why it’s on my keyboard, but there it is, conveniently located and it doesn’t require the Shift key. I always run a quick Find/Replace to delete any instances where OCR conversion has put those in the document. (If I ever run into a case where the writer actually used the ` then I can easily put them back in later.)

Start at the bottom, work your way up, ignore the odd things that happens to the formatting as you work.

Start at the bottom, work your way up, ignore the odd things that happens to the formatting as you work.

You will start at the bottom–it’s less crazy making, trust me. Just tag the start of every paragraph. If you reach a header or footer that is in text, highlight and delete it. If Word has turned it into an actual header or footer and it’s grayed out, you can safely ignore it. If Word has turned your chapter headings into images, then you will have to type in new ones. Make sure you tag those, too. You will also want to tag deliberate blank lines such as scene breaks. I always insert `## to indicate a deliberate blank line.

QUICK TIP: If, for whatever reason, your document isn’t displaying paragraph indents or if they are difficult to see, open the original scan and place it side by side on your computer screen. Use the original as your guide to find and tag your paragraphs.

By the time you reach the beginning the document is going to look insane–ten times worse than when you started. IT DOES NOT MATTER. One more step and you will never have to look at this particular document again.

TAG YOUR ITALICS

Word is going to do a mediocre job of restoring your italics. But that’s okay. You can get most of them now and restore the rest later. For now, do a simple Find/Replace.

RestoreBlog4This will wrap all your italics in tags. Even though you will have to change the html tags before you return the text to a word processor, I use them now to make other searches easier.

A WARNING: You may be tempted to also tag bolded text such as that found in headers or subtitles. Don’t. It’s unnecessary and it will make extra work for you down the line. Tag bolded text ONLY if instances occur within paragraphs. Otherwise, just do the italics. Same applies for such things as underlined text. I’ll give you some tips about those special cases later.

QUICK TIP: Illustrations, photographs and other graphical images are going to disappear in the next step. You can delete them as you find them if you want, but it’s not necessary. I do note them as I find them, though. What I do:` [IMAGE CAPTION “Buster Bigbelly on his famous trick pony, Pal.” 1948, photo by J. Somebody, page 134] If the image lacks a caption, I insert `[IMAGE photo of a man on a horse, page 134] The page number refers to the original book. If I intend to recover and use the images in an ebook or print-on-demand, I handle those separately from the text.

Now open the text editor. In Word do a Ctrl-a (Select All). Then Ctrl-c (Copy). Open a new file in the text editor and do a Ctrl-v (paste). Now your entire document is pasted into the text editor.

From 7MB to 408KB in minutes.

From 7MB to 408KB in seconds.

I do the majority of clean up in the text editor. Every document is going to be different and have different issues. Most fiction writers aren’t familiar with text editors and it looks funny and distracting and it makes it hard for them to work. Since I can’t possibly in one blog post cover all the many searches that I use, I am going to go with the bare minimum that will get you where you need to be.

RESTORE PARAGRAPHS

Before we restore the paragraphs we are going to add a space to the end of every line. It’s not always necessary, but when it is necessary you will be very sorry you did not do this. So, since it doesn’t hurt even if you don’t need it, do it. From the menu bar select Search>Replace and open the Find/Replace box. In the Find field type \r and in the Replace field insert a blank space \r. Make it look EXACTLY like this:

RestoreBlog9Click Replace All and now you have an extra space at the end of every line. Now open the Find box and make it look EXACTLY like this:

RestoreBlog6Now click Replace All. As soon as you do, your ENTIRE text file is going to turn into a single line. There will not be a single paragraph or line break to be seen.

Next, open the Find/Replace box and make it look EXACTLY like this:

RestoreBlog7IMPORTANT: I used the diacritical mark as my tag. That is what I ask it to search for. If you used a different tag, use that. Do a Replace All and every paragraph you tagged is now restored. Use Find/Replace All to delete all your tags.

You may have missed a few or mis-tagged a few paragraphs. You can find many of them now with this search. Open the Find box and search for this \n[a-z]

RestoreBlog8Now tell it to Find Next. This will find any instances of paragraphs that begin with a lower case letter. You can fix those paragraphs manually.

Word probably used a bunch of tabs–often within paragraphs for justifying text. You want those gone. Open the Find box. Make sure the “Extended” circle is checked. In the Find field type \t and put a single space in the Replace field. Do a Replace All and all the tabs will be replaced with a space.

QUICK TIP: If you have an “oh shit” moment and have done something you did not want to do, go to Edit>Undo. Notepad++ will let you go back as many steps as you need to.

DASHES: HYPHENS AND EM DASHES

When the print book was produced, the typesetter used a variety of dashes–hyphens, em and en dashes, half-ems, 3/4 ens, etc.–to lay out the text. Words were hyphenated. I have tried turning off hyphenation in the original converted document to decidedly mixed and mostly unpleasant results. My recommendation in that regard is to not bother. Here is another of those tedious chores that requires the human eye and some common sense. You can use the Find/Replace function to help you along. Scroll through, find an instance of a dash or hyphenation, then select it with any spaces around it and search for other occurrences. You can use Replace to delete unwanted hyphenation, but be careful about using Replace All. Under Edit in the menu bar you will find the Character Panel. It contains all the ASCII characters, including such useful items as em and en dashes. You can insert them manually or use them in the Replace field.

QUICK TIP: While you are fixing the dashes, you will notice all sorts of interesting characters–what I affectionately call “bugshit”. These are OCR artifacts. You might see bullet symbols or British pound symbols or plus signs. You can delete them as you go, or do Find/Replace All to delete them en masse. Just copy/paste the character into the Find field. I highly recommend that if you do a Replace All, that you replace inappropriate characters with a blank space. It’s a lot easier to delete blank spaces than it is to root out joined words.

A WORD OF WISDOM: Relax. This is a tedious process and imprecise. If you obsess about perfection at this point you will drive yourself nuts. Don’t bother going through the text word by word, line by line or even paragraph by paragraph. This is a BIG CLEAN. Suppose your car was wrecked and you took it to the body shop, and what if the first thing the mechanic did was whip out the wax and buffing wheel and start polishing the hood? Um, no. The first thing you do is pound out the dents and make sure the mechanical parts are in working order. The time for wax and polish is later. Right now, just focus on pounding out dents.

TIDY THE SPECIAL FORMATTING/ITALICS

If for some reason OCR conversion didn’t recognize your italics, you can skip this step. I’ll give you some tips later on how to fix that. If you do have italics, be aware that conversion and Word did a sloppy job of it. Use Find to search for your tags. You can search for either <i> or </i>. Use Find Next and go through the text, deleting any unnecessary tags (such as italicized blank spaces) and tidying the rest. Make sure if you delete either an open or closed tag, that you also delete its corresponding tag. Again, if you happen to see more bugshit while you’re doing this, fix that, too.

When you’re done tidying, use Find/Replace to turn the html tags into something that will not give word processors a case of the vapors. Turn <i> into -STARTI- and </i> into -ENDI-. The hyphens and all caps will help refine your search.

GET RID OF EXTRA SPACES

Use Find/Replace All to rid your text of extra spaces.

  • In the Find field insert TWO blank spaces; in the Replace field insert ONE blank space. Click Replace All until it tells you it can find no more.
  • Make sure the “extended” circle is highlighted. In the Find field type \n with one blank space after it; in the Replace field type \n with no spaces. Click Replace All until it tells you it can find no more.
  • Make sure the “extended” circle is highlighted. In the Find field insert one blank space and \r; in the Replace field type \r with no spaces. Click Replace All until it tells you it can find no more.

Congratulations. Have a drink or a piece of dark chocolate. You deserve it. You have repaired the worst damage to your text. If the text editor is driving you nuts, you can stop using it now. In my next blog post, Part 2 of the Big Clean, I will take you back to Word so you can finish the job in a more comfortable environment. If by chance you are intrigued by the possibilities for some powerhouse searches and find/replace functions to clean up issues specific to your project, ask about them in the comments and let’s see if we can come up with a solution for you.

 

 

 

 

 

Restore Your Back List Books: Step 1: Scan and Convert

bookstackAs I write this I have around a two million words worth of back list books sitting on my desk, awaiting conversion from print into ebooks. In the past week alone I have scanned, converted and restored over 400K words to the stage where I can send the doc files to the writer for proofreading.

Tedious. Yes. Daunting, perhaps. Expensive, sometimes. Impossible and difficult, no way. Writers with back list, please, if you have gotten the rights back to your work, don’t let either expense or the thought of so much work stop you from bringing your back list back to life and reissuing it as either ebooks or print-on-demand or both.

Summertime is a fabulous time for restoring back list. Especially for the do-it-yourselfer, since you can take your laptop out on the deck and do the tedious work while working on your tan. (I like to queue up oddball indie films on Netflix and semi-watch and semi-listen to them while I’m working.) Over the next few blog posts, I’ll take you step-by-step through the process.

Understand, this process ranges from very expensive (having someone else do ALL the work for you) to no-cash-outlay at all (takes time). One way I save writers money–and time–is by doing the scanning, conversion and gross restoration (which I can do in hours) then sending them a Word doc in manuscript format so they can do the fine tuning and proofreading. It’s still tedious, but it’s not rip-your-hair-out frustrating.

A word of caution: There are some services that promise to scan, convert and turn your print book into an ebook, all for one very low price. This is the process used by many of the big publishing houses and this is why so many of their (your!) ebooks are broken, ugly, and riddled with formatting errors and typos. Research those services extensively. If there is any hint that they convert pdf files into ebooks, walk away. Run away! There is the right way to do this and there is the super-speed, el-cheapo, don’t give a shit about the quality of product way–and nothing in between.

This is the process for the RIGHT way:

  1. Scan the book into a pdf file
  2. Convert the pdf using OCR into a document file
  3. Gross restoration: remove headers, footers, page numbers, and bugshit produced when conversion “reads” speckles, debris, foxing, watermarks or penciled notations as characters; restore paragraphs; restore special formatting such as italics or bolded text; remove all formatting artifacts embedded by the pdf AND the word processor.
  4. Fine tune and proofread.
  5. Format the fully restored text for either digital or print-on-demand.
  6. Proofread the ebook and/or print-on-demand.

Skip any of the above steps and you’ll end up with a substandard product that is disrespectful to your written work AND to your readers. There is no way to skip any of those steps and turn out a great product. I can, however, share quite a few tricks and tips that will make the process easier for you.

STEP 1: SCAN AND CONVERT

Two ways to do this.

SOMEONE ELSE: If you do a Google search for “book scanning services” you will turn up hundreds of companies that will scan and convert your printed book into a workable document file. Or, you can run down to your local office supply store (Kinko’s or Staples) and they will do the job while you wait and give you a CD or thumbdrive containing your file to take home. Prices are all over the board. I recommend you budget $100. Chances are, the job can be done far more cheaply than that, and you can use your change to have a really nice lunch while you’re waiting for your book to be scanned.

DO-IT-YOURSELF: It is possible you have everything you need already to scan and convert your books.

  • X-acto knife or paper cutter
  • Scanner
  • External storage device or cloud service
  • Conversion program

“X-acto knife? Paper cutter? Jaye, what are you talking about?”

To easily scan your books, you will need to take them apart. The easiest way to do this is to run down the office supply store and have them chop off the spines. They’ll charge you a couple of bucks and it only takes minutes. One BIG caution here. If your mass market paperback is decades old (or sometimes, only a few years old, depending on how cheap-o the original publisher was) the paper could be badly degraded to the point where any rough handling can tear it, crinkle or shred pages, or even break off chunks. The best way to cut off their spines is by hand–gently. I use a metal ruler and an X-acto knife (I buy blades in bulk, so I always have fresh blades). If you want to do this at home, a good paper cutter (available at any hobby and craft store) will do the job nicely. (This is also a good job for a bored kid–“Mom, I have noooothing to do!” “Here, darling, chop the spine off this book.”)

It takes me about ten minutes to despine a fragile old paperback by hand. Not a big deal.

What if it’s a rare hardcover and you don’t want it chopped and destroyed? That is going to cost you–even if you do it yourself. You will have to copy each page (one page to a sheet, please–doing it two-up will turn into a restoration nightmare), then scan the copies. Nice thing about this is, though, if you use a heavy weight bond copy paper (at least 20#) you can run the sheets through a high speed scanner and it’ll take minutes instead of hours.

IMPORTANT TIP: If you’re chopping the book apart yourself, make sure you remove ALL the binding glue. It can jam your scanner or copier, or even melt into the works.

What if you don’t have a scanner? Double check because you just might. Most printers sold these days are multi-purpose: print, copy, scan, fax. If you don’t have a scanner, it might be cost effective to invest in one. For less than $200 bucks you can get a really good multi-purpose printer. (My home multi-purpose printer was on sale for under $150 and it will do double-sided scans in bulk at a pretty good clip–ain’t technology grand?)

You want to output your scans as pdf files. And those are huge. Hence, you’ll want either an external storage device (such as a flashdrive or an external hard drive) or a cloud service (such as Dropbox). It will make handling the files ever so much easier and keep your computer from having hissy fits and being draggy.

QUICK TIP: Rubber bands. Keep a good supply on hand. Cats, kids, open windows, fans, a careless hand wave, and there goes all those pages you cut apart. Old paperback pages are so flimsy they’ll glide under furniture. Keep your work banded and save yourself some headaches.

IMPORTANT TIP: Always do a test run with the front or back matter before you run pages through a sheet feeder or a high-speed scanner. Fragile, flimsy, brittle paper can be eaten by the machine. Pages can twist and turn and wrinkle from the heat. Some books must be hand scanned on the bed, one sheet at a time.

Some useful things to know about scanning:

  • If your scanner allows it, scan in black and white. Your output files will be smaller and more readable.
  • Experiment with the resolution and go with the lowest resolution that gives you a workable scan. The higher the resolution, the bigger your files will be AND the greater the amount of speckling and debris the scan will pick up. The only time you need to scan at a high resolution is if your book has illustrations or photographs. In that case, you might want to do one run at a lower setting for the text, then do a high resolution scan of your images.
  • If the pages are so flimsy there is significant bleed-thru from the opposing pages, you will need to scan them via the bed (rather than the sheet feeder). Use a sheet of black card stock as a backer and that will reduce or eliminate the bleed-thru.

CONVERSION

The very best program I have found is Adobe Acrobat XI. Not only will it compile all your files (if you have to hand scan the pages, you could end up with hundreds of individual files), but it will quickly and (fairly) cleanly convert the pdf into a workable Word document. It’s a bit pricy and not a program for a person doing one or two jobs. If you have an extensive back list and intend to do the restoration yourself, then it is worth the investment because it will save you tons of time. Some people use it for creating print-on-demand books, too.

There are also hundreds of programs (many as free downloads) and online services (also, many that are free) that will convert your pdf/s into a workable document. Do a Google search for “pdf conversion” and you’ll have a wide variety to choose from.

IMPORTANT TIP: Results will vary. Before you download any program or pay for a subscription or use an online service, test a few pages and see how they look. NO OCR conversion will produce perfect results, but some conversions are much, MUCH better than others and therefore much easier for you to restore the text back to its original glory. It’s worth an hour or so of your time to find the best one for you.

There you go. Your book is scanned and converted and ready for restoration. You all are lucky in that I’ve learned a lot from doing a lot and I’ll save you a LOT of fumbling around with my many tips and tricks. Watch this space for the next post: STEP 2: Gross restoration.