Restore Your Back List Books: Step 1: Scan and Convert

bookstackAs I write this I have around a two million words worth of back list books sitting on my desk, awaiting conversion from print into ebooks. In the past week alone I have scanned, converted and restored over 400K words to the stage where I can send the doc files to the writer for proofreading.

Tedious. Yes. Daunting, perhaps. Expensive, sometimes. Impossible and difficult, no way. Writers with back list, please, if you have gotten the rights back to your work, don’t let either expense or the thought of so much work stop you from bringing your back list back to life and reissuing it as either ebooks or print-on-demand or both.

Summertime is a fabulous time for restoring back list. Especially for the do-it-yourselfer, since you can take your laptop out on the deck and do the tedious work while working on your tan. (I like to queue up oddball indie films on Netflix and semi-watch and semi-listen to them while I’m working.) Over the next few blog posts, I’ll take you step-by-step through the process.

Understand, this process ranges from very expensive (having someone else do ALL the work for you) to no-cash-outlay at all (takes time). One way I save writers money–and time–is by doing the scanning, conversion and gross restoration (which I can do in hours) then sending them a Word doc in manuscript format so they can do the fine tuning and proofreading. It’s still tedious, but it’s not rip-your-hair-out frustrating.

A word of caution: There are some services that promise to scan, convert and turn your print book into an ebook, all for one very low price. This is the process used by many of the big publishing houses and this is why so many of their (your!) ebooks are broken, ugly, and riddled with formatting errors and typos. Research those services extensively. If there is any hint that they convert pdf files into ebooks, walk away. Run away! There is the right way to do this and there is the super-speed, el-cheapo, don’t give a shit about the quality of product way–and nothing in between.

This is the process for the RIGHT way:

  1. Scan the book into a pdf file
  2. Convert the pdf using OCR into a document file
  3. Gross restoration: remove headers, footers, page numbers, and bugshit produced when conversion “reads” speckles, debris, foxing, watermarks or penciled notations as characters; restore paragraphs; restore special formatting such as italics or bolded text; remove all formatting artifacts embedded by the pdf AND the word processor.
  4. Fine tune and proofread.
  5. Format the fully restored text for either digital or print-on-demand.
  6. Proofread the ebook and/or print-on-demand.

Skip any of the above steps and you’ll end up with a substandard product that is disrespectful to your written work AND to your readers. There is no way to skip any of those steps and turn out a great product. I can, however, share quite a few tricks and tips that will make the process easier for you.

STEP 1: SCAN AND CONVERT

Two ways to do this.

SOMEONE ELSE: If you do a Google search for “book scanning services” you will turn up hundreds of companies that will scan and convert your printed book into a workable document file. Or, you can run down to your local office supply store (Kinko’s or Staples) and they will do the job while you wait and give you a CD or thumbdrive containing your file to take home. Prices are all over the board. I recommend you budget $100. Chances are, the job can be done far more cheaply than that, and you can use your change to have a really nice lunch while you’re waiting for your book to be scanned.

DO-IT-YOURSELF: It is possible you have everything you need already to scan and convert your books.

  • X-acto knife or paper cutter
  • Scanner
  • External storage device or cloud service
  • Conversion program

“X-acto knife? Paper cutter? Jaye, what are you talking about?”

To easily scan your books, you will need to take them apart. The easiest way to do this is to run down the office supply store and have them chop off the spines. They’ll charge you a couple of bucks and it only takes minutes. One BIG caution here. If your mass market paperback is decades old (or sometimes, only a few years old, depending on how cheap-o the original publisher was) the paper could be badly degraded to the point where any rough handling can tear it, crinkle or shred pages, or even break off chunks. The best way to cut off their spines is by hand–gently. I use a metal ruler and an X-acto knife (I buy blades in bulk, so I always have fresh blades). If you want to do this at home, a good paper cutter (available at any hobby and craft store) will do the job nicely. (This is also a good job for a bored kid–“Mom, I have noooothing to do!” “Here, darling, chop the spine off this book.”)

It takes me about ten minutes to despine a fragile old paperback by hand. Not a big deal.

What if it’s a rare hardcover and you don’t want it chopped and destroyed? That is going to cost you–even if you do it yourself. You will have to copy each page (one page to a sheet, please–doing it two-up will turn into a restoration nightmare), then scan the copies. Nice thing about this is, though, if you use a heavy weight bond copy paper (at least 20#) you can run the sheets through a high speed scanner and it’ll take minutes instead of hours.

IMPORTANT TIP: If you’re chopping the book apart yourself, make sure you remove ALL the binding glue. It can jam your scanner or copier, or even melt into the works.

What if you don’t have a scanner? Double check because you just might. Most printers sold these days are multi-purpose: print, copy, scan, fax. If you don’t have a scanner, it might be cost effective to invest in one. For less than $200 bucks you can get a really good multi-purpose printer. (My home multi-purpose printer was on sale for under $150 and it will do double-sided scans in bulk at a pretty good clip–ain’t technology grand?)

You want to output your scans as pdf files. And those are huge. Hence, you’ll want either an external storage device (such as a flashdrive or an external hard drive) or a cloud service (such as Dropbox). It will make handling the files ever so much easier and keep your computer from having hissy fits and being draggy.

QUICK TIP: Rubber bands. Keep a good supply on hand. Cats, kids, open windows, fans, a careless hand wave, and there goes all those pages you cut apart. Old paperback pages are so flimsy they’ll glide under furniture. Keep your work banded and save yourself some headaches.

IMPORTANT TIP: Always do a test run with the front or back matter before you run pages through a sheet feeder or a high-speed scanner. Fragile, flimsy, brittle paper can be eaten by the machine. Pages can twist and turn and wrinkle from the heat. Some books must be hand scanned on the bed, one sheet at a time.

Some useful things to know about scanning:

  • If your scanner allows it, scan in black and white. Your output files will be smaller and more readable.
  • Experiment with the resolution and go with the lowest resolution that gives you a workable scan. The higher the resolution, the bigger your files will be AND the greater the amount of speckling and debris the scan will pick up. The only time you need to scan at a high resolution is if your book has illustrations or photographs. In that case, you might want to do one run at a lower setting for the text, then do a high resolution scan of your images.
  • If the pages are so flimsy there is significant bleed-thru from the opposing pages, you will need to scan them via the bed (rather than the sheet feeder). Use a sheet of black card stock as a backer and that will reduce or eliminate the bleed-thru.

CONVERSION

The very best program I have found is Adobe Acrobat XI. Not only will it compile all your files (if you have to hand scan the pages, you could end up with hundreds of individual files), but it will quickly and (fairly) cleanly convert the pdf into a workable Word document. It’s a bit pricy and not a program for a person doing one or two jobs. If you have an extensive back list and intend to do the restoration yourself, then it is worth the investment because it will save you tons of time. Some people use it for creating print-on-demand books, too.

There are also hundreds of programs (many as free downloads) and online services (also, many that are free) that will convert your pdf/s into a workable document. Do a Google search for “pdf conversion” and you’ll have a wide variety to choose from.

IMPORTANT TIP: Results will vary. Before you download any program or pay for a subscription or use an online service, test a few pages and see how they look. NO OCR conversion will produce perfect results, but some conversions are much, MUCH better than others and therefore much easier for you to restore the text back to its original glory. It’s worth an hour or so of your time to find the best one for you.

There you go. Your book is scanned and converted and ready for restoration. You all are lucky in that I’ve learned a lot from doing a lot and I’ll save you a LOT of fumbling around with my many tips and tricks. Watch this space for the next post: STEP 2: Gross restoration.

 

 

Advertisements

Scan, OCR and Restore BackList Books

This week I read a comment on a blog (can’t remember where–sorry) where a writer said she was putting off reissuing her backlist titles because she didn’t have accessible computer files for them and so she’d have to scan the actual books, run them through an OCR program and format them. She didn’t know how to do that.

I hear ya, sister. A few months ago I’d have nodded in agreement, and said, “Yep, too hard, too time-consuming, too expensive.” Now, however, having spent the past few months restoring nearly two dozen old paperback books from scans and turning them into ebooks, I know it’s NOT too hard, it IS time-consuming, and the cost can range from dollars per page (expensive) to FREE (DIY option).

(Another option is to retype the book, but quite frankly, folks, unless you are a super-typist with wrists of steel–which I most certainly am not–that is a daunting proposition.)

You know me. Somebody sez, “Can you do this?” and I reply, “How hard can it be?” Then I bumble and fumble around until I figure out how to do it. Then I come on here and am able to give you some tips that mean you can skip the bumbling and fumbling part. Unless you enjoy b&f. In that case, you can stop reading this post.

This is for the Do-It-Yourselfers.

SCANNING

Do a Google search for “scanning books” and the result will come up with thousands of services that will take your old books or manuscripts and turn them into pdf or doc files. Some services will scan the book without harming the binding, some will chop off the spine, destroying the book. Prices range from per-page costs to flat-rate. I haven’t used any of those services, so I can’t recommend any of them. You’ll have to do your own research.

You can also take your old books or manuscripts to a copy store such as Fed-Ex/Kinkos or a full-service office supply store such as Staples, and either do it yourself on their equipment or have them do it for you.

If you happen to own a scanner, you can do it at home. This is the insane option because quite frankly most home scanners are ridiculous beasts that take their sweet time (I know this because I had to try it myself just to see and so scanned a nearly 300 page manuscript–easy on the hands, tough on the buttocks. It took hours!) If you are home-scanning actual pages from a paperback, you will have to play with the settings on your scanner because most are at their best scanning photos and that resolution is far too high to get good results. Best results are achieved if you copy the pages onto good quality 20# or 24# copy paper and then scan the copies.

However you choose to have your book/manuscript scanned, my recommendation is to have the scanner turn it into a pdf file. There are services and programs that will do the OCR conversion during the scan and produce a .doc, .docx or .rtf file for you. On the surface, it looks like a bargain. I think it’s dangerous because: 1) the file you receive will be huge and bloated and junked up with tons of coding that can severely mess up your ebook: 2) it will not save you any work during clean-up and in some ways it makes clean-up more of a chore; 3) it could give you a false sense of security that your file is cleaner than it actually is and your ebook could end up like so many that are on my Kindle right now, full of formatting errors and gibberish.

Here is a file that has been scanned and converted at the same time:

Here is a file that has been DIY scanned and turned into a .doc file:

It’s a big mess, too, but there are actually fewer dangerous formatting issues you will have to address. Awful as it looks, this example is easier to clean up and turn into an ebook then the first example. So save your money (and a few headaches) and run the pdf through the OCR program yourself.

OCR

PDF files are image files. Pictures of a page. In order to clean up and format the pages they must be converted into text. That’s where OCR comes in–Optical Character Recognition.

I found a nifty little program called FreeOCR. It’s a free program you download onto your computer. It’s a powerful program with a few bells and whistles–none of which I recommend you use. This is a case where the more you automate the process, the worse your results will be. There is no good substitute for the human eye and human instincts when it comes to restoring a document file. You’re better off in the long-run by doing a basic OCR conversion. That means, open the FreeOCR program, open a pdf file, then render it page by page (depending on the size of the file and the density of the type, to do a complete book the process will take between 20 minutes and an hour).

The original scanned page is on the left, the OCR conversion is on the right. You can see what a mess it is. That’s because the OCR is very efficient. It turns not only images of text into text, it turns water stains, wrinkles, shadows, and debris embedded in the paper into text, too. If there are notes in the margin, it will try to turn that into text. A basic scan also inserts a hard paragraph return at the end of every line, gets rid of paragraph indents and destroys special formatting such as bold and italics (the first time I saw this I totally freaked out). Some things convert more cleanly than others. If you’re converting a decades-old paperback where the pages have yellowed and degraded, the conversion will be a HUGE mess.

But not a hopeless mess.

CLEAN UP

FreeOCR gives you an option of saving your rendered document as a Word file. You can do that and clean up your file in Word. There is a much easier, faster and more efficient way. Use a text editor (with a little eventual help from Word). I use Notepad++, a program you can download for free. Save your OCR rendering into the clipboard (or do a right click, Select All/Copy) and paste it into the text editor.

Whether you use Word or a text editor, this is the time-consuming part of the process. And there’s no help for it. If you want a good-looking ebook, you need to make your converted file squeaky clean. (Your other option is hiring someone to do it for you. BUT–and this is a huge but–you have to make sure the service you hire is NOT automating the process, but that there is instead an actual human being going through the book word by word and restoring the text. Those automated programs are powerful and they do a good job on some projects, but I have ebooks I have purchased on my Kindle right now that are unreadable messes due to those programs.)

I have learned a few things to make the job go faster and more efficiently.

  1. Save restoring the paragraphs for last. Take a look at the image of the OCR conversion in Word. I toggled on the Show/Hide feature so you can see how every line has a paragraph return. What you see is the layout from the printed book. That can help during clean up.
  2. Work off the actual pages. Either have the actual book in front of you or split your computer screen and have the pdf file open to the scanned pages. That way if the OCR mangled the text, you can retype a word or line from the actual copy instead of trying to guess what it is supposed to say. You can also tag special formatting such as italics as you go along.
  3. Use Find/Replace.

The text will be full of oddball characters (I call them bug shit). Things like degree symbols, floating quote marks, greater and less than characters, slashes, tildes. If something doesn’t belong in your text file–Find/Replace All gets rid of it. You can also use it to get rid of headers, footers and page numbers. Once you have the text cleaned up, you can use Find/Replace All to get rid of extra paragraph returns, restore the proper paragraphs and un-hyphenate any words that had been split in the printed version. (BONUS TIP: Before you get rid of the extra paragraph returns use Find/Replace to add an extra space at the end of each line. That keeps words from being joined and makes it easier to find hyphens you want to get rid of)

So, yes, this is time-consuming, but it is not hard nor does it have to be expensive. It is definitely worthwhile to get your backlist back in circulation.