Scan, OCR and Restore BackList Books

This week I read a comment on a blog (can’t remember where–sorry) where a writer said she was putting off reissuing her backlist titles because she didn’t have accessible computer files for them and so she’d have to scan the actual books, run them through an OCR program and format them. She didn’t know how to do that.

I hear ya, sister. A few months ago I’d have nodded in agreement, and said, “Yep, too hard, too time-consuming, too expensive.” Now, however, having spent the past few months restoring nearly two dozen old paperback books from scans and turning them into ebooks, I know it’s NOT too hard, it IS time-consuming, and the cost can range from dollars per page (expensive) to FREE (DIY option).

(Another option is to retype the book, but quite frankly, folks, unless you are a super-typist with wrists of steel–which I most certainly am not–that is a daunting proposition.)

You know me. Somebody sez, “Can you do this?” and I reply, “How hard can it be?” Then I bumble and fumble around until I figure out how to do it. Then I come on here and am able to give you some tips that mean you can skip the bumbling and fumbling part. Unless you enjoy b&f. In that case, you can stop reading this post.

This is for the Do-It-Yourselfers.

SCANNING

Do a Google search for “scanning books” and the result will come up with thousands of services that will take your old books or manuscripts and turn them into pdf or doc files. Some services will scan the book without harming the binding, some will chop off the spine, destroying the book. Prices range from per-page costs to flat-rate. I haven’t used any of those services, so I can’t recommend any of them. You’ll have to do your own research.

You can also take your old books or manuscripts to a copy store such as Fed-Ex/Kinkos or a full-service office supply store such as Staples, and either do it yourself on their equipment or have them do it for you.

If you happen to own a scanner, you can do it at home. This is the insane option because quite frankly most home scanners are ridiculous beasts that take their sweet time (I know this because I had to try it myself just to see and so scanned a nearly 300 page manuscript–easy on the hands, tough on the buttocks. It took hours!) If you are home-scanning actual pages from a paperback, you will have to play with the settings on your scanner because most are at their best scanning photos and that resolution is far too high to get good results. Best results are achieved if you copy the pages onto good quality 20# or 24# copy paper and then scan the copies.

However you choose to have your book/manuscript scanned, my recommendation is to have the scanner turn it into a pdf file. There are services and programs that will do the OCR conversion during the scan and produce a .doc, .docx or .rtf file for you. On the surface, it looks like a bargain. I think it’s dangerous because: 1) the file you receive will be huge and bloated and junked up with tons of coding that can severely mess up your ebook: 2) it will not save you any work during clean-up and in some ways it makes clean-up more of a chore; 3) it could give you a false sense of security that your file is cleaner than it actually is and your ebook could end up like so many that are on my Kindle right now, full of formatting errors and gibberish.

Here is a file that has been scanned and converted at the same time:

Here is a file that has been DIY scanned and turned into a .doc file:

It’s a big mess, too, but there are actually fewer dangerous formatting issues you will have to address. Awful as it looks, this example is easier to clean up and turn into an ebook then the first example. So save your money (and a few headaches) and run the pdf through the OCR program yourself.

OCR

PDF files are image files. Pictures of a page. In order to clean up and format the pages they must be converted into text. That’s where OCR comes in–Optical Character Recognition.

I found a nifty little program called FreeOCR. It’s a free program you download onto your computer. It’s a powerful program with a few bells and whistles–none of which I recommend you use. This is a case where the more you automate the process, the worse your results will be. There is no good substitute for the human eye and human instincts when it comes to restoring a document file. You’re better off in the long-run by doing a basic OCR conversion. That means, open the FreeOCR program, open a pdf file, then render it page by page (depending on the size of the file and the density of the type, to do a complete book the process will take between 20 minutes and an hour).

The original scanned page is on the left, the OCR conversion is on the right. You can see what a mess it is. That’s because the OCR is very efficient. It turns not only images of text into text, it turns water stains, wrinkles, shadows, and debris embedded in the paper into text, too. If there are notes in the margin, it will try to turn that into text. A basic scan also inserts a hard paragraph return at the end of every line, gets rid of paragraph indents and destroys special formatting such as bold and italics (the first time I saw this I totally freaked out). Some things convert more cleanly than others. If you’re converting a decades-old paperback where the pages have yellowed and degraded, the conversion will be a HUGE mess.

But not a hopeless mess.

CLEAN UP

FreeOCR gives you an option of saving your rendered document as a Word file. You can do that and clean up your file in Word. There is a much easier, faster and more efficient way. Use a text editor (with a little eventual help from Word). I use Notepad++, a program you can download for free. Save your OCR rendering into the clipboard (or do a right click, Select All/Copy) and paste it into the text editor.

Whether you use Word or a text editor, this is the time-consuming part of the process. And there’s no help for it. If you want a good-looking ebook, you need to make your converted file squeaky clean. (Your other option is hiring someone to do it for you. BUT–and this is a huge but–you have to make sure the service you hire is NOT automating the process, but that there is instead an actual human being going through the book word by word and restoring the text. Those automated programs are powerful and they do a good job on some projects, but I have ebooks I have purchased on my Kindle right now that are unreadable messes due to those programs.)

I have learned a few things to make the job go faster and more efficiently.

  1. Save restoring the paragraphs for last. Take a look at the image of the OCR conversion in Word. I toggled on the Show/Hide feature so you can see how every line has a paragraph return. What you see is the layout from the printed book. That can help during clean up.
  2. Work off the actual pages. Either have the actual book in front of you or split your computer screen and have the pdf file open to the scanned pages. That way if the OCR mangled the text, you can retype a word or line from the actual copy instead of trying to guess what it is supposed to say. You can also tag special formatting such as italics as you go along.
  3. Use Find/Replace.

The text will be full of oddball characters (I call them bug shit). Things like degree symbols, floating quote marks, greater and less than characters, slashes, tildes. If something doesn’t belong in your text file–Find/Replace All gets rid of it. You can also use it to get rid of headers, footers and page numbers. Once you have the text cleaned up, you can use Find/Replace All to get rid of extra paragraph returns, restore the proper paragraphs and un-hyphenate any words that had been split in the printed version. (BONUS TIP: Before you get rid of the extra paragraph returns use Find/Replace to add an extra space at the end of each line. That keeps words from being joined and makes it easier to find hyphens you want to get rid of)

So, yes, this is time-consuming, but it is not hard nor does it have to be expensive. It is definitely worthwhile to get your backlist back in circulation.

 

19 thoughts on “Scan, OCR and Restore BackList Books

  1. Actually all of my backlist titles are on disk. Unfortunately, some of those disks are 5.25″ floppies. 3.5″ floppies I can restore, 5.25″ that have been stored for the past twenty years in an attic over the garage, not so much. There’s a lesson in there somewhere. Heh.

  2. There’s one more option – an option you may not like or consider – but available nonetheless: one of the programs which converts spoken words into text, such as Dragon Dictate.

    Instead of typing like mad, you would read your book into their microphone/headset combination.

    It would give you a lot of practice using the dictation program, and would leave your Dragon well trained to recognize your voice. I have found it reasonably competent at interpreting what I say – and the text is there to compare. Possibly you could hire someone to read the book in?

    The disadvantage is that, at least with the version I have (the new one is too expensive for me to update to), you have to speak the punctuation. It gets automatic very quickly, but it is a drawback.

    About those 5.25″ floppies – there are probably some real nerds out there who can still read them for you, unless the disks are physically damaged and won’t go into a reader. The information density on them was very low by modern standards – that may actually help recover them. When I was in grad school, my fiance showed me you could actually put magnetic filings on magnetic tape and SEE the 1s and 0s – and taught me how to read a magnetic tape bit by bit. And I mean bit (0 or 1).

    Some people, once they have learned to dictate, do really well and go very fast with the dictation software – and it is a nice skill to have if you develop carpal tunnel syndrome, or, like a friend’s husband, detach your biceps muscle from your shoulder and have to have it surgically reattached – and can’t type for months.

    • Hi, ABE. I actually have Dragon and use it–not as much as I should because I do have carpal tunnel syndrome and I have to be careful about binge typing. I actually began dictating one of my books and doing a fair job of “teaching” Dragon my way of talking… then allergy season hit. Dragon did not like me sneezing, snuffling, clearing my throat, coughing and snorting. But you are absolutely right. Dragon software is wonderful. It takes some getting used to, but I know several people who swear by it.

      As for those ancient floppies, bah. I wrote off the few I could find. I’m sure there are genuises who could recover data from them, but if I remember correctly each one only held about two or three chapters of manuscript, which means that for each book, I’d have to find a bunch of disks. Much easier to scan the old books. They have machines that can do an entire book in about five minutes and poof, there’s a pdf file.

  3. And people wonder why I held on to my old 5.25″ floppy drive. All I need is a computer with an IDE-compatible bus, & I can read those old floppies…

  4. (Not) Surprisingly, I’ve got a computer here with a 5.25″ drive and a 3.5″ drive. I don’t know if it works, though. I have some 5.25″ disks around and could test it… after getting it set back up. ;)

    By the way, Jaye, is there any way to automatically subscribe to the comments? I like getting the new posts in my e-mail, and I’d like to get the comments that way, too. Automatically.

    • You can get email notifications if you click the little box below the reply box that says notify. Have to remember to click it before you Post Comment, otherwise the box disappears.

      I have my old books scanned, so the floppies are moot now. Live and learn, though. Now I back up on external drives and in the cloud. I used to think having printed copies was good enough. Silly me. Silly silly silly…

      • Hi Jaye:

        I do remember to click on the e-mail notifications check box when I post replies, but I was hoping that there was some way I had yet to ascertain that would send all replies to every new journal post as well. That way, I keep up on the thread without having to post a reply that most likely would be meaningless twaddle. Oh well. ;)

  5. No, oh well, me, Jon! You made me realize I had my follow blog widget kind of hidden. It is now up top where it belongs and you can sign up to follow the blog via email. Sorry about that.

  6. Most enlightening, informative, and helpful! Thank you. Now if you could please get Microsoft and Google to follow your advice. I recently obtained some electronic copies of books which I wanted to read but not necessarily to pay antiquarian market rates to own. They were marked all over that they were scanned by one or the other of these companies, who ought to be ashamed of the unreadable mess which resulted from (apparently) simply scanning and posting. They were akin to something a dog would wisely leave outside and a cat would attempt to cover up after ejecting. Clearly those involved in the project to digitize books at Microsoft and Google don’t actually care that their product is garbage…

  7. Sorry, Chris. Frustrating. I don’t know if anyone is making any money off those scanned public domain books or not. The thing is, it’s very easy to scan, convert and upload a document into a Kindle. (I just did it myself, uploading an non-formatted story onto my Kindle for proofreading). Personally, I don’t know why anyone would want their name associated with a poorly done book. Maybe they think they are doing a public service and readers should be grateful for whatever they get. I don’t see it that way. I think it’s disrespectful to the books and the readers.

  8. Jaye is right when she says OCR introduces errors into your finished files. But that’s not the only sort of error the conversion process can introduce. In my case, another sort of error–one that looked like a proofreading error to my readers but wasn’t!–dealt one of my own backlist books a nearly-mortal blow.

    I was very ill the week I finished proofreading my backlist book, MISS GRANTHAM’S ONE TRUE SIN, and I accidentally upload to Kindle and Nook an older, working copy that was full of errors. Then I headed off to the hospital and only realized my mistake many days, several thousand downloads, and a few angry reviews later. :-/

    Formatting services can save you from making such mistakes!

    Jaye, I don’t think you offer scanning service, so I don’t think I’m out of line inserting a little advert here (but if you object, just delete, and I’ll have no hard feelings!):

    For a flat fee of $25 (the lowest price I know of!), I scan paperbacks and run them through OCR, returning them in .rtf or .pdf format (whichever you prefer). My OCR software is ~good~, and the .rtf produced is great to work from (I did my own 7 backlist books this way!) I turn scanning/OCR orders around in under 24 hours.

    How can I do this? Because I’m a graphic artist in my other life, and I own several hundred dollars worth of specialized equipment: rapid-scanner, OCR software, and book spine cutter).

    Why do I offer it at a price so low? Because I don’t believe in cheating people, and because I get lots of business that way. As it is, it’s good money considering that, after doing it so many times, it only takes me about a half hour now. At the same time, I’m providing a really useful and valuable service for my authors. It’s a win-win.

    References upon request.

    • Hi, Meylinda. Oh I hear ya about the problem of mixing up files. Check the dates, check the dates, check the dates. Le sigh…
      As for adverts, well, post again with a link. Nice to have a face with a service. Folks can check you out.

  9. Pingback: Boast Post: This Time It’s All About Me | J W Manus

  10. Meylinda, I wish I’d found your post a year ago. I have the rights back to 30 romances, had them scanned, and then began trying to clean them up. Paid to have ten of them done, couldn’t afford the rest, so trying to clean them up myself. Will now try Jaye’s method–if I can figure out how to turn the scanned files into PDF’s. Thanks to both of you for the info.

    • Hi, Bobby. Yet another romance writer DD1 likes better than me. :D Hi!

      The files you had scanned should be pdf files–unless you had them turned into doc files. If they are doc files, open them in your word processor and tag your italics. Then copy/paste the entire file into a text editor (I highly recommend Notepad++ –there’s a link to it in my cheat sheet page on tools and programs). Do your clean up in a text editor. You will get much better results. If the files are in pdf format, you can download FreeOCR (also linked in the cheat sheet) then render the pdf files into text files. It takes a little time, but it’s inexpensive. Again, copy/paste it into the text editor.

      If you have trouble figuring out the OCR reader or text editor, email me. I’ve picked up some nifty tricks.

      jayewmanus at gmail dot com

  11. Pingback: Restore Paragraphs in an OCR Scan | J W Manus

  12. Pingback: Self-Publishers: Do You Need Nurturing? | J W Manus

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s