Restore Paragraphs in an OCR Scan

Earlier, I wrote a post about DIY scanning and doing an OCR rendering and clean-up of your back list books. It doesn’t have to be expensive and it’s not difficult to do. It does require patience, because cleaning up an OCR rendering takes time.

If you used FreeOCR (as I’d recommended) one thing you’ve noticed is that it inserts a hard return at the end of every single line. The first time I saw that I freaked out a bit. I envisioned having to go through the entire file, manually deleting those extra returns and restoring every paragraph. Then I discovered the hard returns actually help in cleaning up the file because I can work line by line through the text, comparing it to the original material.

Once the text is cleaned up, the paragraphs do need to be restored. If you are using Notepad++ (a text editor that I highly recommend) you can use Find/Replace to do the job. The first step takes some time, but the actual restoration uses the power of Replace All to do the job quickly.

Before you begin work on the file, do a Save As and work on the copy. That way if you mess up, the original is intact and you can easily start over.

STEP ONE: Insert an extra line between each “true” paragraph.

In order to keep an eye on what you are doing toggle on the Show Characters button. It’s in the menu bar and the icon looks like a blue pilcrow (paragraph symbol). It will display black boxes with [CR]–for carriage return–and [LF]–line feed–wherever there is a hard return.

Once you have an extra line between every true paragraph, you will need to insert an extra space at the end of every line. This way you won’t end up with joined words.

STEP TWO: Open the Find/Replace box and toggle on “extended”.
In the Find box type: \r
In the Replace box type: (space)\r
(don’t type out “space” just tap the space bar once)
Do a Replace All

Now you are going to tag the places where you WANT a hard return.

STEP THREE: In the Find box type: \r\n(space)
In the Replace box type: \r\n-N-
Do a Replace All

Now the step where you have to steel your nerves. Remove ALL the hard returns.

STEP FOUR: In the Find box type: \r\n
Leave the Replace box blank (no spaces either)
Do a Replace All.

Now you have one giant block of text with zero hard returns. But don’t freak out. Now you restore the proper paragraphs.

STEP FIVE: In the Find box type: -N-
In the Replace box type: \r\n
Do a Replace All.

Now your paragraphs are restored and there are no extra hard returns to be found. You will need to now get rid of those extra spaces at the end of each paragraph.

In the Find box type: (space)\r
In the Replace box type: \r
Do a Replace All.

That’s it. Except for the first step where you have to insert an extra line between each real paragraph, explaining this takes longer than doing it. This method is a whole lot easier than manually deleting the unwanted hard returns.

Have fun!

About these ads

15 thoughts on “Restore Paragraphs in an OCR Scan

      • Jaye,

        This is very helpful, as it exactly what I am trying to do–create ebooks out of “extra” favorite paperbacks I have. Tear apart, scan, etc. But I am stuck on Step 3: find \r\n(space). I have typed this in the Find space and \r\n-N- in Replace–but when I hit Replace it says it can’t find any instances of \r\n(space).

  1. Hi Jaye:

    I’ve done a similar process, but using Word. The real trick — as you so correctly pointed out — is to ensure there is always a plank line between paragraphs. I replaced two paragraph marks with a tab, replaced all remaining paragraph marks with a space, then replaced the tabs with paragraph marks.

    Akways nice to see how others solve similar problems! ;)

  2. I am missing one little piece of information: when the OCR software runs, does it delete the blank lines between paragraphs if they are there?

    Okay, two pieces: If a paragraph is indented, does this indentation get turned into something? Like a tab? Seems to me that would be a requirement.

    I read your posts and store the information in a ‘if I need to know how to do this, Jaye has a post on it’ form. Including the comments. It seems to me that Word does this easily (if the blank lines and indents are there) with global replace commands. Are you trying NOT to use Word?

    I haven’t used the free OCR software, but that may be coming, as a novel I’m hoping to re-do (I have a great idea that will make it work today) is only available to me in paper form.

    BTW, do you have thoughts/experience on the scanner side – ie, a recommendation or a caution re a particular scanner? Thanks.

    • Hi ABE,
      A scan produces an image. A picture of the text. That’s why you have to run the scan through an OCR program. The OCR interprets the images as text characters. So, if there is no image to interpret (a space, a tab, indent, a blank line) it won’t be reproduced. Sometimes the OCR “reads” specks and shadows as characters. This is why an OCR rendering must be cleaned up.

      As for Word versus a text editor: I have found through the process of cleaning up a whole lot of scanned books that the text editor is faster and more efficient than Word. IF one is doing a clean-up in Word the basic information is pretty much the same. Word’s Find/Replace is the cleaner-uppers best friend. The search terms are different. For instance, ^p will find hard returns and ^t will find tabs.

      As for scanners, I don’t know much about them. I have an Epson 2480 that is several years old and slower than dead people. I did scan a book with it. I do NOT recommend the process. It’s extremely slow and a pain in the butt–cheaper (in aspirin savings alone!) to pay $25 for someone else to run it through a commercial scanner. I had to try it just to see if it could be done. The answer is yes, but too much hassle. Places like Kinko’s or Office Depot or Staples can run your book or manuscript through a scanner in about 5-10 minutes and the quality will probably be much better than anything you could produce on a home scanner. There are many services on-line that will do the job, too. (I have one recommendation for a scanning service: Melynda Andrews. If you or anyone wants her contact info, send me an email at jayewmanus at gmail dot com) Some scanning services will also convert your scan into a document file. That will still require extensive cleaning and reformatting. i say, don’t bother. Run the scan through an OCR rendering yourself.

      • Thanks. That explains the lack of paragraphing info – the software is looking for characters, not formatted text.

        It would seem easy to program in some general intelligence. After all, humans automatically look for paragraphing information, but we are the great general multipurpose computer – and programs are often incomplete.

        I did OCR, I forget for what, a number of years ago (might be a large number). It was pitifully inadequate: I remember having a 5% error rate – it was easier for a fast typist to just retype most things. I was curious about newer stuff, having read a lot of complaints from people who read ebooks badly scanned by the big traditional publishers. Badly scanned to be unreadable – as if no human ever looked at the result of the scan before it was put out as an ebook.

        Now I might try speaking into my Dragon Dictate software, to save myself typing. Dragon has a much smaller error rate for me (~1-2%), though I need to be more diligent about training the things it doesn’t get (I usually just retype, but this results in Dragon not improving).

        I’m glad you suggested places like Staples – I wasn’t thinking of them for scanning. Mostly I see things where you have to send your valuable single copy of something and risk losing it in the mail. The THOUGHT of losing things in the mail has kept me from trying. If I can stand there while something is processed in-house, I’ll be much happier.

      • Hi ABE:

        I remember back when OCR tech was coming out and they (whomever “they” were) touted “…a 98% accuracy rate.” What they didn’t tell you was that said accuracy rate was per character — in other words, there’s a 2% chance that every character is wrong.

        Fortunately, that rate has improved. And the “intelligence” of the OCR software has improved, allowing the software to “learn” what the patterns really mean, thanks to a patient, human operator who is running the program.

        It is faster than retyping, but the scanned data MUST be carefully proofed!

        Jon

  3. Patti,
    If you are using Notepad++ make sure you’ve toggled “Extended” in the search box and have added a extra space at the end of every line.

    If you are using a word processor, \r\n won’t work at all. To find hard returns in Word use the search term ^p. Modify the steps so that you tag the double hard return to indicate the proper start of a paragraph.

    • Jaye, I am using Notepad++, although I am new to it. Your instructions said it was easier than Word–why is that?

      I have toggled Extended and I believe Step 2 worked correctly. There is a space before every instance of CR-LF

      I snipped pictures of my file after doing step 2 and also of the dialog bog in step 3 (although the snip mostly fails because the dialog box goes transparent when the snipping tool opens). I could attach those but don’t see how in this email.

      I tried it again–I believe I have typed in your step 3 Find/Replace instructions correctly–if I hit Replace All it says 0 occurrences; if I hit Find it says no instances of \r\n(space) found.

      I am just trying a sample file, to figure out how it works.

      • Problem solved! MY BAD: When I write in the instructions an action enclosed in parentheses, what I MEAN is to do the action, not copy the text. ie (space) actually means use the space bar to create a space. I apologize for not making that clear.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s