Restore Paragraphs in an OCR Scan

Earlier, I wrote a post about DIY scanning and doing an OCR rendering and clean-up of your back list books. It doesn’t have to be expensive and it’s not difficult to do. It does require patience, because cleaning up an OCR rendering takes time.

If you used FreeOCR (as I’d recommended) one thing you’ve noticed is that it inserts a hard return at the end of every single line. The first time I saw that I freaked out a bit. I envisioned having to go through the entire file, manually deleting those extra returns and restoring every paragraph. Then I discovered the hard returns actually help in cleaning up the file because I can work line by line through the text, comparing it to the original material.

Once the text is cleaned up, the paragraphs do need to be restored. If you are using Notepad++ (a text editor that I highly recommend) you can use Find/Replace to do the job. The first step takes some time, but the actual restoration uses the power of Replace All to do the job quickly.

Before you begin work on the file, do a Save As and work on the copy. That way if you mess up, the original is intact and you can easily start over.

STEP ONE: Insert an extra line between each “true” paragraph.

In order to keep an eye on what you are doing toggle on the Show Characters button. It’s in the menu bar and the icon looks like a blue pilcrow (paragraph symbol). It will display black boxes with [CR]–for carriage return–and [LF]–line feed–wherever there is a hard return.

Once you have an extra line between every true paragraph, you will need to insert an extra space at the end of every line. This way you won’t end up with joined words.

STEP TWO: Open the Find/Replace box and toggle on “extended”.
In the Find box type: \r
In the Replace box type: (space)\r
(don’t type out “space” just tap the space bar once)
Do a Replace All

Now you are going to tag the places where you WANT a hard return.

STEP THREE: In the Find box type: \r\n(space)
In the Replace box type: \r\n-N-
Do a Replace All

Now the step where you have to steel your nerves. Remove ALL the hard returns.

STEP FOUR: In the Find box type: \r\n
Leave the Replace box blank (no spaces either)
Do a Replace All.

Now you have one giant block of text with zero hard returns. But don’t freak out. Now you restore the proper paragraphs.

STEP FIVE: In the Find box type: -N-
In the Replace box type: \r\n
Do a Replace All.

Now your paragraphs are restored and there are no extra hard returns to be found. You will need to now get rid of those extra spaces at the end of each paragraph.

In the Find box type: (space)\r
In the Replace box type: \r
Do a Replace All.

That’s it. Except for the first step where you have to insert an extra line between each real paragraph, explaining this takes longer than doing it. This method is a whole lot easier than manually deleting the unwanted hard returns.

Have fun!