There are two types of PDF files that concern writers and from which writers would like to extract editable text.
The first is created by exporting a text document from a word processor or publishing program into a PDF file. The second type is created by scanning printed material and producing a PDF file.
(The second type, the scan, is actually an image file that requires further conversion via OCR (optical character recognition). OCR conversion requires special software, and it falls into the category of “you get what you pay for” and will be the subject of another blog post.)
This post concerns the first type of PDF. A common request I get is: “I had someone do a print layout for my book and it’s been edited and updated, but it’s in a PDF and I need a final copy as a Word doc. Can you help?” No problem. It takes just a minute, so I don’t charge people to do it. (I do, however, charge an arm and a leg to clean up conversions. Just kidding, only an arm.)
The good news: Converting a PDF file into a Word doc is easier than ever and the results are better, too. And, you probably have the tools on your computer already.
The bad news: Conversion is always a mixed bag—some results are vastly superior and some will make you tear your hair out.
The good news about the bad news is that if you know what is happening, you can fix it without ending up in a weepy, shivering, fetal ball. Or sending people like me an anxious email saying, “I’ve spent months trying to fix this fripping’ Word doc and I’ve torn all my hair out and can you please, please, please help meeeeee!” Then wondering what is wrong with you when in a couple of hours I send you a fully restored Word doc—nothing wrong with you, but I’ve recovered millions of words from PDF files and pretty much know what I’m doing. 😉
Use MS Word to Convert the PDF
If you have a version of MS Word that is capable of exporting a PDF file then it is capable of importing a PDF file. How to know? Open a doc in Word and click Save As. In the tool box is a dropdown menu of different file types: .doc, .docx, .rtf,. txt, and a bunch of others. If the list includes PDF, you’re golden. Conversion is as easy as opening a document.
In Word, click on File > Open and select the PDF file you want to open. (Be patient. Depending on how fast your computer is and how large the PDF file is, conversion may take several minutes.)
Once it is open on your computer do a SAVE AS into the DOCX file format.
In the example the Show feature is activated so you can see the paragraph returns and other formatting.
What I like about this method:
- Headers and footers are rendered as headers and footers (for the most part, depending on how the original PDF was created), meaning they can be quickly deleted or safely ignored.
- It’s not horrible about retaining paragraphs.
- It can hide hyphenation. (Sometimes the hyphenation is there but invisible and Word will not allow a search for them—if this occurs, you’ll need a text editor to clean them up. See below.)
- If the fonts used in the pdf are not available on your computer, Word will substitute fonts. If Word is unable to read the font, it will insert black boxes, pink boxes or gibberish.
- Images and other graphics can make the file difficult or impossible to open. This works best for a text-only document.
- Depending on the source PDF, Word can go into overdrive attempting to retain the formatting. That can result in massive (and slow!) files.
Use Google Drive to Convert the PDF
You may have to create a Google account (gmail account) in order to use Google Drive, but it’s free and widely available.
- Go to Google Apps > Drive
- Click New > File Upload
- Select the PDF file you want to convert
- When the box opens saying “1 Upload Complete”, click on the file name
- Tell it to “Open with Google Docs”
- File > Download As > Microsoft Word (docx)
- Open the downloaded file in Word
- Save As to make sure the new Word doc is on your computer.
- The PDF file is editable in Google Docs, so if you don’t have Word or don’t want to use it, you can work on the PDF directly. VERY IMPORTANT!: This version remains on the cloud, not your computer, so if you want it saved on your computer you will have to download it.
- No real formatting to fight with.
- It makes very little effort to convert images and graphics during conversion, so it rarely chokes up or crashes because of it.
- Headers and footers will have to be removed manually.
- Hyphenation will have to be cleaned up manually.
- Spacing issues.
- Not fabulous about retaining paragraphs.
Tips for Making Clean Up Merely Mildly Annoying (as opposed to having you curled up in a fetal ball, quietly weeping)
- Forget trying to retain the formatting from the PDF file. The text is what matters, focus on it.
- Work in Web Layout view rather than Print Layout view so that you can adjust the width of the screen to approximate the width of the PDF text. This will make checking for and fixing wayward paragraphs easier.
- Make sure all scene breaks, page breaks and deliberate blank lines are clearly tagged with some kind of marker so you know exactly where they are. Don’t use extra hard returns or actual page breaks to mark them—you’ll regret it.
- If possible, work with the Word doc and the PDF open on the screen side by side so you can see scene breaks, page breaks, deliberate blank lines and special formatting such as italics.
- Activate the Show feature (click the pilcrow icon ¶ in the Home Ribbon menu) so you can see such things as paragraph returns, soft returns, tabs and spaces.
- If Word is having trouble reading a font, you will need to try another method. Contact me (see below) and I’ll see if I can find a solution for you.
- Clear the formatting. First, make sure all your scene breaks, page breaks and deliberate blank lines are clearly marked. Second, tag your italics (easy way: https://jwmanus.wordpress.com/tag/italics-in-ebooks/). To clear the formatting. Ctrl+a to select all text then click the Clear All Formatting icon in the Home Ribbon. This will leave you with a blank slate, essentially, and remove any unwanted formatting Word has applied. Apply the Normal style to the selected text then modify the style so it suits you. Restore the italics.
Quick Find/Replace terms useful for clean up:
Get rid of unwanted page breaks:
In the Find field: ^m
In the Replace field: leave blank
Get rid of unwanted section breaks:
In the Find field: ^b
In the Replace field: leave blank
Turn soft returns into hard returns:
In the Find field: ^l
In the Replace field: ^p
To find and delete unwanted hyphens (in most cases, discretionary hyphens that are turned into single dashes have a space after them):
In the Find field: -(hit the space bar once to create a blank space)
In the Replace field: leave blank
What if Word has hidden the hyphens?
It’s a common problem. It’s frustrating because you might never know it happened until you format your book as an ebook or send it in an email to someone. To find out if Word has done this, you will need a text editor. On a Windows machine, Notepad works fine. Open a blank document in the text editor. Use Ctrl+a to select all the text in the Word doc. Copy it, then paste it into the text editor. If you see this character ¬ then Word has replaced the hyphenation with “non-characters” that will cause trouble down the line. Word’s Find/Replace won’t do you any good. You will need to tag your italics, copy/paste the entire document into the text editor then use the editor’s Find/Replace function to delete the hyphenation.
If neither of these conversion methods works for you, feel free to contact me at jayewmanus at gmail.com. I have other tools on hand that can convert difficult files. If the conversion does work for you, but you’re struggling with restoring the text, explain your problem in the comments and let’s figure it out.