More About Ebook Formatting, Source Files and Tales of Tagging

First an apology for not answering every comment this week. On the “Source Files Update” post there were some great comments. People are coming up with solutions and solving problems. So go read the comments over there. One commenter in particular is hard at work on the subject of formatting ebooks from word processor files. I’ve been corresponding with William Ockham regarding his efforts to create a program that will make it easy to format a word processor file into a good-looking ebook. I’ve sent William some grotty files and he’s been problem solving. I’ve brought one of his comments over to this post so you can get a better idea of what he’s doing:

Wow, I’m flattered. I’ve been busy with my guest blogging stint over at http://www.thepassivevoice.com and didn’t see all these comments. Since there is some interest here, I’ll share what I can of my plans. I firmly believe that writers should use whatever tool works for them. For most people, that’s Microsoft Word. Some folks are using Scrivener and almost everyone else is using some word processor (a flavor of OpenOffice or those WordPerfect holdouts).

The first thing I’m going to release is a free document to source file converter service (to use Jaye’s terms). You save your manuscript in RTF format (pretty much every program supports RTF) and upload it to my service. My program will go through and do all the stuff that Jaye talks about. It will strip all the formatting except bold, italics, and chapter headings. You get back a nice clean source file in RTF format. You load it up into your tool and save it back as a .doc file and you have a source file suitable as the input for ebook formatting. It’s not much, but it is a nice little timesaver and your ebook formatter will thank you (even if you DIY). Did I mention it would be free?

I really appreciate all the expressions of support. I hadn’t really given much thought to a Kickstarter, but I am thinking about it now. In the meantime, there is something you could do to help. I need test cases. That is, I need real manuscripts before they’ve been given the Jaye Manus treatment. If anyone has copies of their novels (or short story collections) that they wouldn’t sharing with me, I would really appreciate it. I promise not use them for anything other than perfecting my software. I will send you the cleaned up version and destroy or return the original when I’m done.

If you can help in this way, save your gnarliest files (smart quotes, em dashes, paragraphs indented with tabs and spaces, whatever) in RTF format and
email them
to razoroftruth at
gmail dot
com

Let me know what program (i.e Microsoft Word) and version (like 2000 or 2007) and whether you are using Windows, Mac, or Linux (or other Unix variant).

Which brings us to another problem I’m working on with source files–tagging. One of the things keeping me so busy this week is learning HTML. Turns out it’s kind of fun and quite the challenge. I also discovered that my resulting ebook files are much smaller–why? Who knows. But that’s a plus since I love using graphics for headers and such. Anyhow, the biggest challenge has been doing an ebook in screenplay format. It’s not difficult. It requires essentially three styles: Centered, Block Quote and Hanging Text. Since it ran about 120 pages in manuscript form, the real challenge was making sure every style was properly applied. I also wanted a way to NOT have to go in and tweak every line of text.

Now me, I happen to think FIND/REPLACE is the greatest invention since the light bulb. I’ve stated before that Word’s F/R is a powerhouse. Indeed. I also made some very interesting discoveries about Word and text editors and how they interact re formatting tags.

Le sigh…

Let’s talk about the two most common special formatting tags in the writing universe. Asterisks to indicate bolded text and underscores to indicate italics. Most editors and agents understand what those marks mean. Sending an e-query with those tags in place would be perfectly acceptable. Except… Even if you turn off the auto-formatting features, Word treats them like special characters and so does a text editor. Meaning, a text editor will strip them out. So those are out. You can use them if you like–they are easy to read–but if you ever have to copy the file into a text editor, you’ll lose the tags and your special formatting.

Anyhow, I’ve been using my own little special formatting tags–ii for italics, BB for bolding, and UU for underlining. Nobody but me sees them or has to read them, so no big deal. BUT, I am in the process of creating a cheat sheet for Source Files, and need to come up with tags that One) Make sense; Two) Are easy to remember and use; Three) Don’t activate “helpfulness” in word processors; Four) Work well in FIND/REPLACE operations. Number three is a bitch. I popped around in different programs to see how they handle various tags. Turns out non-letter characters are a problem when created in strings–Word, especially, kept getting wobbly and persnickety. Plus, some can cause problems in HTML coding because it uses so many characters for commands. For instance, I tried i/TEXT/i for italics. That seems fairly straightforward, right? It didn’t make Word go all wobbly either and it translated into a text editor. Problems arose when I did F/R operations in the text editor. I needed characters that are NOT used in coding. Which leaves out almost all of them.

Ah ha, most FIND operations can be made case sensitive. And there is one non-letter character that gave me no problems at all–the lowly dash/hyphen. So here are a few of the tags I ended up with:

  • -ITAL-   -NOITAL-
  • -CTR-     -NOCTR-
  • -BQ-        -NOBQ-
  • -NBSP-

Those might seem a little “wordy” but they are pretty self-explanatory (italics, centered text, block quote, no break space) and they don’t cause interpretation wars between programs. When I paste the Word file into the text editor, all I have to do is run FIND/REPLACE operations to insert the coding. (ex: -ITAL- becomes <i> and -NOITAL- becomes </i> to make italicized text) Most fiction doesn’t require every paragraph be tagged. So I won’t go in to the nifty little shortcuts I found.

The really important thing I’ve discovered is that not all tagging is equal and some of the old printer’s tags will not work because the programs want to do something with them and it’s not always what the writer intends.

So how about you, folks? What nifty tricks tricks have you come up for tagging the special formatting in your files?

Advertisement

24 thoughts on “More About Ebook Formatting, Source Files and Tales of Tagging

  1. Le sigh, indeed 😉 It doesn’t matter what I want, because my editor needs a Word doc so she can do markup, my critique group expects the underline-for-italics trick and some form of rtf, and CreateSpace wants a Word doc too (if there’s a OpenOffice template I haven’t found it). What works for me is to edit *in* HTML, using an HTML viewer like CoffeeCup (free). It has the glorious search-and-replace too, and all you have to do is search for (I’m being verbose to fool WordPress) and replace the u with an i. (and do it again with the added / in front to clear up the ending tags) Instant underline-to-italics!

    I usually write in OpenOffice because I like it and the HTML conversion is MUCH cleaner than Word’s. It still needs cleanup, which is also easier to do in the HTML editor.

    • I haven’t learned to love the text editor yet (I’m using Notepad++) but it’s got a pretty good F/R and it sure makes coding lines easy.

      What I hope to accomplish with the cheat sheet I keep talking about is that writers can produce clean files from the get-go that stay clean no matter where they’re sent or how they’re used. Coming up with special formatting tags that DON’T trigger programs to do something and that everyone can understand is part of it.

  2. Hmmm. It’s been a little while, but I seem to remember being able to directly change Word’s underline into html code for italics using advanced function of search and replace in Word. I don’t remember how, but I followed Guido Henkel’s instructions.,

  3. It’s simple as can be, Marie. Leave the FIND box blank, but tag it with Underline (from the special menu) then in the REPLACE box type ^& and it will tag the underlining. Caution, if you put in too many HTML tags into any one Word file, it will start to go wobbly. Depending on your system, it could overload and crash, too. (found that out the hard way) Which is a frustration because it would be easier to just tag the text with the actual codes instead of going an extra step. Now I would like to know why some of these posts are italicized….

  4. Have you tried Markdown to format for HTML? I’ve been using it for many years and have gotten to the point where I write web content in a text editor then run a script that automatically transforms my document into HTML.

    • Hi, Michael. I just started doing HTML, I don’t do anything as fancy as web content, just ebooks. I will check out Markdown. I love the idea of “automating” repeatable tasks.

  5. I’m a little fuzzy on your workflow. You’re not manually adding tags, right? You’re just doing a find-replace in the original document to add your “universal” tags.

    You might want to change those “NOITAL” and such to ENDITAL or ITALOFF. NOITAL sounds like “don’t use italics here”. The closing trash for italics actually

    • Darn phone…

      HTML tags work in pairs, so a closing tag should reflect that even in your own shorthand. NOITAL is focused on what comes after, which is not what a closing tag does. Those tags I suggested focus on the affected material.

      That might sound anal, but good code is written to be easily understood by others, even when they aren’t likely to read it.

      • Good point. I’m still figuring this out. And no, I don’t manually tag if I can help it. I find ways to use FIND/REPLACE. When I do it right, one REPLACE ALL does the trick. Which is actually the only tricky bit in coming up with tags. Making sure they are unique and will not catch other text in the operation. Something I really like about Word’s F/R is that I can use paragraph returns, special formatting and special characters as search terms. Some of the tagging I use now now is because I’m not used to reading on a text editor. The screen looks funny. I can read better in Scrivener or a word processor.

  6. You all have to stop this. I can’t continue to read and write with crossed eyes. As far as I’m concerned your directions are in Korean.

    • I’m making it sound way too complicated, Nila. Not on purpose, but because I’m fumbling and I don’t know all the proper terms yet. In practice, you’d have no problems at all.

      • Oh, good. ‘The End’ is going to end up with graphics between each story, a bio for each author, a story blurb, then the story, an introduction before all that, and maybe some art. It is growing and all I can think of is how the hell am I going to make it look good? I love some of what you have been producing and hope that I can emulate.

  7. Just ran into your journal site here (from a link at the PeoWrimo site on FaceBook) and am enjoying it a lot. But, as a screenwriter (among other things, such as a programmer), I need to correct your assertation that a screenplay just uses monospaced type and centering. It might look that way at first blush, but it’s not really so. And the formatting rules are both rather rigid and very easy to get wrong. Trying to format a screenplay for an e-reader is going to be horribly difficult to get exactly correct because the rules are designed for an 8 1/2″ x 11″ sheet of paper, not a screen of an unpredictable size (or orientation, for that matter). Prime example: I purchased “The Hollywood Standard, Second Revision” for my Nook Tablet. And guess what? The formatting of the examples on “how to correctly format a screenplay” is botched in nearly every way imaginable, save for using Courier New. Luckily I have the first edition in printed form, so the inaccuracy of the e-book version isn’t quite the hindrance to me that it might be to someone who doesn’t own both versions, but it just “goes to show you” that something that looks quite simple usually is much harder than it looks… especiallty when computers are involved! 😉

    • Hi, Jan. I know my “screenplay” format doesn’t match standard script format. It’s a faux script format. Because, as you point out, on an ereader, of any type, the formatter doesn’t have any control over the screen size, the justification and not much control over the right margin. I think what I managed would raise howls of outrage from screenwriters, but since the book is a novel, I’m hoping its readers will be able to get into the spirit of the thing and imagine they are reading a script. If nothing else, the simplicity of the format makes the script portion easy to read and follow no matter what device the reader is using.

      Right now, there is so little flexibility in font choices, and so little control over screen size, formatters need to be creative (but not too creative!) in their design choices. I no longer look at the ereaders as imitations of paper. I keep looking for ways for the medium to shine on its own terms. At times with more success than others, but I’m working on it.

      I am curious, though, about what will happen with screenwriters and people who use scripts. I imagine they are still using paper. I wonder what will happen to the formatting as more and more go digital.

      • Hi Jaye:

        Yes, screenwriters still use paper. Lots of paper. Frequently in a vast array of colors as a “locked-in” script is altered. Once it’s locked-in, scene numbers get assigned, shooting schedules are developed, and any manner of departments depend on specific sequences of events, all based upon the printed page.

        I rather doubt there will be a huge rush to go digital with scripts. The current rules involving spacing, margins, and line lengths evolved over time to give a rough estimate to the producer and/or director of the overall length of the film, with one page basically equalling one minute of film time. But you probably already knew that. 😉

        By the way, it’s “Jon,” not “Jan.” 😉

      • Hi, Jon (and I apologize for misspelling your name). That almost makes me feel sorry for script writers. I did know about the “one-minute” rule, but only from the sidelines, having read about it. I’ve never actually tried my hand at writing a screenplay.

        But this does lead to a point that I cannot emphasize enough for writers, producers and formatters–CONSIDER THE END USER AND THEIR REQUIREMENTS.

        As you pointed out, the format failed a how-to book on script writing. This, of course, leads me off on one of my bunny trails and… Blog post forthcoming.

  8. Hi again, Jaye:

    (Sorry if this is the wrong place for this note; it just seems to flow with this journal topic, and I can’t seem to find a way to e-mail you directly. Oh well. On with my note!)

    I’ve been reading your journal now with great interest, and I’m really liking your discoveries, recommendations, and suggestions. I also read through the journal entries you shared from Guido Henkel (smart guy, by the way)

    And now, I’m confused.

    Why? Well, a writing group of which I’m a member recommends “Smashwords” to all members as THE way to self-publish an e-book. And reading the “Smashwords Style Guide” by Mark Coker is what has led to my confusion. Because Mr. Coker (founder of Smashwords) recommends USING Word to submit one’s writing to them for formatting.

    Conceptually, I like — no, LOVE — the idea of letting the text be the text and then having a separate formatting component to control how the text looks. Not only does it make sense, it keeps things very simple for the author.

    But you and Guido recommend trying to curtail Word at every step. And I agree with that idea — the extra, “helpful” code Word inserts into a document — even one created using Filtered HTML — is so much overkill.

    It would be quite possible, I suppose, to create one’s e-book in the Guido-recommended HTML source and then open the HTML within Word to create another document to ship off to Smashwords, only to have them tear the file up and, essentially, take it back to the version one had in HTML.

    It seems like a very difficult way to go, rife with potential formatting pitfalls, if one wishes to use Smashwords.

    If.

    What are your thoughts on distributing one’s e-book via Smashwords? And what forms of distribution should one seek out if not using Smashwords?

    Also, I wanted to recommend to you a wonderful free program called Sigil (found at http://code.google.com/p/sigil/ ). The nice thing about Sigil is that it lets the writer work either in a text format or a code format. The file is stored as XHTML, and it will automatically wrap paragraphs with the all-important – tags just by pasting a plain-text file into the text editor! It inherently saves files as EPUB, but, using Calibre as recommended by Guido, the format can be translated in mere moments. Just thought I’d bring it to your attention.

    Enjoying your journal very, very much! Keep up the outstanding work!

    Jon

    • Hi, Jon. Wow, you’ve practically written a blog post in and of itself.

      I’ll try to answer some questions. One, Smashwords is a terrific site, and considering it has a pipeline of distribution to several retail outlets otherwise closed to indie publishers, it’s a necessary site. Coker has opened doors that might have otherwise remained shut for years. I do recommend uploading to Smashwords for as wide a distribution as possible.

      This is only my theory, but Coker probably used Word and knew a whole lot of writers use Word, so it made perfect sense at the time to base his “meatgrinder” on Word. Better to go with what you have, and what a whole lot of people have, then requiring everybody to make huge changes. Even so. word processors don’t make great ebooks (they make great documents). It’s getting worse instead of better because new distribution outlets are opening up and each one of them has their own little quirks and special requirements. So right now Smashwords is in the position of trying to make everybody happy through compromises.

      Coker has a choice between making the suppliers happy and satisfying the distributors, so he’s doing the best he can out of an increasingly difficult position. I actually blame the device makers. They might have perfectly sound reasons for their proprietary formats and special requirements, but those reasons are in conflict with users and producers.

      Is there an easy cure for this? I don’t see one. At least, not an inexpensive one. If Coker were to decide tomorrow that he’d only accept html files, the end products would be far more stable–but then which version of html? And what about the indie publishers, most of whom are writers who have far better things to do than learn an entirely new way of formatting their work. Some people are adept at computer programming and html, and almost anyone with the will and time can certainly learn it, but most writers would rather put their energy into their writing. Plus, the majority of writers don’t think in “electronic files” they think in “documents,” and shifting that mind set is a bear. (I am finishing up a project involving 40 writers–I got 40 stories, 39 of which were absolutely beautifully formatted documents, each reflecting the personality and style of the writer, and only one file that didn’t require stripping and flipping in order to turn it into an ebook file–do I demand that those writers, all of them masters of their craft, forget everything they’ve ever learned about word processing and document creation? Then turn around and learn how to create electronic files? Absolutely not. They have better things to do.)

      As for the “best” programs to do it all–when it comes to creating ebooks, I don’t believe there is one “best.” For another project I’m working on I’m using five–FIVE–different programs. Each is terrific, but each one does something better than the others and worse than the others. So I bebop back and forth and make do while trying to do my best.

      I get frustrated with Word not because it’s a bad program–it isn’t–and not that it can’t create perfectly serviceable ebooks–it can (in fact, I often do minimally formatted files that I run through Mobipocket or Calibre to make files I can proofread on my Kindle). The trouble is, the program is full of “helpful” features that can create landmines. Once those “helpful features” encounter the “helpful” features in other programs and they’ve finished shaking hands and screwing around with other to see who’s top dog, the end result can be a big old mess. Plus, if you have special requirements that require special formatting, you’ve just upped your chances of triggering a mess.

      What’s the solution? I don’t have one. I keep plugging along, trying to figure out ways to prevent problems, solve problems and make attractive, easy to read ebooks. Until the device makers get their acts together and start thinking in terms of what is best for the end user, anyone who produces an ebook is going to have to take into account that there are going to be problems, hassles and frustrations along the way.

    • And it was late night when I tried to reply to your comment, and i missed a few things.

      The indie publisher has tons of options for distribution. Direct Distribution: Amazon (very easy, requires a mobi. or prc file–I kind of prefer the prc files since they seem a bit more stable than mobi files, but that could just be my imagination); Barnes & Noble (also easy, uses an epub file); Kobo (international sales, recently opened, accepts epub files, haven’t tried it yet and have heard mixed comments about how easy the process is); Apple–IF one uses a Mac AND uses Apple’s book creation software AND IF one is willing to abide by Apple’s TOS regarding exclusivity (it’s complicated).

      (creating prc, mobi or epub files from source files in txt., html, doc. etc., format is super easy with free programs such as Calibre and Mobipocket, and uploading and building an ebook in either of those takes about a minute or less)

      Indies also have the option of setting up a publishing company, purchasing ISBNs from Bowker, then working deals with distributors such as Sony and Diesel.

      Indies can also sell their own ebooks (and other formats) directly to customers on their websites and webstores.

      And, even with Smashwords, writers are NOT limited to Word. There are word processors that create doc files comparable to Word that Smashwords will accept and a program such as Scrivener (which is brilliant, by the way, not that I’m biased or anything) can generate files in many, many formats (Scrivener is NOT a document generator), including a perfectly good Word doc file.

      Smashwords is NOT the only game in town. It is one of the more convenient games in town and the most user friendly (comparably). Plus, Coker plays fair with producers and keeps up his end of the agreements he makes, which puts him about 10000 lightyears ahead of traditional publishers in that regard. I DO believe his ‘meatgrinder’ needs redoing, but since it appears he’s having more good luck than bad luck with it right now, i doubt it is going to change any time soon. So Word doc files it is. For now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s