Fixing Weird Formatting, The Programmers Way

→ January 8th, 2010

Here’s a unique problem. A friend of mine suggested I get some Robert E. Howard stories for my Kindle and pointed me to the Project Gutenberg Australia website for some free eBooks by him. eBook is a loose term, since the versions that are available are either TXT or HTML versions.

I chose the TXT version, since I could just dump it on my Kindle and have a decent looking version. The problem: the TXT version is really an HTML page with a large chunk of preformatted text as the eBook. They also limited the sentence length to 80 columns. Even if I stripped the HTML from the source, the line breaks didn’t match with the width of the Kindle.

So I wrote a Ruby script to fix it. It reads every line of the file. If there’s a sentence, it prints it back out without a newline character. If there’s a newline character, it starts a new paragraph. The script also allows you to skip a certain number of lines in the beginning, since they usually represent the title and author information and should be on separate lines.

It was also a good excuse to learn Ruby’s OptionParser library, although I didn’t dive in too far.

Enjoy.

#!/usr/bin/env ruby

require 'optparse'

# Default options
dumped_line = false
input_file = STDIN
output_file = STDOUT
skip = 0

# Parse the options
ARGV.options do |o|
  script_name = File.basename($0)

  o.set_summary_indent('  ')
  o.banner = "Usage: #{script_name} [OPTIONS] [input_file] [output_file]"
  o.define_head 'Convert given Gutenberg txt file to a cleaner text file'
  o.separator   ''

  o.on('-s', '--skip=val', Integer, 'Lines to skip') { |s| skip = s }

  o.separator   ''

  o.on_tail('-h', '--help', 'Show this help message.') { puts o; exit }

  o.parse!
end

if ARGV.count == 1
  # One more argument means an input file is given
  input_file = File.open ARGV[0]
elsif ARGV.count == 2
  # Two more arguments means both input and output file are given
  input_file = File.open ARGV[0]
  output_file = File.open ARGV[1], 'w'
end

input_file.each_line do |line|

  line.strip!

  # Do we need to just pass the lines?
  if skip != 0
    output_file << line
    output_file << "\n"
    skip -= 1
    next
  end

  if line == ''
    # Blank lines mean new lines
    output_file << "\n"
    output_file << "\n" if dumped_line
    dumped_line = false
  else
    # Just dump the line without a new line
    output_file << line
    output_file << ' '
    dumped_line = true
  end

end

input_file.close
output_file.close
blog comments powered by Disqus