Fixing Weird Formatting, The Programmers Way

→ January 8th, 2010

Here’s a unique problem. A friend of mine suggested I get some Robert E. Howard stories for my Kindle and pointed me to the Project Gutenberg Australia website for some free eBooks by him. eBook is a loose term, since the versions that are available are either TXT or HTML versions.

I chose the TXT version, since I could just dump it on my Kindle and have a decent looking version. The problem: the TXT version is really an HTML page with a large chunk of preformatted text as the eBook. They also limited the sentence length to 80 columns. Even if I stripped the HTML from the source, the line breaks didn’t match with the width of the Kindle.

So I wrote a Ruby script to fix it. It reads every line of the file. If there’s a sentence, it prints it back out without a newline character. If there’s a newline character, it starts a new paragraph. The script also allows you to skip a certain number of lines in the beginning, since they usually represent the title and author information and should be on separate lines.

It was also a good excuse to learn Ruby’s OptionParser library, although I didn’t dive in too far.

Enjoy.

#!/usr/bin/env ruby

require 'optparse'

# Default options
dumped_line = false
input_file = STDIN
output_file = STDOUT
skip = 0

# Parse the options
ARGV.options do |o|
  script_name = File.basename($0)

  o.set_summary_indent('  ')
  o.banner = "Usage: #{script_name} [OPTIONS] [input_file] [output_file]"
  o.define_head 'Convert given Gutenberg txt file to a cleaner text file'
  o.separator   ''

  o.on('-s', '--skip=val', Integer, 'Lines to skip') { |s| skip = s }

  o.separator   ''

  o.on_tail('-h', '--help', 'Show this help message.') { puts o; exit }

  o.parse!
end

if ARGV.count == 1
  # One more argument means an input file is given
  input_file = File.open ARGV[0]
elsif ARGV.count == 2
  # Two more arguments means both input and output file are given
  input_file = File.open ARGV[0]
  output_file = File.open ARGV[1], 'w'
end

input_file.each_line do |line|

  line.strip!

  # Do we need to just pass the lines?
  if skip != 0
    output_file << line
    output_file << "\n"
    skip -= 1
    next
  end

  if line == ''
    # Blank lines mean new lines
    output_file << "\n"
    output_file << "\n" if dumped_line
    dumped_line = false
  else
    # Just dump the line without a new line
    output_file << line
    output_file << ' '
    dumped_line = true
  end

end

input_file.close
output_file.close

Weeks in a Month Calculations

→ January 5th, 2010

I was bitten by a nuance in Ruby where the Date “2010/01/01″ is actually the 53rd week in 2009.  I probably don’t fully understand how the cweek method works, but to see it in action, fire up irb and try:

Date.civil(2010,1,1).cweek

I needed a new way to calculate all of the weeks in month and my old solution was a hack, so I came up with another quick hack to get it right.  Below is my code to extend Date and Time to return an array of ranges for every week in a month.

module ActiveSupport #:nodoc:
  module CoreExtensions #:nodoc:

    module Date #:nodoc:
      module MyCalculations
        # Return an array of ranges with the weeks in the month
        def weeks_in_month
          weeks = []
          start = finish = beginning_of_month

          while finish != end_of_month
            finish = start.end_of_week
            finish = end_of_month if finish > end_of_month
            weeks << (start..finish)

            start = finish + 1.day
            start = start.beginning_of_day
          end

          weeks
        end
      end
    end

    module Time #:nodoc:
      module MyCalculations
        # Return an array of ranges with the weeks in the month
        def weeks_in_month
          weeks = []
          start = finish = beginning_of_month

          while finish != end_of_month
            finish = start.end_of_week
            finish = end_of_month if finish > end_of_month
            weeks << (start..finish)

            start = finish + 1.day
            start = start.beginning_of_day
          end

          weeks
        end
      end
    end
  end
end

class Date
  include ActiveSupport::CoreExtensions::Date::MyCalculations
end

class Time
  include ActiveSupport::CoreExtensions::Time::MyCalculations
end