Cleaning Up E-Books

I have a large number of ebooks in Microsoft's .lit format. My Nokia 770 doesn't have any software to read a .lit format book. In fact, I can't say I've ever seen a .lit reader other than Microsoft's own.

What I have seen is the nifty and very usefull ConvertLIT which I use to down convert the files into plain HTML. I don't even bother with the images. The problem is, they tend to come out formatted in a hideous fashion. I came up with a nice combo of HTML tidy and a perl script.

Here's my command line for tidy, beware, this will modify your original copy!

tidy --bare yes --clean yes --drop-font-tags yes --drop-proprietary-attributes yes --enclose-text yes --output-xhtml yes --word-2000 yes --tidy-mark no --write-back yes TARGETFILENAME.htm

Here is my perl script, it just runs the file through some regex's and writes to the same filename with "NEW" appended. I also made a nice little progress bar because I was bored.


$file = $ARGV[0];   # Name the file
open(INFO, "< ".$file);   # Open the file
@lines = ;    # Read it into an array
close(INFO);      # Close the file

$size = @lines;
$counter = 0;
$size = $size / 50;

open(FILEWRITE, "> NEW".$file);
foreach(@lines) {
  if(0 == ($counter % 50) || $counter == @lines) {
  print "\rProcessing: [";
  for($i = 0; $i < ($counter / $size); $i++) {
    print "+";
  for($i = 0; $i < (49 - ($counter / $size)); $i++) {
    print "-";
  print "]";

  # Empty paragraph removal
  $_ =~ s/

\s*<\/p>//mi; if($_ =~ m/^\s*\n$/) { # If the line is just a newline or newline and spaces, scrap it. $_ = ''; } else { # Remove excess spaces $_ =~ s/ //mi; # I get these alot... $_ =~ s/­//mi; } print FILEWRITE $_; } close FILEWRITE; print "\n";

You can download it here, but be careful with it.

Update (01/21/07)
That perl script has a line $_ =~ s/ //mi; which doesn't really make that much sense looking at it now. I'm thinking $_ s/\s\s+/ /mi; for a replacement. Also, for some reason the server throws up a 500 error on trying to get that file, I'm working on it.