Kiirani.com Blog

My local church discusses God, Gay marriage

posted 27 April 2013

Blasphemy ensues.

St Matthew in the City is my local Anglican church. Local as in one block down the road.

I like to classify myself as an anti-theist, but I have huge amounts of respect for St Matthew’s for their classy sense of humour and progressive attitude. It’s also a very pretty church, architecturally speaking.

St Matthew’s is auctioning their current billboard, which commemorates the marriage equality bill, on trademe.

The Q&A on the auction is hilarious.

 

Using Tesseract OCR with PDF scans

posted 22 March 2013

We’re at the very beginning of a push to create a centralised repository of company knowledge: a place where new employees know they can go to find up to date, definitive information.

Just finding a place to start is a daunting task. Which is how I found myself retrieving a dog-eared photocopy of an introductory document from my folder in my cubby in the office I rarely visit.

I’ve been keeping this document around “just in case” for about a year and a half now, but I finally have a use for it: it’s the perfect lightweight overview from which to spawn a dozen or so wiki pages – just enough to get our knowledgable staff interested in contributing to and maintaining the information.

We’re not sure where the master digital copy of this document got to. It obviously existed a couple of years ago, when the version I photocopied mine from was first printed. Equally obviously, it doesn’t exist now.

While fruitlessly searching through our digital archives, the boss had a marvellous idea: if we can’t find it, we can get the document scanned and OCR’d instead

Challenge accepted! I know how to scan and OCR a document. Our photocopier even has a document feeder. I can probably work this out in the next half hour or so.

The photocopier emailed me a lovely 13 page PDF, ready to run through tesseract.

Wait, no, that doesn’t sound right.

% tesseract file.pdf output
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Unsupported image type.

Tesseract doesn’t read PDFs.

I’m sure I used it successfully on a TIFF last time, though.
Imagemagick can probably convert that for me.

% convert file.pdf file.tiff
% tesseract file.tiff output
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Error in pixReadFromTiffStream: can't handle bpp > 32
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Unsupported image type.

Mm, nope. Some slight issues with colour TIFF images there.

Let’s make that an 8 bit TIFF instead.

% convert file.pdf -depth 8 file.tiff
% tesseract file.tiff output
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Page 1 of 13
...

Okay, now we’re getting somewhere. Unfortunately, the result is gibberish:

sugersm um wmcwankhmsmu n mm, may k,mvnImq.\-nyv'r1nunr pcv:mclc"s mnsmmsluv uu5|t.eqweu\\me'v mivvr ucnk‘ lm cenlmsu hcv We r_ mu; M. ZIVEV Var‘ wharwuh

Why? Well, because the TIFF file is barely readable.

There’s one last piece of wisdom here - the standard resolution for “convert” is 72dpi.

If we re-convert that at 300dpi, the result… actually comes out in English. Mostly.

% convert -density 300 file.pdf -depth 8 file.tiff  
% tesseract file.tiff output

So, steps to read a PDF document with tesseract:

  1. Convert the PDF document to something else
  2. Make sure that something else is high resolution, and grayscale
  3. Use tesseract to read the something else instead

And all of that took about a half hour to work out.

insomnia

posted 20 March 2013

Spending another night unable to sleep; took my pills, read a book, lay down, relaxed.

Eventually become frustrated. Lying in bed tossing and turning isn’t productive.

I get up, fiddle with my website, write some documentation, think about work tomorrow.

Manage some semblance of exhaustion by 6am… Maybe enough to drop off before 7… I can’t afford to wake up at one in the afternoon after only 6 hours of sleep… again…

I like to think I’m having trouble sleeping because I’m stressed, but then lorazepam, glorious cure of anxiety and creator of unconsciousness, lorazepam doesn’t work. I don’t know how to deal with this.

(pacman) resolving missing dependencies

posted 18 March 2013

I was in dependency hell on my last system update. I’m impatient, so I resolve dependency hell by bypassing dependency checks altogether.

Having successfully run pacman -Sudd, I’m now stuck with some missing dependencies and a major headache every time I try to run things… Yay!

The official way to search your Arch system for missing dependencies is the testdb command.

testdb is useless for piping into pacman: it doesn’t have any special piped output, nor does it have a linear option like pactree (♥ pactree -l).

The resulting output is a mess of human readableness

missing dependency for blender : openshadinglanguage
missing dependency for gnome-settings-daemon : ibus
missing dependency for lib32-mesa : lib32-libdrm
missing dependency for lib32-mesa : lib32-systemd
missing dependency for libpurple : farstream-0.1
missing dependency for mga-dri : mesa-libgl
missing dependency for openimageio : intel-tbb
missing dependency for workrave : gtkmm3
missing dependency for workrave : python2-cheetah

Fortunately this is pretty easily cleaned up with a little bit of commandline magic We can cut this at ”:”, grab the second half, and then trim off the leading whitespace.

# testdb | cut -d: -f2 | sed s"/ //g"  
openshadinglanguage
ibus
lib32-libdrm
lib32-systemd
farstream-0.1
mesa-libgl
intel-tbb
gtkmm3
python2-cheetah

This list can then be piped into pacman :

# testdb | cut -d: -f2 | sed s"/ //g" | pacman -S -

But for me, this doesn’t end here.

The reason I ran an -Sdd in the first place was I have some weird dependency conflicts between mesa-libgl and my nvidia drivers, and I can’t be bothered fixing those right now.

I could do a basic | grep -v "mesa-libgl", but I thought it would be nice to go through each package and check that it and its dependencies won’t cause me any problems

for line in `testdb | cut -d: -f2 | sed s"/ //g"`; do 
    pacman -S $line
done;

Okay, so, I’m still left in dependency hell with libgl, but most of my system functions and my bash-fu is stronger.

bad comments are bad

posted 16 March 2013

My local apache install has some issues. For one thing, it’s completely ignoring everything I put in .htaccess.

My first mistake was going into my main configuration to try and track this down.
My second mistake was reading the comments I’ve left in this file.

I think it might be time to consider reconfiguring apache.

# Wtf? This is redundant..

# This is going to get in the way in later life :)

# For compatibility with my old setup
# Needs to be changed lol

# Icons for apache generated pages. Should be replaced soon

# For the love of god, turn this down after use

# Rewrite rules SHOULD go into .htaccess files, but nvm for now

# Commented until I can migrate script to modern python

# Should php ever break...

# More magic?

# Indexes are globally disabled, but oh well, need to configure somewhere.