pdfgrep

Inspired by the comments to this NewsForge article about concatenating PDF files I wrote this little script which allows you to find patterns in PDF documents in exactly the same manner that you know from the grep utility with plain text files. If you never heard about grep, then you probably won't be interested in this script.

The syntax is as follows:

pdfgrep [grep options] <pattern>  [file ...]

A comment poster hardcoded the grep options "--context=4 --color=always", but I don't hardcode these because I want to be able to give the context size myself (for four lines just try "pdfgrep -4 ...") and for the colour I have this line in my .bashrc anyway:

export GREP_OPTIONS="--color=auto"

Finally, here comes the script:

#!/bin/sh
# 2004-06-22 Ulf Rompe <ulf@@@@rompe.org>
# Updated for filenames containing whitespace 2005-07-18
if [ $# -lt 2 ]; then
        echo 'Syntax: pdfgrep [grep options] <pattern> <file> [file ...]'
        exit 1
fi
grepopts=""
while [ `echo $1 | cut -c1` == "-" ]; do
        grepopts="$grepopts $1"
        shift
done
pat="$1"
shift
if [ $# -gt 1 ]; then shownames=1; else shownames=0; fi
while [ $# -gt 0 ]; do
        [ "$shownames" == 1 ] && echo $1":"
        pdftotext -layout "$1" - | egrep $grepopts "$pat"
        shift
done

AttachmentSize
pdfgrep0 bytes

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

An anonymous reader

An anonymous reader states:

A small note:
You should change 'egrep $grepopts $pat' to 'egrep $grepopts "$pat"',
otherwise you won't be able to grep for patterns containing
whitespace.

And yes, he's right. I will change that right now.

thanks for this script :)

thanks for this script :)

This is very useful! A

This is very useful!
A couple of changes, hopefully they are improvements:

1. It would be nice to if pdfgrep printed the filename on
every line, and didn't print the filename when nothing
from the file matched. Basically the same behaviour as grep.

2. A small bug: if you pass a -e or -n as grep options,
then the echo in the test of the first while loop thinks that it's
an option for itself, for example echo -e outputs just a blank line.

#!/bin/sh
# 2004-06-22 Ulf Rompe 
# Updated for filenames containing whitespace 2005-07-18
if [ $# -lt 2 ]; then
        echo 'Syntax: pdfgrep [grep options]   [file ...]'
        exit 1
fi
grepopts=""
while [ `echo x$1 | cut -c2` == "-" ]; do
        grepopts="$grepopts $1"
        shift
done
pat=$1
shift
if [ $# -gt 1 ]; then shownames=1; else shownames=0; fi
while [ $# -gt 0 ]; do
        prefix=$([ "$shownames" == 1 ] && echo -n $1": ")
        pdftotext -layout "$1" - | egrep $grepopts $pat | sed "s/^/$prefix/"
        shift
done

Hm, I agree - nice. It

Hm, I agree - nice.

It doesnt work with filenames containing spaces, though.
The problem seems to be, that $* seperates with IFS.
I suggest adding a
#--------------
IFS=''
#--------------
before the for loop.
Any other ideas?

I.

Oh yes, you are right. A

Oh yes, you are right. A cleaner fix (in my eyes) would be to replace the for loop with a while loop. The problem with the shell is that it lacks a real foreach statement that iters over arrays instead of string components.

I will replace the script with a version that handles whitespace in Filenames in a few minutes. For the archives, this was the code fragment in question:

for i in $*; do
        [ "$shownames" == 1 ] && echo $i":"
        pdftotext -layout $i - | egrep $grepopts $pat
done

And this is what I will add now:

while [ $# -gt 0 ]; do
    [ "$shownames" == 1 ] && echo $1":"
    pdftotext -layout "$1" - | egrep $grepopts $pat
    shift
done

Thanks for the hint!

Just a suggestion -- I think

Just a suggestion -- I think you can get the same effect (using bash anyway, though on many modern Linux machines sh and bash are the same) using the "$@" operator. In other words, your original code could change to

for i in "$@"; do ... done

"$@" turns into "$1" "$2" ... "$N" (whereas "$*" is just "$1 $2 ... $N" where the space is replaced by $IFS)I think that's a little cleaner than the shifting. Anyway, thanks for the script.

Excellent!

Exactly what I needed today. Works like a champ. Thanks for this script!

Thanks

Thanks for saving us precious time.