Inspired by the comments to this NewsForge article about concatenating PDF files I wrote this little script which allows you to find patterns in PDF documents in exactly the same manner that you know from the grep utility with plain text files. If you never heard about grep, then you probably won't be interested in this script.
The syntax is as follows:
pdfgrep [grep options] <pattern>[file ...]
A comment poster hardcoded the grep options "--context=4 --color=always", but I don't hardcode these because I want to be able to give the context size myself (for four lines just try "pdfgrep -4 ...") and for the colour I have this line in my .bashrc anyway:
export GREP_OPTIONS="--color=auto"
Finally, here comes the script:
#!/bin/sh # 2004-06-22 Ulf Rompe <ulf@@@@rompe.org> # Updated for filenames containing whitespace 2005-07-18 if [ $# -lt 2 ]; then echo 'Syntax: pdfgrep [grep options] <pattern> <file> [file ...]' exit 1 fi grepopts="" while [ `echo $1 | cut -c1` == "-" ]; do grepopts="$grepopts $1" shift done pat="$1" shift if [ $# -gt 1 ]; then shownames=1; else shownames=0; fi while [ $# -gt 0 ]; do [ "$shownames" == 1 ] && echo $1":" pdftotext -layout "$1" - | egrep $grepopts "$pat" shift done
| Attachment | Size |
|---|---|
| pdfgrep | 0 bytes |
Comments
An anonymous reader
An anonymous reader states:
And yes, he's right. I will change that right now.
thanks for this script :)
thanks for this script :)
This is very useful! A
This is very useful!
A couple of changes, hopefully they are improvements:
1. It would be nice to if pdfgrep printed the filename on
every line, and didn't print the filename when nothing
from the file matched. Basically the same behaviour as grep.
2. A small bug: if you pass a -e or -n as grep options,
then the echo in the test of the first while loop thinks that it's
an option for itself, for example echo -e outputs just a blank line.
Hm, I agree - nice. It
Hm, I agree - nice.
It doesnt work with filenames containing spaces, though.
The problem seems to be, that $* seperates with IFS.
I suggest adding a
#--------------
IFS=''
#--------------
before the for loop.
Any other ideas?
I.
Oh yes, you are right. A
Oh yes, you are right. A cleaner fix (in my eyes) would be to replace the for loop with a while loop. The problem with the shell is that it lacks a real foreach statement that iters over arrays instead of string components.
I will replace the script with a version that handles whitespace in Filenames in a few minutes. For the archives, this was the code fragment in question:
for i in $*; do [ "$shownames" == 1 ] && echo $i":" pdftotext -layout $i - | egrep $grepopts $pat doneAnd this is what I will add now:
while [ $# -gt 0 ]; do [ "$shownames" == 1 ] && echo $1":" pdftotext -layout "$1" - | egrep $grepopts $pat shift doneThanks for the hint!
Just a suggestion -- I think
Just a suggestion -- I think you can get the same effect (using bash anyway, though on many modern Linux machines sh and bash are the same) using the "$@" operator. In other words, your original code could change to
for i in "$@"; do ... done
"$@" turns into "$1" "$2" ... "$N" (whereas "$*" is just "$1 $2 ... $N" where the space is replaced by $IFS)I think that's a little cleaner than the shifting. Anyway, thanks for the script.
Excellent!
Exactly what I needed today. Works like a champ. Thanks for this script!
Thanks
Thanks for saving us precious time.