检测和更改PDF中的字符串

人们可能不感兴趣，或者Python没有能力，所以这里是Perl
:-)中的解决方案。认真地说，如上所述，您不需要“更改字符串”。PDF批注是您的解决方案。我不久前有一个带有注释的小项目，那里有一些代码。但是，我的内容解析器不是通用的，并且您不需要全面的解析-
意味着能够更改内容并将其写回。因此，我求助于外部工具。我使用的PDF库有些底层，但是我不介意。这也意味着，人们期望对PDF内部结构有适当的了解，以了解正在发生的事情。否则，只需使用该工具。

这是使用命令标记例如OP文件中的所有gerunds的快照

perl pdf_hl.pl -f westlaw.pdf -p 'S*ing'

代码（里面的注释也值得一读）：

use strict;use warnings;use XML::Simple;use CAM::PDF;use Getopt::Long;use Regexp::Assemble;#######################################################################  This is PDF highlight mark-up tool.#  Though fully functional, it's still a prototype proof-of-concept.#  Please don't feed it with non-pdf files or patterns like 'd*' #  (because you probably want 'd+', don't you?).#  #  Requires muPDF-tools installed and in the PATH, plus some CPAN modules.##  ToDo:#  - error handling is primitive if any.#  - cropped files (CropBox) are processed incorrectly. Fix it.#  - of course there can be other useful parameters.#  - allow loading them from file.#  - allow searching across lines (e.g. for multi-word patterns)#    and certainly across "spans" within a line (see mudraw output).#  - multi-color mark-up, not just yellow.#  - control over output file name.#  - compress output (use cleanoutput method instead of output,#    plus more robust (think compressed object streams) compressors #    may be useful).#  - file list processing.#  - annotations are not just colorful marks on the page, their #    dictionaries can contain all sorts of useful information, which may #    be extracted automatically further up the food chain i.e. by #    whoever consumes these files (date, time, author, comments, actual #    text below, etc., etc., plus think of customized appearence streams,#    placing them on layers, etc..#  - ???##   Most complexity in the pre comes from adding appearance #   dictionary (AP). You can safely delete it, because most viewers don't #   need AP for standard annotations. Ironically, muPDF-viewer wants it #   (otherwise highlight placement is not 100% correct), and since I relied #   on muPDF-tools, I thought it be proper to create PDFs consumable by #   their viewer... Firefox wants AP too, btw.######################################################################my ($file, $csv);my ($c_flag, $w_flag) = (0, 1);GetOptions('-f=s' => $file,   '-p=s' => $csv, '-c!'  => $c_flag, '-w!'  => $w_flag)     and defined($file)    and defined($csv)or die "nUsage: perl $0 -f FILE -p LIST -c -wnn",       "t-fttFILEt PDF file to annotaten",       "t-pttLISTt comma-separated patternsn",       "t-c or -noctt be case sensitive (default = no)n",       "t-w or -nowtt whole words only (default = yes)n";my $re = Regexp::Assemble->new    ->add(split(',', $csv))    ->anchor_word($w_flag)    ->flags($c_flag ? '' : 'i')    ->re;my $xml = qx/mudraw -ttt $file/;my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);my $pdf = CAM::PDF->new($file);sub __num_nodes_list {    my $precision = shift;    [ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} @_ ]}sub add_highlight {    my ($idx, $x1, $y1, $x2, $y2) = @_;    my $p = $pdf->getPage($idx);    # mirror vertically to get to normal cartesian plane     my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);    ($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);    # corner radius    my $r = 2;    # AP appearance stream    my $s = "/GS0 gs 1 1 0 rg 1 1 0 RGn";    $s .= "1 j @{[sprintf '%.0f', $r * 2]} wn";    $s .= "0 0 @{[sprintf '%.1f', $x2 - $x1]} ";    $s .= "@{[sprintf '%.1f',$y2 - $y1]} re Bn";    my $highlight = CAM::PDF::Node->new('dictionary', {        Subtype => CAM::PDF::Node->new('label', 'Highlight'),        Rect => CAM::PDF::Node->new('array',__num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),        QuadPoints => CAM::PDF::Node->new('array',  __num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),        BS => CAM::PDF::Node->new('dictionary', { S => CAM::PDF::Node->new('label', 'S'), W => CAM::PDF::Node->new('number', 0),        }),        Border => CAM::PDF::Node->new('array',  __num_nodes_list(0, 0, 0, 0)),        C => CAM::PDF::Node->new('array',  __num_nodes_list(0, 1, 1, 0)),        AP => CAM::PDF::Node->new('dictionary', { N => CAM::PDF::Node->new('reference',      $pdf->appendObject(undef,          CAM::PDF::Node->new('object',  CAM::PDF::Node->new('dictionary', {      Subtype => CAM::PDF::Node->new('label', 'Form'),      BBox => CAM::PDF::Node->new('array',        __num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2,      $y2 - $y1 + $r * 2)),      Resources => CAM::PDF::Node->new('dictionary', {          ExtGState => CAM::PDF::Node->new('dictionary', {   GS0 => CAM::PDF::Node->new('dictionary', {       BM => CAM::PDF::Node->new('label', 'Multiply'),   }),          }),      }),      StreamData => CAM::PDF::Node->new('stream', $s),      Length => CAM::PDF::Node->new('number', length $s),  }),         ),     ,0), ),        }),    });    $p->{Annots} ||= CAM::PDF::Node->new('array', []);    push @{$pdf->getValue($p->{Annots})}, $highlight;    $pdf->{changes}->{$p->{Type}->{objnum}} = 1}my $page_index = 1;for my $page (@{$tree->{page}}) {    for my $block (@{$page->{block}}) {        for my $line (@{$block->{line}}) { for my $span (@{$line->{span}}) {     my $string = join '', map {$_->{c}} @{$span->{char}};     while ($string =~ /$re/g) {         my ($x1, $y1) =   split ' ', $span->{char}->[$-[0]]->{bbox};         my (undef, undef, $x2, $y2) =   split ' ', $span->{char}->[$+[0] - 1]->{bbox};         add_highlight($page_index, $x1, $y1, $x2, $y2)     } }        }    }    $page_index ++}$pdf->output($file =~ s/(.{4}$)/++$1/r);__END__

附言：我用“ Perl”标记了该问题，以便获得社区的一些反馈（代码更正等）。

检测和更改PDF中的字符串

面试问答相关栏目本月热门文章