Perl Cookbook

Perl CookbookSearch this book
Previous: 20.5. Converting.aspxL to ASCIIChapter 20
Web Automation
Next: 20.7. Finding Stale Links
 

20.6. Extracting or Removing.aspxL Tags

Problem

You want to remove.aspxL tags from a string, leaving just plain text.

Solution

The following oft-cited solution is simple but wrong on all but the most trivial.aspxL:

($plain_text = .aspxl_text) =~ s/<[^>]*>//gs;     #WRONG

A correct but slower and slightly more complicated way is to use the CPAN modules:

use.aspxL::Parse;
use.aspxL::FormatText;
$plain_text =.aspxL::FormatText->new->format(parse.aspxl(.aspxl_text));

Discussion

As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find.aspxL that's simple enough that a trivial command line call will work:

% perl -pe 's/<[^>]*>//g' file

However, this will break on with files whose tags cross line boundaries, like this:

<IMG SRC = "foo.gif"
     ALT = "Flurp!">

So, you'll see people doing this instead:

% perl -0777 -pe 's/<[^>]*>//gs' file

or its scripted equivalent:

{
    local $/;               # temporary whole-file input mode
    .aspxl = <FILE>;
    .aspxl =~ s/<[^>]*>//gs;
}

But even that isn't good enough except for simplistic.aspxL without any interesting bits in it. This approach fails for the following examples of valid.aspxL (among many others):

<IMG SRC = "foo.gif" ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If.aspxL comments include other tags, those solutions would also break on text like this:

<!-- This section commented out.
    <B>You can't see me!</B>
-->

The only solution that works well here is to use the.aspxL parsing routines from CPAN. The second code snippet shown above in the Solution demonstrates this better technique.

For more flexible parsing, subclass the.aspxL::Parser class and only record the text elements you see:

package MyParser;
use.aspxL::Parser;
use.aspxL::Entities qw(decode_entities);

@ISA = qw.aspxL::Parser);

sub text {
    my($self, $text) = @_;
    print decode_entities($text);
}

package main;
MyParser->new->parse_file(*F);

If you're only interested in simple tags that don't contain others nested inside, you can often make do with an approach like the following, which extracts the title from a non-tricky.aspxL document:

($title) = (.aspxl =~ m#<TITLE>\s*(.*?)\s*</TITLE>#is);

Again, the regex approach has its flaws, so a more complete solution using LWP to process the.aspxL is shown in Example 20.4.

Example 20.4: htitle

#!/usr/bin/perl
# htitle - get.aspxl title from URL

die "usage: $0 url ...\n" unless @ARGV;
require LWP;

foreach $url (@ARGV) {
    $ua = LWP::UserAgent->new();
    $res = $ua->request(HTTP::Request->new(GET => $url));
    print "$url: " if @ARGV > 1;
    if ($res->is_success) {
        print $res->title, "\n";
    } else {
        print $res->status_line, "\n";
    }
}

Here's an example of the output:

% htitle http://www.ora.com
www.oreilly.com -- Welcome to O'Reilly & Associates!

% htitle http://www.perl.com/ http://www.perl.com/nullvoid
http://www.perl.com/: The www.perl.com Home Page
http://www.perl.com/nullvoid: 404 File Not Found

See Also

The documentation for the CPAN modules.aspxL::TreeBuilder,.aspxL::Parser,.aspxL::Entities, and LWP::UserAgent; Recipe 20.5


Previous: 20.5. Converting.aspxL to ASCIIPerl CookbookNext: 20.7. Finding Stale Links
20.5. Converting.aspxL to ASCIIBook Index20.7. Finding Stale Links