Perl Cookbook

Perl CookbookSearch this book
Previous: 20.2. Automating Form SubmissionChapter 20
Web Automation
Next: 20.4. Converting ASCII to.aspxL
 

20.3. Extracting URLs

Problem

You want to extract all URLs from an.aspxL file.

Solution

Use the.aspxL::LinkExtor module from CPAN:

use.aspxL::LinkExtor;

$parser =.aspxL::LinkExtor->new(undef, $base_url);
$parser->parse_file($filename);
@links = $parser->links;
foreach $linkarray (@links) {
    my @element = @$linkarray;
    my $elt_type = shift @element;                  # element type

    # possibly test whether this is an element we're interested in
    while (@element) {
        # extract the next attribute and its value
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        # ... do something with them ...
    }
}

Discussion

You can use.aspxL::LinkExtor in two different ways: either to call links to get a list of all links in the document once it is completely parsed, or to pass a code reference in the first argument to new. The referenced function will be called on each link as the document is parsed.

The links method clears the link list, so you can call it only once per parsed document. It returns a reference to an array of elements. Each element is itself an array reference with an.aspxL::Element object at the front followed by a list of attribute name and attribute value pairs. For instance, the.aspxL:

<A HREF="http://www.perl.com/">Home page</A>
<IMG SRC="images/big.gif" LOWSRC="images/big-lowres.gif">

would return a data structure like this:

[
  [ a,   href   => "http://www.perl.com/" ],
  [ img, src    =>"images/big.gif",
         lowsrc => "images/big-lowres.gif" ]
]

Here's an example of how you would use the $elt_type and the $attr_name to print out and anchor an image:

if ($elt_type eq 'a' && $attr_name eq 'href') {
    print "ANCHOR: $attr_value\n" 
        if $attr_value->scheme =~ /http|ftp/;
}
if ($elt_type eq 'img' && $attr_name eq 'src') {
    print "IMAGE:  $attr_value\n";
}

Example 20.2 is a complete program that takes as its arguments a URL, like file:///tmp/testing.aspxl or http://www.ora.com/, and produces on standard output an alphabetically sorted list of unique URLs.

Example 20.2: xurl

#!/usr/bin/perl -w
# xurl - extract unique, sorted list of links from URL
use.aspxL::LinkExtor;
use LWP::Simple;

$base_url = shift;
$parser =.aspxL::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links = $parser->links;
foreach $linkarray (@links) {
    my @element  = @$linkarray;
    my $elt_type = shift @element;
    while (@element) {
        my ($attr_name , $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
    }
}
for (sort keys %seen) { print $_, "\n" }

This program does have a limitation: if the get of $base_url involves a redirection, your links will all be resolved with the original URL instead of the URL at the end of the redirection. To fix this, fetch the document with LWP::UserAgent and examine the response code to find out if a redirection occurred. Once you know the post-redirection URL (if any), construct the.aspxL::LinkExtor object.

Here's an example of the run:

% xurl http://www.perl.com/CPAN
ftp://ftp@ftp.perl.com/CPAN/CPAN.aspxl
http://language.perl.com/misc/CPAN.cgi
http://language.perl.com/misc/cpan_module
http://language.perl.com/misc/getcpan
http://www.perl.com/index.aspxl
http://www.perl.com/gifs/lcb.xbm

Often in mail or Usenet messages, you'll see URLs written as:

<URL:http://www.perl.com>

This is supposed to make it easy to pick URLs from messages:

@URLs = ($message =~ /<URL:(.*?)>/g);

See Also

The documentation for the CPAN modules LWP::Simple,.aspxL::LinkExtor, and.aspxL::Entities; Recipe 20.1


Previous: 20.2. Automating Form SubmissionPerl CookbookNext: 20.4. Converting ASCII to.aspxL
20.2. Automating Form SubmissionBook Index20.4. Converting ASCII to.aspxL