Google hacking for penetration tester - part 21

  • 10 trang
  • file .pdf
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 201
Google’s Part in an Information Collection Framework • Chapter 5 201
Figure 5.20 The LinkedIn Profile of the Author of a Government Document
Can this process of grabbing documents and analyzing them be automated? Of course!
As a start we can build a scraper that will find the URLs of Office documents (.doc, .ppt, .xls,
.pps). We then need to download the document and push it through the meta information
parser. Finally, we can extract the interesting bits and do some post processing on it. We
already have a scraper (see the previous section) and thus we just need something that will
extract the meta information from the file.Thomas Springer at ServerSniff.net was kind
enough to provide me with the source of his document information script. After some slight
changes it looks like this:
#!/usr/bin/perl
# File-analyzer 0.1, 07/08/2007, thomas springer
# stripped-down version
# slightly modified by roelof temmingh @ paterva.com
# this code is public domain - use at own risk
# this code is using phil harveys ExifTool - THANK YOU, PHIL!!!!
# http://www.ebv4linux.de/images/articles/Phil1.jpg
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 202
202 Chapter 5 • Google’s Part in an Information Collection Framework
use strict;
use Image::ExifTool;
#passed parameter is a URL
my ($url)=@ARGV;
# get file and make a nice filename
my $file=get_page($url);
my $time=time;
my $frand=rand(10000);
my $fname="/tmp/".$time.$frand;
# write stuff to a file
open(FL, ">$fname");
print FL $file;
close(FL);
# Get EXIF-INFO
my $exifTool=new Image::ExifTool;
$exifTool->Options(FastScan => '1');
$exifTool->Options(Binary => '1');
$exifTool->Options(Unknown => '2');
$exifTool->Options(IgnoreMinorErrors => '1');
my $info = $exifTool->ImageInfo($fname); # feed standard info into a hash
# delete tempfile
unlink ("$fname");
my @names;
print "Author:".$$info{"Author"}."\n";
print "LastSaved:".$$info{"LastSavedBy"}."\n";
print "Creator:".$$info{"creator"}."\n";
print "Company:".$$info{"Company"}."\n";
print "Email:".$$info{"AuthorEmail"}."\n";
exit; #comment to see more fields
foreach (keys %$info){
print "$_ = $$info{$_}\n";
}
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 203
Google’s Part in an Information Collection Framework • Chapter 5 203
sub get_page{
my ($url)=@_;
#use curl to get it - you might want change this
# 25 second timeout - also modify as you see fit
my $res=`curl -s -m 25 $url`;
return $res;
}
Save this script as docinfo.pl.You will notice that you’ll need some PERL libraries to use
this, specifically the Image::ExifTool library, which is used to get the meta data from the files.
The script uses curl to download the pages from the server, so you’ll need that as well. Curl
is set to a 25-second timeout. On a slow link you might want to increase that. Let’s see how
this script works:
$ perl docinfo.pl http://www.elsevier.com/framework_support/permreq.doc
Author:Catherine Nielsen
LastSaved:Administrator
Creator:
Company:Elsevier Science
Email:
The scripts looks for five fields in a document: Author, LastedSavedBy, Creator, Company,
and AuthorEmail.There are many other fields that might be of interest (like the software used
to create the document). On it’s own this script is only mildly interesting, but it really starts
to become powerful when combining it with a scraper and doing some post processing on
the results. Let’s modify the existing scraper a bit to look like this:
#!/usr/bin/perl
use strict;
my ($domain,$num)=@ARGV;
my @types=("doc","xls","ppt","pps");
my $result;
foreach my $type (@types){
$result=`curl -s -A moo
"http://www.google.com/search?q=filetype:$type+site:$domain&hl=en&
num=$num&filter=0"`;
parse($result);
}
sub parse {
($result)=@_;
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 204
204 Chapter 5 • Google’s Part in an Information Collection Framework
my $start;
my $end;
my $token="
";
my $count=1;
while (1){
$start=index($result,$token,$start);
$end=index($result,$token,$start+1);
if ($start == -1 || $end == -1 || $start == $end){
last;
}
my $snippet=substr($result,$start,$end-$start);
my ($pos,$url) = cutter(" my ($pos,$heading) = cutter(">","",$pos,$snippet);
my ($pos,$summary) = cutter("","
",$pos,$snippet);
# remove and
$heading=cleanB($heading);
$url=cleanB($url);
$summary=cleanB($summary);
print $url."\n";
$start=$end;
$count++;
}
}
sub cutter{
my ($starttok,$endtok,$where,$str)=@_;
my $startcut=index($str,$starttok,$where)+length($starttok);
my $endcut=index($str,$endtok,$startcut+1);
my $returner=substr($str,$startcut,$endcut-$startcut);
my @res;
push @res,$endcut;
push @res,$returner;
return @res;
}
sub cleanB{
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 205
Google’s Part in an Information Collection Framework • Chapter 5 205
my ($str)=@_;
$str=~s///g;
$str=~s/<\/b>//g;
return $str;
}
Save this script as scraper.pl.The scraper takes a domain and number as parameters.The
number is the number of results to return, but multiple page support is not included in the
code. However, it’s child’s play to modify the script to scrape multiple pages from Google.
Note that the scraper has been modified to look for some common Microsoft Office for-
mats and will loop through them with a site:domain_parameter filetype:XX search term. Now
all that is needed is something that will put everything together and do some post processing
on the results.The code could look like this:
#!/bin/perl
use strict;
my ($domain,$num)=@ARGV;
my %ALLEMAIL=(); my %ALLNAMES=();
my %ALLUNAME=(); my %ALLCOMP=();
my $scraper="scrape.pl";
my $docinfo="docinfo.pl";
print "Scraping...please wait...\n";
my @all_urls=`perl $scraper $domain $num`;
if ($#all_urls == -1 ){
print "Sorry - no results!\n";
exit;
}
my $count=0;
foreach my $url (@all_urls){
print "$count / $#all_urls : Fetching $url";
my @meta=`perl $docinfo $url`;
foreach my $item (@meta){
process($item);
}
$count++;
}
#show results
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 206
206 Chapter 5 • Google’s Part in an Information Collection Framework
print "\nEmails:\n-------------\n";
foreach my $item (keys %ALLEMAIL){
print "$ALLEMAIL{$item}:\t$item";
}
print "\nNames (Person):\n-------------\n";
foreach my $item (keys %ALLNAMES){
print "$ALLNAMES{$item}:\t$item";
}
print "\nUsernames:\n-------------\n";
foreach my $item (keys %ALLUNAME){
print "$ALLUNAME{$item}:\t$item";
}
print "\nCompanies:\n-------------\n";
foreach my $item (keys %ALLCOMP){
print "$ALLCOMP{$item}:\t$item";
}
sub process {
my ($passed)=@_;
my ($type,$value)=split(/:/,$passed);
$value=~tr/A-Z/a-z/;
if (length($value)<=1) {return;}
if ($value =~ /[a-zA-Z0-9]/){
if ($type eq "Company"){$ALLCOMP{$value}++;}
else {
if (index($value,"\@")>2){$ALLEMAIL{$value}++; }
elsif (index($value," ")>0){$ALLNAMES{$value}++; }
else{$ALLUNAME{$value}++; }
}
}
}
This script first kicks off scraper.pl with domain and the number of results that was
passed to it as parameters. It captures the output (a list of URLs) of the process in an array,
and then runs the docinfo.pl script against every URL.The output of this script is then sent
for further processing where some basic checking is done to see if it is the company name,
an e-mail address, a user name, or a person’s name.These are stored in separate hash tables
for later use. When everything is done, the script displays each collected piece of informa-
tion and the number of times it occurred across all pages. Does it actually work? Have a
look:
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 207
Google’s Part in an Information Collection Framework • Chapter 5 207
# perl combined.pl xxx.gov 10
Scraping...please wait...
0 / 35 : Fetching http://www.xxx.gov/8878main_C_PDP03.DOC
1 / 35 : Fetching http://***.xxx.gov/1329NEW.doc
2 / 35 : Fetching http://***.xxx.gov/LP_Evaluation.doc
3 / 35 : Fetching http://*******.xxx.gov/305.doc
...
Emails:
-------------
1: ***zgpt@***.ksc.xxx.gov
1: ***[email protected]
1: ***ald.l.***[email protected]
1: ****ie.king@****.xxx.gov
Names (Person):
-------------
1: audrey sch***
1: corina mo****
1: frank ma****
2: eileen wa****
2: saic-odin-**** hq
1: chris wil****
1: nand lal****
1: susan ho****
2: john jaa****
1: dr. paul a. cu****
1: *** project/code 470
1: bill mah****
1: goddard, pwdo - bernadette fo****
1: joanne wo****
2: tom naro****
1: lucero ja****
1: jenny rumb****
1: blade ru****
1: lmit odi****
2: **** odin/osf seat
1: scott w. mci****
2: philip t. me****
1: annie ki****
452_Google_2e_05.qxd 10/5/07 12:46 PM Page 208
208 Chapter 5 • Google’s Part in an Information Collection Framework
Usernames:
-------------
1: cgro****
1: ****
1: gidel****
1: rdcho****
1: fbuchan****
2: sst****
1: rbene****
1: rpan****
2: l.j.klau****
1: gane****h
1: amh****
1: caroles****
2: mic****e
1: baltn****r
3: pcu****
1: md****
1: ****wxpadmin
1: mabis****
1: ebo****
2: grid****
1: bkst****
1: ***(at&l)
Companies:
-------------
1: shadow conservatory
[SNIP]
The list of companies has been chopped way down to protect the identity of the gov-
ernment agency in question, but the script seems to work well.The script can easily be
modified to scrape many more results (across many pages), extract more fields, and get other
file types. By the way, what the heck is the one unedited company known as the “Shadow
Conservatory?”