theHarvester For Open Source Recon

I was using theHarvester the other day and had to do a little extra work to get the data I wanted out of the results.  There are plenty of posts out there about how excatly to use theHarvester so I am not covering that.

The current version from theHarvester GitHub by Christian Martorella has a bunch of open Pull Requests and Issues, some of which probably would provide the functionality that I was looking for.  The right answer here is probably to fork the code and add the functionality myself.  What I did below is more of a quick fix that I put together during an engagement.

Using All Modules

Upon cloning the latest code, I discovered that the all search type did not really work as intended and only searched Google and PGP keys. The tool help lists many other options:

baidu, bing, bingapi, dogpile, google, googleCSE,  googleplus, google-profiles, linkedin, pgp, twitter, vhost, virustotal, threatcrowd, crtsh, netcraft, yahoo, all

So to get around this limitation (since I wanted all the results) I just dumped all of those search types (except all) to a file called search-types and then used a bash for loop to run each theHarvester module, writing the results of each module to it's own file.

for i in $(cat search-types); do ./theHarvester.py -d target.com -b $i -f $i-results.html -l 500; done

Parsing Output

Unfortunately, theHarvester only has 2 output options: HTML and XML.  If you are gathering emails and domains for an attack, it would be nice to just have these in a list. The HTML would have been hard to parse, luckily the XML is pretty well formatted so I was able to parse that using the Python XML Etree library into a plain list.

python theharvester_parser.py *.xml

import glob
import re
import sys
import xml.etree.ElementTree as ET

# Check for argument
if not sys.argv[1]:
    print('Must specify a file or search.')
    sys.exit()

# Parse for emails
for i in sys.argv[1:]:
    try:
        tree = ET.parse(i)
    except ET.ParseError:
        pass
    root = tree.getroot()
    for emails in root.findall('email'):
        print(emails.text)

print('------------------------------------------')

# Parse for domains/IPs
for i in sys.argv[1:]:
    try:
        tree = ET.parse(i)
    except ET.ParseError:
        pass
    root = tree.getroot()
    for host in root.findall('host'):
        if host.find('ip') is not None:
            print(host.find('ip').text + ' - ' + host.find('hostname').text)

LinkedIn Output

Lastly, the LinkedIn module has a different output than the others, namely the output is a list of names/positions instead of emails and domains. I noticed (perhaps anectdotally) that all of the names that also had a position listed, were not associated with the target company (perhaps they had been in the past).  

Also, given the email naming convention from other discovered emails, I knew my target has an email syntax of first.last@target.com.

So I whipped up a quick few lines to parse the names, ignore the ones with position titles, and jam them all into the correct email format to add to my already discovered emails.

with open('../linkedin-names') as names:
    for name in names:
        name_parts = name.split(' ')
        if len(name_parts) > 2:
         continue
        print name_parts[0].lower() + '.' + name_parts[1].lower().rstrip() + '@target.com'

I am not sure if any of this will be useful again in the future, it is here if I need.  I may see if I have the time to fork theHarvest code and add some of this functionality to my own version.