I was using theHarvester the other day and had to do a little extra work to get the data I wanted out of the results. There are plenty of posts out there about how excatly to use theHarvester so I am not covering that.
The current version from theHarvester GitHub by Christian Martorella has a bunch of open Pull Requests and Issues, some of which probably would provide the functionality that I was looking for. The right answer here is probably to fork the code and add the functionality myself. What I did below is more of a quick fix that I put together during an engagement.
Using All Modules
Upon cloning the latest code, I discovered that the
all search type did not really work as intended and only searched Google and PGP keys. The tool help lists many other options:
baidu, bing, bingapi, dogpile, google, googleCSE, googleplus, google-profiles, linkedin, pgp, twitter, vhost, virustotal, threatcrowd, crtsh, netcraft, yahoo, all
So to get around this limitation (since I wanted all the results) I just dumped all of those search types (except
all) to a file called
search-types and then used a bash
for loop to run each theHarvester module, writing the results of each module to it's own file.
for i in $(cat search-types); do ./theHarvester.py -d target.com -b $i -f $i-results.html -l 500; done
Unfortunately, theHarvester only has 2 output options: HTML and XML. If you are gathering emails and domains for an attack, it would be nice to just have these in a list. The HTML would have been hard to parse, luckily the XML is pretty well formatted so I was able to parse that using the Python XML Etree library into a plain list.
python theharvester_parser.py *.xml
import glob import re import sys import xml.etree.ElementTree as ET # Check for argument if not sys.argv: print('Must specify a file or search.') sys.exit() # Parse for emails for i in sys.argv[1:]: try: tree = ET.parse(i) except ET.ParseError: pass root = tree.getroot() for emails in root.findall('email'): print(emails.text) print('------------------------------------------') # Parse for domains/IPs for i in sys.argv[1:]: try: tree = ET.parse(i) except ET.ParseError: pass root = tree.getroot() for host in root.findall('host'): if host.find('ip') is not None: print(host.find('ip').text + ' - ' + host.find('hostname').text)
Lastly, the LinkedIn module has a different output than the others, namely the output is a list of names/positions instead of emails and domains. I noticed (perhaps anectdotally) that all of the names that also had a position listed, were not associated with the target company (perhaps they had been in the past).
Also, given the email naming convention from other discovered emails, I knew my target has an email syntax of
So I whipped up a quick few lines to parse the names, ignore the ones with position titles, and jam them all into the correct email format to add to my already discovered emails.
with open('../linkedin-names') as names: for name in names: name_parts = name.split(' ') if len(name_parts) > 2: continue print name_parts.lower() + '.' + name_parts.lower().rstrip() + '@target.com'
I am not sure if any of this will be useful again in the future, it is here if I need. I may see if I have the time to fork theHarvest code and add some of this functionality to my own version.