Hello Folks😗,
Welcome to the sixth blog of our “Python for Hackers” blog series. Today, we’re building a very easy to make tool which is Recursive web crawler with Python , which will help us scrape a whole page and find various important things for a web page.
🫐 What is Web crawling&Scraping
- Web scraping is the process of automatically extracting data from websites
- Web crawling is the process of automatically visiting and downloading web pages. It is often used by search engines to discover and index new web pages.
How does Web Scraping help hackers?
- to steal email addresses and passwords from social media websites.
- to identify vulnerabilities in website
- to launch denial-of-service attacks
🫐 Building the tool
Before building the tool itself let me give a brief of what you are going to build.
Importing all the necessary modules in our file
import requests
import argparse
from termcolor import colored
from bs4 import BeautifulSoup
import re
from datetime import datetime
from urllib.parse import urljoin
These all the modules which will be used.
Taking command-line arguments
We will be using the argparse
module to accept command line inputs in this blog. If you are unfamiliar with using the argparse
module, you can read my prior blogs, where I have taught the fundamentals of using it as well as detailed how to take user inputs
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument('-u', '--url', dest='url', help="Specify the URL, provide it along http/https", required=True)
parser.add_argument('-d', '--depth', dest='depth', type=int, default=1, help="Specify the recursion depth limit")
return parser.parse_args()
Let me firstly explain the various arguments we are adding.
-u/--url
:This argument is responsible to accept the URL, user want to scan.
-d
: This argument is used to define the recursive limit.
Coding out code outside functions
if __name__ == "__main__":
args = get_args()
web_crawler = WebCrawler(args.url, args.depth)
web_crawler.print_banner()
web_crawler.start_crawling()
web_crawler.print_results()
In here, We are simply parsing out the command-line args
and assigning to the args
variable. Then we are creating a object named web_crawler
from the class WebCrawler
and in the instantiation process, we are also instantiating the object with url and depth argument which was parsed from the argument.
Creating the WebCrawler
Class
Defining the __init__
method
class WebCrawler:
def __init__(self, url, max_depth):
self.url = url
self.max_depth = max_depth
self.subdomains = set()
self.links = set()
self.jsfiles = set()
🙌 __init__
method is used as a class constructor.
Let’s understand the various elements of the code:
- After initializing the class
WebCrawler
, we are constructing our class using the __init__
method in it.
- The
__init__
method initializes the WebCrawler
object with the parameters like (url,max_depth).
- The
init
method also creates three instance variables named , subdomains
,links
,jsfiles
, which all are sets
Defining the print_banner
method
As you may recall, after instantiating a class in the global space, we called the print_banner method, so let’s write that down.
def print_banner(self):
print("-" * 80)
print(colored(f"Recursive Web Crawler starting at {datetime.now().strftime('%d/%m/%Y %H:%M:%S')}", 'cyan', attrs=['bold']))
print("-" * 80)
print(f"[*] URL".ljust(20, " "), ":", self.url)
print(f"[*] Max Depth".ljust(20, " "), ":", self.max_depth)
print("-" * 80)
I’m not going to explain the code because it’s simple and straightforward.
Defining the start_crawling
method
After calling the print_banner
method we had called, the start_crawling
method, so let’s write it down
def start_crawling(self):
self.crawl(self.url, depth=1)
The start_crawling
method is responsible to call the function crawl
which is the main function responsible for crawling. The function calls the crawl
function with the following parameters => url
and the current value of depth
.
Defining the start_crawling
method
def crawl(self, url, depth):
if depth > self.max_depth:
return
try:
response = requests.get(url, timeout=3, allow_redirects=True)
soup = BeautifulSoup(response.text, 'html.parser')
except requests.exceptions.RequestException as err:
print(f"[-] An error occurred: {err}")
return
subdomain_query = fr"https?://([a-zA-Z0-9.-]+)"
for link in soup.find_all('a'):
link_text = link.get('href')
if link_text:
if re.match(subdomain_query, link_text) and link_text not in self.subdomains:
self.subdomains.add(link_text)
else:
full_link = urljoin(url, link_text)
if full_link != url and full_link not in self.links:
self.links.add(full_link)
self.crawl(full_link, depth + 1)
for file in soup.find_all('script'):
script_src = file.get('src')
if script_src:
self.jsfiles.add(script_src)
This is the main function which handles all the various functionalities. let me give a high-level overview of the whole function.
- Because we have created a recursive limit in the function, in the very first few lines we check if the current value of depth is larger than the ‘self.max_depth’, if that is the situation we just exits the function.
- - In the following line, in the try block, the code attempts to send a http get request to the specified URL and return its response, which we parse using the Beautiful soup package.
- If some exception happens in the request making, we capture it in the except block and then we print the code and exits out of the function.
- Subdomain and Link extraction:
- For extraction We have built a regex query to assist in matching subdomains and urls within the html text
- After that we iterate through all the anchor elements in the html using the beautiful soup, for each anchor, it retrieves the
href
attribute, which contains the url.
- whether the
href
property (link_text
) is not empty, the code checks to see whether it matches the subdomain_query
. If it does and the link is not already in the subdomains
list, it adds the link to it. Otherwise, it generates the entire URL with the urljoin
function and tests to see if it differs from the original URL (url
) and is not already in the links
list. If all conditions are met, the whole link is added to the ‘links’ set and the recursion is continued by calling self.crawl
with the new link and an incremented depth.
- JS File extraction
- The code also extracts Javascript files by iterating through all
<script>
elements in the html and retrieving the src
attribute. If a src
attribute exists, it adds the script source to the jsfiles
set.
Defining the print_results
method
This is the last method of the class which is responsible for printing the results retrieved.
def print_results(self):
if self.subdomains:
for subdomain in self.subdomains:
print(f"[+] Subdomains : {subdomain}")
print()
if self.links:
for link in self.links:
print(f"[+] Links : {link}")
print()
if self.jsfiles:
for file in self.jsfiles:
print(f"[+] JS Files : {file}")
The code is simple to understand , so it doesn’t making sense explaining it as well.
So this is the end of the code , if you want to reach out the whole code you can check it out at : 🔗Link to code
🐟 Conclusion
In summary, 🌐 the Recursive Web Crawler is a useful tool for web exploration, security 🔒, and development 🛠️. With its recursive depth control and data extraction capabilities, you can take advantage by knowing more about your target.
Python is the most versatile and easy programming languages to learn, while it’s frequently used for automating tasks, there is nothing it cannot do. Web scraping? Got it. Tic-Tac-Toe? Got it. Python based video games? Yes it can do that too! Python can even be used for game hacking. But what happens when you want to find out what these python apps are doing? Can you reverse engineer them? You sure can, but it’s a bit different than reversing most apps. To learn more, try the Python Reverse Engineering Course here at GuidedHacking. There is truly unlimited potential Python and it’s no wonder it’s the commonly rated the most popular language out there that people want to learn next.