Python For Hackers #6 | Building A Recursive Web Crawler

calc1f4r

Hello Folks😗,

Welcome to the sixth blog of our “Python for Hackers” blog series. Today, we’re building a very easy to make tool which is Recursive web crawler with Python , which will help us scrape a whole page and find various important things for a web page.

🫐 What is Web crawling&Scraping

Web scraping is the process of automatically extracting data from websites
Web crawling is the process of automatically visiting and downloading web pages. It is often used by search engines to discover and index new web pages.

How does Web Scraping help hackers?

to steal email addresses and passwords from social media websites.
to identify vulnerabilities in website
to launch denial-of-service attacks

🫐 Building the tool

Before building the tool itself let me give a brief of what you are going to build.

Importing all the necessary modules in our file

import requests
import argparse
from termcolor import colored
from bs4 import BeautifulSoup
import re
from datetime import datetime
from urllib.parse import urljoin

These all the modules which will be used.

Taking command-line arguments

We will be using the argparse module to accept command line inputs in this blog. If you are unfamiliar with using the argparse module, you can read my prior blogs, where I have taught the fundamentals of using it as well as detailed how to take user inputs

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-u', '--url', dest='url', help="Specify the URL, provide it along http/https", required=True)
    parser.add_argument('-d', '--depth', dest='depth', type=int, default=1, help="Specify the recursion depth limit")
    return parser.parse_args()

Let me firstly explain the various arguments we are adding.

-u/--url :This argument is responsible to accept the URL, user want to scan.
-d : This argument is used to define the recursive limit.

Coding out code outside functions

if __name__ == "__main__":
    args = get_args()
    web_crawler = WebCrawler(args.url, args.depth)
    web_crawler.print_banner()
    web_crawler.start_crawling()
    web_crawler.print_results()

In here, We are simply parsing out the command-line args and assigning to the args variable. Then we are creating a object named web_crawler from the class WebCrawler and in the instantiation process, we are also instantiating the object with url and depth argument which was parsed from the argument.

Creating the `WebCrawler` Class

Defining the `init` method

class WebCrawler:
    def __init__(self, url, max_depth):
        self.url = url
        self.max_depth = max_depth
        self.subdomains = set()
        self.links = set()
        self.jsfiles = set()

🙌 __init__ method is used as a class constructor.

Let’s understand the various elements of the code:

After initializing the class WebCrawler , we are constructing our class using the __init__ method in it.
The __init__ method initializes the WebCrawler object with the parameters like (url,max_depth).
The init method also creates three instance variables named , subdomains,links,jsfiles, which all are sets

Defining the `print_banner` method

As you may recall, after instantiating a class in the global space, we called the print_banner method, so let’s write that down.

    def print_banner(self):
        print("-" * 80)
        print(colored(f"Recursive Web Crawler starting at {datetime.now().strftime('%d/%m/%Y %H:%M:%S')}", 'cyan', attrs=['bold']))
        print("-" * 80)
        print(f"[*] URL".ljust(20, " "), ":", self.url)
        print(f"[*] Max Depth".ljust(20, " "), ":", self.max_depth)
        print("-" * 80)

I’m not going to explain the code because it’s simple and straightforward.

Defining the `start_crawling` method

After calling the print_banner method we had called, the start_crawling method, so let’s write it down

    def start_crawling(self):
        self.crawl(self.url, depth=1)

The start_crawling method is responsible to call the function crawl which is the main function responsible for crawling. The function calls the crawl function with the following parameters => url and the current value of depth.

Defining the `start_crawling` method

def crawl(self, url, depth):
        if depth > self.max_depth:
            return
        try:
            response = requests.get(url, timeout=3, allow_redirects=True)
            soup = BeautifulSoup(response.text, 'html.parser')
        except requests.exceptions.RequestException as err:
            print(f"[-] An error occurred: {err}")
            return  
        subdomain_query = fr"https?://([a-zA-Z0-9.-]+)"
        for link in soup.find_all('a'):
            link_text = link.get('href')
            if link_text:
                if re.match(subdomain_query, link_text) and link_text not in self.subdomains:
                    self.subdomains.add(link_text)
                else:
                    full_link = urljoin(url, link_text)
                    if full_link != url and full_link not in self.links:
                        self.links.add(full_link)
                        self.crawl(full_link, depth + 1)

        for file in soup.find_all('script'):
            script_src = file.get('src')
            if script_src:
                self.jsfiles.add(script_src)

This is the main function which handles all the various functionalities. let me give a high-level overview of the whole function.

Because we have created a recursive limit in the function, in the very first few lines we check if the current value of depth is larger than the ‘self.max_depth’, if that is the situation we just exits the function.
- In the following line, in the try block, the code attempts to send a http get request to the specified URL and return its response, which we parse using the Beautiful soup package.
If some exception happens in the request making, we capture it in the except block and then we print the code and exits out of the function.
Subdomain and Link extraction:
- For extraction We have built a regex query to assist in matching subdomains and urls within the html text
- After that we iterate through all the anchor elements in the html using the beautiful soup, for each anchor, it retrieves the href attribute, which contains the url.
- whether the href property (link_text) is not empty, the code checks to see whether it matches the subdomain_query. If it does and the link is not already in the subdomains list, it adds the link to it. Otherwise, it generates the entire URL with the urljoin function and tests to see if it differs from the original URL (url) and is not already in the links list. If all conditions are met, the whole link is added to the ‘links’ set and the recursion is continued by calling self.crawl with the new link and an incremented depth.
JS File extraction
- The code also extracts Javascript files by iterating through all <script> elements in the html and retrieving the src attribute. If a src attribute exists, it adds the script source to the jsfiles set.

Defining the `print_results` method

This is the last method of the class which is responsible for printing the results retrieved.

    def print_results(self):
        if self.subdomains:
            for subdomain in self.subdomains:
                print(f"[+] Subdomains : {subdomain}")
        print()
        if self.links:
            for link in self.links:
                print(f"[+] Links : {link}")
        print()
        if self.jsfiles:
            for file in self.jsfiles:
                print(f"[+] JS Files : {file}")

The code is simple to understand , so it doesn’t making sense explaining it as well.

So this is the end of the code , if you want to reach out the whole code you can check it out at : 🔗Link to code

🐟 Conclusion

In summary, 🌐 the Recursive Web Crawler is a useful tool for web exploration, security 🔒, and development 🛠️. With its recursive depth control and data extraction capabilities, you can take advantage by knowing more about your target.

Nafi

calc1f4r would be helpful if u link a github/ colab notebook so we can check it out

calc1f4r

Nafi Applogies, Looks like I forgot to add the link , Have updated now !

Nafi

calc1f4r cool, thanks!