Creating a Basic Web Scraper With Python ๐Ÿ

Creating a Basic Web Scraper With Python ๐Ÿ

Learn How to Scrape Indeed.com

ยท

5 min read

I recently got into web scraping as a way to simplify database insertion for a side project of mine. The uses of web scrapers are endless. You can create a web scraper to collect different football stats, download videos or images, compare flight prices, and more. Whatever you do, just keep in mind there are some basic legal restrictions on scrapers that I won't get into, but that you can read more about here.

For this tutorial, we are going to scrape the Indeed website for relevant software developer jobs. There are two main tools to choose from for scraping with Python: BeautifulSoup and Selenium. While BeautifulSoup will work with the Indeed website, we are going to use Selenium because Selenium works on JavaScript heavy websites (most modern websites that use frameworks). In other words, you'll be able to use it more.

This tutorial works with Indeed's layout as of 11/15/21. Websites change all the time, but hopefully this tutorial will give you enough guidance to make any updates as needed.

Installation & Imports

Begin by installing Selenium. For more information visit the Selenium website .

pip install selenium

Import the following into your Python file:

import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException

Grabbing The Job Posts

Now, we will create function that visits the Indeed website and gathers the company name, position, description, link, and salary from each job positing on the page. That function will also be adding each posting to a list of dictionaries. Let's start with having webdriver visit the page:

def find_jobs():
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get('https://www.indeed.com/jobs?q=software%20developer&l=remote&vjk=1e9c69b0185f6d30')
# Searching for software developer positions that are remote. 
# Change this as you wish or convert it to accept input variables

Scraping is primarily about finding HTML elements and gather data from them. Inspect the page. What do you notice about the page's layout in your dev tools? I noticed that each posting was contained in an a tag with a class of tapItem.

page layout of indeed with tapItem selected

To create a list of items use driver.find_elements_by_class_name('tapItem'). Check out the full list of selectors at the Selenium website .

jobs = driver.find_elements_by_class_name('tapItem')
job_list = []
#This will hold all of the dictionaries we create for each item.

Finding The Job Information

We are going to loop through each of those jobs and find their corresponding title, company and more. With your dev tools still open, can you locate the HTML tag that holds the company name? How about the location? Once you locate them, select them using the selector above (or a different you found in the Selenium docs).

for j in jobs:
        dictionary = {}

        company = j.find_element_by_class_name('companyName').text
        dictionary['company'] = company

        position = j.find_element_by_class_name('jobTitle').text
        dictionary['position'] = position

        location = j.find_element_by_class_name('companyLocation').text
        dictionary['location'] = location

You may have noticed that some job postings list the salary while others do not. Salary information is still very important so how can we grab those that list it? The answer is to use a try catch.

try: 
     salary = j.find_element_by_class_name('salary-snippet-container')
     dictionary['salary'] = salary.text
except NoSuchElementException:
      dictionary['salary'] = 'No salary listed'

All that's left to collect is a link to more information and a short description. Let's grab those!

description = j.find_element_by_class_name('job-snippet').text
dictionary['description'] = description

Finding the link to view more information is actually really easy for this project. Remember how we selected that a tag earlier? We just have to select the href of that tag and we're all set!

dictionary['link'] = j.get_attribute('href')

Creating .txt Files For Each Posting

Alright, we have all of the information, but reading the postings in your terminal isn't very fun or practical. Instead, we are going to create a .txt document with the information we've gathered. We are also going to run our get_jobs function every 15 minutes so we can get the newest job postings without doing anything! Let's append those dictionaries to our list and save those files! Create a folder called jobPosts to store the txt files.

       job_list.append(dictionary)    

        with open(f'jobPosts/{company.replace(" ", "")}.txt', 'w') as f:
            f.write(f"{company}\n {position}\n {location}\n {dictionary['salary']} \n {description} \n {dictionary['link']}")


        print(f'File saved: {company.replace(" ", "")}.txt')

To run the script every 15 minutes, add this to the end of your code outside of the find_jobs function:

if __name__ == '__main__':
    while True:
        find_jobs()
        time_wait = 15
        print(f'Waiting {time_wait} minutes...')
        time.sleep(time_wait * 60)

Congratulations, you've made your first scraper! Now you can run your scraper and have new jobs in this folder every 15 minutes! Sweet! ๐Ÿป

This is just one example of how you can automate your life with Python and web scraping. If you used this tutorial let me know what else you've created or want to create with Python! Thanks for reading! โœจโœจ

Full Code

import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException

def find_jobs():
    driver = webdriver.Chrome(ChromeDriverManager().install())

    driver.get('https://www.indeed.com/jobs?q=software%20developer&l=remote&vjk=1e9c69b0185f6d30')

    jobs = driver.find_elements_by_class_name('tapItem')

    job_list = []

    for j in jobs:
        dictionary = {}

        company = j.find_element_by_class_name('companyName').text
        dictionary['company'] = company

        position = j.find_element_by_class_name('jobTitle').text
        dictionary['position'] = position

        location = j.find_element_by_class_name('companyLocation').text
        dictionary['location'] = location

        try: 
           salary = j.find_element_by_class_name('salary-snippet-container')
           dictionary['salary'] = salary.text
        except NoSuchElementException:
            dictionary['salary'] = 'No salary listed'

        description = j.find_element_by_class_name('job-snippet').text
        dictionary['description'] = description

        dictionary['link'] = j.get_attribute('href')

        job_list.append(dictionary)    

        with open(f'jobPosts/{company.replace(" ", "")}.txt', 'w') as f:
            f.write(f"{company}\n {position}\n {location}\n {dictionary['salary']} \n {description} \n {dictionary['link']}")

        print(f'File saved: {company.replace(" ", "")}.txt')      

if __name__ == '__main__':
    while True:
        find_jobs()
        time_wait = 15
        print(f'Waiting {time_wait} minutes...')
        time.sleep(time_wait * 60)
ย