Master the Art of Web Scraping!
Learn Beautiful soup, web scraping, and sending email alerts through this fun project!
A lot people have come up with smart ways to make purchases. And in this era of technology, there are modern solutions for just about anything. For instance, companies are mushrooming everywhere that help you find the best deals. But how exactly do they do that? For this project, we are going to create one such program.
The program will track a product that you are interested in buying and send you an email alert when the product goes below a certain set threshold. Follow along to find out how.
That aside, lemme take you through how!
Probably a very salient step before we start coding, map out the steps.
- Go to amazon.com/any website and find the item you want to purchase.
- Take note of the item of choice. This can be getting the URL.
- Use BeautifulSoup python module to scrape the website to retrieve the price of the item.
- Set up your email using the smtplib module to send the email alerts when the price is below a set threshold in order to purchase.
Assuming you have step 1 out of the way, lets head straight to step 2.
- In order for python code to interact with the web, we use the requests module. This module allows you to send HTTP requests through your program. The same way you go on google and you google something and get feedback, that is what sending a HTTP request is. You ask for information and get back a response.
a) Import the module into python. (modules are pieces of code written by other people that you can reuse). This is an external module and doesn't come pre-installed with python.
b) The web_URL is the link to the item. For this project, the link is from Amazon.com for an electric scooter.
c) Some websites, in order to send you information when you want to scrape their websites, they will require you to send some extra information. This information is basically like some sought of identification of who you are! e.g What browser you are using, your computer, etc. This is a way of saying: I am not a robot I swear! (This is not mandatory for all websites, but some like amazon.com will require headers information. I will show you what happens when you dont send header information). To get your header information, go to http://myhttpheader.com/
d) Use the requests module to set up a get request and pass the URL and headers as the inputs.
If you print the response from the get request, you get a html document of the website. Below is a snippet. (This is the HTML structure of the website.)
In order to proceed to scraping the website we have copied, This project assumes you have an idea of how a website is written.
However, in layman’s terms, a website can be divided into a title, a main heading, sub-headings, paragraphs, buttons and links that you can click and so on. What is used to achieve that, is HTML. If you need to familiarize yourself further, then take a look at this tutorial before proceeding further.
2. Now that we have read the web info into our program, the next thing is to catch the part of the website that has the price information
In order to tell what part of the website holds that price information got the website and put the cursor over the price information (in red) and right click. Click on inspect on the dialog box that pops up. This is what will appear.
Above is a screenshot of what you will see once you click on inspect. And highlighted will be what you had your cursor pointed on/hovering over.
As you can see, the part of the code that holds the price information is in a span tag with an id=“priceblock_ourprice”
If you dont understand what a span is or an id, kindly refer to the tutorial I referred you to.
Now comes the web-scraping.
2. Use Beautiful soup to scrape the website for that information.
In websites, an id is unique throughout the website. That means, there will be only one id with a certain name in a website. As a result, we can pull that part of the code using the id.
First, import the module(Note: The module is not pre-installed with python. You will have to use pip install to install any external modules. )
a) Import the BeautifulSoup class as shown from the bs4 module
b) Create an object from BeautifulSoup passing in the website data that we had set up from the get request earlier and a parser. This is to tell BeautifulSoup that, “As you are reading through the website, decode it in html. i.e , it is in html format.”
c) Once that is done, we will tell the soup using the find() method to look through the website and give us the part /tag that has the id of “priceblock_ourprice”.
d) If you print out that information, you will receive the tag where the price information is contained as shown below: (Note that this is the exact same tag that was highlighted in the official website earlier.)
NOTE: If you didn't put the headers information in your get request, then you will receive “none” as the output. This is amazon basically thinking you are a thief!
e) To go further, what we need ins’t the entire tag. we only need the price itself. That is where the getText() method comes in. This will get only the text inside that tag as shown below.
One you run it you will get: THE PRICE ALONE!
As such, you have scrapped the website. Way to go!
Now onto step 3: Send the email Alert
For this step, we will be using the smtplib module- Find the documentation here in order to understand how to go about using the code contained here.
Note: smtplib is what we use to send emails from python. Under this module, you will need to open a connection, secure a connection, enter your credentials such as the email and password and send the email. Finally, we will close that connection.
In order to automatically open and shut a connection, we use the with keyword in python.
But first, this will be sent once the price goes below a certain threshold.
First thing would be to remove/strip the dollar($) sign from the data and change it from a string to a floating number(decimal).
From the above code, we get: 109.99
Then, set the threshold and send the email when the product you want to buy goes below a certain set price.
a) Set up the connection using the “with” keyword to open and close the connection. This is achieved using the SMTP class from the smtplib module. What you need to pass in as an input is the domain. If you will be using a gmail address to send the email, then use gmail domain. The same applies for the yahoo mail domain.
b) the starttls() method secures your connection.
c) You will then login to your account using the login() method and pass in the username and password.
d) Send the email using the sendmail() method. You will need the from_address(This is the address you are sending the email from); the to_address(This is the email you’re sending the email to) and the message/email body
Note: When an email does not have a subject, it will be flagged as spam. As such, when composing the email, create a subject. This will be achieved using the Subject keyword you see in the message. You will then add two newlines and then followed by the message. That will be interpreted as the body of the email.
p/s: I changed the threshold price to $150 to test out that the email was sending.
And it is working.
Additional Information
Considering this code is running on your console, you cannot be running your code everyday in order to track. That beats the logic. You can upload your code to Python Anywhere which basically ensures the code is always running and tracking the price for you. ! And dont worry. Your program will be private. No one will see your email/password
You can find the complete code for this project here!.
Thank you for reading