Infrastructure as Code Best Practices with Terraform for DevOps
João Victor Alhadas | Dec 17, 2024
Web scraping might save several hours when compared to manually collecting data from websites, and AWS Lambda is a good way to set up scripts to run by demand.
In this blog post, we’ll cover everything you need to know, the good and bad things about Web Scraping and AWS Lambda, and also analyze the code of an example project where we can combine these two technologies that fit very well together.
If you don’t have a good way to retrieve information from a website, data scraping is the way to go. Besides, it’s awesome to see a browser opening, clicking on buttons, and filling out forms by itself.
AWS Lambda Functions is an easy and low-price service to deploy scripts to the cloud, but how can we have a Chrome browser installed in an AWS lambda function?
Let’s talk about these technologies and go through a Python example that solves this mystery. Or, if you are not into reading definitions and just want to go straight to the point, skip to “The codebase” section of this blog post.
Data Scraping is retrieving information generated from another program. In this blog post, we will focus on Web Scraping, which is a type of data scraping that occurs when the data being retrieved is from a website.
You should consider web scraping as a last-resort alternative because it’s usually more beneficial to consume APIs (application programming interfaces) which are interfaces made to get the exact information you want.
Web scraping often requires more processing capacity and is also very likely to break if the website changes its display, or even if an element changes its identification parameters.
However, if there is no convenient API to access the data you need and the website you want to scrape is unlikely to have frequent interface changes, Web Scraping is the way to go!
This technique is very simple and manual. To use it, you only need to select the text you want to get, copy it, and paste it wherever you want to save it. Every programmer has already web scraped from stack overflow ^^
This technique consists of finding a text pattern match within the website-generated HTML. Below is a screenshot where we combine the Unix commands curl and grep respectively, get the website HTML, and find the text inside the quotes, which is the title tag in this case.
In this case, instead of dealing with the raw HTML text, the HTML is parsed into elements, making it easier to find the content you want to scrape.
Below is a Python example using the package Beautiful Soup. In this script, we make a request to Wikipedia’s main page, load its response into a BeautifulSoup instance, then use it to get all of the anchor elements of the page, and finally, iterate through the elements found and print each one of its HREF properties:
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.find_all('a'):
print(anchor.get('href', '/'))
Code language: Python (python)
Retrieved from https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)#Code_example
This technique uses a browser controlled by automation software to interact and retrieve the information on the website. So with DOM parsing, you are capable of opening a page, clicking buttons, filling forms, running your scripts on the page, and then getting the information you want. This is the technique we will focus on.
One famous package used for this is Selenium. This technique requires a driver to communicate with the browser installed (e.g. ChromeDriver, GeckoDriver).
This driver provides an API so the Selenium package can manage the browser. The driver version needs to be compatible with the installed browser version
Here’s an example of a Web Scraper that opens a browser gets the header element of a page and prints its header content:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome() # Open browser
driver.get("http://example.com") # Access page
header_element = driver.find_element(By.CSS_SELECTOR, 'h1') # Find H1 element
print(header_element.text) # Print found element text content
driver.quit() # Close browser
Code language: Python (python)
In this blog post, we are going to set the web scraper script on the cloud. Cloud computing is making services like hosting and storage, available over the internet.
Some famous examples of cloud providers are Amazon AWS, Google Cloud, and Microsoft Azure. There are several advantages of using a cloud solution, for example:
Some of these cloud services are serverless, which means these services are executed by demand and the cloud provider takes care of the server infrastructure on behalf of the customer.
Some of these cloud services require the usage of containers, which are executions of virtualized operating systems based on images (templates). Docker is the main platform for working with containers.
Although Cloud computing is great, it is not needed for every project. Running the script on your machine might be enough.
Each one of the providers mentioned has several ways to host a web scraper script. This AWS blog post describes 3 options.
To summarize the blog post, the options are:
Here, we will use AWS Lambda. Its main limitations are the timeout limit, which is 15 minutes and the deployment package can’t exceed 250 MB (but it accepts up to 10 GB using containers).
The example script is very simple and takes less than a minute to execute, but as we previously mentioned, Selenium requires a browser, and the Chrome binary size is around 500 MB, which forces us to use the container approach.
There are several ways to set up an AWS Lambda function. One easy way is by using the Serverless Framework. The Serverless framework helps us to develop and deploy Lambda Functions by using a single YAML file to declare the lambda functions, their infrastructure, and the events that will trigger them.
Using the Serverless Framework also allows us to deploy the lambda functions with a single command, simplifying the process a lot.
The Serverless Framework also provides an optional dashboard which gives us an interface to check the function’s health, trigger events manually, and check their logs.
In this section, we will analyze some important files of a demo project which can be checked here.
This is the file where we set the lambda application infrastructure, the lambda functions, and the events that are going to trigger them.
In the provider section of this file, we declare that we are going to use a docker image named img.
The functions section is where we set the lambda functions and their specific configuration like environment variables and handler functions.
Notice that here we add environment variables that will have the browser and its driver path, we state that we will use the image that was previously set, that the command that will be executed when the lambda function is triggered is the example.py file handler, and that the event that will trigger it is a cronjob that is scheduled for every 6 hours.
The Dockerfile is the template where we configure our container image. This file creates a Linux instance capable of running the web scraper. The template installs the project requirements (including Chrome and Chrome driver) and copies the required files to the image.
This file has a function that will be executed when the event is triggered. This file is pretty similar to the Selenium example we showed earlier.
The main difference is that we customize the browser to not display an interface (because the lambda does not have a display) and to use a single process (because the lambda only has 1 CPU).
In this case, the handler function returns a dictionary with a status code and a body just in case we want to change the event from a cronjob to an HTTP request.
To deploy the lambda function we just need to run a single command on the terminal.
In this case, Serverless Frameworks raises a warning message that explains that the dashboard does not support functions that use container images. It means that we will not be able to check the lambda function logs, nor trigger the lambda function manually through the dashboard.
But we can still do it through the AWS console. Below we have a screenshot of a successful log retrieved from AWS Cloudwatch:
This is a very specific example of the usage of AWS Lambda and Selenium but I hope it can illustrate the potential of these technologies. Instead of creating a web scraper, we can create functions that run end-to-end tests that, in case of failure, send a Slack message, or we can create an API that calculates the distance between two strings and returns it in the HTTP response. It’s all up to your imagination!
Demo Project Repository:
Content references:
A computer scientist who loves to study new technologies. Also enjoys rap, watching movies and TV shows, sports (especially soccer), and playing videogames.