{"id":11057,"date":"2023-01-12T20:42:29","date_gmt":"2023-01-12T20:42:29","guid":{"rendered":"https:\/\/cheesecakelabs.com\/blog\/"},"modified":"2024-07-31T18:59:15","modified_gmt":"2024-07-31T18:59:15","slug":"selenium-scraper-aws-lambda","status":"publish","type":"post","link":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/","title":{"rendered":"How To Use Selenium To Web-Scrape on AWS Lambda"},"content":{"rendered":"\n<p>Web scraping might save several hours when compared to manually collecting data from websites, and AWS Lambda is a good way to set up scripts to run by demand.<\/p>\n\n\n\n<p>In this blog post, we\u2019ll cover everything you need to know,\u00a0 the good and bad things about Web Scraping and AWS Lambda, and also analyze the code of an example project where we can combine these two technologies that fit very well together.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">In this article<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"#Introduction\">Introduction<\/a>\n<ul class=\"wp-block-list\">\n<li><a href=\"#Data-Scraping\">Data Scraping<\/a><\/li>\n\n\n\n<li><a href=\"#List-of-some-web-scraping-techniques:\">Web scraping techniques<\/a><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><a href=\"#How-to-use-the-Web-Scraper-script?\">How to use the Web Scraper script?<\/a><\/li>\n\n\n\n<li><a href=\"#How-are-we-going-to-deploy-it-to-AWS-Lambda?\">How are we going to deploy it to AWS Lambda?<\/a>\n<ul class=\"wp-block-list\">\n<li><a href=\"#The-codebase:\">The codebase<\/a><\/li>\n\n\n\n<li><a href=\"#Deploying\">Deploying<\/a><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><a href=\"#Conclusions\">Conclusions<\/a><\/li>\n\n\n\n<li><a href=\"#Resources\">Resources<\/a><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"Introduction\"><strong>Introduction<\/strong><\/h2>\n\n\n\n<p>If you don&#8217;t have a good way to retrieve information from a website, data scraping is the way to go. Besides, it&#8217;s awesome to see a browser opening, clicking on buttons, and filling out forms by itself.<\/p>\n\n\n\n<p>AWS Lambda Functions is an easy and low-price service to deploy scripts to the cloud, but how can we have a Chrome browser installed in an AWS lambda function?<\/p>\n\n\n\n<p>Let&#8217;s talk about these technologies and go through a Python example that solves this mystery. Or, if you are not into reading definitions and just want to go straight to the point, skip to &#8220;The codebase&#8221; section of this blog post.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"Data-Scraping\">Data Scraping<\/h2>\n\n\n\n<p>Data Scraping is retrieving information generated from another program. In this blog post, we will focus on Web Scraping, which is a type of data scraping that occurs when the data being retrieved is from a website.&nbsp;<\/p>\n\n\n\n<p>You should consider web scraping as a last-resort alternative because it&#8217;s usually more beneficial to consume <a href=\"https:\/\/cheesecakelabs.com\/blog\/api-design-think-first-code-later\/\" target=\"_blank\" rel=\"noreferrer noopener\">APIs<\/a> (application programming interfaces) which are interfaces made to get the exact information you want.<\/p>\n\n\n\n<p>Web scraping often requires more processing capacity and is also very likely to break if the website changes its display, or even if an element changes its identification parameters.<\/p>\n\n\n\n<p>However, if there is no convenient API to access the data you need and the website you want to scrape is unlikely to have frequent interface changes, Web Scraping is the way to go!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"List-of-some-web-scraping-techniques:\"><strong>List of some web scraping techniques:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Copy and Paste<\/strong><\/li>\n<\/ul>\n\n\n\n<p>This technique is very simple and manual. To use it, you only need to select the text you want to get, copy it, and paste it wherever you want to save it. Every programmer has already web scraped from stack overflow ^^<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Text Pattern Matching<\/strong><\/li>\n<\/ul>\n\n\n\n<p>This technique consists of finding a text pattern match within the website-generated HTML. Below is a screenshot where we combine the Unix commands curl and grep respectively, get the website HTML, and find the text inside the quotes, which is the title tag in this case.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1200\" height=\"226\" src=\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-scraper-aws-lambda-1200x226.png\" alt=\"\" class=\"wp-image-11058\" srcset=\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-scraper-aws-lambda-1200x226.png 1200w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-scraper-aws-lambda-600x113.png 600w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-scraper-aws-lambda-768x145.png 768w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-scraper-aws-lambda-760x143.png 760w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-scraper-aws-lambda.png 1274w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>HTML Parsing<\/strong><\/li>\n<\/ul>\n\n\n\n<p>In this case, instead of dealing with the raw HTML text, the HTML is parsed into elements, making it easier to find the content you want to scrape.&nbsp;<\/p>\n\n\n\n<p>Below is a <a href=\"https:\/\/cheesecakelabs.com\/blog\/biggest-benefits-of-python\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python<\/a> example using the package <a href=\"https:\/\/beautiful-soup-4.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener\">Beautiful Soup<\/a>. In this script, we make a request to Wikipedia&#8217;s main page, load its response into a BeautifulSoup instance, then use it to get all of the anchor elements of the page, and finally, iterate through the elements found and print each one of its HREF properties:<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\"><pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-keyword\">from<\/span> bs4 <span class=\"hljs-keyword\">import<\/span> BeautifulSoup\n<span class=\"hljs-keyword\">from<\/span> urllib.request <span class=\"hljs-keyword\">import<\/span> urlopen\n\n<span class=\"hljs-keyword\">with<\/span> urlopen(<span class=\"hljs-string\">'https:\/\/en.wikipedia.org\/wiki\/Main_Page'<\/span>) <span class=\"hljs-keyword\">as<\/span> response:\n   soup = BeautifulSoup(response, <span class=\"hljs-string\">'html.parser'<\/span>)\n   <span class=\"hljs-keyword\">for<\/span> anchor <span class=\"hljs-keyword\">in<\/span> soup.find_all(<span class=\"hljs-string\">'a'<\/span>):\n       print(anchor.get(<span class=\"hljs-string\">'href'<\/span>, <span class=\"hljs-string\">'\/'<\/span>))<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Retrieved from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Beautiful_Soup_(HTML_parser)#Code_example\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/en.wikipedia.org\/wiki\/Beautiful_Soup_(HTML_parser)#Code_example<\/a><\/p>\n<\/div>\n<\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>DOM Parsing<\/strong><\/li>\n<\/ul>\n\n\n\n<p>This technique uses a browser controlled by automation software to interact and retrieve the information on the website. So with DOM parsing, you are capable of opening a page, clicking buttons, filling forms, running your scripts on the page, and then getting the information you want. This is the technique we will focus on.<\/p>\n\n\n\n<p>One famous package used for this is <a href=\"https:\/\/www.selenium.dev\/\" target=\"_blank\" rel=\"noreferrer noopener\">Selenium<\/a>. This technique requires a driver to communicate with the browser installed (e.g. <a href=\"https:\/\/chromedriver.chromium.org\/downloads\" target=\"_blank\" rel=\"noreferrer noopener\">ChromeDriver<\/a>, <a href=\"https:\/\/github.com\/mozilla\/geckodriver\/releases\" target=\"_blank\" rel=\"noreferrer noopener\">GeckoDriver<\/a>).<\/p>\n\n\n\n<p>This driver provides an API so the Selenium package can manage the browser. The driver version needs to be compatible with the installed browser version<\/p>\n\n\n\n<p>Here&#8217;s an example of a Web Scraper that opens a browser gets the header element of a page and prints its header content:<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\"><pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-keyword\">from<\/span> selenium <span class=\"hljs-keyword\">import<\/span> webdriver\n<span class=\"hljs-keyword\">from<\/span> selenium.webdriver.common.by <span class=\"hljs-keyword\">import<\/span> By\n\ndriver = webdriver.Chrome() <span class=\"hljs-comment\"># Open browser<\/span>\ndriver.get(<span class=\"hljs-string\">\"http:\/\/example.com\"<\/span>) <span class=\"hljs-comment\"># Access page<\/span>\nheader_element = driver.find_element(By.CSS_SELECTOR, <span class=\"hljs-string\">'h1'<\/span>) <span class=\"hljs-comment\"># Find H1 element<\/span>\nprint(header_element.text) <span class=\"hljs-comment\"># Print found element text content<\/span>\ndriver.quit() <span class=\"hljs-comment\"># Close browser<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"How-to-use-the-Web-Scraper-script?\"><strong>How to use the Web Scraper script?&nbsp;<\/strong><\/h2>\n\n\n\n<p>In this blog post, we are going to set the web scraper script on the <strong>cloud<\/strong>. Cloud computing is making services like hosting and storage, available over the internet.<\/p>\n\n\n\n<p>Some famous examples of cloud providers are <a href=\"https:\/\/aws.amazon.com\/?nc2=h_lg\" target=\"_blank\" rel=\"noreferrer noopener\">Amazon AWS<\/a>, <a href=\"https:\/\/cloud.google.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Google Cloud<\/a>, and <a href=\"https:\/\/azure.microsoft.com\/en-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft Azure<\/a>. There are several advantages of using a cloud solution, for example:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You don&#8217;t need to set up your server<\/li>\n\n\n\n<li>There are a lot of engineers working to make sure these services don&#8217;t have any security breaches<\/li>\n\n\n\n<li>It&#8217;s usually very easy to escalate the capabilities of the cloud service you are using. In many of these services, you can increase the memory and processing capacity just by clicking some buttons<\/li>\n<\/ul>\n\n\n\n<p>Some of these <a href=\"https:\/\/cheesecakelabs.com\/blog\/cloud-services-best-fit-for-your-project\/\" target=\"_blank\" rel=\"noreferrer noopener\">cloud services<\/a> are <strong>serverless<\/strong>, which means these services are executed by demand and the cloud provider takes care of the server infrastructure on behalf of the customer.&nbsp;<\/p>\n\n\n\n<p>Some of these cloud services require the usage of <strong>containers<\/strong>, which are executions of virtualized operating systems based on images (templates). <a href=\"https:\/\/www.docker.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Docker<\/a> is the main platform for working with containers.<\/p>\n\n\n\n<p>Although Cloud computing is great, it is not needed for every project. Running the script on your machine might be enough.\u00a0<\/p>\n\n\n\n<p>Each one of the providers mentioned has several ways to host a web scraper script. <a href=\"https:\/\/aws.amazon.com\/blogs\/architecture\/serverless-architecture-for-a-web-scraping-solution\/\" target=\"_blank\" rel=\"noreferrer noopener\">This AWS blog post<\/a> describes 3 options.<\/p>\n\n\n\n<p>To summarize the blog post, the options are:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a virtual machine using the EC2 Service. This is the most primitive option, you&#8217;d need to set up the machine just like a regular one and it would be kept on 24 hours a day. This is also the most expensive solution out of the 3 options.<\/li>\n\n\n\n<li>Containerize the script and use it on the AWS Fargate service. Fargate is a serverless option, which is useful because the web scraper only needs to be executed by demand. Fargate is also cheaper than EC2.<\/li>\n\n\n\n<li>Use AWS Lambda, which is also a serverless service that supports both raw code and containerized scripts. It has more limitations compared to Fargate, but it is enough in most cases. It&#8217;s the cheapest service and your script might even fit the free tier.<\/li>\n<\/ol>\n\n\n\n<p>Here, we will use AWS Lambda. Its main limitations are the timeout limit, which is 15 minutes and the deployment package can&#8217;t exceed 250 MB (but it accepts up to 10 GB using containers).<\/p>\n\n\n\n<p>The example script is very simple and takes less than a minute to execute, but as we previously mentioned, Selenium requires a browser, and the Chrome binary size is around 500 MB, which forces us to use the container approach.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"How-are-we-going-to-deploy-it-to-AWS-Lambda?\"><strong>How are we going to deploy it to AWS Lambda?<\/strong><\/h2>\n\n\n\n<p>There are several ways to set up an AWS Lambda function. One easy way is by using the <a href=\"https:\/\/serverless.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Serverless Framework<\/a>. The Serverless framework helps us to develop and deploy Lambda Functions by using a single YAML file to declare the lambda functions, their infrastructure, and the events that will trigger them.<\/p>\n\n\n\n<p>Using the Serverless Framework also allows us to deploy the lambda functions with a single command, simplifying the process a lot.<\/p>\n\n\n\n<p>The Serverless Framework also provides an optional dashboard which gives us an interface to check the function&#8217;s health, trigger events manually, and check their logs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"The-codebase:\"><strong>The codebase:<\/strong><\/h3>\n\n\n\n<p>In this section, we will analyze some important files of a demo project which can be checked <a href=\"https:\/\/github.com\/CheesecakeLabs\/selenium-serverless-example\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/CheesecakeLabs\/selenium-serverless-example\/blob\/main\/serverless.yml\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>serverless.yml<\/strong><\/a><\/p>\n\n\n\n<p>This is the file where we set the lambda application infrastructure, the lambda functions, and the events that are going to trigger them.<\/p>\n\n\n\n<p>In the <strong>provider<\/strong> section of this file, we declare that we are going to use a docker image named <em>img<\/em>.<\/p>\n\n\n\n<p>The <strong>functions<\/strong> section is where we set the lambda functions and their specific configuration like environment variables and handler functions.<\/p>\n\n\n\n<p>Notice that here we add environment variables that will have the browser and its driver path, we state that we will use the image that was previously set, that the command that will be executed when the lambda function is triggered is the <em>example.py <\/em>file handler, and that the event that will trigger it is a <em>cronjob<\/em> that is scheduled for every 6 hours.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/CheesecakeLabs\/selenium-serverless-example\/blob\/main\/Dockerfile\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Dockerfile<\/strong><\/a><\/p>\n\n\n\n<p>The Dockerfile is the template where we configure our container image. This file creates a Linux instance capable of running the web scraper. The template installs the project requirements (including Chrome and Chrome driver) and copies the required files to the image.<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/CheesecakeLabs\/selenium-serverless-example\/tree\/main\/src\/handlers\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Example.py<\/strong><\/a><\/p>\n\n\n\n<p>This file has a function that will be executed when the event is triggered. This file is pretty similar to the Selenium example we showed earlier.<\/p>\n\n\n\n<p>The main difference is that we customize the browser to not display an interface (because the lambda does not have a display) and to use a single process (because the lambda only has 1 CPU).<\/p>\n\n\n\n<p>In this case, the handler function returns a dictionary with a status code and a body just in case we want to change the event from a cronjob to an HTTP request.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"Deploying\"><strong>Deploying<\/strong><\/h3>\n\n\n\n<p>To deploy the lambda function we just need to run a single command on the terminal.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1200\" height=\"275\" src=\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapper-aws-lambda-1200x275.jpg\" alt=\"\" class=\"wp-image-11060\" srcset=\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapper-aws-lambda-1200x275.jpg 1200w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapper-aws-lambda-600x138.jpg 600w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapper-aws-lambda-768x176.jpg 768w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapper-aws-lambda-1536x352.jpg 1536w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapper-aws-lambda-760x174.jpg 760w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapper-aws-lambda.jpg 1858w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/figure>\n\n\n\n<p>In this case, Serverless Frameworks raises a warning message that explains that the dashboard does not support functions that use container images. It means that we will not be able to check the lambda function logs, nor trigger the lambda function manually through the dashboard.<\/p>\n\n\n\n<p>But we can still do it through the AWS console. Below we have a screenshot of a successful log retrieved from AWS Cloudwatch:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1200\" height=\"379\" src=\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapping-aws-1200x379.jpg\" alt=\"\" class=\"wp-image-11062\" srcset=\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapping-aws-1200x379.jpg 1200w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapping-aws-600x190.jpg 600w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapping-aws-768x243.jpg 768w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapping-aws-1536x486.jpg 1536w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapping-aws-760x240.jpg 760w, https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/web-scrapping-aws.jpg 1999w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"Conclusions\"><strong>Conclusions<\/strong><\/h2>\n\n\n\n<p>This is a very specific example of the usage of AWS Lambda and Selenium but I hope it can illustrate the potential of these technologies. Instead of creating a web scraper, we can create functions that run end-to-end tests that, in case of failure, send a Slack message, or we can create an API that calculates the distance between two strings and returns it in the HTTP response. It&#8217;s all up to your imagination!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"Resources\"><strong>Resources<\/strong><\/h2>\n\n\n\n<p>Demo Project Repository:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/CheesecakeLabs\/selenium-serverless-example\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/CheesecakeLabs\/selenium-serverless-example<\/a><\/li>\n<\/ul>\n\n\n\n<p>Content references:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/aws.amazon.com\/blogs\/architecture\/serverless-architecture-for-a-web-scraping-solution\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/aws.amazon.com\/blogs\/architecture\/serverless-architecture-for-a-web-scraping-solution\/<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_scraping\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/en.wikipedia.org\/wiki\/Data_scraping<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Web_scraping\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/en.wikipedia.org\/wiki\/Web_scraping<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/umihico\/docker-selenium-lambda\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/umihico\/docker-selenium-lambda<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/dev.to\/awscommunity-asean\/creating-an-api-that-runs-selenium-via-aws-lambda-3ck3\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/dev.to\/awscommunity-asean\/creating-an-api-that-runs-selenium-via-aws-lambda-3ck3<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/docs.aws.amazon.com\/lambda\/latest\/dg\/gettingstarted-limits.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/docs.aws.amazon.com\/lambda\/latest\/dg\/gettingstarted-limits.html<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.docker.com\/resources\/what-container\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.docker.com\/resources\/what-container\/<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Web scraping might save several hours when compared to manually collecting data from websites, and AWS Lambda is a good way to set up scripts to run by demand. In this blog post, we\u2019ll cover everything you need to know,\u00a0 the good and bad things about Web Scraping and AWS Lambda, and also analyze the [&hellip;]<\/p>\n","protected":false},"author":81,"featured_media":11064,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[432],"tags":[305],"class_list":["post-11057","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering","tag-tag-development"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Setting up a Selenium Scraper on AWS Lambda<\/title>\n<meta name=\"description\" content=\"In this article, you&#039;ll discover more about Web Scrapping and learn step by step how to use Selenium to Web-Scrape on a AWS Lambda.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Setting up a Selenium Scraper on AWS Lambda\" \/>\n<meta property=\"og:description\" content=\"In this article, you&#039;ll discover more about Web Scrapping and learn step by step how to use Selenium to Web-Scrape on a AWS Lambda.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/\" \/>\n<meta property=\"og:site_name\" content=\"Cheesecake Labs\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cheesecakelabs\" \/>\n<meta property=\"article:published_time\" content=\"2023-01-12T20:42:29+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-07-31T18:59:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"860\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Cheesecake Labs\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@cheesecakelabs\" \/>\n<meta name=\"twitter:site\" content=\"@cheesecakelabs\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/\"},\"author\":{\"name\":\"Karran Besen\"},\"headline\":\"How To Use Selenium To Web-Scrape on AWS Lambda\",\"datePublished\":\"2023-01-12T20:42:29+00:00\",\"dateModified\":\"2024-07-31T18:59:15+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/\"},\"wordCount\":1734,\"image\":{\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg\",\"keywords\":[\"development\"],\"articleSection\":[\"Engineering\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/\",\"url\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/\",\"name\":\"Setting up a Selenium Scraper on AWS Lambda\",\"isPartOf\":{\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg\",\"datePublished\":\"2023-01-12T20:42:29+00:00\",\"dateModified\":\"2024-07-31T18:59:15+00:00\",\"author\":{\"@type\":\"person\",\"name\":\"Karran Besen\"},\"description\":\"In this article, you'll discover more about Web Scrapping and learn step by step how to use Selenium to Web-Scrape on a AWS Lambda.\",\"breadcrumb\":{\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage\",\"url\":\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg\",\"contentUrl\":\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg\",\"width\":1920,\"height\":860,\"caption\":\"woman programming in front of a pc\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/cheesecakelabs.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How To Use Selenium To Web-Scrape on AWS Lambda\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/#website\",\"url\":\"https:\/\/cheesecakelabs.com\/blog\/\",\"name\":\"Cheesecake Labs\",\"description\":\"Nearshore outsourcing company for Web and Mobile design and engineering services, and staff augmentation for startups and enterprises..\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/cheesecakelabs.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"name\":\"Karran Besen\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/cheesecakelabs.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2019\/12\/karran-300x300.png\",\"contentUrl\":\"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2019\/12\/karran-300x300.png\",\"caption\":\"Karran Besen\"},\"url\":\"https:\/\/cheesecakelabs.com\/blog\/autor\/karran-besen\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Setting up a Selenium Scraper on AWS Lambda","description":"In this article, you'll discover more about Web Scrapping and learn step by step how to use Selenium to Web-Scrape on a AWS Lambda.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/","og_locale":"en_US","og_type":"article","og_title":"Setting up a Selenium Scraper on AWS Lambda","og_description":"In this article, you'll discover more about Web Scrapping and learn step by step how to use Selenium to Web-Scrape on a AWS Lambda.","og_url":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/","og_site_name":"Cheesecake Labs","article_publisher":"https:\/\/www.facebook.com\/cheesecakelabs","article_published_time":"2023-01-12T20:42:29+00:00","article_modified_time":"2024-07-31T18:59:15+00:00","og_image":[{"width":1920,"height":860,"url":"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg","type":"image\/jpeg"}],"author":"Cheesecake Labs","twitter_card":"summary_large_image","twitter_creator":"@cheesecakelabs","twitter_site":"@cheesecakelabs","twitter_misc":{"Written by":null,"Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#article","isPartOf":{"@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/"},"author":{"name":"Karran Besen"},"headline":"How To Use Selenium To Web-Scrape on AWS Lambda","datePublished":"2023-01-12T20:42:29+00:00","dateModified":"2024-07-31T18:59:15+00:00","mainEntityOfPage":{"@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/"},"wordCount":1734,"image":{"@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage"},"thumbnailUrl":"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg","keywords":["development"],"articleSection":["Engineering"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/","url":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/","name":"Setting up a Selenium Scraper on AWS Lambda","isPartOf":{"@id":"https:\/\/cheesecakelabs.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage"},"image":{"@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage"},"thumbnailUrl":"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg","datePublished":"2023-01-12T20:42:29+00:00","dateModified":"2024-07-31T18:59:15+00:00","author":{"@type":"person","name":"Karran Besen"},"description":"In this article, you'll discover more about Web Scrapping and learn step by step how to use Selenium to Web-Scrape on a AWS Lambda.","breadcrumb":{"@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#primaryimage","url":"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg","contentUrl":"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2023\/01\/selenium-web-scraping-aws-lambda.jpg","width":1920,"height":860,"caption":"woman programming in front of a pc"},{"@type":"BreadcrumbList","@id":"https:\/\/cheesecakelabs.com\/blog\/selenium-scraper-aws-lambda\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/cheesecakelabs.com\/blog\/"},{"@type":"ListItem","position":2,"name":"How To Use Selenium To Web-Scrape on AWS Lambda"}]},{"@type":"WebSite","@id":"https:\/\/cheesecakelabs.com\/blog\/#website","url":"https:\/\/cheesecakelabs.com\/blog\/","name":"Cheesecake Labs","description":"Nearshore outsourcing company for Web and Mobile design and engineering services, and staff augmentation for startups and enterprises..","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/cheesecakelabs.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","name":"Karran Besen","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/cheesecakelabs.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2019\/12\/karran-300x300.png","contentUrl":"https:\/\/ckl-website-static.s3.amazonaws.com\/wp-content\/uploads\/2019\/12\/karran-300x300.png","caption":"Karran Besen"},"url":"https:\/\/cheesecakelabs.com\/blog\/autor\/karran-besen\/"}]}},"_links":{"self":[{"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/posts\/11057","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/users\/81"}],"replies":[{"embeddable":true,"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/comments?post=11057"}],"version-history":[{"count":5,"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/posts\/11057\/revisions"}],"predecessor-version":[{"id":12202,"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/posts\/11057\/revisions\/12202"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/media\/11064"}],"wp:attachment":[{"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/media?parent=11057"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/categories?post=11057"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cheesecakelabs.com\/blog\/wp-json\/wp\/v2\/tags?post=11057"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}