An AWS hack for cheap and reliable proxies
Hey gang,
Following up on my last article about proxy scraping I thought I'd share a cool hack I found if you run your scraping workflows through lambdas, which I often find myself doing.
To give you guys a concrete use case, I'll piggy back off the Scraping with Tor article where I scrape Linkedin job offers.
Tor is pretty tough to use on lambdas and Linkedin blocks a ton of free public proxies.
This gives us the perfect situation for our lambda-powered proxy hack.
To follow along you’re gonna need Serverless installed and configured with your AWS account. I explain how to do so in this video !
Scraping Linkedin job offers
Our goas here is to be able to scrape all the new software engineering offers in France daily.
Here's the linkedin URL I was working with, the page looks fairly simple:
In a world where Linkedin doesn't block our requests, the process would be fairly simple:
- Scrape the job offer urls from this page
- Access the job offers one by one
- Format and return the data
The issue with this process is that when we're going to loop over the job offers, Linkedin will block our IP.
To leverage the hack what we need to do is actually create two cloud functions, one that would be responsible for launching the whole process (including scraping the top level job offers URLs), and another one that will simply scrape the job offer information from a job URL.
- The first function - i'll call it main - will be triggered every day and scrapes the top level job offers URLs
- The second function -the worker- will call another lambda function which access the job url and actually scrapes the content of the offer
- The main function launches the worker for every URL, waiting for the worker to answer with the scraped code (or an error)
So before implementing our hack, this is what our code looks like (the JOB_SCRAPER_ARN variable is the resource name of our worker function):
Our serverless configuration, which will not change throughout the tutorial looks like this:
I didn't setup a trigger for our worker since it's only meant to be called by the lambda client in the main function.
You can already deploy the functions at this stage but you'l see that you'll run into errors really quickly.
Let's now talk about the meat of the article: the hack.
A cold start
To set the context for the hack, I have to go a bit further into what lambdas actually are.
AWS defines it as:
A serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.
Basically, write code and deploy a lambda => you can now use that code everywhere.
Under the hood, you can think of lambdas as a pool of unused VMs that are used when they're called.
When your lambda is called for the first time, there's a process starting your ephemeral lambda instance, and your code is ran.
That first call takes significantly more times than the subsequent calls, which is why it's called a cold start.
According to AWS:
Cold starts typically occur in under 1% of invocations. The duration of a cold start varies from under 100 ms to over 1 second.
After your made your calls and the function is idle for a bit, the instance destroys itself, freeing the space for another use.
Usually cold starts are looked as a drawback of using serverless functions but in our case, we're going to use it as a feature.
Indeed, when a function cold starts, it redeploys itself on another pooled instance, which as a consequence, changes the IP address its code is ran from.
So if there's a scalable way to force a cold start, there's a way to change functioning IPs freely.
Forcing cold starts
Forcing cold starts is actually super simple: any update to a lambda function will force it to cold start on its next run.
I chose to update the function that needs to run environment with a random id.
To use it in a synchronous way as we'll do for our scraping, we also need to wait for our function to finish updating using an AWS waiter:
Finally, I'll simply add an if/else statement to force a cold start if Linkedin blocked our call and voila:
I also added a ip to the output for good measure and to show you at what frequency we actually change our IP.
Forcing a cold start is free and as you already know, lambda computing is dirt cheap so it's perfect for small use cases like this.