I’ve been working on a lot of different scraping projects now and I realize that proxy usage and management is often misconstrued.
Let me explain myself:
Ever heard of companies that want to be ✨ data driven✨ just because that’s what everybody’s been talking about ?
Feels like it’s the same with scraping and proxies: everybody wants to use proxies but doesn’t really know why, when or how.
Admittedly, things can get a bit complicated but that’s why I’m writing this article !
I’ll try to make this clearer by:
- Explaining in which situations you need proxies
- Explaining why you need them in these situations
- Giving you practical examples of website scraping
- Giving you pros and cons about different scraping techniques and technologies
The scraping mind map
The login wall
It all starts with the data and the website you’re looking to scrape.
Is the juicy data you’re looking for behind a login wall ?
The login wall is a wall for multiple reasons :
- The first : well the obvious one, you need input to get data
- The second : the rules change based on the websites you’re logged into (Linkedin doesn’t have the same limits as Facebook and so on)
- The third : the complexity of the scraping can ramp up very quickly as you scale because of said limits
That’s why before anything you should ask yourself this:
Can I manage to scrape the same data in a public (non login) session vs a logged in session ?
Odds are you’ll take more time figuring out how to login properly without your session bugging out than just scraping public pages.
To give you an example with Facebook scraping:
I reverse engineered the Facebook pages API after login but got this error after going a little too hard.
Turns out I actually could fetch every data point (besides emails) I wished by simply scraping the public page, which I could do without limits thanks to proxies.
Let’s go a bit further down the mind map, starting with scraping pages with no login.
Scraping page with no login
Assume you're being watched 🥷
To understand the context behind the technology, imagine that a website was designed for requests caused by humans on any of their actions (a search, a click…).
When scraping you should assume that every website has a fixed limit connection per user connecting to their website.
When you make requests on a given website, that website designs you, the user, by your IP address, and communicates with your browser to execute your requests.
When you’re mass scraping, you take this to the next level and you end up sending a ton of requests with your IP very quickly to the website’s server.
If it's done recklessly, you can end up blacklisted from that server 🛑
Even worse, if you're not careful, you IP reputation can take a hit and you could see significant issues on multiple sites: you'll see a lot more captchas, you'll see less results on Google, you might even be asked to show authentication every time you checkout on popular websites.
To come back to scraping, you might also have to make your consecutive requests very slowly (one page per 15-20s 🐢) simply for the page to load, or because you’ve been told you are using automation tools by the website you’re scraping.
Getting your data this way is a very slow process and you find yourself questioning why you even started scraping in the first place.
That’s precisely where proxies come in.
Why proxies should be used
A proxy provider gives you multiple IP addresses in order for you to act as if you were another computer, hence their name proxy.
With proxies, you can make as many requests concurrently as you have IPs, which is precisely why they’re convenient, because from the website's server perspective, it looks like multiple different computers connected to the website.
You can then ramp up your scraping as you scale your proxies and forget both slow scraping and IP banning issues.
There are a lot of proxy providers out there, with different levels of quality and prices.
Free proxies should always be tested since some websites don’t have any restrictions, but as you scrape well known sites you’re gonna have to get yourself higher end proxies.
For free proxies, I usually start with pubproxy.com or the Tor network (anonymous and free !) on which I wrote a tutorial about.
As for paid proxy I’m really satisfied with the Zyte proxies. Easy to cycle through and very reliable.
I have mixed feelings about brightdata.com (ex Luminati.io) as I’ve been getting a few proxy connection errors here and there.
I’m still ok with it since their residential IPs really do provide great undetectability, it has been proven very useful when scraping Facebook in one of my last projects.
To have a broader range of comparison, check out this article on Limeproxies.
Additional best practice : user agent cycling
Now that you understand that making actions on a browser is equivalent to making requests to a web server here’s some additional information:
When you make a request through a browser, your device and browser are automatically identified by the request’s user agent.
It's usually used by servers to display device specific content, but it's also used in various analytics.
When scraping and artificially creating requests, you can “fake” these user agents by simply replacing the headers in your requests.
For Python users there’s even libraries like latest-user-agents to get recent and legit user agents or even this list found off Google.
When you use custom user agents, you increase the reliability of your scraping scripts because the web servers have a tendency to “think” that you’re a new unseen computer.
Cloud services
AWS/GCP/Azure
The most famous cloud services provide serverless coding platforms : simply write your piece of scraping code and upload it to any of these providers.
Any time you call your scraper, it’ll be provisioned on a random ip, giving you a neat proxy effect.
As you already know if you follow my blog, I'm a big fan of Serverless on AWS, which I use almost every day.
On the topic, I have an article explaining how you would turn a scraper into an API with AWS right here.
Zyte/Phantombuster/Apify
These are dedicated scraping cloud services, they already have ready made scripts modules used by thousands of users.
You can also duplicate them and make your changes to fit your data scraping needs.
All of these services also include IP proxy services.
Zyte is more dedicated towards "crawling" (sometime synonymous with scraping) giving you multiple tools to scrape multiple e-commerce websites pages at once.
Phantombuster and Apify are very similar in the sense that they provide ready made scraping tools in a no-code way.
Both tools can also be very customized, I recently used Apify extensively to build a custom Facebook scraper to be managed by a non-developer.
Scraping websites requiring a login
That’s where it gets tougher !
First you have to login into the website of choice which poses two questions:
- What account should you use ?
- Can you login from anywhere ?
These questions are especially important for sites that are familiar being targeted by web scraping like Linkedin, Facebook, Twitter or Amazon.
Account limits and human like scraping
They’re a lot of different approaches about how to use accounts for our kind of scraping operations and they depend on your use case.
You can decide to go the fully fake new account routes and scrape at will as you rotate through your different accounts. (which sometimes is not even possible because you’re looking to scrape data linked to a specific account like on Linkedin for instance)
You can decide to use a fully legitimate account and scrape around very cautiously as well.
Guidelines and techniques
Authentication
Don’t connect to a “scraping account” while it’s in use and do not use random proxies because the account will get flagged immediately: almost every top website can detect IP discrepancies.
Say you’re scraping with a french IP but you’re located in the US.
As soon as you login from your browser on US WIFI, you’re being suspected of being a bot user, making your future connections under more scrutiny with no benefit of the doubt in the event your account gets banned.
You can use login cookies if the cookies expiration of the websites suits you, Linkedin famously uses the li_at cookie to authenticate most requests.
The previous point applies to cookies as well though, if you login from anywhere else, your cookies will most likely expire instantly.
Accounts and limits
The importance of using a legitimate account seems to vary a lot from website to website and from industry to industry.
The general guideline is “act like a human” but it seems it’s not even enough as it’s not uncommon to reach limits on Instagram and Facebook for example.
I’d say the best practice is looking for people who already have experience scraping these sites.
Phantombuster has a few of these popular websites and limits documented here (might be a bit outdated though)
Technologies and tools for reliable scraping on login pages
To ensure you’re considered a human, there are plugins to common emulation software like Puppeteer.
For Puppeteer the most known is the Extra Stealth plugin.
On Python, I've been using the Undetected Chrome Driver with great results.
I often use Curl converter to convert the browser requests to your code of choice.
It's very useful to parse cookies, request headers and request data, usually a good starting point to know for reverse engineering.
Sum up, TL;DR
- When you're scraping public pages, make sure you're always using proxies, starting with the lower quality proxies first.
- If you're looking to scrape data behind a login wall, do your best to search alternatives to not log in and go back to step one
- If you still have to login, try to use reverse engineering techniques as much as you can
- Respect the automation limits indicated by the website directly or the community. Make sure to act like a human by avoiding connecting from multiple IPs/browsers at the same time as well
I plan to do another article where I'll get specific with scripts and techs on websites with ramping difficulties !