Scraping with Tor | The Anas Growth Blog

‍

‍

Before anything, (if you didn't know) I made a youtube video to watch along this article:

‍

At Quable, we’re constantly trying to find automated campaigns to run as background tasks.

The latest one is linked to job boards.

Job boards are fantastic because they can send a bunch of marketing signals and pain points.

If a company opens up 5 sales jobs, it’s a sign that they’re growing fast which means they’re probably in the market for new organizational or lead generation tools.

In my case, I’m going to use the LinkedIn job boards to check out which companies are currently using the word “omnichannel” in their company.

‍

The motive
‍

Quable is a PIM/DAM software helping their clients sort out their product information mess, helping them manage it all in a platform.

PIM softwares go hand in hand with omnichannel merchandising !

The idea here is to find out which companies are looking for manpower and strike a conversation with their CTO/CMO (ideally) about their information system.

Scraping Linkedin jobs can be automated through a lot of different tools, I personally use Phantombuster to do so.

Phantombuster’s output got me the company names and their linkedin urls but it turned out to be a little underwhelming : I couldn't get the job offer themselves.

It felt annoying because it can be very useful to filter out consulting companies or freelance offers.

I decided to scrap the job offers on my own !

‍

Scraping the Linkedin job offers

‍

I used BeautifulSoup and a simple curl request to test the waters:

‍

Everything was straightforward and I managed to get the job offer content.

‍

Linkedin job offer — The job offer content we're scraping

But then, I tried to loop through my ~150 relevant results...

After 20 requests Linkedin was onto my IP:

‍

tqdm error — After 42 smooth requests the error below is thrown

jupyter notebook error — LinkedIn broke the connection !

Usually this means I had to go to proxies BUT I decided to go another route ! The Tor route 😏

‍

TOR ? Deep web ?

‍

The Tor name is a little ambiguous because it can indicate the browser that connects to the Tor network and the network itself.

At its core, Tor is a privacy first tool/network which is why it has a bad mainstream reputation : anything using privacy as a core feature is bound to be used by criminals.

‍

‍

With its technology, you can be virtually untraceable, which can make it a good proxy alternative.

The performance isn’t top tier but for small scraping projects like this one it’s perfect.

‍

Quick Tor setup and usage

Here's how to setup Tor on Windows and on linux/debian distributions it’s as easy as :

‍

sudo apt-get install tortor --hash-password YOUR_PASSWORD

Then edit the torrc option file usually located at /etc/tor/torrc right above the ### This section is just for location-hidden services ### line.

SOCKSPort 9050

HashedControlPassword YOUR_PASSWORD

CookieAuthentication 1

‍

Save and then restart Tor :

sudo /etc/init.d/tor restart

‍

Finally to use Tor with Python, install TorRequest:

pip install torrequest

Actually using Tor

‍

I put the few lines of scraping into a function, and used tr.get instead of requests.get to make our Tor wrapped requests to LinkedIn.

‍

Here's how it looks:

‍

The idea was to simply use the reset_identity function to reset our Tor public IP when we have a Linkedin error.

I coded it lazily but hey, no more errors ✔️:

all the results — We scraped every single job offer we needed :)

‍

Conclusion

‍

Tor is a nice tool to work with when you don't want to go ballistic on your scraping stack (residential proxies, selenium/puppeteer, stealth plugins ...).

It's not very fast, but it does the job very well for small scraping projects.

Test it out and come back to me ! 🤙

‍