June 25, 2021

scraping with tor banner

Before anything, (if you didn't know) I made a youtube video to watch along this article:

 

At Quable, we’re constantly trying to find automated campaigns to run as background tasks.


The latest one is linked to job boards.


Job boards are fantastic because they can send a bunch of marketing signals and pain points.


If a company opens up 5 sales jobs, it’s a sign that they’re growing fast which means they’re probably in the market for new organizational or lead generation tools.


In my case, I’m going to use the LinkedIn job boards to check out which companies are currently using the word “omnichannel” in their company. 

The motive

Quable is a PIM/DAM software helping their clients sort out their product information mess, helping them manage it all in a platform. 


PIM softwares go hand in hand with omnichannel merchandising !


The idea here is to find out which companies are looking for manpower and strike a conversation with their CTO/CMO (ideally) about their information system. 


Scraping Linkedin jobs can be automated through a lot of different tools, I personally use Phantombuster to do so.


Phantombuster’s output got me the company names and their linkedin urls but it turned out to be a little underwhelming : I couldn't get the job offer themselves


It felt annoying because it can be very useful to filter out consulting companies or freelance offers.


I decided to scrap the job offers on my own !

Scraping the Linkedin job offers

I used BeautifulSoup and a simple curl request to test the waters:


Everything was straightforward and I managed to get the job offer content. 

Linkedin job offer
The job offer content we're scraping


But then,  I tried to loop through my ~150 relevant results...


After 20 requests Linkedin was onto my IP:

tqdm error
After 42 smooth requests the error below is thrown
jupyter notebook error
LinkedIn broke the connection !


Usually this means I had to go to proxies BUT I decided to go another route ! The Tor route 😏

TOR ? Deep web ?

The Tor name is a little ambiguous because it can indicate the browser that connects to the Tor network and the network itself.

At its core, Tor is a privacy first tool/network which is why it has a bad mainstream reputation : anything using privacy as a core feature is bound to be used by criminals.

Tor network
A


With its technology, you can be virtually untraceable, which can make it a good proxy alternative. 

The performance isn’t top tier but for small scraping projects like this one it’s perfect.

Quick Tor setup and usage

Here's how to setup Tor on Windows and on linux/debian distributions it’s as easy as :


sudo apt-get install tor
tor --hash-password YOUR_PASSWORD


Then edit the torrc option file usually located at /etc/tor/torrc right above the ### This section is just for location-hidden services ### line.

SOCKSPort 9050
HashedControlPassword YOUR_PASSWORD
CookieAuthentication 1

Save and then restart Tor :

sudo /etc/init.d/tor restart

Finally to use Tor with Python, install TorRequest:

pip install torrequest

Actually using Tor

I put the few lines of scraping into a function, and used tr.get instead of requests.get to make our Tor wrapped requests to LinkedIn.

Here's how it looks:

The idea was to simply use the reset_identity function to reset our Tor public IP when we have a Linkedin error. 

I coded it lazily but hey, no more errors ✔️:

progress clean
all the results
We scraped every single job offer we needed :)

Conclusion

Tor is a nice tool to work with when you don't want to go ballistic on your scraping stack (residential proxies, selenium/puppeteer, stealth plugins ...).

It's not very fast, but it does the job very well for small scraping projects.

Test it out and come back to me ! 🤙


Get my posts in your inbox

I promise I'll never send any spam your way

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.