Bypassing IP rate limiting for web scraping using WireBalancer
My problem with IP rate limiting
A week ago I wanted to scrape a website that had a strong IP-based rate-limiting mechanism. When I did the math, I realized I would need about two weeks just to scrape all the pages I was interested in, so I started thinking about possible solutions to bypass the issue.
The simplest solution would have been to use proxies, but this would have meant an extra cost and relying on third-party servers of dubious reliability and origin. However, I noticed that many cheap and popular VPN services offer multiple simultaneous connections and support the Wireguard protocol; some even offer up to 5 or 10 concurrent connections.
I won’t name the service I bought, but it’s one of the most popular ones for about 5 euros a month.
With it, I have 5 different IPs that I can use at the same time, allowing me to be 5 times faster without ever exceeding the rate limiter.
The only problem is that my scraper was just a normal Python program using the requests and aiohttp libraries to make HTTP requests, so it didn’t natively support the use of Wireguard.
Why not just buy proxies?
Well, there’s not much to say about proxies: they cost money and you never know who is behind them.
You can never rule out that a proxy service might be backed by a botnet or some other shady setup.
Seriously, think about it: would you ever rent out the network traffic of your VPS server or your home network to strangers who will realistically use it for illicit activities?
No sane person would do that, unless driven by strong ideology or financial incentives.
My theory is that most free or very cheap proxy services rely on botnets or compromised networks to provide their service, and therefore it’s never a good idea to use them for sensitive or important activities.
You can’t rule out that traffic is being monitored, logged, or worse, manipulated.
And even if that’s not the case, there’s always the risk that you’re using someone else’s home network at the same time as some criminals, and you could end up under investigation for illegal activities you didn’t commit.
Thinking about it, it would be enough to download any random app on a mobile device to end up being part of a botnet. After all, you just need an app that connects to a remote socket and stays in the background; every packet it receives from the socket is parsed and sent as a TCP/UDP request to a specific target, the response is then sent back to the remote controller and that’s all you need to create a botnet of mobile devices. On top of that, mobile devices are perfect for this purpose since they have two network interfaces (WiFi and mobile data). With VPNs, instead, you are almost always using dedicated servers for this purpose, often provided by M247 Ltd or similar companies offering low-cost dedicated servers with unlimited bandwidth, perfect for acting as VPN exit nodes.
WireBalancer
Returning to my problem, it turns out that both requests and aiohttp support the use of SOCKS5 proxies.
In reality, almost all software that makes HTTP requests (including web browsers and even operating systems) supports some kind of outbound network configuration using the SOCKS5 protocol, so a sensible approach to building a scraper would be to always develop it normally and then configure the request library to use a local proxy.
Considering that I don’t have a proxy server but I do have 5 Wireguard connections, after doing some research and not finding any suitable tool, I decided to write my own. The final result was WireBalancer, a Go tool that runs in a container, exposes SOCKS5 servers, and connects to all configured Wireguard profiles.
Main features
I mainly needed four things:
- Load balancing outgoing traffic across the various configured Wireguard servers
- The ability to choose the Wireguard server to use for each outgoing connection by selecting a specific SOCKS5 server
- A very basic dashboard to monitor the status of connections and Wireguard servers and to see how many requests were made through each server
- It had to be simple to use and configure, lightweight, and not slow down connections
After 24 hours of intensive use, I can say that WireBalancer met all my needs and allowed me to scrape efficiently and without issues, enabling over 4 million total requests with 0 connection errors.
Bonus
One of the SOCKS5 servers exposed by WireBalancer randomizes the connection selection for each HTTP request; if you’re feeling bold and want to have some fun, you can try using it in your main browser or in your system’s global network settings. I’m not exactly sure what would happen, but I imagine many sites would ban your session and you’d get endless captchas, along with interesting experiences with CloudFlare.