Why We Created Crawlera? The World’s Smartest Web Scraping Proxy Network

Zyte
7 min readMar 14, 2019

Let’s face it, managing your proxy pool can be an absolute pain and the biggest bottleneck to the reliability of your web scraping!

Nothing annoys developers more than crawlers failing because their proxies are continuously being banned.

Not only do you find yourself constantly firefighting proxy issues, the people who rely on this web data get increasingly frustrated with you because of the unreliability of the data feed.

Scrapinghub had the same issues for years until we hit our breaking point and decided to solve this problem forever.

Scrapinghub’s Proxy Problems

At the time, Scrapinghub was about 3 years in business, providing web scraping consultancy services to companies looking to outsource their data extraction.

We had established ourselves as the leading provider of web scraping consultancy services, however, increasingly our crawl engineers were running into big proxy issues as the scale and complexity of the projects grew.

We’d start a project, but our development timelines we’re constantly being delayed because after deploying our spiders we’d experience constant proxy issues.

First, we’d configure a proxy pool and tell the client that everything was working. Then within a couple of days, everything would be broken. The proxy pool would no longer return requests at the target RPM (requests per minute).

We’d then acquire new proxies, increase the pool size, rotate the proxies and create a new pool to route the requests through. This would work for a while but before long we were back where we started. A proxy pool full of banned IPs and our crawlers unable to make successful requests.

It was a never ending cycle of swapping and rotating proxies. We couldn’t reliably predict how long it would take us to get a project into full production, leading to frustrated engineers and plenty of hard conversations with customers.

We were getting sick of firefighting proxies issues, but then one project came along that forced us to say “enough was enough” and commit to fixing this problem permanently…

The Project That Broke the Camel’s Back

The client wanted us to build a web scraping infrastructure to scrape product data from 20 e-commerce sites, about 1 million requests per day. Which in 2011 was a big deal!.

Everything started off great. We developed the spiders, ran a number of pilot crawls and delivered the proof of concept data to the customer.

However, as was all too common we ran into serious problems scaling the crawls.

Although our spiders were well designed and configured to crawl at a polite speed, when we moved the project from proof of concept to production our proxies were being banned at an alarming rate.

We started the normal process of switching out proxies to try and get the crawlers back up and running.

However, eventually it got to the point that we couldn’t scale the crawl anymore as we couldn’t put out the proxy fires fast enough.

Initially, we told the client that we’d have the issue fixed in 1 or 2 days “as it was just a matter of swapping out the banned IPs”.

However, the days kept ticking by and we still hadn’t found a permanent solution.

Finally, nearly a month later. We fixed it!

The solution…

We stopped focusing on the underlying IPs and put all our energy into intelligently managing the IPs so that we could not only scrape reliably without the fear of being banned, but also have more predictable development schedules and reduce the amount of time and costs associated with running our crawls.

We found that without an intelligent proxy management layer, our requests were continuously being blocked and our proxies burned. Leaving us constantly scrambling to find new proxies and get our crawlers back up and running again.

However, when managed intelligently we could reliably scrape the web with little fear of our IPs being banned and the accompanying development/crawl delays.

This breakthrough was a game changer for us. With this new proxy management layer, we were able to exponentially scale our crawls and completely remove the headache of managing proxies.

Once configured for a project this new proxy management layer would automatically select the best proxy to use for the target website and manage all the proxy rotation, throttling, blacklisting, etc. ensuring that we could reliably extract the data we need.

All without any manual intervention from our crawl engineers!

As we continued to scale, people were constantly asking us how were we managing our proxies as they were facing the same reliability issues we encountered as they scaled their web scraping. It was at this point Crawlera was born…

Enter Crawlera — The World’s Smartest Proxy Network

In 2012, we decided to make this technology available to everyone in the form of Crawlera, a proxy management solution specifically designed for web scraping.

Crawlera enabled web scrapers to reliably crawl at scale, managing thousands of proxies internally, so you they didn’t have to.

They never needed to worry about rotating or swapping proxies again.

Users loved Crawlera! It removed the frustrations their engineers had with managing their web scraping proxies.

With Crawlera, instead of having to manage a pool of IPs the user’s spiders send the request directly to Crawlera’s single endpoint API.

Crawlera then selects the best IP and proxy configuration (user agents, request delay, etc.) for that particular website to retrieve the target data.

If a request is blocked, Crawlera then automatically selects the next best IP and reconfigures the proxy configuration before making another request. This process continues until Crawlera is able to obtain a successful request or a predefined request limit has been reached.

All this functionality happened under the hood. The user just made the request to Crawlera’s API and Crawlera would take care of everything else. Enabling users to focus on the data, not proxies.

Crawlera achieved this by managing a massive pool of proxies, carefully rotating, throttling, blacklists and selecting the optimal IPs to use for any individual request to give the optimal results at the lowest cost. Completely, removing the hassle of managing IPs.

The huge advantage of this approach is that it is extremely reliable and scalable. Crawlera can scale from a few hundred requests per day to millions of requests per day without any additional workload on your part.

Better yet, as Crawlera was built by web scrapers for web scrapers we know they only care about successful requests, not the number of proxies. As a result with Crawlera you only pay for successful requests that return your desired data, not IPs or the amount of bandwidth you use.

This is a huge benefit for users for Crawlera as they can accurately predict the cost of their proxy solution as they scale.

For Scrapinghub having Crawlera at our disposal was a game changer for our business. Now our crawl engineers could focus on what they really enjoyed, building crawlers and delivering accurate reliable data for our customers. Not constantly having to put out proxy fires just to get their data feeds up and running again. Leading to happier and more motivated teams, and happy customers.

New Improved Crawlera

Since its original launch, Crawlera has undergone numerous redesigns and improvements to keep pace with the changes in web scraping technologies and cope with the ever more complex challenges experienced when scraping the web.

We’ve added highly targeted geographical support (city granularity), residential IPs, headless browser support, custom user agents, to name a few of the features. Making Crawlera the most feature rich and robust proxy solution for web scraping.

Today, Scrapinghub offers Crawlera in two flavours:

  • Crawlera Self-service: For web scraping teams (or individual developers) that are tired of managing their own proxy pools and that are ready to integrate an off-the-shelf proxy API into their web scraping stack that only charges for successful requests.
  • Crawlera Enterprise: For larger organisations with mission-critical web crawling requirements looking for a dedicated crawling partner who’s tools and team of crawl consultants can help them crawl more reliably at scale, build custom solutions for their specific requirements, help debug any issues they may run into when scraping the web, and offer enterprise SLAs.

Crawlera also comes with global support. Clients know that they can get expert input into any proxy issue that may arise 24 hours per day, 7 days a week no matter where they are in the world. Giving them immense peace of mind, knowing that they will never be left alone if they can’t get access to the data they need.

Try Crawlera The World’s Smartest Proxy Network Today!

At Scrapinghub all our products are 100% designed with web scraping in mind. We are committed to helping the web scraping community extract they need to grow their businesses.

If you tired of troubleshooting proxy issues and would like to give Crawlera a try then signup today and you can cancel within 7 days if Crawlera isn’t for you, or schedule a call with our crawl consultant team.

At Scrapinghub we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.

--

--

Zyte

Hi, we’re Zyte, the central point of entry for all your web data needs.