Automatic extraction 2021 data extractor features reviewed

Getting hold of clean, accurate web data — quickly, and in a format that’s easy to manage — is a struggle for many organizations. One solution is hiring a couple of enthusiastic interns to copy and paste the information you’re looking for, but they’ll soon be struggling on larger-scale projects. Alternatively, you might use commercially available data extractor software or scraping apps. Or if you’re feeling brave you could try writing your own script for web scraping.

Let’s imagine you need to get product names or prices from an e-commerce marketplace for comparison purposes. Web scraping traditionally means identifying product pages on a site and then extracting relevant information from those pages. To achieve this your developer needs to inspect the site’s source code, and then write some more code to pick out the relevant bits like links, product names, and prices. There are some excellent tools out there to do this, including CSS and XPath selectors or a free open source framework like Scrapy.

It’s automatic

AI-powered Automatic Extraction harnesses deep learning methods, helping you to retrieve clean, accurate data in seconds rather than days or weeks. And it already supports more than 40 languages, making it easy to scrape web data literally all over the world.

Your own data extractor scripts can break if a web page changes, but Automatic Extraction reliably gets the data you want… even from dynamic sites. It’s a huge time saver, taking away the pain of having to maintain your own code. Our API also makes it possible to get data from many different web page types. Say you’re building a comparison tool, allowing your customers to browse prices and availability for high street fashions or automotive parts across lots of different sources. With Automatic Extraction, it’s easy to get reliable data aggregated from e-commerce sites, news articles, blogs and more.

Try Automatic Extraction for free!

Do it yourself

Wouldn’t it be fantastic if you could focus on profitable business activities instead of writing spiders to collect all those relevant page URLs? Automatic Extraction neatly meets this need, handling both extraction and crawling without manual intervention. It’s effectively a spidering service based on the same AI/machine learning technology used in our API. And it also features built-in ban management, using automatic IP rotation to prevent blocking so you don’t need to babysit every crawl.

Just select the data type you want to extract, enter your URLs, select the data type and let Automatic Extraction do its thing. Leave it running while you have another meeting. Then come back, check everything’s been OK, and fetch your clean, usable data. That’s all there is to it. Currently, we support news and article data and product data extraction, with more data types coming later this year.

Smart, totally automated crawling

Let’s say you want to extract products from the ‘Arts and Crafts’ category of an online marketplace. You’ll be faced with creating a list of all the URLs you wish to scrape. And that means writing code to crawl the web and collect individual page URLs for feeding into our API for extraction.

Our smart automated crawler capability is accessed via the ‘Datasets’ tab on the left-side navigation. It’s easy to use, letting you get the data you want without any coding. Just select the data type you want and you’re ready to go. There are some customization options if you need them. Extraction Strategy lets you collect all products from a website (Full Extraction), or just fetch products within a particular category starting from the specified page (Category Extraction). You can also tweak extraction request limits. A low limit lets you experiment without worrying about running out of credits, while larger limits enable scalability and production crawls. Select how you want to receive your data by setting up S3 export, or skip this option and select built-in JSON export. And now you’re all ready to start getting data back. It’s as quick and easy as that.

Hit ‘Start’ and you’ll see products start appearing in seconds. There’s a preview of items you’re about to extract, plus crawl statistics with a number of used requests from your specified limit, crawl speed in requests per minute and field coverage.

This friendly new web interface doesn’t displace our original API that lets developers seamlessly integrate Automatic Extraction into their own applications. And while fully automated crawling is super-easy, the API lets you deploy refined custom crawling strategies tailored to your own specific business needs.

So how do we do it?

New backend, better quality data

What’s around the corner?

  • CSV export to supplement S3 and JSON: allowing quick and easy data validation and analysis without needing developer tools.
  • Incremental crawling for articles: commonly requested by our users, this lets you only push new content from publications within a particular news source.

Try Automatic Extraction for free!

Originally published at https://www.zyte.com.

Hi, we’re Zyte, the central point of entry for all your web data needs.