Zyte’s New AI-Powered Developer Data Extraction API for E-Commerce & Article Extraction

Zyte
6 min readApr 9, 2019

We are delighted to announce the launch of the beta program for Scrapinghub’s new AI powered developer data extraction API for automated product and article extraction.

After much development and refinement with alpha users, our team have refined this machine learning technology to the point that data extraction engine is capable of automatically identifying common items on product and article web pages and extracting them without the need to develop and maintain individual web crawlers for each site.

Enabling developers to easily turn unstructured product and article pages into structured datasets at a scale, speed and flexibility that is nearly impossible to achieve when manually developing spiders.

With the AI enabled data extraction engine contained within the developer API, you now have the potential to extract product data from 100,000 e-commerce sites without having to write 100,000 custom spiders for each.

As result, today we’re delighted to announce the launch of the developer API’s public beta.

Join The Beta Program Today

If you are interested in e-commerce or media monitoring and would like to get early access to the data extraction developer API then be sure to sign up to the public beta program.

When you sign up to the beta program you will be issued an API key and documentation on how to use the API. From there you are free to use the developer API for your own projects and retain ownership of the data you extracted when the beta program closes.

What’s even better, the beta program is completely free. You will be assigned a daily/monthly request quota which you are free to consume as you wish.

The beta program will run until July 9th, so if you’d like to be involved then be sure to sign up today as places are limited.

How to Use The API?

Once you’ve been approved to join the beta program and have received your API key, using the API is very straightforward.

Currently, the API has a single endpoint: https://developerapi.scrapinghub.com/v1/extract. A request is composed of one or more queries where each query contains a URL to extract from, and a page type that indicates what the extraction result should be (product or article).

Requests and responses are transmitted in JSON format over HTTPS. Authentication is performed using HTTP Basic Authentication where your API key is the username and the password is empty.

To make a request simply send a POST request to the API along with your API key, target URL and pageType (either article or product):

curl --verbose \
--user '[api key]':'' \
--header 'Content-Type: application/json' \
--data '[{"url": "https://blog.scrapinghub.com/gopro-study", "pageType": "article"}]' \
https://developerapi.scrapinghub.com/v1/extract

Or, in Python:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=[{'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}])
print(response.json())

To facilitate query batching (see below) API responses are wrapped in a JSON array. Here is an article from our blog that we want to extract structured data from:

And the response from the article extraction API:

[
{
"article": {
"articleBody": "Unbeknownst to many, there is a data revolution happening in finance.\n\nIn their never ending search for alpha hedge funds and investment banks are increasingly turning to new alternative sources of data to give them an informational edge over the market.\n\nOn the 31st May, Scrapinghub got ...",
"articleBodyRaw": "<span id=\"hs_cos_wrapper_post_body\" class=\"hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_rich_text\" data-hs-cos-general-type=\"meta_field\" data-hs-cos-type=\"rich_text\"><p><span>Unbeknownst to many, there is a data revolution ... ",
"audioUrls": null,
"author": "Ian Kerins",
"authorsList": [
"Ian Kerins"
],
"breadcrumbs": null,
"datePublished": "2018-06-19T00:00:00",
"datePublishedRaw": "June 19, 2018",
"description": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
"headline": "A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data",
"images": [
"https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg"
],
"inLanguage": "en",
"mainImage": "https://blog.scrapinghub.com/hubfs/conference-1038x576.jpg#keepProtocol",
"probability": 0.8376080989837646,
"url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data",
"videoUrls": null
},
"error": null,
"html": "<!DOCTYPE html><!-- start coded_template: id:5871566911 path:generated_layouts/5871566907.html --><!-...",
"product": null,
"query": {
"userMeta": "Ku chatlanin!",
"userQuery": {
"pageTypeHint": "article",
"url": "https://blog.scrapinghub.com/2018/06/19/a-sneak-peek-inside-what-hedge-funds-think-of-alternative-financial-data"
}
}
}
]

Product & Article Extraction

As mentioned previously the developer API is capable of extracting data from two types of web pages: product and article pages.

Product Extraction

The product extraction API enables developers to easily turn product pages into structured datasets for e-commerce monitoring applications.

To make a request to the product extraction API, simply set the “pageType” attribute to “product”, and provide the URL of a product page to the API. Example:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=[{'url': 'http://www.waterbedbargains.com/innomax-perfections-deep-fill-softside-waterbed/', 'pageType': 'product'}])
print(response.json()[0]['product'])

The product extraction API is able to extract the following data types:

All fields are optional (can be null), except for url and probability.

Article Extraction

The article extraction API enables developers to easily turn articles into structured datasets for media monitoring applications.

To make a request to the article extraction API, simply set the “pageType” attribute to “article”, and provide the URL of an article to the API. Example:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=[{'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-with-python-3-support', 'pageType': 'article'}])
print(response.json()[0]['article'])

The article extraction API is able to extract the following data types:

Similarly to the product extraction API, all article extraction fields are optional (can be null), except for url and probability.

Batching Queries

Both the product and article extraction API offer the ability to submit multiple queries (up to 100) in a single API request:

import requests
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=[{'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-with-python-3-support', 'pageType': 'article'},
{'url': 'https://blog.scrapinghub.com/spidermon-scrapy-spider-monitoring', 'pageType': 'article'},
{'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}])
for query_result in response.json():
print(query_result['article']['headline'])

The API will return the results of the extraction as the data extraction receives them, so query results are not necessarily returned in the same order as the original query.

If you need an easy way to associate the results with the queries that generated them, you can pass an additional “meta” field in the query. The value that you pass will appear as a “userMeta” field in the corresponding query result. For example, you can create a dictionary keyed on the “meta” field to match queries with their corresponding results:

import requests
queries = [{'meta': 'query1', 'url': 'https://blog.scrapinghub.com/2016/08/17/introducing-scrapy-cloud-with-python-3-support', 'pageType': 'article'},
{'meta': 'query2', 'url': 'https://blog.scrapinghub.com/spidermon-scrapy-spider-monitoring', 'pageType': 'article'},
{'meta': 'query3', 'url': 'https://blog.scrapinghub.com/gopro-study', 'pageType': 'article'}]
response = requests.post('https://developerapi.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=queries)
query_results = {result['query']['userMeta']:result for result in response.json()}
for query in queries:
query_result = query_results[query['meta']]
print(query_result['article']['headline'])

If you would like to learn more about the developer API’s functionality and how you can use it for your specific projects then check out the API documentation (will be sent to you when you sign up).

Don’t Forget ! Join The Beta Program Today

The Developer API Beta Program is only open for a limited time (July 9th), so if you would like to get early and free access to the future of product and article extraction then be sure to sign up to the public beta program today.

--

--

Zyte

Hi, we’re Zyte, the central point of entry for all your web data needs.