It’s a 21st-century truism that web data touches virtually every aspect of our daily lives. We create, consume, and interact with it while we’re working, shopping, traveling, and relaxing. It’s not surprising that web data makes the difference for companies to innovate and get ahead of their competitors. But how can you actually get data from websites? And what’s this thing called ‘web scraping’?

Why would you want to extract data from a website?

Up-to-date, trustworthy data from other websites is the rocket fuel that can power every organization’s successful growth, including your own.

You might want to compare the pricing of competitors’ products across popular e-commerce sites. You could…

In this tutorial, you will learn how to scale up your already existing Scrapy project in order to make more requests and extract more web data.

Scrapy is a very popular web crawling framework and can make your life so much easier if you’re a web data extraction developer. Scrapy can handle many web scraping jobs including URL discovery, parsing, data cleaning, custom data pipelines, etc… But there’s one thing that Scrapy cannot do out of the box and it has become a must if you want to extract large amounts of data reliably: proxy management.

In order to scale…

Dynamic pricing is a great tool for businesses, especially those in the e-commerce field. A lot of major companies already use web extracted pricing data to formulate pricing strategies, adapt to price variations, spot MAP violations & analyze customer opinions. Adding dynamic pricing to that can add an array of benefits like following the competition, adjusting prices instantly, and easily capturing quantitative metrics about your products to boost revenue.
Using dynamic pricing makes perfect sense for the bottom line of your business.

If you’re looking to learn more about dynamic pricing and how to make the most of it, I…

Handling javascript objects is an important skill for any web data extraction developer. You might only start dipping your toes into this area when dealing with dynamic pages, but you will then quickly see that <script> tags are a good way to get data in general. At the start it can seem daunting to get data from these nested dictionaries within blocks of javascript code, however, I am going to introduce you to two packages I use to make getting info from these a breeze!

Parse script text using chompjs library

So, let’s assume you’ve extracted this text from a script tag

__DATA__ = {"data":{"type":"@products", "products":[{"id":12345678…

Article and news data extraction is becoming increasingly popular and widely used by companies. Data quality plays a vital role in making sure these projects succeed. If the quality of the extracted articles is not good enough, your whole business could be at risk, especially if it depends on the constant flow of high-quality article data.

Data quality enables your business to move data across your organization and transform it into something valuable for your users or customers. …

We are delighted to announce that ExtractSummit 2021 will be a virtual event again in 2021 brought to you by Zyte (formerly Scrapinghub). We had to make a quick decision last year and turn the event into a virtual one — the good news was it allowed so many of our community to join us from all over the world. We had 3,000 web data extraction experts sign up.

We loved that so many more of you could join us so this year we want to create an online experience that will allow the community to interact and network more.

2020 & 2021 thus far have not been business as usual. Last year started with a continued run of the longest bull market run in modern economic history, which traces its beginnings to the previous market low post-sub-prime mortgage triggered the financial crisis in early March 2009. The S&P 500 on 9th March 2009 closed at 676.53, while on 19th February 2020, it closed at 3,386.15. That’s a 500% growth over a little over a decade; the best run ever for a major index topping the second-best run of 417% experienced in the 1990s.

Then came March 2020 and the…

Web data touches every aspect of our lives. We create, consume, and interact with it while we’re working, shopping, traveling, and relaxing. Extracting meaningful data from the web — reliably, cost-effectively and at scale — can play a vital role in helping companies get up-to-the-minute insights about their own brand, understand the competitive landscape, optimize their product offerings and pricing strategies.

Current events have intensified digital transformation, shaping new digital habits. Social distancing along with the shift to remote working has changed the way we shop, socialize and consume data. …

It was 6 years ago when Zyte (formerly Scrapinghub) released Dateparser, an open-source library that parses human-readable dates, and in October 2020 we released version 1.0.0, a very important milestone. In this article, I’d like to introduce this library and share some insights into why it’s so popular.

Dateparser was developed to make date extraction from HTML pages easier. Initially, it was used only by web scraping developers, but later it was quickly adopted by the wider community as well. It has been used for multiple applications like command-line tools, chatbots, etc.

Key features of Dateparser

  • Support for almost every existing date format: absolute…

Whether we are talking about the credit-granting or insurance underwriting arena, alternative data usually refers to datasets not inherently related to an individual’s credit or insurance claim behavior. Traditional data is usually circumscribed to that originating at a credit bureau (think Equifax. Experian, TransUnion), credit or insurance application data, or an institution’s proprietary files on an existing customer.

Alternative data has a lot of senses become not only a hot topic but even a buzzword, partly because of the data explosion of the last decade (IDC estimated that in 2010 1.2 zettabytes of data were created that year. 2018 saw…

Zyte (formerly Scrapinghub)

Hi, we’re Zyte, the central point of entry for all your web data needs.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store