Newspaper3k: Advanced News and Article Extraction in Python

Introduction

Newspaper3k is an exceptional Python 3 library that streamlines the process of extracting news, full-text content, and article metadata from websites. It's built to be simple to use, much like the requests library, and leverages lxml for high-speed parsing. Whether you need to pull authors, publication dates, main text, images, or even perform Natural Language Processing (NLP) for keywords and summaries, Newspaper3k offers a comprehensive solution.

This library is not just about basic scraping; it's designed for advanced article curation, capable of identifying news URLs, handling multi-threaded downloads, and working seamlessly across more than 10 languages, including English, Chinese, German, and Arabic.

Installation

To get started with Newspaper3k, ensure you are using Python 3. The library is installed via pip3.

Important: Install newspaper3k, not newspaper. The newspaper package is for Python 2 and is deprecated.

pip3 install newspaper3k

For Debian / Ubuntu users, you might need to install additional dependencies:

sudo apt-get install python3-pip
sudo apt-get install python-dev
sudo apt-get install libxml2-dev libxslt-dev
sudo apt-get install libjpeg-dev zlib1g-dev libpng-dev
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
pip3 install newspaper3k

For OSX users, using Homebrew or Macports:

brew install libxml2 libxslt
brew install libtiff libjpeg webp little-cms2
pip3 install newspaper3k
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

Examples

Here are some examples demonstrating how to use Newspaper3k to extract information from articles and news sources.

Extracting a Single Article

from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)

article.download()
article.parse()

print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
print(f"Text: {article.text[:200]}...")
print(f"Top Image: {article.top_image}")
print(f"Movies: {article.movies}")

article.nlp()

print(f"Keywords: {article.keywords}")
print(f"Summary: {article.summary}")

Building a News Source (Paper)

import newspaper

cnn_paper = newspaper.build('http://cnn.com')

print("First 5 article URLs from CNN:")
for article in cnn_paper.articles[:5]:
    print(article.url)

print("\nCategory URLs from CNN:")
for category in cnn_paper.category_urls():
    print(category)

# You can then download, parse, and NLP individual articles from the paper
cnn_article = cnn_paper.articles[0]
cnn_article.download()
cnn_article.parse()
cnn_article.nlp()
print(f"\nFirst CNN article title: {cnn_article.title}")

Language Detection and Specific Language Usage

Newspaper3k can automatically detect languages or be instructed to use a specific one.

from newspaper import Article

# Example with Chinese article, specifying language
url_chinese = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a_chinese = Article(url_chinese, language='zh') # Chinese

a_chinese.download()
a_chinese.parse()

print(f"Chinese Article Title: {a_chinese.title}")
print(f"Chinese Article Text (first 150 chars): {a_chinese.text[:150]}...")

# Building a paper for a specific language
sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh')
article_sina = sina_paper.articles[0]
article_sina.download()
article_sina.parse()
print(f"\nSina Article Title: {article_sina.title}")

Why Use Newspaper3k?

Newspaper3k stands out for several reasons, making it an excellent choice for news and article extraction tasks:

Robust Extraction: It reliably extracts text, authors, publication dates, top images, and all images from HTML content.
NLP Capabilities: Built-in Natural Language Processing allows for keyword and summary extraction, providing deeper insights into article content.
Multi-threaded Downloads: Efficiently download multiple articles concurrently, speeding up data collection.
Multi-language Support: Works in over 10 languages, with seamless auto-detection or explicit language specification, making it versatile for global news sources.
News URL Identification: Smartly identifies news-related URLs, helping to focus your scraping efforts.
Ease of Use: Its API is designed to be intuitive and straightforward, allowing developers to quickly integrate it into their projects.

Newspaper3k: Advanced News and Article Extraction in Python

Summary

Repository Info

Tags