Newspaper3k: Advanced News and Article Extraction in Python
Summary
Newspaper3k is a powerful Python 3 library designed for news, full-text, and article metadata extraction. Inspired by the simplicity of 'requests' and the speed of 'lxml', it provides robust tools for scraping and curating articles from various sources. This library is ideal for developers needing to programmatically gather and process news content with advanced NLP capabilities.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Newspaper3k is an exceptional Python 3 library that streamlines the process of extracting news, full-text content, and article metadata from websites. It's built to be simple to use, much like the requests
library, and leverages lxml
for high-speed parsing. Whether you need to pull authors, publication dates, main text, images, or even perform Natural Language Processing (NLP) for keywords and summaries, Newspaper3k offers a comprehensive solution.
This library is not just about basic scraping; it's designed for advanced article curation, capable of identifying news URLs, handling multi-threaded downloads, and working seamlessly across more than 10 languages, including English, Chinese, German, and Arabic.
Installation
To get started with Newspaper3k, ensure you are using Python 3. The library is installed via pip3
.
Important: Install newspaper3k
, not newspaper
. The newspaper
package is for Python 2 and is deprecated.
pip3 install newspaper3k
For Debian / Ubuntu users, you might need to install additional dependencies:
sudo apt-get install python3-pip
sudo apt-get install python-dev
sudo apt-get install libxml2-dev libxslt-dev
sudo apt-get install libjpeg-dev zlib1g-dev libpng-dev
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
pip3 install newspaper3k
For OSX users, using Homebrew or Macports:
brew install libxml2 libxslt
brew install libtiff libjpeg webp little-cms2
pip3 install newspaper3k
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
Examples
Here are some examples demonstrating how to use Newspaper3k to extract information from articles and news sources.
Extracting a Single Article
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.parse()
print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
print(f"Text: {article.text[:200]}...")
print(f"Top Image: {article.top_image}")
print(f"Movies: {article.movies}")
article.nlp()
print(f"Keywords: {article.keywords}")
print(f"Summary: {article.summary}")
Building a News Source (Paper)
import newspaper
cnn_paper = newspaper.build('http://cnn.com')
print("First 5 article URLs from CNN:")
for article in cnn_paper.articles[:5]:
print(article.url)
print("\nCategory URLs from CNN:")
for category in cnn_paper.category_urls():
print(category)
# You can then download, parse, and NLP individual articles from the paper
cnn_article = cnn_paper.articles[0]
cnn_article.download()
cnn_article.parse()
cnn_article.nlp()
print(f"\nFirst CNN article title: {cnn_article.title}")
Language Detection and Specific Language Usage
Newspaper3k can automatically detect languages or be instructed to use a specific one.
from newspaper import Article
# Example with Chinese article, specifying language
url_chinese = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a_chinese = Article(url_chinese, language='zh') # Chinese
a_chinese.download()
a_chinese.parse()
print(f"Chinese Article Title: {a_chinese.title}")
print(f"Chinese Article Text (first 150 chars): {a_chinese.text[:150]}...")
# Building a paper for a specific language
sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh')
article_sina = sina_paper.articles[0]
article_sina.download()
article_sina.parse()
print(f"\nSina Article Title: {article_sina.title}")
Why Use Newspaper3k?
Newspaper3k stands out for several reasons, making it an excellent choice for news and article extraction tasks:
- Robust Extraction: It reliably extracts text, authors, publication dates, top images, and all images from HTML content.
- NLP Capabilities: Built-in Natural Language Processing allows for keyword and summary extraction, providing deeper insights into article content.
- Multi-threaded Downloads: Efficiently download multiple articles concurrently, speeding up data collection.
- Multi-language Support: Works in over 10 languages, with seamless auto-detection or explicit language specification, making it versatile for global news sources.
- News URL Identification: Smartly identifies news-related URLs, helping to focus your scraping efforts.
- Ease of Use: Its API is designed to be intuitive and straightforward, allowing developers to quickly integrate it into their projects.
Links
- GitHub Repository: https://github.com/codelucas/newspaper
- Official Documentation: https://newspaper.readthedocs.io
- Online Demo: http://newspaper-demo.herokuapp.com
- Another Online Demo: http://newspaper.chinazt.cc/