Pipet: A Swiss-Army Tool for Web Scraping and Data Extraction

Summary
Pipet is a versatile command-line web scraper designed for hackers, enabling efficient data extraction from various online assets. It supports HTML parsing, JSON parsing, and client-side JavaScript evaluation, leveraging existing tools like `curl` and `playwright` for powerful and flexible scraping operations. This tool is ideal for tracking information, monitoring changes, and automating data collection tasks.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Pipet is a powerful and flexible command-line web scraper, often described as a "swiss-army tool" for extracting data from online assets. Built with hackers in mind, it simplifies complex scraping tasks by supporting three primary modes of operation: HTML parsing, JSON parsing, and client-side JavaScript evaluation. Pipet cleverly integrates with existing tools like curl
and playwright
, and utilizes Unix pipes to extend its built-in capabilities, making it highly adaptable for various data extraction needs. Whether you need to track shipments, monitor stock prices, or get notified about concert tickets, Pipet provides a robust solution.
Installation
Getting started with Pipet is straightforward, with several installation options available:
Pre-built Binaries
The easiest way to install is by downloading the latest release for your operating system from the official Releases page. After downloading, make the binary executable with chmod +x pipet
and run ./pipet
.
Compile from Source
If you have Go installed on your system, you can compile and install Pipet directly:
go install github.com/bjesus/pipet/cmd/pipet@latest
Alternatively, you can run it without a full installation using go run
.
Package Managers
Pipet is also available through various package managers:
Examples
Pipet's strength lies in its intuitive .pipet
files, which define how and where to extract data. Here's a quick example to scrape Hacker News:
- Create a file named
hackernews.pipet
with the following content:curl https://news.ycombinator.com/ .title .titleline span > a .sitebit a
- Run Pipet:
go run github.com/bjesus/pipet/cmd/pipet@latest hackernews.pipet # Or, if installed: pipet hackernews.pipet
This will display the latest Hacker News titles and their associated domains directly in your terminal.
Pipet offers many advanced features, including:
- Custom Separators: Use the
--separator
flag to format output. - JSON Output: Get results as a clean JSON structure with the
--json
flag. - Templating: Render results into custom HTML or text templates.
- Unix Pipes Integration: Extend functionality by piping data to other command-line tools like
wc
orhtmlq
. - Monitoring: Set intervals and commands to run on changes, allowing you to track dynamic content.
Why Use Pipet?
Pipet stands out for its versatility and hacker-friendly design. Its ability to handle HTML, JSON, and JavaScript-rendered content means it can tackle almost any web scraping challenge. By integrating with curl
for complex HTTP requests and playwright
for headless browser automation, it provides powerful capabilities without reinventing the wheel. The use of Unix pipes allows for seamless integration into existing workflows and custom data processing. Furthermore, its monitoring features make it an excellent tool for staying updated on online information, from personal alerts to business intelligence.
Links
- GitHub Repository: https://github.com/bjesus/pipet
- Go Reference: https://pkg.go.dev/github.com/bjesus/pipet
- Releases Page: https://github.com/bjesus/pipet/releases/