Pipet: A Swiss-Army Tool for Web Scraping and Data Extraction

Introduction

Pipet is a powerful and flexible command-line web scraper, often described as a "swiss-army tool" for extracting data from online assets. Built with hackers in mind, it simplifies complex scraping tasks by supporting three primary modes of operation: HTML parsing, JSON parsing, and client-side JavaScript evaluation. Pipet cleverly integrates with existing tools like curl and playwright, and utilizes Unix pipes to extend its built-in capabilities, making it highly adaptable for various data extraction needs. Whether you need to track shipments, monitor stock prices, or get notified about concert tickets, Pipet provides a robust solution.

Installation

Getting started with Pipet is straightforward, with several installation options available:

Pre-built Binaries

The easiest way to install is by downloading the latest release for your operating system from the official Releases page. After downloading, make the binary executable with chmod +x pipet and run ./pipet.

Compile from Source

If you have Go installed on your system, you can compile and install Pipet directly:

go install github.com/bjesus/pipet/cmd/pipet@latest

Alternatively, you can run it without a full installation using go run.

Package Managers

Pipet is also available through various package managers:

Arch Linux: pipet-git
Homebrew: pipet
Nix: pipet

Examples

Pipet's strength lies in its intuitive .pipet files, which define how and where to extract data. Here's a quick example to scrape Hacker News:

Create a file named hackernews.pipet with the following content:

curl https://news.ycombinator.com/
.title .titleline
  span > a
  .sitebit a

Run Pipet:
```
go run github.com/bjesus/pipet/cmd/pipet@latest hackernews.pipet
# Or, if installed:
pipet hackernews.pipet
```
This will display the latest Hacker News titles and their associated domains directly in your terminal.

Pipet offers many advanced features, including:

Custom Separators: Use the --separator flag to format output.
JSON Output: Get results as a clean JSON structure with the --json flag.
Templating: Render results into custom HTML or text templates.
Unix Pipes Integration: Extend functionality by piping data to other command-line tools like wc or htmlq.
Monitoring: Set intervals and commands to run on changes, allowing you to track dynamic content.

Why Use Pipet?

Pipet stands out for its versatility and hacker-friendly design. Its ability to handle HTML, JSON, and JavaScript-rendered content means it can tackle almost any web scraping challenge. By integrating with curl for complex HTTP requests and playwright for headless browser automation, it provides powerful capabilities without reinventing the wheel. The use of Unix pipes allows for seamless integration into existing workflows and custom data processing. Furthermore, its monitoring features make it an excellent tool for staying updated on online information, from personal alerts to business intelligence.

Pipet: A Swiss-Army Tool for Web Scraping and Data Extraction

Summary

Repository Info

Tags