AnyCrawl: A High-Performance Node.js/TypeScript Web Crawler for LLM Data

Summary
AnyCrawl is a powerful Node.js/TypeScript web crawler designed to transform websites into LLM-ready data. It excels at extracting structured SERP results from various search engines and features native multi-threading for efficient bulk processing, making it ideal for large-scale data collection.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
AnyCrawl is a high-performance, Node.js/TypeScript web crawler and scraping toolkit designed to efficiently gather data from the web. It specializes in transforming raw website content into structured, LLM-ready data, making it an invaluable tool for AI development and data analysis. AnyCrawl supports various operations, including comprehensive site crawling, single-page web scraping, and structured SERP (Search Engine Results Page) data extraction from major search engines like Google. Its native multi-threading capabilities ensure fast and scalable processing for bulk tasks.
Installation
Getting started with AnyCrawl is straightforward, especially using Docker Compose for self-hosting. This method simplifies deployment and setup.
To run AnyCrawl via Docker Compose:
docker compose up -d
If you enable authentication, you'll need to generate an API key. You can do this by executing a command within the running Docker container:
docker compose exec api pnpm --filter api key:generate -- default
For more detailed installation instructions and configuration options, refer to the official documentation.
Examples
AnyCrawl offers flexible APIs for different scraping needs. Here are a couple of examples demonstrating its power:
Web Scraping with LLM Extraction
AnyCrawl can not only scrape web pages but also extract structured data using LLM-powered capabilities, based on a provided JSON schema.
curl -X POST "https://api.anycrawl.dev/v1/scrape" \
-H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"json_options": {
"schema": {
"type": "object",
"properties": {
"company_mission": { "type": "string" },
"is_open_source": { "type": "boolean" },
"employee_count": { "type": "number" }
},
"required": ["company_mission"]
}
}
}'
Search Engine Results (SERP)
Extract structured search results from engines like Google with ease.
curl -X POST https://api.anycrawl.dev/v1/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
-d '{
"query": "AnyCrawl",
"limit": 10,
"engine": "google",
"lang": "all"
}'
You can test these APIs and generate code in your preferred language using the AnyCrawl Playground.
Why Use AnyCrawl?
AnyCrawl stands out for several reasons:
- LLM-Ready Data: It transforms raw HTML into clean, structured data optimized for Large Language Models, simplifying your AI workflows.
- High Performance: Leveraging native multi-threading and multi-process capabilities, AnyCrawl handles bulk tasks efficiently and reliably.
- Versatile Scraping: From full-site traversal to single-page content extraction and structured SERP results, it covers a wide range of web data needs.
- Ease of Integration: Built with Node.js and TypeScript, it's easy to integrate into existing projects and offers a clear API.
- Scalability: Designed for batch processing, it can scale to meet demanding data collection requirements.
Links
- GitHub Repository: any4ai/AnyCrawl
- Official Documentation: docs.anycrawl.dev
- API Playground: anycrawl.dev/playground