这是用户在 2025-7-7 2:18 为 https://github.com/unclecode/crawl4ai 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Skip to content
Owner avatar crawl4ai Public

🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN

License

Open in github.dev Open in a new github.dev tab Open in codespace

unclecode/crawl4ai

Add file

Add file

Folders and files

NameName
Last commit message
Last commit date

Latest commit

aravindkarnamunclecodecoderabbitai[bot]
Jun 25, 2025
02f3127 · Jun 25, 2025

History

813 Commits
Jun 25, 2025
May 12, 2025
May 1, 2025
Jun 8, 2025
Apr 10, 2025
Apr 29, 2025
May 9, 2024
Jan 7, 2025
May 6, 2025
May 2, 2025
Jan 6, 2025
Feb 28, 2025
Apr 23, 2025
Apr 17, 2025
Feb 21, 2025
Dec 15, 2024
Oct 31, 2024
Apr 23, 2025
Nov 9, 2024
Feb 21, 2025
Apr 22, 2025
Apr 23, 2025
Apr 23, 2025
Apr 3, 2025
Sep 25, 2024
Apr 23, 2025

Repository files navigation

🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.🚀🤖 Crawl4AI:面向大型语言模型的开源友好型网络爬虫和抓取工具。

unclecode%2Fcrawl4ai | Trendshift

GitHub Stars GitHub Forks

PyPI version Python Version Downloads

License Code style: black Security: bandit Contributor Covenant

Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.Crawl4AI 是 GitHub 上排名第一的热门项目,由充满活力的社区积极维护。它提供专为 LLM(大语言模型)、AI 代理和数据管道打造的超高速、AI 就绪型网络爬虫。作为开源项目,它灵活且专为实时性能而构建,为开发者提供无与伦比的速度、精度和部署便利性。

✨ Check out latest update v0.6.0✨ 查看最新更新 v0.6.0

🎉 Version 0.6.0 is now available! This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! Read the release notes →🎉 0.6.0 版本现已推出!此候选版本引入了具备地理定位和区域设置的世界感知抓取功能、表格转数据框提取、预热浏览器池、网络和控制台流量捕获、用于 AI 工具的 MCP 集成以及全新设计的 Docker 部署!查看发行说明 →

🤓 My Personal Story 🤓我的个人故事

My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.

Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.

I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.

Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.

🧐 Why Crawl4AI? 🧐为什么是Crawl4AI?

  1. Built for LLMs: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.专为大语言模型打造:生成智能、简洁的 Markdown 格式内容,优化用于检索增强生成(RAG)和微调应用。
  2. Lightning Fast: Delivers results 6x faster with real-time, cost-efficient performance.闪电般快速:实时交付结果,速度提升 6 倍,性能卓越且成本经济。
  3. Flexible Browser Control: Offers session management, proxies, and custom hooks for seamless data access.灵活的浏览器控制:提供会话管理、代理和自定义挂钩,实现无缝数据访问。
  4. Heuristic Intelligence: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.启发式智能:运用先进的算法实现高效提取,降低对昂贵模型的依赖。
  5. Open Source & Deployable: Fully open-source with no API keys—ready for Docker and cloud integration.开源且可部署:完全开源,无需 API 密钥,支持 Docker 和云集成。
  6. Thriving Community: Actively maintained by a vibrant community and the #1 trending GitHub repository.繁荣的社区:由充满活力的社区积极维护,是 GitHub 上排名第一的热门项目。

🚀 Quick Start 🚀快速入门

  1. Install Crawl4AI:
# Install the package
pip install -U crawl4ai

# For pre release versions
pip install crawl4ai --pre

# Run post-installation setup
crawl4ai-setup

# Verify your installation
crawl4ai-doctor

If you encounter any browser-related issues, you can install them manually:如果您遇到任何与浏览器相关的问题,您可以手动安装它们:

python -m playwright install --with-deps chromium
  1. Run a simple web crawl with Python:使用 Python 进行简单的网络爬虫操作:
import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())
  1. Or use the new command-line interface:或者使用新的命令行界面:
# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown

# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# Use LLM extraction with a specific question
crwl https://www.example.com/products -q "Extract all product prices"

✨ Features ✨特性

📝 Markdown Generation 📝 Markdown 生成
  • 🧹 Clean Markdown: Generates clean, structured Markdown with accurate formatting.
  • 🎯 Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
  • 🔗 Citations and References: Converts page links into a numbered reference list with clean citations.
  • 🛠️ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
  • 📚 BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.
📊 Structured Data Extraction📊 结构化数据提取
  • 🤖 LLM-Driven Extraction: Supports all LLMs (open-source and proprietary) for structured data extraction.
  • 🧱 Chunking Strategies: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
  • 🌌 Cosine Similarity: Find relevant content chunks based on user queries for semantic extraction.
  • 🔎 CSS-Based Extraction: Fast schema-based data extraction using XPath and CSS selectors.
  • 🔧 Schema Definition: Define custom schemas for extracting structured JSON from repetitive patterns.
🌐 Browser Integration 🌐 浏览器集成
  • 🖥️ Managed Browser: Use user-owned browsers with full control, avoiding bot detection.
  • 🔄 Remote Browser Control: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
  • 👤 Browser Profiler: Create and manage persistent profiles with saved authentication states, cookies, and settings.
  • 🔒 Session Management: Preserve browser states and reuse them for multi-step crawling.
  • 🧩 Proxy Support: Seamlessly connect to proxies with authentication for secure access.
  • ⚙️ Full Browser Control: Modify headers, cookies, user agents, and more for tailored crawling setups.
  • 🌍 Multi-Browser Support: Compatible with Chromium, Firefox, and WebKit.
  • 📐 Dynamic Viewport Adjustment: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
🔎 Crawling & Scraping 爬虫与抓取
  • 🖼️ Media Support: Extract images, audio, videos, and responsive image formats like srcset and picture.
  • 🚀 Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
  • 📸 Screenshots: Capture page screenshots during crawling for debugging or analysis.
  • 📂 Raw Data Crawling: Directly process raw HTML (raw:) or local files (file://).
  • 🔗 Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content.
  • 🛠️ Customizable Hooks: Define hooks at every step to customize crawling behavior.
  • 💾 Caching: Cache data for improved speed and to avoid redundant fetches.
  • 📄 Metadata Extraction: Retrieve structured metadata from web pages.
  • 📡 IFrame Content Extraction: Seamless extraction from embedded iframe content.
  • 🕵️ Lazy Load Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading.
  • 🔄 Full-Page Scanning: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
🚀 Deployment 🚀部署
  • 🐳 Dockerized Setup: Optimized Docker image with FastAPI server for easy deployment.
  • 🔑 Secure Authentication: Built-in JWT token authentication for API security.
  • 🔄 API Gateway: One-click deployment with secure token authentication for API-based workflows.
  • 🌐 Scalable Architecture: Designed for mass-scale production and optimized server performance.
  • ☁️ Cloud Deployment: Ready-to-deploy configurations for major cloud platforms.
🎯 Additional Features 🎯 其他功能
  • 🕶️ Stealth Mode: Avoid bot detection by mimicking real users.
  • 🏷️ Tag-Based Content Extraction: Refine crawling based on custom tags, headers, or metadata.
  • 🔗 Link Analysis: Extract and analyze all links for detailed data exploration.
  • 🛡️ Error Handling: Robust error management for seamless execution.
  • 🔐 CORS & Static Serving: Supports filesystem-based caching and cross-origin requests.
  • 📖 Clear Documentation: Simplified and updated guides for onboarding and advanced usage.
  • 🙌 Community Recognition: Acknowledges contributors and pull requests for transparency.

Try it Now! 现在就试试!

✨ Play around with this 试试这个吧Open In Colab

✨ Visit our Documentation Website✨ 访问我们的文档网站

Installation 🛠️ 安装🛠️

Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.Crawl4AI 提供了灵活的安装选项以适应各种使用场景。您可以将其作为 Python 包进行安装,也可以使用 Docker。

🐍 Using pip 🐍使用pip

Choose the installation option that best fits your needs:

Basic Installation

For basic web crawling and scraping tasks:

pip install crawl4ai
crawl4ai-setup # Setup the browser

By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.

👉 Note: When you install Crawl4AI, the crawl4ai-setup should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:

  1. Through the command line:

    playwright install
  2. If the above doesn't work, try this more specific command:

    python -m playwright install chromium

This second method has proven to be more reliable in some cases.


Installation with Synchronous Version

The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:

pip install crawl4ai[sync]

Development Installation

For contributors who plan to modify the source code:

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .                    # Basic installation in editable mode

Install optional features:

pip install -e ".[torch]"           # With PyTorch features
pip install -e ".[transformer]"     # With Transformer features
pip install -e ".[cosine]"          # With cosine similarity features
pip install -e ".[sync]"            # With synchronous crawling (Selenium)
pip install -e ".[all]"             # Install all optional features
🐳 Docker Deployment 🐳Docker部署

🚀 Now Available! Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.

New Docker Features

The new Docker implementation includes:

  • Browser pooling with page pre-warming for faster response times
  • Interactive playground to test and generate request code
  • MCP integration for direct connection to AI tools like Claude Code
  • Comprehensive API endpoints including HTML extraction, screenshots, PDF generation, and JavaScript execution
  • Multi-architecture support with automatic detection (AMD64/ARM64)
  • Optimized resources with improved memory management

Getting Started

# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number

# Visit the playground at http://localhost:11235/playground

For complete documentation, see our Docker Deployment Guide.


Quick Test 快速测试

Run a quick test (works for both Docker options):运行快速测试(两种 Docker 选项均适用):

import requests

# Submit a crawl job
response = requests.post(
    "http://localhost:11235/crawl",
    json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]

# Continue polling until the task is complete (status="completed")
result = requests.get(f"http://localhost:11235/task/{task_id}")

For more examples, see our Docker Examples. For advanced configuration, environment variables, and usage examples, see our Docker Deployment Guide.更多示例,请参阅我们的 Docker 示例。有关高级配置、环境变量和使用示例,请参阅我们的 Docker 部署指南。

🔬 Advanced Usage Examples 🔬

You can check the project structure in the directory https://github.com/unclecode/crawl4ai/docs/examples. Over there, you can find a variety of examples; here, some popular examples are shared.

📝 Heuristic Markdown Generation with Clean and Fit Markdown
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    browser_config = BrowserConfig(
        headless=True,  
        verbose=True,
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
        ),
        # markdown_generator=DefaultMarkdownGenerator(
        #     content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
        # ),
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            config=run_config
        )
        print(len(result.markdown.raw_markdown))
        print(len(result.markdown.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())
🖥️ Executing JavaScript & Extract Structured Data without LLMs无需大型语言模型即可执行 JavaScript 并提取结构化数据
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def main():
    schema = {
    "name": "KidoCode Courses",
    "baseSelector": "section.charge-methodology .w-tab-content > div",
    "fields": [
        {
            "name": "section_title",
            "selector": "h3.heading-50",
            "type": "text",
        },
        {
            "name": "section_description",
            "selector": ".charge-content",
            "type": "text",
        },
        {
            "name": "course_name",
            "selector": ".text-block-93",
            "type": "text",
        },
        {
            "name": "course_description",
            "selector": ".course-content-text",
            "type": "text",
        },
        {
            "name": "course_icon",
            "selector": ".image-92",
            "type": "attribute",
            "attribute": "src"
        }
    }
}

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    browser_config = BrowserConfig(
        headless=False,
        verbose=True
    )
    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
        cache_mode=CacheMode.BYPASS
    )
        
    async with AsyncWebCrawler(config=browser_config) as crawler:
        
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=run_config
        )

        companies = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(companies)} companies")
        print(json.dumps(companies[0], indent=2))


if __name__ == "__main__":
    asyncio.run(main())
📚 Extracting Structured Data with LLMs📚 利用大型语言模型提取结构化数据
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
            # provider="ollama/qwen2", api_token="no-token", 
            llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        cache_mode=CacheMode.BYPASS,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            config=run_config
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())
🤖 Using You own Browser with Custom User Profile🤖 使用您自己的浏览器并自定义用户配置文件
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=True,
        user_data_dir=user_data_dir,
        use_persistent_context=True,
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
        
        result = await crawler.arun(
            url,
            config=run_config,
            magic=True,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

✨ Recent Updates ✨最近更新

Version 0.6.0 Release Highlights0.6.0 版本发布亮点

  • 🌎 World-aware Crawling: Set geolocation, language, and timezone for authentic locale-specific content:🌍 全球感知抓取:设置地理位置、语言和时区以获取真实本地化内容:

      crun_cfg = CrawlerRunConfig(
          url="https://browserleaks.com/geo",          # test page that shows your location
          locale="en-US",                              # Accept-Language & UI locale
          timezone_id="America/Los_Angeles",           # JS Date()/Intl timezone
          geolocation=GeolocationConfig(                 # override GPS coords
              latitude=34.0522,
              longitude=-118.2437,
              accuracy=10.0,
          )
      )
  • 📊 Table-to-DataFrame Extraction: Extract HTML tables directly to CSV or pandas DataFrames:📊 表格转数据框提取:直接将 HTML 表格提取为 CSV 或 pandas 数据框:

      crawler = AsyncWebCrawler(config=browser_config)
      await crawler.start()
    
      try:
          # Set up scraping parameters
          crawl_config = CrawlerRunConfig(
              table_score_threshold=8,  # Strict table detection
          )
    
          # Execute market data extraction
          results: List[CrawlResult] = await crawler.arun(
              url="https://coinmarketcap.com/?page=1", config=crawl_config
          )
    
          # Process results
          raw_df = pd.DataFrame()
          for result in results:
              if result.success and result.media["tables"]:
                  raw_df = pd.DataFrame(
                      result.media["tables"][0]["rows"],
                      columns=result.media["tables"][0]["headers"],
                  )
                  break
          print(raw_df.head())
    
      finally:
          await crawler.stop()
  • 🚀 Browser Pooling: Pages launch hot with pre-warmed browser instances for lower latency and memory usage🚀 浏览器池化:页面启动时使用预热的浏览器实例,实现低延迟和低内存占用的快速启动

  • 🕸️ Network and Console Capture: Full traffic logs and MHTML snapshots for debugging:网络和控制台捕获:用于调试的完整流量日志和 MHTML 快照:

    crawler_config = CrawlerRunConfig(
        capture_network=True,
        capture_console=True,
        mhtml=True
    )
  • 🔌 MCP Integration: Connect to AI tools like Claude Code through the Model Context Protocol🔌 MCP 集成:通过模型上下文协议连接到像 Claude Code 这样的 AI 工具

    # Add Crawl4AI to Claude Code
    claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
  • 🖥️ Interactive Playground: Test configurations and generate API requests with the built-in web interface at http://localhost:11235//playground互动游乐场:使用内置的网页界面在  http://localhost:11235//playground 测试配置并生成 API 请求

  • 🐳 Revamped Docker Deployment: Streamlined multi-architecture Docker image with improved resource efficiency🐳 重构的 Docker 部署:优化的多架构 Docker 镜像,资源利用效率更高

  • 📱 Multi-stage Build System: Optimized Dockerfile with platform-specific performance enhancements📱 多阶段构建系统:针对特定平台进行了性能优化的 Dockerfile

Read the full details in our 0.6.0 Release Notes or check the CHANGELOG.请在我们的 0.6.0 版本说明中阅读完整详情,或者查看变更日志。

Previous Version: 0.5.0 Major Release Highlights0.5.0 版本主要更新亮点

  • 🚀 Deep Crawling System: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies🚀 深度抓取系统:运用广度优先搜索(BFS)、深度优先搜索(DFS)和最佳优先搜索策略探索初始 URL 之外的网站
  • ⚡ Memory-Adaptive Dispatcher: Dynamically adjusts concurrency based on system memory⚡ 内存自适应调度器:根据系统内存动态调整并发度
  • 🔄 Multiple Crawling Strategies: Browser-based and lightweight HTTP-only crawlers🔄 多种抓取策略:基于浏览器的抓取和仅使用 HTTP 的轻量级抓取
  • 💻 Command-Line Interface: New crwl CLI provides convenient terminal access💻 命令行界面:全新  crwl  命令行界面提供便捷的终端访问
  • 👤 Browser Profiler: Create and manage persistent browser profiles👤 浏览器配置文件管理器:创建和管理持久化的浏览器配置文件
  • 🧠 Crawl4AI Coding Assistant: AI-powered coding assistant🧠 Crawl4AI 编码助手:人工智能驱动的编码助手
  • 🏎️ LXML Scraping Mode: Fast HTML parsing using the lxml library🏎️ LXML 抓取模式:使用  lxml  库进行快速 HTML 解析
  • 🌐 Proxy Rotation: Built-in support for proxy switching🌐 代理轮换:内置代理切换支持
  • 🤖 LLM Content Filter: Intelligent markdown generation using LLMs🤖 LLM 内容过滤器:利用 LLM 进行智能 Markdown 生成
  • 📄 PDF Processing: Extract text, images, and metadata from PDF files📄 PDF 处理:从 PDF 文件中提取文本、图像和元数据

Read the full details in our 0.5.0 Release Notes.请在我们的 0.5.0 版本说明中阅读完整详情。

Version Numbering in Crawl4AICrawl4AI 中的版本编号

Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.Crawl4AI 遵循标准的 Python 版本编号约定(PEP 440),以帮助用户了解每个版本的稳定性和特性。

Version Numbers Explained版本号解析

Our version numbers follow this pattern: MAJOR.MINOR.PATCH (e.g., 0.4.3)我们的版本号遵循这种格式:  MAJOR.MINOR.PATCH  (例如,0.4.3)

Pre-release Versions 预发布版本

We use different suffixes to indicate development stages:我们使用不同的后缀来表示开发阶段:

  • dev (0.4.3dev1): Development versions, unstabledev (0.4.3dev1):开发版本,不稳定
  • a (0.4.3a1): Alpha releases, experimental featuresa (0.4.3a1):测试版发布,实验性功能
  • b (0.4.3b1): Beta releases, feature complete but needs testingb (0.4.3b1):测试版发布,功能完备但需测试
  • rc (0.4.3): Release candidates, potential final versionrc (0.4.3):发布候选版,可能是最终版本

Installation 安装

  • Regular installation (stable version):常规安装(稳定版):

    pip install -U crawl4ai
  • Install pre-release versions:

    pip install crawl4ai --pre
  • Install specific version:

    pip install crawl4ai==0.4.3b1

Why Pre-releases? 为什么预发行版?

We use pre-releases to:

  • Test new features in real-world scenarios
  • Gather feedback before final releases
  • Ensure stability for production users确保生产用户的稳定性
  • Allow early adopters to try new features

For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the --pre flag.对于生产环境,我们建议使用稳定版本。若要测试新功能,您可以使用  --pre  标志选择预发布版本。

📖 Documentation & Roadmap

🚨 Documentation Update Alert: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!

For current documentation, including installation instructions, advanced features, and API reference, visit our Documentation Website.

To check our development plans and upcoming features, visit our Roadmap.

📈 Development TODOs 📈开发事项
  • 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
  • 1. Question-Based Crawler: Natural language driven web discovery and content extraction
  • 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
  • 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
  • 4. Automated Schema Generator: Convert natural language to extraction schemas
  • 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
  • 6. Web Embedding Index: Semantic search infrastructure for crawled content
  • 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
  • 8. Performance Monitor: Real-time insights into crawler operations
  • 9. Cloud Integration: One-click deployment solutions across cloud providers
  • 10. Sponsorship Program: Structured support system with tiered benefits
  • 11. Educational Content: "How to Crawl" video series and interactive tutorials

🤝 Contributing 🤝贡献

We welcome contributions from the open-source community. Check out our contribution guidelines for more information.我们欢迎开源社区的贡献。查看我们的贡献指南以获取更多信息。

I'll help modify the license section with badges. For the halftone effect, here's a version with it:我会帮忙修改带有徽章的许可部分。关于半色调效果,这是带有它的版本:

Here's the updated license section:以下是更新后的许可部分:

📄 License & Attribution 许可与归属

This project is licensed under the Apache License 2.0 with a required attribution clause. See the Apache 2.0 License file for details.此项目根据 Apache 许可证 2.0 版本授权,附带要求注明归属的条款。详情请参阅 Apache 2.0 许可证文件。

Attribution Requirements 归属要求

When using Crawl4AI, you must include one of the following attribution methods:使用 Crawl4AI 时,您必须采用以下其中一种归属方法:

1. Badge Attribution (Recommended)1. 徽章归属(推荐)

Add one of these badges to your README, documentation, or website:在您的 README、文档或网站中添加这些徽章之一:

Theme 主题 Badge 徽章
Disco Theme (Animated) 迪斯科主题(动画版) Powered by Crawl4AI
Night Theme (Dark with Neon)夜之主题(暗黑霓虹) Powered by Crawl4AI
Dark Theme (Classic) 深色主题(经典) Powered by Crawl4AI
Light Theme (Classic) 浅色主题(经典) Powered by Crawl4AI

HTML code for adding the badges:添加徽章的 HTML 代码:

<!-- Disco Theme (Animated) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Night Theme (Dark with Neon) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Dark Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Light Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Simple Shield Badge -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
</a>

2. Text Attribution 2. 文本属性

Add this line to your documentation:在您的文档中添加这一行:

This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.

📚 Citation 📚传票

If you use Crawl4AI in your research or project, please cite:如果您在研究或项目中使用 Crawl4AI,请引用:

@software{crawl4ai2024,
  author = {UncleCode},
  title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/unclecode/crawl4ai}},
  commit = {Please use the commit hash you're working with}
}

Text citation format: 文本引用格式:

UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software]. 
GitHub. https://github.com/unclecode/crawl4ai

📧 Contact 📧联系

For questions, suggestions, or feedback, feel free to reach out:如有问题、建议或反馈,请随时联系我们:

Happy Crawling! 🕸️🚀 快乐爬行!

🗾 Mission 🗾任务

Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.我们的使命是通过将数字足迹转化为结构化、可交易的资产来释放个人和企业数据的价值。Crawl4AI 为个人和组织提供开源工具来提取和结构化数据,促进共享数据经济的发展。

We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.我们展望一个未来,在这个未来中,人工智能由真实的人类知识驱动,确保数据创造者能直接从他们的贡献中获益。通过使数据民主化并实现道德共享,我们正在为真正的人工智能进步奠定基础。

🔑 Key Opportunities 🔑主要机会
  • Data Capitalization: Transform digital footprints into measurable, valuable assets.数据资本化:将数字足迹转化为可衡量的、有价值的资产。
  • Authentic AI Data: Provide AI systems with real human insights.真实的人工智能数据:为人工智能系统提供真实的人类洞察。
  • Shared Economy: Create a fair data marketplace that benefits data creators.共享经济:打造一个公平的数据市场,让数据创造者受益。
🚀 Development Pathway 🚀 发展路径
  1. Open-Source Tools: Community-driven platforms for transparent data extraction.开源工具:由社区驱动的透明数据提取平台。
  2. Digital Asset Structuring: Tools to organize and value digital knowledge.数字资产结构化:用于组织和评估数字知识的工具。
  3. Ethical Data Marketplace: A secure, fair platform for exchanging structured data.道德数据市场:一个安全、公平的结构化数据交换平台。

For more details, see our full mission statement.欲知详情,请参阅我们的完整使命宣言。

Star History 明星的历史

Star History Chart

About

🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

set 限制解除
Translate to
Translate to
Summarize
Paraphrase
Expand
Check Grammar
Answer
Explain
Go to chat