crawler.py


Overview

The crawler.py file defines a web crawling component within the InfiniFlow system designed to asynchronously retrieve and extract web content from URLs. It provides configurable parameters controlling the format of the extracted content (e.g., HTML, Markdown, or plain content) and supports optional proxy usage.

The core functionality revolves around the Crawler class, which integrates with an asynchronous web crawling library (crawl4ai) to fetch web pages, process them, and return the extracted data in the desired format. This component is intended for use in environments where automated web data retrieval and processing are needed, such as AI agents or data pipelines.


Detailed Documentation

Classes


CrawlerParam

class CrawlerParam(ToolParamBase)

Description:
Defines configuration parameters for the Crawler component. Inherits from ToolParamBase which likely provides base functionality for parameter handling.

Attributes:

Attribute

Type

Default

Description

proxy

str or None

None

Optional proxy server address to route requests through.

extract_type

str

"markdown"

Defines the format of the extracted content. Allowed values are 'html', 'markdown', or 'content'.

Methods:

Usage Example:

params = CrawlerParam()
params.proxy = "http://127.0.0.1:8080"
params.extract_type = "html"
params.check()  # Validates parameters

Crawler

class Crawler(ToolBase, ABC)

Description:
Main class implementing the crawling tool. Inherits from ToolBase (presumably a base class for tools in the system) and ABC (Abstract Base Class). Handles the orchestration of URL validation, asynchronous crawling, and content extraction.

Class-level Attributes:

Attribute

Type

Description

component_name

str

Name identifier "Crawler"

Methods:


Important Implementation Details


Interactions with Other System Components

This design facilitates easy embedding of the crawler as a modular tool in larger AI agent workflows or data pipelines.


Mermaid Class Diagram

classDiagram
    class CrawlerParam {
        +proxy: str or None
        +extract_type: str
        +__init__()
        +check()
    }

    class Crawler {
        +component_name: str
        +_run(history, **kwargs)
        +async get_web(url)
    }

    ToolParamBase <|-- CrawlerParam
    ToolBase <|-- Crawler
    ABC <|-- Crawler

Summary

The crawler.py file provides a powerful, asynchronous web crawling tool encapsulated in the Crawler class, configurable via CrawlerParam. It supports flexible output formats, proxy usage, and integrates cleanly into the InfiniFlow agent framework, enabling efficient web content retrieval for AI and data processing applications.