crawler.py

Overview

The crawler.py file defines a web crawling component within the InfiniFlow system designed to asynchronously retrieve and extract web content from URLs. It provides configurable parameters controlling the format of the extracted content (e.g., HTML, Markdown, or plain content) and supports optional proxy usage.

The core functionality revolves around the Crawler class, which integrates with an asynchronous web crawling library (crawl4ai) to fetch web pages, process them, and return the extracted data in the desired format. This component is intended for use in environments where automated web data retrieval and processing are needed, such as AI agents or data pipelines.

Detailed Documentation

Classes

`CrawlerParam`

class CrawlerParam(ToolParamBase)

Description:
Defines configuration parameters for the Crawler component. Inherits from ToolParamBase which likely provides base functionality for parameter handling.

Attributes:

Attribute	Type	Default	Description
`proxy`	`str` or `None`	`None`	Optional proxy server address to route requests through.
`extract_type`	`str`	`"markdown"`	Defines the format of the extracted content. Allowed values are `'html'`, `'markdown'`, or `'content'`.

Methods:

init(self)
Initializes the parameters with default values.
check(self)
Validates the parameter values. Specifically, it checks that extract_type is one of the allowed strings ('html', 'markdown', 'content'). Raises an error or logs if invalid.

Usage Example:

params = CrawlerParam()
params.proxy = "http://127.0.0.1:8080"
params.extract_type = "html"
params.check()  # Validates parameters

`Crawler`

class Crawler(ToolBase, ABC)

Description:
Main class implementing the crawling tool. Inherits from ToolBase (presumably a base class for tools in the system) and ABC (Abstract Base Class). Handles the orchestration of URL validation, asynchronous crawling, and content extraction.

Class-level Attributes:

Attribute	Type	Description
`component_name`	`str`	Name identifier `"Crawler"`

Methods:

_run(self, history, **kwargs)
Purpose:
Entry point for running the crawler tool. This method:
- Retrieves input (likely from user or upstream process).
- Validates the input as a URL.
- Runs the asynchronous get_web method to fetch and extract content.
- Handles exceptions gracefully.
Parameters:
Parameter
Type
Description
history
Any
Context or history of previous interactions (not explicitly used in this code).
**kwargs
dict
Additional keyword arguments (not explicitly used).
Returns:
Output wrapped by Crawler.be_output(), a method inherited from ToolBase that likely formats the output for the system.
Usage Example:
```
crawler = Crawler()
crawler._param = CrawlerParam()
crawler._param.extract_type = "markdown"
output = crawler._run(history=None)
print(output)
```
async get_web(self, url)
Purpose:
Asynchronously fetches the web content for the given URL using AsyncWebCrawler from the crawl4ai package.
Parameters:
Parameter
Type
Description
url
str
The target URL to crawl.
Returns:
Extracted content from the URL in the format specified by self._param.extract_type. Possible return values are:
- cleaned_html (raw HTML string)
- markdown (Markdown formatted text)
- extracted_content (plain content text)
Defaults to Markdown if an unrecognized type is specified.
Implementation Notes:
- Supports optional proxy usage if specified in parameters.
- Uses async with context to manage crawler lifecycle.
- Enables verbose logging for debugging.
Usage Example:
```
import asyncio

crawler = Crawler()
crawler._param = CrawlerParam()
crawler._param.extract_type = "html"

result = asyncio.run(crawler.get_web("https://example.com"))
print(result)
```

Parameter	Type	Description
`history`	Any	Context or history of previous interactions (not explicitly used in this code).
`**kwargs`	dict	Additional keyword arguments (not explicitly used).

Parameter	Type	Description
`url`	`str`	The target URL to crawl.

Important Implementation Details

Asynchronous Crawling:
The use of AsyncWebCrawler and asyncio.run() allows the crawler to operate asynchronously, improving performance when dealing with multiple or slow-loading web requests.
Input Validation:
The method _run uses an external utility is_valid_url (imported locally) to ensure only valid URLs are processed, preventing errors or misuse.
Flexible Content Extraction:
The crawler can return content in three formats:
- html: Raw HTML string.
- markdown: Markdown formatted content.
- content: Extracted plain text content.
Error Handling:
The _run method captures and returns unexpected errors gracefully as output strings.
Proxy Support:
Allows configuration of HTTP proxy to route crawling requests, useful for network restrictions or anonymity.

Interactions with Other System Components

ToolParamBase and ToolBase (agent.tools.base):
The crawler inherits parameter management and tool interface functionality from these base classes, integrating smoothly into the agent/tool framework.
AsyncWebCrawler (crawl4ai):
Core asynchronous crawler used to fetch and parse web pages.
is_valid_url (api.utils.web_utils):
Utility function to validate URL inputs.
Output Handling:
Uses be_output method (likely from ToolBase) to generate standardized output consumable by other components or user interfaces.

This design facilitates easy embedding of the crawler as a modular tool in larger AI agent workflows or data pipelines.

Mermaid Class Diagram

classDiagram
    class CrawlerParam {
        +proxy: str or None
        +extract_type: str
        +__init__()
        +check()
    }

    class Crawler {
        +component_name: str
        +_run(history, **kwargs)
        +async get_web(url)
    }

    ToolParamBase <|-- CrawlerParam
    ToolBase <|-- Crawler
    ABC <|-- Crawler

Summary

The crawler.py file provides a powerful, asynchronous web crawling tool encapsulated in the Crawler class, configurable via CrawlerParam. It supports flexible output formats, proxy usage, and integrates cleanly into the InfiniFlow agent framework, enabling efficient web content retrieval for AI and data processing applications.