crawler.py
Overview
The crawler.py file defines a web crawling component within the InfiniFlow system designed to asynchronously retrieve and extract web content from URLs. It provides configurable parameters controlling the format of the extracted content (e.g., HTML, Markdown, or plain content) and supports optional proxy usage.
The core functionality revolves around the Crawler class, which integrates with an asynchronous web crawling library (crawl4ai) to fetch web pages, process them, and return the extracted data in the desired format. This component is intended for use in environments where automated web data retrieval and processing are needed, such as AI agents or data pipelines.
Detailed Documentation
Classes
CrawlerParam
class CrawlerParam(ToolParamBase)
Description:
Defines configuration parameters for the Crawler component. Inherits from ToolParamBase which likely provides base functionality for parameter handling.
Attributes:
Attribute | Type | Default | Description |
|---|---|---|---|
|
|
| Optional proxy server address to route requests through. |
|
|
| Defines the format of the extracted content. Allowed values are |
Methods:
init(self)Initializes the parameters with default values.
check(self)Validates the parameter values. Specifically, it checks that
extract_typeis one of the allowed strings ('html','markdown','content'). Raises an error or logs if invalid.
Usage Example:
params = CrawlerParam()
params.proxy = "http://127.0.0.1:8080"
params.extract_type = "html"
params.check() # Validates parameters
Crawler
class Crawler(ToolBase, ABC)
Description:
Main class implementing the crawling tool. Inherits from ToolBase (presumably a base class for tools in the system) and ABC (Abstract Base Class). Handles the orchestration of URL validation, asynchronous crawling, and content extraction.
Class-level Attributes:
Attribute | Type | Description |
|---|---|---|
|
| Name identifier |
Methods:
_run(self, history, **kwargs)Purpose:
Entry point for running the crawler tool. This method:Retrieves input (likely from user or upstream process).
Validates the input as a URL.
Runs the asynchronous
get_webmethod to fetch and extract content.Handles exceptions gracefully.
Parameters:
Parameter
Type
Description
historyAny
Context or history of previous interactions (not explicitly used in this code).
**kwargsdict
Additional keyword arguments (not explicitly used).
Returns:
Output wrapped byCrawler.be_output(), a method inherited fromToolBasethat likely formats the output for the system.Usage Example:
crawler = Crawler() crawler._param = CrawlerParam() crawler._param.extract_type = "markdown" output = crawler._run(history=None) print(output)async get_web(self, url)Purpose:
Asynchronously fetches the web content for the given URL usingAsyncWebCrawlerfrom thecrawl4aipackage.Parameters:
Parameter
Type
Description
urlstrThe target URL to crawl.
Returns:
Extracted content from the URL in the format specified byself._param.extract_type. Possible return values are:cleaned_html(raw HTML string)markdown(Markdown formatted text)extracted_content(plain content text)
Defaults to Markdown if an unrecognized type is specified.
Implementation Notes:
Supports optional proxy usage if specified in parameters.
Uses
async withcontext to manage crawler lifecycle.Enables verbose logging for debugging.
Usage Example:
import asyncio crawler = Crawler() crawler._param = CrawlerParam() crawler._param.extract_type = "html" result = asyncio.run(crawler.get_web("https://example.com")) print(result)
Important Implementation Details
Asynchronous Crawling:
The use ofAsyncWebCrawlerandasyncio.run()allows the crawler to operate asynchronously, improving performance when dealing with multiple or slow-loading web requests.Input Validation:
The method_runuses an external utilityis_valid_url(imported locally) to ensure only valid URLs are processed, preventing errors or misuse.Flexible Content Extraction:
The crawler can return content in three formats:html: Raw HTML string.markdown: Markdown formatted content.content: Extracted plain text content.
Error Handling:
The_runmethod captures and returns unexpected errors gracefully as output strings.Proxy Support:
Allows configuration of HTTP proxy to route crawling requests, useful for network restrictions or anonymity.
Interactions with Other System Components
ToolParamBaseandToolBase(agent.tools.base):
The crawler inherits parameter management and tool interface functionality from these base classes, integrating smoothly into the agent/tool framework.AsyncWebCrawler(crawl4ai):
Core asynchronous crawler used to fetch and parse web pages.is_valid_url(api.utils.web_utils):
Utility function to validate URL inputs.Output Handling:
Usesbe_outputmethod (likely fromToolBase) to generate standardized output consumable by other components or user interfaces.
This design facilitates easy embedding of the crawler as a modular tool in larger AI agent workflows or data pipelines.
Mermaid Class Diagram
classDiagram
class CrawlerParam {
+proxy: str or None
+extract_type: str
+__init__()
+check()
}
class Crawler {
+component_name: str
+_run(history, **kwargs)
+async get_web(url)
}
ToolParamBase <|-- CrawlerParam
ToolBase <|-- Crawler
ABC <|-- Crawler
Summary
The crawler.py file provides a powerful, asynchronous web crawling tool encapsulated in the Crawler class, configurable via CrawlerParam. It supports flexible output formats, proxy usage, and integrates cleanly into the InfiniFlow agent framework, enabling efficient web content retrieval for AI and data processing applications.