googlescholar.py
Overview
This file implements a component for querying Google Scholar to retrieve scholarly articles and related academic literature. It provides a structured interface to perform searches on Google Scholar using the scholarly Python package, enabling filtering, sorting, and retrieval of metadata such as titles, authors, abstracts, and URLs of publications.
The component is designed as part of a larger agent/tool framework (agent.tools.base) and adheres to a standardized parameter and execution model, including retry mechanisms and timeout controls. It supports configurable search parameters like query keywords, number of results, sorting criteria, publication year range, and patent inclusion.
Classes and Functions
Class: GoogleScholarParam
Defines the parameters used to configure the Google Scholar search component.
Purpose
Encapsulates all configurable options for the Google Scholar search.
Provides validation for parameter values.
Specifies metadata for integration with UI/input forms or other tooling.
Attributes
meta: ToolMeta
Metadata describing the tool, its name, description, and input parameters.top_n: int
Number of top results to return (default: 12).sort_by: str
Sort order for results: either'date'or'relevance'(default:'relevance').year_low: Optional[int]
Lower bound of publication year filter (default: None).year_high: Optional[int]
Upper bound of publication year filter (default: None).patents: bool
Whether to include patents in search results (default: True).
Methods
init(self)
Initializes default parameter values and metadata.check(self)
Validates parameter values:Ensures
top_nis a positive integer.Checks
sort_byis either'date'or'relevance'.Validates
patentsis a boolean.
get_input_form(self) -> dict[str, dict]
Returns a dictionary describing the expected input form fields for this tool's parameters.
Example return value:{ "query": { "name": "Query", "type": "line" } }
Usage Example
params = GoogleScholarParam()
params.top_n = 10
params.sort_by = 'date'
params.patents = False
params.check() # Validate parameters
Class: GoogleScholar
Extends ToolBase with an abstract base class (ABC) to implement the Google Scholar search logic.
Purpose
Executes the search query using the scholarly API.
Handles retries, error logging, and timeout.
Retrieves and processes search results into a structured format.
Class Attributes
component_name = "GoogleScholar"
Identifier for this component.
Methods
_invoke(self, **kwargs)
Internal method invoked to perform the search.
Decorated with atimeoutdecorator to limit execution time (default 12 seconds or environment variableCOMPONENT_EXEC_TIMEOUT).Parameters:
query(str, required): Search keyword(s) for Google Scholar.
Behavior:
If no
queryparameter is provided, returns empty content.Attempts to execute the query up to
max_retries + 1times.Uses
scholarly.search_pubswith parameters fromself._param(such as patents, year range, sort order).For each result, extracts:
Title (
r['bib']['title'])URL (
r["pub_url"])Content combining authors and abstract.
Stores JSON of results and formatted content in output.
Logs and retries on exceptions with a delay (
self._param.delay_after_error).On continuous failure, sets error output.
Returns:
On success, returns the formalized content string.
On failure, returns an error message string.
thoughts(self) -> str
Returns a string describing the current internal "thought process" or purpose of the component, useful for logging or debugging.
Example:"Looking for scholarly papers on{query}, prioritising reputable sources."
Usage Example
gs = GoogleScholar()
gs._param = GoogleScholarParam()
gs._param.top_n = 5
result = gs._invoke(query="machine learning optimization")
print(result)
Implementation Details and Algorithms
Search Execution: Uses the
scholarlyPython library'ssearch_pubsfunction to perform queries. The API abstracts the complexities of scraping or accessing Google Scholar's search results.Result Extraction:
Uses lambda functions to extract specific fields from each publication record:Title from
r['bib']['title'].URL from
r['pub_url'].Content combining author list and abstract.
Retry Logic:
Implements retries on exceptions during querying, with a configurable maximum number of retries (max_retries) and delay (delay_after_error).Timeout Handling:
Uses a decorator to ensure the search does not exceed a configured timeout, preventing long blocking operations.
Interaction with Other System Components
Base Classes:
Inherits from
ToolBaseandToolParamBaseinagent.tools.basewhich provide standard interfaces for parameter handling, input/output management, and execution lifecycle.
Utility Functions:
Uses
timeoutdecorator fromapi.utils.api_utilsto enforce execution time limits.
Third-Party Library:
Relies on
scholarlypackage for underlying Google Scholar search functionality.
Environment Configuration:
Execution timeout can be controlled by the environment variable
COMPONENT_EXEC_TIMEOUT.
Output Handling:
Sets outputs such as
"formalized_content","json", and"_ERROR"for downstream consumption in the agent framework.
Mermaid Class Diagram
classDiagram
class GoogleScholarParam {
+meta: ToolMeta
+top_n: int
+sort_by: str
+year_low: int
+year_high: int
+patents: bool
+__init__()
+check()
+get_input_form() dict
}
class GoogleScholar {
+component_name: str
+_invoke(**kwargs) str
+thoughts() str
}
GoogleScholarParam <|-- GoogleScholar
ToolParamBase <|-- GoogleScholarParam
ToolBase <|-- GoogleScholar
Summary
This file defines a well-structured component for integrating Google Scholar searches into a larger agent-based system. It abstracts search parameters, provides validation, manages API calls with retries and timeouts, and formats results for further processing. The design leverages inheritance and decorators to maintain clean separation of concerns and robustness in execution.