resume.py
Overview
resume.py is a specialized module designed to parse, process, and chunk resume documents (in PDF, DOC, DOCX, and TXT formats) into structured data suitable for downstream Natural Language Processing (NLP) workflows. It acts as a bridge between raw resume files and an internal knowledge base system by:
Sending resume files to a remote parsing service for initial extraction.
Refining and restructuring parsed data through multi-step internal processing.
Tokenizing and organizing resume content into weighted textual representations.
Updating knowledge base configurations with field mappings.
Returning a structured document suitable for indexing or further NLP tasks.
This module is part of the larger InfiniFlow system, leveraging multiple internal packages for parsing (deepdoc.parser.resume), tokenization (rag.nlp), and knowledge base service (api.db.services.knowledgebase_service).
Detailed Component and Function Documentation
Global Variables
forbidden_select_fields4resume (
list[str]):
A list of field keys that should not be automatically selected or simplified when processing the resume data. This prevents certain fields from being prematurely reduced or tokenized.
Function: remote_call(filename: str, binary: bytes) -> dict
Purpose:
Performs a remote API call to a locally hosted resume parsing service, sending the binary content of a resume file encoded in base64. It attempts up to 3 retries to obtain parsed results, then applies internal refactoring and multi-step parsing to return a structured dictionary representing the resume data.
Parameters:
filename(str): The name of the resume file (used for logging and request metadata).binary(bytes): The raw binary content of the resume file.
Returns:
dict: Parsed and processed resume data structured in key-value form.
Exceptions:
Logs exceptions internally if the remote call or processing fails. If all retries fail, returns an empty dictionary.
Usage Example:
with open("resume.pdf", "rb") as f:
binary_data = f.read()
parsed_resume = remote_call("resume.pdf", binary_data)
Implementation Details:
Constructs a JSON query with user and file metadata.
Uses
requests.postto send data to a local API endpoint athttp://127.0.0.1:61670/tog.Processes the raw results with
refactor()fromdeepdoc.parser.resume.Removes empty sections like education, work, or project if they are empty lists.
Applies two-step parsing (
step_one.refactor()andstep_two.parse()), which likely normalize and enrich the data.Retries 3 times before failing silently with logging.
Function: chunk(filename: str, binary: bytes = None, callback: callable = None, **kwargs) -> list
Purpose:
Core function of the module that validates the file type, reads file content if not provided, invokes the remote parser (remote_call), and then organizes the parsed resume info into a tokenized, weighted document format for indexing and knowledge base integration.
Parameters:
filename(str): Path or name of the resume file to process.binary(bytes, optional): Raw binary content of the resume. If not provided, the function reads from thefilename.callback(callable, optional): A callback function to report parsing progress. It is called with a float progress value and a status message.**kwargs: Additional keyword arguments, expects at leastkb_id(knowledge base identifier) for updating parser config.
Returns:
list[dict]: A list containing one dictionary representing the parsed and tokenized resume document.
Raises:
NotImplementedError: If the file extension is not among supported types (pdf,doc,docx,txt).Exception: If the remote parser fails to return sufficient results.
Usage Example:
def progress_report(progress, message):
print(f"{progress*100}%: {message}")
docs = chunk("resume.pdf", callback=progress_report, kb_id=123)
Implementation Details:
Validates file extension with regex.
Reads binary data if not provided.
Reports progress via callback.
Calls
remote_callto get parsed resume dictionary.Validates that at least 7 keys are present to consider parsing successful.
Uses a predefined
field_mapdictionary to map internal keys to human-readable Chinese field labels.Constructs a title by extracting key fields (
name_kwd,gender_kwd,position_name_tks,age_int), tokenizes them usingrag_tokenizer.Builds a content string with weighted fields and tokenizes it on two levels: coarse and fine-grained.
For fields in the resume data, applies tokenization or flattening based on field type and presence in
forbidden_select_fields4resume.Updates knowledge base parser configuration with the
field_mapviaKnowledgebaseService.update_parser_config.Returns a list with one document dict containing all processed fields and tokenizations.
Important Implementation Details and Algorithms
Remote Parsing and Retry Logic: The module depends on an external HTTP service running locally for resume parsing. It attempts up to 3 retries for robustness.
Two-step Resume Parsing: After the remote call, two internal steps (
step_one.refactorandstep_two.parse) apply additional normalization and parsing logic tailored to resume data.Tokenization: Uses
rag_tokenizer.tokenizeandrag_tokenizer.fine_grained_tokenizeto create token lists for titles and content for use in NLP models or search indices.Field Flattening and Cleaning: Certain list fields are flattened if they have a single element or are not in the forbidden list to simplify data structure. Also, spaces are removed from token fields using
rmSpace.Knowledge Base Integration: Updates a knowledge base configuration with the current field map to keep metadata aligned.
Progress Callbacks: Supports progress reporting during parsing to allow integration in UI or logging systems.
Interactions with Other System Components
requestslibrary: Sends HTTP requests to the local resume parsing server.deepdoc.parser.resumemodule: Usesrefactor,step_one, andstep_twofunctions to internally preprocess and parse resume data.rag.nlp.rag_tokenizer: Tokenizes textual data into coarse and fine-grained tokens for NLP processing.rag.utils.rmSpace: Cleans whitespace from tokenized strings.api.db.services.knowledgebase_service.KnowledgebaseService: Updates knowledge base configurations with parsing metadata.Local Resume Parsing Service: The HTTP API at
http://127.0.0.1:61670/togis a critical external dependency that actually performs the heavy lifting of initial resume extraction.
Module Execution
When run as a script, the module accepts a filename argument and performs chunking with a dummy callback that does nothing.
Mermaid Class Diagram
The file primarily defines two functions without classes. However, it interacts with several imported classes and modules.
To reflect the structure and workflow of resume.py, the following flowchart illustrates the main function relationships and flow of data:
flowchart TD
A[Start: chunk(filename, binary, callback, **kwargs)]
A --> B{Validate file extension}
B -- Invalid --> C[Raise NotImplementedError]
B -- Valid --> D[Read binary if not provided]
D --> E[callback(0.2, "Resume parsing is going on...")]
E --> F[remote_call(filename, binary)]
F --> G{Parsing result valid?}
G -- No --> H[callback(-1, "Failed parsing"), Raise Exception]
G -- Yes --> I[callback(0.6, "Done parsing. Chunking...")]
I --> J[Build field_map and extract titles]
J --> K[Tokenize title and content]
K --> L[Flatten and tokenize fields]
L --> M[Update KnowledgebaseService with field_map]
M --> N[Return list with processed resume doc]
N --> O[End]
subgraph Remote Call
F --> F1[Send POST request to local API]
F1 --> F2[Receive JSON response]
F2 --> F3[Refactor and clean resume data]
F3 --> F4[step_one.refactor(DataFrame)]
F4 --> F5[step_two.parse(DataFrame)]
F5 --> F
end
Summary
resume.py is a resume processing utility that integrates remote parsing services and internal NLP processing steps to convert various resume file types into structured, tokenized documents for knowledge base ingestion and further analysis. It handles error cases, supports progress callbacks, and updates system-wide parser configurations to maintain consistency.
Appendix: Example Usage
def progress_callback(progress, message):
print(f"[{progress*100:.1f}%] {message}")
# Parsing a resume file
docs = chunk("candidate_resume.pdf", callback=progress_callback, kb_id=42)
for doc in docs:
print(doc["docnm_kwd"], doc["title_tks"])
# Further processing...