resume.py

Overview

resume.py is a specialized module designed to parse, process, and chunk resume documents (in PDF, DOC, DOCX, and TXT formats) into structured data suitable for downstream Natural Language Processing (NLP) workflows. It acts as a bridge between raw resume files and an internal knowledge base system by:

Sending resume files to a remote parsing service for initial extraction.
Refining and restructuring parsed data through multi-step internal processing.
Tokenizing and organizing resume content into weighted textual representations.
Updating knowledge base configurations with field mappings.
Returning a structured document suitable for indexing or further NLP tasks.

This module is part of the larger InfiniFlow system, leveraging multiple internal packages for parsing (deepdoc.parser.resume), tokenization (rag.nlp), and knowledge base service (api.db.services.knowledgebase_service).

Detailed Component and Function Documentation

Global Variables

forbidden_select_fields4resume (list[str]):
A list of field keys that should not be automatically selected or simplified when processing the resume data. This prevents certain fields from being prematurely reduced or tokenized.

Function: `remote_call(filename: str, binary: bytes) -> dict`

Purpose:
Performs a remote API call to a locally hosted resume parsing service, sending the binary content of a resume file encoded in base64. It attempts up to 3 retries to obtain parsed results, then applies internal refactoring and multi-step parsing to return a structured dictionary representing the resume data.

Parameters:

filename (str): The name of the resume file (used for logging and request metadata).
binary (bytes): The raw binary content of the resume file.

Returns:

dict: Parsed and processed resume data structured in key-value form.

Exceptions:

Logs exceptions internally if the remote call or processing fails. If all retries fail, returns an empty dictionary.

Usage Example:

with open("resume.pdf", "rb") as f:
    binary_data = f.read()
parsed_resume = remote_call("resume.pdf", binary_data)

Implementation Details:

Constructs a JSON query with user and file metadata.
Uses requests.post to send data to a local API endpoint at http://127.0.0.1:61670/tog.
Processes the raw results with refactor() from deepdoc.parser.resume.
Removes empty sections like education, work, or project if they are empty lists.
Applies two-step parsing (step_one.refactor() and step_two.parse()), which likely normalize and enrich the data.
Retries 3 times before failing silently with logging.

Function: `chunk(filename: str, binary: bytes = None, callback: callable = None, **kwargs) -> list`

Purpose:
Core function of the module that validates the file type, reads file content if not provided, invokes the remote parser (remote_call), and then organizes the parsed resume info into a tokenized, weighted document format for indexing and knowledge base integration.

Parameters:

filename (str): Path or name of the resume file to process.
binary (bytes, optional): Raw binary content of the resume. If not provided, the function reads from the filename.
callback (callable, optional): A callback function to report parsing progress. It is called with a float progress value and a status message.
**kwargs: Additional keyword arguments, expects at least kb_id (knowledge base identifier) for updating parser config.

Returns:

list[dict]: A list containing one dictionary representing the parsed and tokenized resume document.

Raises:

NotImplementedError: If the file extension is not among supported types (pdf, doc, docx, txt).
Exception: If the remote parser fails to return sufficient results.

Usage Example:

def progress_report(progress, message):
    print(f"{progress*100}%: {message}")

docs = chunk("resume.pdf", callback=progress_report, kb_id=123)

Implementation Details:

Validates file extension with regex.
Reads binary data if not provided.
Reports progress via callback.
Calls remote_call to get parsed resume dictionary.
Validates that at least 7 keys are present to consider parsing successful.
Uses a predefined field_map dictionary to map internal keys to human-readable Chinese field labels.
Constructs a title by extracting key fields (name_kwd, gender_kwd, position_name_tks, age_int), tokenizes them using rag_tokenizer.
Builds a content string with weighted fields and tokenizes it on two levels: coarse and fine-grained.
For fields in the resume data, applies tokenization or flattening based on field type and presence in forbidden_select_fields4resume.
Updates knowledge base parser configuration with the field_map via KnowledgebaseService.update_parser_config.
Returns a list with one document dict containing all processed fields and tokenizations.

Important Implementation Details and Algorithms

Remote Parsing and Retry Logic: The module depends on an external HTTP service running locally for resume parsing. It attempts up to 3 retries for robustness.
Two-step Resume Parsing: After the remote call, two internal steps (step_one.refactor and step_two.parse) apply additional normalization and parsing logic tailored to resume data.
Tokenization: Uses rag_tokenizer.tokenize and rag_tokenizer.fine_grained_tokenize to create token lists for titles and content for use in NLP models or search indices.
Field Flattening and Cleaning: Certain list fields are flattened if they have a single element or are not in the forbidden list to simplify data structure. Also, spaces are removed from token fields using rmSpace.
Knowledge Base Integration: Updates a knowledge base configuration with the current field map to keep metadata aligned.
Progress Callbacks: Supports progress reporting during parsing to allow integration in UI or logging systems.

Interactions with Other System Components

requests library: Sends HTTP requests to the local resume parsing server.
deepdoc.parser.resume module: Uses refactor, step_one, and step_two functions to internally preprocess and parse resume data.
rag.nlp.rag_tokenizer: Tokenizes textual data into coarse and fine-grained tokens for NLP processing.
rag.utils.rmSpace: Cleans whitespace from tokenized strings.
api.db.services.knowledgebase_service.KnowledgebaseService: Updates knowledge base configurations with parsing metadata.
Local Resume Parsing Service: The HTTP API at http://127.0.0.1:61670/tog is a critical external dependency that actually performs the heavy lifting of initial resume extraction.

Module Execution

When run as a script, the module accepts a filename argument and performs chunking with a dummy callback that does nothing.

Mermaid Class Diagram

The file primarily defines two functions without classes. However, it interacts with several imported classes and modules.

To reflect the structure and workflow of resume.py, the following flowchart illustrates the main function relationships and flow of data:

flowchart TD
    A[Start: chunk(filename, binary, callback, **kwargs)]
    A --> B{Validate file extension}
    B -- Invalid --> C[Raise NotImplementedError]
    B -- Valid --> D[Read binary if not provided]
    D --> E[callback(0.2, "Resume parsing is going on...")]
    E --> F[remote_call(filename, binary)]
    F --> G{Parsing result valid?}
    G -- No --> H[callback(-1, "Failed parsing"), Raise Exception]
    G -- Yes --> I[callback(0.6, "Done parsing. Chunking...")]
    I --> J[Build field_map and extract titles]
    J --> K[Tokenize title and content]
    K --> L[Flatten and tokenize fields]
    L --> M[Update KnowledgebaseService with field_map]
    M --> N[Return list with processed resume doc]
    N --> O[End]

    subgraph Remote Call
        F --> F1[Send POST request to local API]
        F1 --> F2[Receive JSON response]
        F2 --> F3[Refactor and clean resume data]
        F3 --> F4[step_one.refactor(DataFrame)]
        F4 --> F5[step_two.parse(DataFrame)]
        F5 --> F
    end

Summary

resume.py is a resume processing utility that integrates remote parsing services and internal NLP processing steps to convert various resume file types into structured, tokenized documents for knowledge base ingestion and further analysis. It handles error cases, supports progress callbacks, and updates system-wide parser configurations to maintain consistency.

Appendix: Example Usage

def progress_callback(progress, message):
    print(f"[{progress*100:.1f}%] {message}")

# Parsing a resume file
docs = chunk("candidate_resume.pdf", callback=progress_callback, kb_id=42)

for doc in docs:
    print(doc["docnm_kwd"], doc["title_tks"])
    # Further processing...

End of Documentation for `resume.py`