resume.py


Overview

resume.py is a specialized module designed to parse, process, and chunk resume documents (in PDF, DOC, DOCX, and TXT formats) into structured data suitable for downstream Natural Language Processing (NLP) workflows. It acts as a bridge between raw resume files and an internal knowledge base system by:

This module is part of the larger InfiniFlow system, leveraging multiple internal packages for parsing (deepdoc.parser.resume), tokenization (rag.nlp), and knowledge base service (api.db.services.knowledgebase_service).


Detailed Component and Function Documentation

Global Variables


Function: remote_call(filename: str, binary: bytes) -> dict

Purpose:
Performs a remote API call to a locally hosted resume parsing service, sending the binary content of a resume file encoded in base64. It attempts up to 3 retries to obtain parsed results, then applies internal refactoring and multi-step parsing to return a structured dictionary representing the resume data.

Parameters:

Returns:

Exceptions:

Usage Example:

with open("resume.pdf", "rb") as f:
    binary_data = f.read()
parsed_resume = remote_call("resume.pdf", binary_data)

Implementation Details:


Function: chunk(filename: str, binary: bytes = None, callback: callable = None, **kwargs) -> list

Purpose:
Core function of the module that validates the file type, reads file content if not provided, invokes the remote parser (remote_call), and then organizes the parsed resume info into a tokenized, weighted document format for indexing and knowledge base integration.

Parameters:

Returns:

Raises:

Usage Example:

def progress_report(progress, message):
    print(f"{progress*100}%: {message}")

docs = chunk("resume.pdf", callback=progress_report, kb_id=123)

Implementation Details:


Important Implementation Details and Algorithms


Interactions with Other System Components


Module Execution


Mermaid Class Diagram

The file primarily defines two functions without classes. However, it interacts with several imported classes and modules.

To reflect the structure and workflow of resume.py, the following flowchart illustrates the main function relationships and flow of data:

flowchart TD
    A[Start: chunk(filename, binary, callback, **kwargs)]
    A --> B{Validate file extension}
    B -- Invalid --> C[Raise NotImplementedError]
    B -- Valid --> D[Read binary if not provided]
    D --> E[callback(0.2, "Resume parsing is going on...")]
    E --> F[remote_call(filename, binary)]
    F --> G{Parsing result valid?}
    G -- No --> H[callback(-1, "Failed parsing"), Raise Exception]
    G -- Yes --> I[callback(0.6, "Done parsing. Chunking...")]
    I --> J[Build field_map and extract titles]
    J --> K[Tokenize title and content]
    K --> L[Flatten and tokenize fields]
    L --> M[Update KnowledgebaseService with field_map]
    M --> N[Return list with processed resume doc]
    N --> O[End]

    subgraph Remote Call
        F --> F1[Send POST request to local API]
        F1 --> F2[Receive JSON response]
        F2 --> F3[Refactor and clean resume data]
        F3 --> F4[step_one.refactor(DataFrame)]
        F4 --> F5[step_two.parse(DataFrame)]
        F5 --> F
    end

Summary

resume.py is a resume processing utility that integrates remote parsing services and internal NLP processing steps to convert various resume file types into structured, tokenized documents for knowledge base ingestion and further analysis. It handles error cases, supports progress callbacks, and updates system-wide parser configurations to maintain consistency.


Appendix: Example Usage

def progress_callback(progress, message):
    print(f"[{progress*100:.1f}%] {message}")

# Parsing a resume file
docs = chunk("candidate_resume.pdf", callback=progress_callback, kb_id=42)

for doc in docs:
    print(doc["docnm_kwd"], doc["title_tks"])
    # Further processing...

End of Documentation for resume.py