init.py

Overview

This file provides a utility function refactor designed to clean up and normalize a curriculum vitae (CV) data structure represented as a dictionary. The function removes unnecessary or redundant fields, restructures nested data, and consolidates key information under a standardized basic section. It also enriches the CV with derived or aggregated attributes based on work and education history.

The primary use case is to prepare raw CV data, which may come from varied sources or extraction processes, into a consistent, streamlined format suitable for downstream processing such as storage, display, or further analytics within the InfiniFlow system.

Detailed Explanation

Function: `refactor(cv)`

Purpose

Transforms and sanitizes a CV dictionary by:

Removing unwanted metadata fields.
Normalizing nested collection fields (like work experience, education).
Consolidating salary and job-related fields.
Extracting summary information such as earliest work start date, management experience flag, and recent education details.
Adding or updating timestamps and contact information.

Parameters

cv (dict): The input CV data dictionary to be cleaned and normalized. It is expected to contain keys like "basic", "work", "education", and others representing different CV sections.

Returns

dict: The modified and normalized CV dictionary.

Usage Example

from __init__ import refactor

raw_cv = {
    "raw_txt": "Some raw text",
    "basic": {
        "name": "John Doe",
        "basic_salary_month": 5000,
        "photo2": "some_photo_data"
    },
    "work": {
        "1": {
            "start_time": "2018-01-01",
            "management_experience": "Y",
            "annual_salary_from": 60000,
            "position_name": "Manager",
        }
    },
    "education": {
        "1": {
            "start_time": "2014-09-01",
            "school_name": "State University",
            "discipline_name": "Computer Science"
        }
    }
}

clean_cv = refactor(raw_cv)
print(clean_cv["basic"]["salary_month"])  # Output: 5000
print(clean_cv["basic"]["work_start_time"])  # Output: 2018-01-01
print(clean_cv["basic"]["management_experience"])  # Output: Y
print(clean_cv["contact"]["name"])  # Output: John Doe

Implementation Details

Field Removal: The function removes fields like "raw_txt", "parser_name", "inference", "ori_text", "use_time", and "time_stat" if they exist and are not None. These fields are considered extraneous metadata.
is_deleted Flag: Sets "is_deleted" to 0 indicating this CV is active or not marked for deletion.
Basic Section Normalization: Ensures "basic" exists as a dictionary. Removes "photo2" field from "basic" if present.
Collection Fields Processing: Fields such as "education", "work", "certificate", "project", "language", "skill", and "training" are converted from dictionaries to lists if necessary, external references ("external" keys) are removed from each item, and then the lists are converted back into dictionaries indexed by stringified integers.
Salary Field Renaming: Renames "basic_salary_month" to "salary_month" and "expect_annual_salary_from" to "expect_annual_salary".
Work Experience Aggregation: Sorts "work" items by "start_time". Extracts earliest work start date, checks for management experience, extracts the latest annual salary and job details into "basic".
Education Aggregation: Sorts "education" items by "start_time". Extracts most recent school and discipline names into "basic".
Timestamp Update: Adds or updates "updated_at" in "basic" with the current datetime.
Contact Info Fallback: Creates "contact" if missing and ensures "name" is populated from "basic" if absent.

Interaction with Other Parts of the System

Data Normalization Module: This file likely serves as part of a data ingestion or preprocessing pipeline within the InfiniFlow system, providing a consistent CV data format to downstream components.
Storage/Database Layer: The normalized CV dictionaries are probably persisted to a database or indexed for search, relying on the standardized keys and structures produced here.
UI/Display Components: Clean, structured CV data facilitates rendering in user interfaces or report generators.
Analytics/Inference Engines: By removing extraneous fields and normalizing data, this module prepares CVs for reliable feature extraction or machine learning inference.

Mermaid Diagram: Function Workflow

flowchart TD
    A[Input CV dict] --> B{Remove unwanted fields}
    B --> C[Set is_deleted = 0]
    C --> D[Ensure 'basic' dict exists]
    D --> E[Remove 'photo2' from basic if exists]
    E --> F[Process collections (education, work, etc.)]
    F --> G[Rename salary fields in basic]
    G --> H[Sort and analyze work experience]
    H --> I[Sort and analyze education]
    I --> J[Set updated_at timestamp]
    J --> K[Ensure contact.name exists]
    K --> L[Return cleaned CV dict]

Summary

This init.py file encapsulates the essential CV data normalization logic for InfiniFlow's CV processing pipeline. By cleansing, restructuring, and enriching the CV data, it ensures consistent and reliable input for subsequent system modules. Its careful handling of nested collections, timestamping, and fallback mechanisms enhances data quality and robustness.