step_one.py


Overview

step_one.py is a utility module designed to preprocess and refactor raw resume data contained in a DataFrame, typically extracted from JSON strings in the resume_content field. The main functionality revolves around parsing nested JSON structures inside resumes, extracting relevant fields, normalizing values, and preparing a flat dictionary of cleaned and human-readable resume attributes.

This preprocessing step is crucial in the pipeline of the InfiniFlow system for standardizing resume data, thereby enabling downstream modules (e.g., candidate matching, analytics, or ranking algorithms) to operate efficiently on structured information.


Detailed Explanation

Constants


Imported Modules


Main Function

refactor(df)

Purpose:
Transforms a DataFrame df containing raw resume data by parsing embedded JSON, extracting relevant fields, normalizing values, and returning a dictionary representation of the first resume record in a clean, standardized format.

Parameters:

Returns:


Internal Helper Functions

  1. deal_obj(obj, k, kk)

    Extracts a nested JSON value safely.

    • obj (dict): JSON dictionary.

    • k (str): First-level key.

    • kk (str): Second-level key.

    Returns:
    The value at obj[k][kk] if all keys exist and types are correct; otherwise, returns an empty string "".

  2. loadjson(line)

    Parses a JSON string into a dictionary.

    • line (str): JSON string.

    Returns:
    Parsed dictionary if valid JSON; otherwise, an empty dictionary {}.

  3. extract(nms, cc=None)

    Extracts columns from the nested JSON stored in the "obj" column of the DataFrame.

    • nms (list of str): Names of keys to extract.

    • cc (str or None): If provided, extraction uses nested keys obj[cc][c].

    The extracted data is assigned as new DataFrame columns, and the column names are added to the list clms for later use.

  4. arr2str(a)

    Converts a list or other iterable to a space-separated string, or returns an empty string if input is falsy.


Workflow and Implementation Details

  1. Parsing JSON Content:

    The "resume_content" string in each row is parsed into a dictionary and stored in a new "obj" column.

  2. Handling Missing Values:

    The DataFrame is filled with empty strings for any missing data.

  3. Field Extraction:

    Using the extract helper, fields are pulled out from various nested structures within "obj":

    • Top-level fields like "education", "work", "certificate", "project", "language", "skill".

    • Contact info under "contact" key: "wechat", "phone", "is_deleted", etc.

    • Basic info under "basic" key: "nation", "expect_industry_name", "salary_month", "industry_ids", and many others.

  4. Normalization:

    • Degree codes are translated to names using degrees.get_name.

    • Addresses are converted to full region names using regions.get_names.

    • Industry IDs are expanded to names using industries.get_names.

    • Gender codes 'M' and 'F' are mapped to "男" and "女".

    • Boolean-like flags with values 'Y' or 'N' are mapped to "是" or "否".

    • Some fields are duplicated with more descriptive names (e.g., "management_experience""is_management_experience").

  5. Cleanup:

    • Replace None string occurrences.

    • Remove certain columns from the final output list.

    • Clean values by replacing tabs and newline characters with spaces or escape sequences.

    • If "phone" is empty but "tel" exists, "phone" is replaced by "tel".

  6. Reindexing and Ordering:

    The DataFrame is reordered according to the sorted set of collected column names (clms).

  7. Returning Result:

    Finally, the function returns a dictionary with keys extracted from the FIELDS constant and values from the first row of the processed DataFrame.


Usage Example

import pandas as pd
from step_one import refactor

# Example raw DataFrame with resume_content column
df = pd.DataFrame({
    "resume_content": [
        '{"basic": {"name": "张三", "gender": "M", "degree": "bachelor", "industry_ids": "1,2"}, "contact": {"phone": "1234567890"}}'
    ],
    "tob_resume_id": ["abc123"],
    "updated_at": ["2025-06-01 12:00:00"]
})

result = refactor(df)
print(result)

Output:

{
    "address": "",
    "annual_salary": "",
    "annual_salary_from": "",
    ...
    "gender": "男",
    "degree": "Bachelor's Degree",
    "industry_names": "Industry1 Industry2",
    ...
    "phone": "1234567890",
    ...
}

Interaction with Other System Components


Important Implementation Notes


Mermaid Diagram: Class/Function Structure

flowchart TD
    A[refactor(df)]
    A --> B[deal_obj(obj, k, kk)]
    A --> C[loadjson(line)]
    A --> D[extract(nms, cc=None)]
    A --> E[arr2str(a)]

    B -->|Helper| F[Safe nested dict extraction]
    C -->|Helper| G[Parse JSON string]
    D -->|Helper| H[Extract and assign DataFrame columns]
    E -->|Helper| I[Convert list to space-separated string]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B,C,D,E fill:#bbf,stroke:#333,stroke-width:1px

Summary

step_one.py is a core preprocessing utility that refactors raw JSON resume data in a DataFrame into a structured, cleaned dictionary. It handles nested JSON parsing, field extraction, value normalization, and data cleaning, preparing resume data for further processing in the InfiniFlow system. The module relies on external entity mappings to convert coded fields into readable names and enforces consistent formatting and localization conventions.