step_one.py

Overview

step_one.py is a utility module designed to preprocess and refactor raw resume data contained in a DataFrame, typically extracted from JSON strings in the resume_content field. The main functionality revolves around parsing nested JSON structures inside resumes, extracting relevant fields, normalizing values, and preparing a flat dictionary of cleaned and human-readable resume attributes.

This preprocessing step is crucial in the pipeline of the InfiniFlow system for standardizing resume data, thereby enabling downstream modules (e.g., candidate matching, analytics, or ranking algorithms) to operate efficiently on structured information.

Detailed Explanation

Constants

FIELDS
A list of string definitions representing the expected fields and their corresponding types after the refactor process. This list defines the schema of the output data dictionary.

Imported Modules

json — Used for parsing JSON strings embedded in resume content.
From deepdoc.parser.resume.entities the following are imported:
- degrees — Presumably a mapping utility to convert degree codes to names.
- regions — Provides region name resolution.
- industries — Provides industry name resolution.

Main Function

`refactor(df)`

Purpose:
Transforms a DataFrame df containing raw resume data by parsing embedded JSON, extracting relevant fields, normalizing values, and returning a dictionary representation of the first resume record in a clean, standardized format.

Parameters:

df (pandas.DataFrame): Input DataFrame with at least a column "resume_content" containing raw JSON strings of resumes.

Returns:

dict: A dictionary mapping field names (defined in FIELDS) to their cleaned and processed values for the first row in the DataFrame.

Internal Helper Functions

deal_obj(obj, k, kk)
Extracts a nested JSON value safely.
- obj (dict): JSON dictionary.
- k (str): First-level key.
- kk (str): Second-level key.
Returns:
The value at obj[k][kk] if all keys exist and types are correct; otherwise, returns an empty string "".
loadjson(line)
Parses a JSON string into a dictionary.
- line (str): JSON string.
Returns:
Parsed dictionary if valid JSON; otherwise, an empty dictionary {}.
extract(nms, cc=None)
Extracts columns from the nested JSON stored in the "obj" column of the DataFrame.
- nms (list of str): Names of keys to extract.
- cc (str or None): If provided, extraction uses nested keys obj[cc][c].
The extracted data is assigned as new DataFrame columns, and the column names are added to the list clms for later use.
arr2str(a)
Converts a list or other iterable to a space-separated string, or returns an empty string if input is falsy.

Workflow and Implementation Details

Parsing JSON Content:
The "resume_content" string in each row is parsed into a dictionary and stored in a new "obj" column.
Handling Missing Values:
The DataFrame is filled with empty strings for any missing data.
Field Extraction:
Using the extract helper, fields are pulled out from various nested structures within "obj":
- Top-level fields like "education", "work", "certificate", "project", "language", "skill".
- Contact info under "contact" key: "wechat", "phone", "is_deleted", etc.
- Basic info under "basic" key: "nation", "expect_industry_name", "salary_month", "industry_ids", and many others.
Normalization:
- Degree codes are translated to names using degrees.get_name.
- Addresses are converted to full region names using regions.get_names.
- Industry IDs are expanded to names using industries.get_names.
- Gender codes 'M' and 'F' are mapped to "男" and "女".
- Boolean-like flags with values 'Y' or 'N' are mapped to "是" or "否".
- Some fields are duplicated with more descriptive names (e.g., "management_experience" → "is_management_experience").
Cleanup:
- Replace None string occurrences.
- Remove certain columns from the final output list.
- Clean values by replacing tabs and newline characters with spaces or escape sequences.
- If "phone" is empty but "tel" exists, "phone" is replaced by "tel".
Reindexing and Ordering:
The DataFrame is reordered according to the sorted set of collected column names (clms).
Returning Result:
Finally, the function returns a dictionary with keys extracted from the FIELDS constant and values from the first row of the processed DataFrame.

Usage Example

import pandas as pd
from step_one import refactor

# Example raw DataFrame with resume_content column
df = pd.DataFrame({
    "resume_content": [
        '{"basic": {"name": "张三", "gender": "M", "degree": "bachelor", "industry_ids": "1,2"}, "contact": {"phone": "1234567890"}}'
    ],
    "tob_resume_id": ["abc123"],
    "updated_at": ["2025-06-01 12:00:00"]
})

result = refactor(df)
print(result)

Output:

{
    "address": "",
    "annual_salary": "",
    "annual_salary_from": "",
    ...
    "gender": "男",
    "degree": "Bachelor's Degree",
    "industry_names": "Industry1 Industry2",
    ...
    "phone": "1234567890",
    ...
}

Interaction with Other System Components

Input Source:
Likely receives raw resume data extracted from a data ingestion pipeline or database, where each resume is stored as a JSON string.
Dependency on Entities Module:
Utilizes degrees, regions, and industries from the deepdoc.parser.resume.entities package to translate codes and IDs into human-readable names.
Downstream Usage:
The cleaned dictionary output is probably consumed by other modules in InfiniFlow for resume analysis, candidate ranking, or storage in a structured database.

Important Implementation Notes

The function is designed to process only the first row of the DataFrame and returns a dictionary for that row.
Nested JSON extraction is robust against missing or malformed data, defaulting to empty strings.
Uses json.dumps with ensure_ascii=False to preserve non-ASCII characters during extraction.
The mapping of certain fields (e.g., gender, boolean flags) to Chinese characters ("男", "女", "是", "否") indicates localization and domain-specific data conventions.
The function cleans control characters (tabs, newlines) to ensure output is safe for tabular storage or further text processing.

Mermaid Diagram: Class/Function Structure

flowchart TD
    A[refactor(df)]
    A --> B[deal_obj(obj, k, kk)]
    A --> C[loadjson(line)]
    A --> D[extract(nms, cc=None)]
    A --> E[arr2str(a)]

    B -->|Helper| F[Safe nested dict extraction]
    C -->|Helper| G[Parse JSON string]
    D -->|Helper| H[Extract and assign DataFrame columns]
    E -->|Helper| I[Convert list to space-separated string]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B,C,D,E fill:#bbf,stroke:#333,stroke-width:1px

Summary

step_one.py is a core preprocessing utility that refactors raw JSON resume data in a DataFrame into a structured, cleaned dictionary. It handles nested JSON parsing, field extraction, value normalization, and data cleaning, preparing resume data for further processing in the InfiniFlow system. The module relies on external entity mappings to convert coded fields into readable names and enforces consistent formatting and localization conventions.