step_one.py
Overview
step_one.py is a utility module designed to preprocess and refactor raw resume data contained in a DataFrame, typically extracted from JSON strings in the resume_content field. The main functionality revolves around parsing nested JSON structures inside resumes, extracting relevant fields, normalizing values, and preparing a flat dictionary of cleaned and human-readable resume attributes.
This preprocessing step is crucial in the pipeline of the InfiniFlow system for standardizing resume data, thereby enabling downstream modules (e.g., candidate matching, analytics, or ranking algorithms) to operate efficiently on structured information.
Detailed Explanation
Constants
FIELDS
A list of string definitions representing the expected fields and their corresponding types after the refactor process. This list defines the schema of the output data dictionary.
Imported Modules
json— Used for parsing JSON strings embedded in resume content.From
deepdoc.parser.resume.entitiesthe following are imported:degrees— Presumably a mapping utility to convert degree codes to names.regions— Provides region name resolution.industries— Provides industry name resolution.
Main Function
refactor(df)
Purpose:
Transforms a DataFrame df containing raw resume data by parsing embedded JSON, extracting relevant fields, normalizing values, and returning a dictionary representation of the first resume record in a clean, standardized format.
Parameters:
df(pandas.DataFrame): Input DataFrame with at least a column"resume_content"containing raw JSON strings of resumes.
Returns:
dict: A dictionary mapping field names (defined in FIELDS) to their cleaned and processed values for the first row in the DataFrame.
Internal Helper Functions
deal_obj(obj, k, kk)Extracts a nested JSON value safely.
obj(dict): JSON dictionary.k(str): First-level key.kk(str): Second-level key.
Returns:
The value atobj[k][kk]if all keys exist and types are correct; otherwise, returns an empty string"".loadjson(line)Parses a JSON string into a dictionary.
line(str): JSON string.
Returns:
Parsed dictionary if valid JSON; otherwise, an empty dictionary{}.extract(nms, cc=None)Extracts columns from the nested JSON stored in the
"obj"column of the DataFrame.nms(listofstr): Names of keys to extract.cc(strorNone): If provided, extraction uses nested keysobj[cc][c].
The extracted data is assigned as new DataFrame columns, and the column names are added to the list
clmsfor later use.arr2str(a)Converts a list or other iterable to a space-separated string, or returns an empty string if input is falsy.
Workflow and Implementation Details
Parsing JSON Content:
The
"resume_content"string in each row is parsed into a dictionary and stored in a new"obj"column.Handling Missing Values:
The DataFrame is filled with empty strings for any missing data.
Field Extraction:
Using the
extracthelper, fields are pulled out from various nested structures within"obj":Top-level fields like
"education","work","certificate","project","language","skill".Contact info under
"contact"key:"wechat","phone","is_deleted", etc.Basic info under
"basic"key:"nation","expect_industry_name","salary_month","industry_ids", and many others.
Normalization:
Degree codes are translated to names using
degrees.get_name.Addresses are converted to full region names using
regions.get_names.Industry IDs are expanded to names using
industries.get_names.Gender codes
'M'and'F'are mapped to"男"and"女".Boolean-like flags with values
'Y'or'N'are mapped to"是"or"否".Some fields are duplicated with more descriptive names (e.g., "management_experience" →
"is_management_experience").
Cleanup:
Replace
Nonestring occurrences.Remove certain columns from the final output list.
Clean values by replacing tabs and newline characters with spaces or escape sequences.
If
"phone"is empty but"tel"exists,"phone"is replaced by"tel".
Reindexing and Ordering:
The DataFrame is reordered according to the sorted set of collected column names (
clms).Returning Result:
Finally, the function returns a dictionary with keys extracted from the FIELDS constant and values from the first row of the processed DataFrame.
Usage Example
import pandas as pd
from step_one import refactor
# Example raw DataFrame with resume_content column
df = pd.DataFrame({
"resume_content": [
'{"basic": {"name": "张三", "gender": "M", "degree": "bachelor", "industry_ids": "1,2"}, "contact": {"phone": "1234567890"}}'
],
"tob_resume_id": ["abc123"],
"updated_at": ["2025-06-01 12:00:00"]
})
result = refactor(df)
print(result)
Output:
{
"address": "",
"annual_salary": "",
"annual_salary_from": "",
...
"gender": "男",
"degree": "Bachelor's Degree",
"industry_names": "Industry1 Industry2",
...
"phone": "1234567890",
...
}
Interaction with Other System Components
Input Source:
Likely receives raw resume data extracted from a data ingestion pipeline or database, where each resume is stored as a JSON string.Dependency on Entities Module:
Utilizesdegrees,regions, andindustriesfrom thedeepdoc.parser.resume.entitiespackage to translate codes and IDs into human-readable names.Downstream Usage:
The cleaned dictionary output is probably consumed by other modules in InfiniFlow for resume analysis, candidate ranking, or storage in a structured database.
Important Implementation Notes
The function is designed to process only the first row of the DataFrame and returns a dictionary for that row.
Nested JSON extraction is robust against missing or malformed data, defaulting to empty strings.
Uses
json.dumpswithensure_ascii=Falseto preserve non-ASCII characters during extraction.The mapping of certain fields (e.g., gender, boolean flags) to Chinese characters (
"男","女","是","否") indicates localization and domain-specific data conventions.The function cleans control characters (tabs, newlines) to ensure output is safe for tabular storage or further text processing.
Mermaid Diagram: Class/Function Structure
flowchart TD
A[refactor(df)]
A --> B[deal_obj(obj, k, kk)]
A --> C[loadjson(line)]
A --> D[extract(nms, cc=None)]
A --> E[arr2str(a)]
B -->|Helper| F[Safe nested dict extraction]
C -->|Helper| G[Parse JSON string]
D -->|Helper| H[Extract and assign DataFrame columns]
E -->|Helper| I[Convert list to space-separated string]
style A fill:#f9f,stroke:#333,stroke-width:2px
style B,C,D,E fill:#bbf,stroke:#333,stroke-width:1px
Summary
step_one.py is a core preprocessing utility that refactors raw JSON resume data in a DataFrame into a structured, cleaned dictionary. It handles nested JSON parsing, field extraction, value normalization, and data cleaning, preparing resume data for further processing in the InfiniFlow system. The module relies on external entity mappings to convert coded fields into readable names and enforces consistent formatting and localization conventions.