step_two.py
Overview
step_two.py is a utility module primarily focused on parsing, cleaning, and enriching structured data extracted from resumes (CVs). The module processes various sections of a resume such as education, projects, work experience, and personal information, converting raw input data into tokenized, normalized, and feature-enriched dictionaries to support downstream applications like candidate ranking, matching, or profile analysis in the InfiniFlow system.
Key functionalities include:
Parsing JSON-like strings and handling data integrity.
Tokenizing text fields for better semantic analysis.
Extracting and normalizing education, project, and work experience details.
Enriching resumes with keywords, tags, and calculated features such as work experience duration and school rankings.
Handling timeouts and exceptions robustly during processing.
Utilities for date/time normalization and string cleaning.
The module depends on external components such as deepdoc.parser.resume.entities (for degrees, schools, corporations) and rag.nlp.rag_tokenizer for tokenization, and it uses xpinyin for pinyin conversions of Chinese names.
Classes and Functions
Class: TimeoutException
Custom exception class used to signal a timeout event.
class TimeoutException(Exception):
pass
Usage: Raised within the time_limit context manager when execution exceeds a specified time limit.
Context Manager: time_limit(seconds)
Sets an alarm signal to interrupt execution after seconds seconds.
Parameters:
seconds(int): Number of seconds before timeout.
Usage Example:
try:
with time_limit(5):
# Code block that must complete within 5 seconds
long_running_function()
except TimeoutException:
print("Function timed out!")
Function: rmHtmlTag(line)
Removes HTML tags from a given string by replacing them with spaces.
Parameters:
line(str): The input string potentially containing HTML tags.
Returns:
str: The cleaned string without HTML tags.
Implementation details:
Uses a regex to match HTML tags and replaces them with spaces.
Function: highest_degree(dg)
Determines the highest academic degree from a list or single string of degree names.
Parameters:
dg(listorstr): Degree(s) to evaluate.
Returns:
str: The highest degree based on a predefined hierarchy.
Degree ranking:
Degree | Rank |
|---|---|
初中 (Middle school) | 0 |
高中 (High school) | 1 |
中专 (Technical secondary) | 2 |
大专 (Junior college) | 3 |
专升本 (Specialized to bachelor) | 4 |
本科 (Bachelor) | 5 |
硕士 (Master) | 6 |
博士 (Doctorate) | 7 |
博士后 (Postdoc) | 8 |
Function: forEdu(cv)
Processes the "education_obj" field in the resume dictionary to extract features such as school rankings, degrees, majors, and study periods.
Parameters:
cv(dict): Resume data dictionary containing an "education_obj" key.
Returns:
dict: Updatedcvwith additional keys related to education features.
Key features computed:
School rank integer and rank-based tags (e.g., "顶尖学校", "精英学校").
Degree keywords and highest degree.
Tokenization of school and major names.
Integrity score adjustment if education data is missing.
Work experience filter based on education end date.
Usage:
Called internally in parse() to enrich education-related data.
Function: forProj(cv)
Processes the "project_obj" field to tokenize and extract project names and descriptions.
Parameters:
cv(dict): Resume data dictionary containing a "project_obj" key.
Returns:
dict: Updatedcvwith tokenized project name and description fields.
Usage:
Called internally in parse().
Function: json_loads(line)
Parses a JSON-like string into a Python object, converting True/False literals to strings to avoid decoding errors.
Parameters:
line(str): JSON-like string.
Returns:
object: Decoded Python object (usually dict or list).
Implementation details:
Uses demjson3.decode after regex replacement to handle boolean literals.
Function: forWork(cv)
Processes the "work_obj" field to extract detailed work experience features, including company rankings, position names, responsibilities, subordinates counts, and calculates work duration metrics.
Parameters:
cv(dict): Resume data dictionary containing a "work_obj" key.
Returns:
dict: Updatedcvwith enriched work-related features.
Key features computed:
Tags for "好公司" if companies are recognized as high-quality.
Tokenized fields for position names, industries, corporation names, and responsibilities.
Work experience duration calculations.
Maximum subordinate count.
Work start and end dates normalization.
Function: turnTm2Dt(b)
Converts a timestamp or string representing time into a formatted datetime string.
Parameters:
b(strorint): Timestamp or datetime string.
Returns:
str: Formatted string"YYYY-MM-DD HH:MM:SS"or original string if conversion not applicable.
Function: getYMD(b)
Extracts the Year, Month, and Day components from a string or timestamp.
Parameters:
b(str): Date/time string or timestamp.
Returns:
tuple:(year, month, day)as strings, with defaults if missing.
Function: birth(cv)
Processes the "birth" field to extract birth date features, calculate age, and generate birthday keywords.
Parameters:
cv(dict): Resume data dictionary containing "birth" field.
Returns:
dict: Updatedcvwith birth-related features.
Function: parse(cv)
The core function that cleans, normalizes, tokenizes, and enriches the entire resume dictionary.
Parameters:
cv(dict): Raw resume data dictionary.
Returns:
dict: Processed resume dictionary containing only feature-rich fields with keys ending in_fea,_tks,_nst,_dt,_int,_flt,_ltks,_kwd, or_id.
Main processing steps:
Remove null or empty fields.
Calculate an integrity filter score based on presence of key fields.
Normalize fields like corporation type, political status, phone numbers.
Parse JSON-like objects embedded as strings.
Tokenize various text fields.
Extract keywords and numerical values.
Generate pinyin tokens for names.
Enrich with tags for yes/no fields.
Call
forEdu,forProj,forWork,birthto process sub-sections.Compute combined tags and handle work experience date conversions.
Remove non-feature keys.
Convert numpy int64 to native int.
Usage Example:
processed_cv = parse(raw_cv_dict)
Function: dealWithInt64(d)
Recursively converts any numpy.integer types within a dictionary or list to native Python int to avoid serialization issues.
Parameters:
d(dictorlistor primitive): Data structure containing possible numpy integers.
Returns:
Same data structure with numpy integers converted to int.
Important Implementation Details
Tokenization: Uses
rag_tokenizerto tokenize fields for better NLP downstream processing.Normalization: Regular expressions are heavily used to clean and normalize text fields such as corporation types, political status, and phone numbers.
Date handling: Supports timestamps and string dates, converting them to uniform formats for consistent processing.
Feature engineering: Computes specialized features such as school rankings, degree hierarchies, company quality tags, and work duration to support candidate profiling.
Robustness: Handles exceptions gracefully, logs errors, and uses timeouts to prevent long-running operations from hanging the system.
Name handling: Uses
xpinyinto generate pinyin tokens and prefixes for Chinese names, aiding in phonetic matching.Data reduction: After processing, only keeps fields relevant for feature extraction, deleting raw or redundant fields.
Interaction with Other System Components
Imports entities from
deepdoc.parser.resume.entitiesfor degree, school, and corporation metadata lookups.Uses
rag.nlp.rag_tokenizerandrag.nlp.surnamefor tokenization and surname detection.Relies on
xpinyin.Pinyinfor converting Chinese names to pinyin.The processed data dictionary
cvis likely used downstream by modules responsible for candidate ranking, matching, or resume analysis within the InfiniFlow recruitment or HR platform.Functions like
schools.select(),corporations.corpNorm(), anddegrees.get_name()are called, indicating integration with external knowledge bases or databases.
Visual Diagram: Class & Function Structure
classDiagram
class TimeoutException {
<<Exception>>
}
class time_limit {
+__enter__()
+__exit__()
}
class step_two {
+rmHtmlTag(line)
+highest_degree(dg)
+forEdu(cv)
+forProj(cv)
+json_loads(line)
+forWork(cv)
+turnTm2Dt(b)
+getYMD(b)
+birth(cv)
+parse(cv)
+dealWithInt64(d)
}
TimeoutException <|-- time_limit
step_two : +rmHtmlTag(line)
step_two : +highest_degree(dg)
step_two : +forEdu(cv)
step_two : +forProj(cv)
step_two : +json_loads(line)
step_two : +forWork(cv)
step_two : +turnTm2Dt(b)
step_two : +getYMD(b)
step_two : +birth(cv)
step_two : +parse(cv)
step_two : +dealWithInt64(d)
Summary
step_two.py is a comprehensive resume data parsing and feature extraction module, transforming raw, often messy resume data into normalized, tokenized, and enriched profiles ready for AI-driven recruitment workflows. Through careful handling of education, work, projects, dates, and personal information, it produces a highly structured feature dictionary essential for downstream machine learning and candidate matching tasks within the InfiniFlow platform.