step_two.py


Overview

step_two.py is a utility module primarily focused on parsing, cleaning, and enriching structured data extracted from resumes (CVs). The module processes various sections of a resume such as education, projects, work experience, and personal information, converting raw input data into tokenized, normalized, and feature-enriched dictionaries to support downstream applications like candidate ranking, matching, or profile analysis in the InfiniFlow system.

Key functionalities include:

The module depends on external components such as deepdoc.parser.resume.entities (for degrees, schools, corporations) and rag.nlp.rag_tokenizer for tokenization, and it uses xpinyin for pinyin conversions of Chinese names.


Classes and Functions

Class: TimeoutException

Custom exception class used to signal a timeout event.

class TimeoutException(Exception):
    pass

Usage: Raised within the time_limit context manager when execution exceeds a specified time limit.


Context Manager: time_limit(seconds)

Sets an alarm signal to interrupt execution after seconds seconds.

Parameters:

Usage Example:

try:
    with time_limit(5):
        # Code block that must complete within 5 seconds
        long_running_function()
except TimeoutException:
    print("Function timed out!")

Function: rmHtmlTag(line)

Removes HTML tags from a given string by replacing them with spaces.

Parameters:

Returns:

Implementation details:

Uses a regex to match HTML tags and replaces them with spaces.


Function: highest_degree(dg)

Determines the highest academic degree from a list or single string of degree names.

Parameters:

Returns:

Degree ranking:

Degree

Rank

初中 (Middle school)

0

高中 (High school)

1

中专 (Technical secondary)

2

大专 (Junior college)

3

专升本 (Specialized to bachelor)

4

本科 (Bachelor)

5

硕士 (Master)

6

博士 (Doctorate)

7

博士后 (Postdoc)

8


Function: forEdu(cv)

Processes the "education_obj" field in the resume dictionary to extract features such as school rankings, degrees, majors, and study periods.

Parameters:

Returns:

Key features computed:

Usage:

Called internally in parse() to enrich education-related data.


Function: forProj(cv)

Processes the "project_obj" field to tokenize and extract project names and descriptions.

Parameters:

Returns:

Usage:

Called internally in parse().


Function: json_loads(line)

Parses a JSON-like string into a Python object, converting True/False literals to strings to avoid decoding errors.

Parameters:

Returns:

Implementation details:

Uses demjson3.decode after regex replacement to handle boolean literals.


Function: forWork(cv)

Processes the "work_obj" field to extract detailed work experience features, including company rankings, position names, responsibilities, subordinates counts, and calculates work duration metrics.

Parameters:

Returns:

Key features computed:


Function: turnTm2Dt(b)

Converts a timestamp or string representing time into a formatted datetime string.

Parameters:

Returns:


Function: getYMD(b)

Extracts the Year, Month, and Day components from a string or timestamp.

Parameters:

Returns:


Function: birth(cv)

Processes the "birth" field to extract birth date features, calculate age, and generate birthday keywords.

Parameters:

Returns:


Function: parse(cv)

The core function that cleans, normalizes, tokenizes, and enriches the entire resume dictionary.

Parameters:

Returns:

Main processing steps:

Usage Example:

processed_cv = parse(raw_cv_dict)

Function: dealWithInt64(d)

Recursively converts any numpy.integer types within a dictionary or list to native Python int to avoid serialization issues.

Parameters:

Returns:


Important Implementation Details


Interaction with Other System Components


Visual Diagram: Class & Function Structure

classDiagram
    class TimeoutException {
        <<Exception>>
    }

    class time_limit {
        +__enter__()
        +__exit__()
    }

    class step_two {
        +rmHtmlTag(line)
        +highest_degree(dg)
        +forEdu(cv)
        +forProj(cv)
        +json_loads(line)
        +forWork(cv)
        +turnTm2Dt(b)
        +getYMD(b)
        +birth(cv)
        +parse(cv)
        +dealWithInt64(d)
    }

    TimeoutException <|-- time_limit
    step_two : +rmHtmlTag(line)
    step_two : +highest_degree(dg)
    step_two : +forEdu(cv)
    step_two : +forProj(cv)
    step_two : +json_loads(line)
    step_two : +forWork(cv)
    step_two : +turnTm2Dt(b)
    step_two : +getYMD(b)
    step_two : +birth(cv)
    step_two : +parse(cv)
    step_two : +dealWithInt64(d)

Summary

step_two.py is a comprehensive resume data parsing and feature extraction module, transforming raw, often messy resume data into normalized, tokenized, and enriched profiles ready for AI-driven recruitment workflows. Through careful handling of education, work, projects, dates, and personal information, it produces a highly structured feature dictionary essential for downstream machine learning and candidate matching tasks within the InfiniFlow platform.