schools.py

Overview

The schools.py module provides functionality to manage and query a dataset of educational institutions, primarily universities and colleges. It loads school data from CSV and JSON resources, processes school names into normalized forms, assigns ranking information to schools, and offers utility functions to select school records and verify the quality status of a school.

This module is designed to assist applications that require normalization, ranking, and filtering of school names, such as academic data processing, university ranking aggregations, or educational data analytics.

Detailed Explanation

Global Variables

TBL
A pandas DataFrame containing school data loaded from res/schools.csv. It is preprocessed by lowercasing and trimming English names (name_en), and is augmented later with ranking data.
GOOD_SCH
A set of normalized school names loaded from res/good_sch.json. The names are stripped of punctuation and special characters to create a uniform set to check if a school is considered "good" or reputable.
current_file_path
Stores the directory path of the current file to help locate resource files relative to this script.

Functions

`loadRank(fnm: str) -> None`

Loads school ranking information from a CSV file and updates the global TBL DataFrame by assigning a rank to matching school entries.

Parameters:
- fnm (str): Path to the CSV file containing school rankings. Each line should have a school name and its rank separated by a comma.
Returns:
- None. Side effect: updates TBL by adding/modifying the "rank" column.
Details:
Initially, all schools are assigned a default high rank value of 1000000. The function reads the ranking file line by line, extracting the school name and rank. It then updates the rank for schools that match by Chinese or English name (name_cn or name_en) in TBL. Errors in parsing or missing schools are silently ignored.
Example usage:
```
loadRank("path/to/school.rank.csv")
```

`split(txt: str) -> list[str]`

Splits a text string into tokens with special handling for English words that should remain together.

Parameters:
- txt (str): Input string to tokenize.
Returns:
- list[str]: A list of tokens.
Details:
The function normalizes spaces and then tokenizes the string by spaces. However, if consecutive tokens both contain alphabetic characters and the previous token ends with a letter, they are merged back with a space to avoid splitting multi-word English names incorrectly.

Example usage:

tokens = split("New York University")
# tokens == ["New York University"]

`select(nm: str | list) -> dict | None`

Selects a school record from TBL that matches the given school name or alias.

Parameters:
- nm (str or list): School name or list where the first element is the school name.
Returns:
- dict representing the first matched school record with all columns, or None if no match is found.
Details:
The input name is normalized by:
- If input is a list, taking the first element.
- Applying the split() function and using the first token.
- Lowercasing, trimming, and removing parenthetical content.
- Removing certain prefixes like "the" or country names ("英国", "美国", "瑞士").
- Simplifying phrases like "大学.*学院" to "大学".
Then, a deep copy of TBL is made, and a boolean column hit_alias is added marking if the normalized name appears in the school's alias list (aliases are separated by "+").
The function returns the first matched school record where the Chinese name, English name, or alias matches the normalized input.

Example usage:

school = select("Massachusetts Institute of Technology (MIT)")
if school:
    print(school["name_en"], school["rank"])

`is_good(nm: str) -> bool`

Checks if a given school name belongs to the set of "good" schools.

Parameters:
- nm (str): School name to check.
Returns:
- bool: True if the normalized school name is in GOOD_SCH, else False.
Details:
The input name is normalized by:
- Lowercasing.
- Removing any parenthetical content.
- Stripping punctuation and whitespace characters.
The normalized name is then checked against the preloaded GOOD_SCH set.

Example usage:

if is_good("Stanford University"):
    print("This is a good school!")

Implementation Details and Algorithms

Data Loading and Normalization:
The module loads schools from a tab-separated CSV, applying lowercase normalization to English names. Ranking data is loaded separately and merged by matching Chinese or English names.
Name Normalization:
School names are aggressively normalized to handle variations such as parentheses, country prefixes, and different suffixes to improve matching accuracy.
Alias Matching:
School aliases are stored as concatenated strings with "+" separators. The code splits and checks if a queried name appears in these aliases.
Ranking Assignment:
Schools not found in the ranking file default to a very high rank (1000000), effectively marking them as low-ranked or unknown.

Interaction with Other Parts of the System

Resource Files:
- res/schools.csv: Contains raw school data including Chinese and English names, aliases, etc.
- res/good_sch.json: JSON list of reputable or "good" school names.
- res/school.rank.csv: CSV file mapping school names to their numeric ranking.
Data Consumers:
This module likely interacts with other components that require normalized school data, ranked filtering, or verification of school reputability. It acts as a utility or data provider layer.
Pandas Dependency:
Heavy reliance on pandas for data manipulation and filtering.

Mermaid Class Diagram

classDiagram
    class schools.py {
        +DataFrame TBL
        +set GOOD_SCH
        +loadRank(fnm: str) void
        +split(txt: str) list
        +select(nm: str | list) dict | None
        +is_good(nm: str) bool
    }

Summary

The schools.py file is a utility module for loading, normalizing, ranking, and querying school data. It offers simple interfaces for selecting schools by name and checking if a school is in a pre-defined "good" list. It standardizes school names and aliases to facilitate robust matching and filtering, making it valuable for educational data processing pipelines.