schools.py
Overview
The schools.py module provides functionality to manage and query a dataset of educational institutions, primarily universities and colleges. It loads school data from CSV and JSON resources, processes school names into normalized forms, assigns ranking information to schools, and offers utility functions to select school records and verify the quality status of a school.
This module is designed to assist applications that require normalization, ranking, and filtering of school names, such as academic data processing, university ranking aggregations, or educational data analytics.
Detailed Explanation
Global Variables
TBL
A pandas DataFrame containing school data loaded fromres/schools.csv. It is preprocessed by lowercasing and trimming English names (name_en), and is augmented later with ranking data.GOOD_SCH
A set of normalized school names loaded fromres/good_sch.json. The names are stripped of punctuation and special characters to create a uniform set to check if a school is considered "good" or reputable.current_file_path
Stores the directory path of the current file to help locate resource files relative to this script.
Functions
loadRank(fnm: str) -> None
Loads school ranking information from a CSV file and updates the global TBL DataFrame by assigning a rank to matching school entries.
Parameters:
fnm(str): Path to the CSV file containing school rankings. Each line should have a school name and its rank separated by a comma.
Returns:
None. Side effect: updatesTBLby adding/modifying the"rank"column.
Details:
Initially, all schools are assigned a default high rank value of1000000. The function reads the ranking file line by line, extracting the school name and rank. It then updates the rank for schools that match by Chinese or English name (name_cnorname_en) inTBL. Errors in parsing or missing schools are silently ignored.Example usage:
loadRank("path/to/school.rank.csv")
split(txt: str) -> list[str]
Splits a text string into tokens with special handling for English words that should remain together.
Parameters:
txt(str): Input string to tokenize.
Returns:
list[str]: A list of tokens.
Details:
The function normalizes spaces and then tokenizes the string by spaces. However, if consecutive tokens both contain alphabetic characters and the previous token ends with a letter, they are merged back with a space to avoid splitting multi-word English names incorrectly.Example usage:
tokens = split("New York University") # tokens == ["New York University"]
select(nm: str | list) -> dict | None
Selects a school record from TBL that matches the given school name or alias.
Parameters:
nm(strorlist): School name or list where the first element is the school name.
Returns:
dictrepresenting the first matched school record with all columns, orNoneif no match is found.
Details:
The input name is normalized by:If input is a list, taking the first element.
Applying the
split()function and using the first token.Lowercasing, trimming, and removing parenthetical content.
Removing certain prefixes like "the" or country names ("英国", "美国", "瑞士").
Simplifying phrases like "大学.*学院" to "大学".
Then, a deep copy of
TBLis made, and a boolean columnhit_aliasis added marking if the normalized name appears in the school's alias list (aliases are separated by "+").The function returns the first matched school record where the Chinese name, English name, or alias matches the normalized input.
Example usage:
school = select("Massachusetts Institute of Technology (MIT)") if school: print(school["name_en"], school["rank"])
is_good(nm: str) -> bool
Checks if a given school name belongs to the set of "good" schools.
Parameters:
nm(str): School name to check.
Returns:
bool:Trueif the normalized school name is inGOOD_SCH, elseFalse.
Details:
The input name is normalized by:Lowercasing.
Removing any parenthetical content.
Stripping punctuation and whitespace characters.
The normalized name is then checked against the preloaded
GOOD_SCHset.Example usage:
if is_good("Stanford University"): print("This is a good school!")
Implementation Details and Algorithms
Data Loading and Normalization:
The module loads schools from a tab-separated CSV, applying lowercase normalization to English names. Ranking data is loaded separately and merged by matching Chinese or English names.Name Normalization:
School names are aggressively normalized to handle variations such as parentheses, country prefixes, and different suffixes to improve matching accuracy.Alias Matching:
School aliases are stored as concatenated strings with "+" separators. The code splits and checks if a queried name appears in these aliases.Ranking Assignment:
Schools not found in the ranking file default to a very high rank (1000000), effectively marking them as low-ranked or unknown.
Interaction with Other Parts of the System
Resource Files:
res/schools.csv: Contains raw school data including Chinese and English names, aliases, etc.res/good_sch.json: JSON list of reputable or "good" school names.res/school.rank.csv: CSV file mapping school names to their numeric ranking.
Data Consumers:
This module likely interacts with other components that require normalized school data, ranked filtering, or verification of school reputability. It acts as a utility or data provider layer.Pandas Dependency:
Heavy reliance onpandasfor data manipulation and filtering.
Mermaid Class Diagram
classDiagram
class schools.py {
+DataFrame TBL
+set GOOD_SCH
+loadRank(fnm: str) void
+split(txt: str) list
+select(nm: str | list) dict | None
+is_good(nm: str) bool
}
Summary
The schools.py file is a utility module for loading, normalizing, ranking, and querying school data. It offers simple interfaces for selecting schools by name and checking if a school is in a pre-defined "good" list. It standardizes school names and aliases to facilitate robust matching and filtering, making it valuable for educational data processing pipelines.