schools.csv

Overview

schools.csv is a large, structured dataset file containing detailed information about educational institutions, primarily universities and colleges, from various regions including China and many international locations. The file is formatted as a CSV (Comma-Separated Values) table, where each row represents a unique school entry with multiple attributes.

This dataset is used to provide foundational school entity data for systems that require comprehensive educational institution metadata, such as resume parsers, education verification systems, academic analytics platforms, or any application needing standardized school information.

Data Structure and Fields

Each row in the file corresponds to a single educational institution and contains the following columns (fields):

Field Name	Description
`id`	Unique identifier for the school (integer).
`type`	Type code of the school entity (integer). Specific type meanings depend on external system definitions.
`parent_id`	Identifier for a parent institution if applicable (integer). Used for hierarchical relationships.
`name_cn`	School name in Chinese (string).
`name_en`	School name in English (string).
`alias`	Alternative names or nicknames for the school, separated by `::` if multiple (string).
`is_abroad`	Boolean flag (0 or 1) indicating if the school is located abroad (outside of China).
`is_world_known`	Boolean flag (0 or 1) indicating if the school is recognized worldwide.
`school_type`	Category/type of school, e.g., "综合类" (comprehensive), "高职类" (vocational), "医科类" (medical), "艺术类" (arts), etc.
`is_double_first`	Boolean flag (0 or 1) indicating if the school is part of the "Double First Class" initiative (prestigious Chinese universities).
`education_type`	Level or type of education offered, e.g., "本科" (undergraduate), "专科" (specialist), "独立学院" (independent college), etc.
`province`	Province or state where the school is located (string).
`city`	City where the school is located (string).
`is_985`	Boolean flag (0 or 1) indicating if the school belongs to China's "Project 985" (a group of elite universities).

Usage

This CSV file serves as a static reference dataset. It can be imported, queried, or integrated into applications to:

Validate and standardize school names in user inputs or resumes.
Provide metadata about schools for educational or recruitment analytics.
Map schools to regions, types, or prestige categories (e.g., 985 universities).
Support hierarchical or parent-child relations among institutions.

Example usage in Python with pandas:

import pandas as pd

# Load the schools data
schools_df = pd.read_csv('schools.csv', sep='\t')  # assuming tab-separated or adjust accordingly

# Query example: find all world-known schools in Shanghai
shanghai_world_known = schools_df[
    (schools_df['city'] == '上海市') & (schools_df['is_world_known'] == 1)
]

print(shanghai_world_known[['name_cn', 'name_en']])

Important Notes on Implementation/Content

The data includes both Chinese and international schools, marked by is_abroad.
School names may have multiple aliases, separated by :: in the alias field.
The type and parent_id fields suggest a hierarchical or categorical classification, potentially for sub-institutions or branches.
The flags is_double_first and is_985 correspond to Chinese government classifications of elite universities.
The dataset is very large and comprehensive, containing thousands of schools, covering multiple countries and regions.
Some fields may have NULL or empty values indicating unavailable or non-applicable data.

Interaction with Other System Components

Resume Parser: This dataset is likely used by a resume parser system to recognize and standardize school names in resumes, mapping them to known entities.
Entity Recognition: The file is part of the entity data under deepdoc/parser/resume/entities/res/, indicating its use in natural language processing pipelines.
Data Enrichment: Other parts of the system may use this file to enrich user profiles or educational histories with metadata like school prestige or location.
Cross-referencing: The parent_id might be used to link branches or affiliates to main universities, aiding hierarchical understanding.

Visual Diagram

Since this file is a data utility file (a dataset), the best representation is a flowchart showing how this file is typically used in a system context, especially focusing on its role in data lookup and enrichment.

flowchart TD
    A[Application / Resume Parser] --> B[Load schools.csv Dataset]
    B --> C{Query School Data}
    C -->|Normalize School Name| D[Match by name_cn / name_en / alias]
    C -->|Fetch Metadata| E[Get attributes: province, city, is_985, etc.]
    D --> F[Standardized School Entity]
    E --> F
    F --> G[Use in Resume Parsing / Profile Enrichment]

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#bbf,stroke:#333,stroke-width:1px
    style E fill:#bbf,stroke:#333,stroke-width:1px
    style F fill:#bfb,stroke:#333,stroke-width:2px

Summary

File Type: CSV dataset
Content: Detailed metadata of educational institutions worldwide, with a focus on Chinese universities.
Purpose: Provides authoritative school entity data for systems handling education-related information.
Key Fields: School IDs, names, aliases, location, type, and prestige indicators.
Usage: Lookup, normalization, and enrichment of educational institution data.
Integration: Primarily used in resume parsing and entity recognition pipelines.

This file is a foundational asset for any system that needs to recognize and analyze educational backgrounds with precision and consistency.