mapping.json

Overview

The mapping.json file defines the configuration and schema mappings for an Elasticsearch index. It specifies how documents are indexed, stored, and searched within the Elasticsearch cluster. This configuration includes index settings like shard count and refresh interval, a custom similarity scoring script for text fields, and detailed dynamic mapping templates that automatically map fields based on their names and patterns to appropriate Elasticsearch data types.

This file is crucial for shaping the behavior of the Elasticsearch index to optimize search relevance, storage efficiency, and query performance. It is typically used during index creation or update via the Elasticsearch REST API.

Detailed Explanation

Root Structure

settings: Contains index-level settings such as shard count, replica count, refresh interval, and custom similarity algorithms.
mappings: Defines the schema of the documents, including field types and dynamic templates for automatic type inference.

Settings Section

1. index

number_of_shards: integer (value: 2)
Defines how many primary shards the index will have. Sharding enables horizontal scaling.
number_of_replicas: integer (value: 0)
Number of replica shards per primary shard. Here, zero replicas, meaning no redundancy.
refresh_interval: string (value: "1000ms")
Frequency at which the index is refreshed and made searchable. 1000ms means 1 second.

2. similarity

Defines custom similarity scoring methods used during full-text search.

scripted_sim
- type: "scripted"
- script.source: a custom Java-like script for calculating score based on:
  - Inverse document frequency (IDF) calculated as:
    idf = log(1 + (field.docCount - term.docFreq + 0.5) / (term.docFreq + 0.5)) / log(1 + ((field.docCount - 0.5) / 1.5))
  - Frequency factor: min(doc.freq, 1)
  - Query boost factor: query.boost
- Purpose: This custom similarity script modifies the scoring to prioritize terms that are rarer across the dataset, with a cap on frequency contribution to avoid over-weighting repeated terms. This improves relevance by emphasizing unique terms.

Mappings Section

1. properties

Defines explicit field mappings:

lat_lon
- type: "geo_point"
- store: "true"
  Stores geographical coordinates with indexing optimized for geo-spatial queries.

2. date_detection

"true"
Enables auto-detection and parsing of date fields in documents.

3. dynamic_templates

Dynamic templates allow automatic field mapping based on field name patterns and regex matching. Each template has:

match or match_pattern + match (regex)
mapping: field type and parameters

Below are key templates:

Template Name	Match Pattern	Mapping Type	Notable Parameters	Description
int	`*_int`	integer	store: true	Maps fields ending with `_int` to integer type
ulong	`*_ulong`	unsigned_long	store: true	Unsigned long integers
long	`*_long`	long	store: true	Signed long integers
short	`*_short`	short	store: true	Short integers
numeric	`*_flt`	float	store: true	Floating-point numbers
tks	`*_tks`	text	analyzer: whitespace, similarity: scripted_sim, store: true	Tokenized text fields with custom similarity
ltks	`*_ltks`	text	analyzer: whitespace, store: true	Tokenized text fields without custom similarity
kwd	regex [^(.*_(kwd	id	ids	uid
dt	regex [^.*(_dt	_time	_at)$](/projects/311/73485)	date
nested	`*_nst`	nested		Nested objects for complex hierarchies
object	`*_obj`	object	dynamic: true	JSON-like objects with dynamic fields
string	regex [^.*_(with_weight	list)$](/projects/311/73485)	text	index: false, store: true
rank_feature	`*_fea`	rank_feature		Fields to be used for ranking features
rank_features	`*_feas`	rank_features		Sets of ranking features
dense_vector	`*_512_vec`	dense_vector	dims: 512, similarity: cosine, index: true	512-dimensional dense vectors for similarity search
dense_vector	`*_768_vec`	dense_vector	dims: 768, similarity: cosine, index: true	768-dimensional dense vectors
dense_vector	`*_1024_vec`	dense_vector	dims: 1024, similarity: cosine, index: true	1024-dimensional dense vectors
dense_vector	`*_1536_vec`	dense_vector	dims: 1536, similarity: cosine, index: true	1536-dimensional dense vectors
binary	`*_bin`	binary		Binary data fields

Important Implementation Details

Custom Similarity Script:
The scripted similarity modifies the traditional TF-IDF approach by applying a custom IDF formula and capping the frequency contribution to 1, thus mitigating term frequency bias.
Dynamic Templates:
This approach allows flexible and automatic field typing based on naming conventions or regex patterns, avoiding the need for explicit mapping for every new field.
Dense Vector Fields:
Support for multiple dense vector sizes and cosine similarity enables advanced vector-based similarity searches (e.g., for machine learning embeddings).
Date Formats:
The date template accepts multiple formats, increasing robustness when ingesting data from different sources.

Interaction With Other System Components

Index Creation API:
This JSON is typically submitted to Elasticsearch when creating an index via the REST API (PUT /index_name with body containing this JSON).
Search Queries:
The similarity script impacts scoring during full-text search queries for fields mapped with scripted_sim.
Ingestion Pipelines:
Data ingestion processes must ensure fields follow the naming conventions so dynamic templates apply correctly.
Vector Search Modules:
Dense vector fields interact with ML models or vector search components that rely on cosine similarity for nearest neighbor queries.
Query DSL:
Users or applications querying this index will leverage field types (keyword, date, nested, etc.) as defined here.

Usage Examples

Example 1: Creating an index with this mapping

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d @mapping.json

Example 2: Indexing a document

{
  "user_id": "12345_uid",
  "created_at": "2024-06-01 15:30:00",
  "location_lat_lon": "40.7128,-74.0060",
  "description_tks": "quick brown fox jumps",
  "embedding_512_vec": [0.12, 0.34, ..., 0.56]  // 512 floats
}

The fields will be dynamically mapped according to the templates, e.g., user_id as keyword, created_at as date, etc.

Visual Diagram

flowchart TD
    A[Settings] --> B[Index Settings]
    A --> C[Similarity]
    B --> B1[number_of_shards: 2]
    B --> B2[number_of_replicas: 0]
    B --> B3[refresh_interval: 1000ms]
    C --> C1[scripted_sim]
    C1 --> C1a[Custom IDF script]

    D[Mappings] --> E[Properties]
    E --> E1[lat_lon: geo_point]

    D --> F[Dynamic Templates]

    F --> F1[int: *_int -> integer]
    F --> F2[ulong: *_ulong -> unsigned_long]
    F --> F3[long: *_long -> long]
    F --> F4[short: *_short -> short]
    F --> F5[numeric: *_flt -> float]
    F --> F6[tks: *_tks -> text + scripted_sim]
    F --> F7[ltks: *_ltks -> text]
    F --> F8[kwd: regex -> keyword + boolean similarity]
    F --> F9[dt: regex -> date with multiple formats]
    F --> F10[nested: *_nst -> nested]
    F --> F11[object: *_obj -> object]
    F --> F12[string: regex -> text (not indexed)]
    F --> F13[rank_feature: *_fea -> rank_feature]
    F --> F14[rank_features: *_feas -> rank_features]
    F --> F15[dense_vector: *_512_vec -> 512 dims, cosine]
    F --> F16[dense_vector: *_768_vec -> 768 dims, cosine]
    F --> F17[dense_vector: *_1024_vec -> 1024 dims, cosine]
    F --> F18[dense_vector: *_1536_vec -> 1536 dims, cosine]
    F --> F19[binary: *_bin -> binary]

Summary

mapping.json is a comprehensive Elasticsearch index configuration file that governs index behavior, scoring, and field mapping. It leverages dynamic templates for flexible field typing, includes a custom similarity script to tailor search relevance, and supports advanced data types like geo points and dense vectors for modern search scenarios.

Correct use and maintenance of this file ensure efficient indexing, accurate search results, and seamless integration with other system components like query engines and ingestion pipelines.