Every Indian data scientist hits the same wall.
You need district-level population data. You go to censusindia.gov.in.
You find hundreds of inconsistent Excel files with merged headers,
footnote rows, and zero documentation.
You spend a full day just the data before doing any actual analysis.
I fixed that. Once. For everyone.
India's Census 2011 district data, clean, typed, and ready for pandas.
640 districts Β· 29 columns Β· 0 missing values
Validated against official India total Β· LGD codes attached
from huggingface_hub import hf_hub_download
import pandas as pd
path = hf_hub_download(
repo_id="indiaset/census-2011",
filename="census_2011_districts_final.parquet",
repo_type="dataset"
)
df = pd.read_parquet(path)
print(df.shape) # (640, 29)
| Column | Description |
|---|---|
state_code |
|
| Census 2011 state code | |
state_name |
|
| Official state/UT name | |
district_code |
|
| Census 2011 district code | |
district_name |
|
| District name as per Census | |
lgd_code |
|
| LGD permanent district code | |
district_name_lgd |
|
| District name as per LGD | |
pop_total |
|
| Total population | |
pop_male |
|
| Male population | |
pop_female |
|
| Female population | |
pop_under6_total |
|
| Children under 6 years | |
pop_sc |
|
| Scheduled Caste population | |
pop_st |
|
| Scheduled Tribe population | |
literate_total |
|
| Literate persons | |
literate_male |
|
| Literate males | |
literate_female |
|
| Literate females | |
illiterate_total |
|
| Illiterate persons | |
workers_total |
|
| Total workers | |
workers_male |
|
| Male workers | |
workers_female |
|
| Female workers | |
non_workers_total |
|
| Non workers | |
literacy_rate |
|
| Literate / Total Γ 100 | |
sex_ratio |
|
| Females per 1000 males | |
workforce_participation |
|
| Workers / Total Γ 100 |
The most important test - do all 640 district populations
sum to India's official total?
print(df['pop_total'].sum())
Most literate district β Pathanamthitta, Kerala : 88.74%
Least literate district β Alirajpur, Madhya Pradesh : 28.77%
Literacy gap across India : 60 points
Highest sex ratio β Mahe, Puducherry : 1176 per 1000 males
Lowest sex ratio β Leh, Jammu & Kashmir : 690 per 1000 males
National population β 1,210,854,977
Our district sum β 1,210,854,977
Difference β 0 β
Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.
Without LGD codes, joining two Indian datasets is a nightmare:
df[df['district'] == 'Leh(Ladakh)']
df[df['lgd_code'] == 9]
This dataset has LGD codes for all 640 districts,
including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.
β οΈ This data reflects 2011 boundaries.
The full reproducible pipeline is on GitHub.
Clone it, run the notebook, get the exact same parquet file.
git clone https://github.com/indiaset/census-2011-pipeline
Raw file β filter β clean β validate β LGD join β parquet.
Every step documented. Every decision explained.
This is dataset #1 under indiaset -
India's open data layer.
| Dataset | Status |
|---|---|
| Census 2011 districts | β Live |
| Indian Elections 1951β2024 | π Coming |
| RBI Economic Series | π Coming |
pip install indiaset |
|
| π Coming |
Jaiswal, Ansuman. (2026). India Census 2011 - District Level
[Dataset]. indiaset. Hugging Face.
https://huggingface.co/datasets/indiaset/census-2011
Licensed under CC-BY-4.0 - free to use, just credit the source.
π Dataset β https://huggingface.co/datasets/indiaset/census-2011
π Pipeline β https://github.com/indiaset/census-2011-pipeline
π Follow β https://x.com/indiaset_data