I cleaned India's Census 2011 data so you never have to

wpnews.pro

cd /news/developer-tools/i-cleaned-india-s-census-2011-data-s… · home › topics › developer-tools › article

[ARTICLE · art-29442] src=dev.to ↗ pub=2026-06-16T12:36Z topic=developer-tools verified=true sentiment=↑ positive

I cleaned India's Census 2011 data so you never have to

An engineer released a cleaned, validated dataset of India's Census 2011 district-level data on Hugging Face, containing 640 districts with 29 columns and zero missing values. The dataset sums exactly to India's official total population of 1,210,854,977 and includes LGD codes for reliable joins. The project is part of the indiaset initiative to build India's open data layer.

read3 min views17 publishedJun 16, 2026

Every Indian data scientist hits the same wall.

You need district-level population data. You go to censusindia.gov.in.

You find hundreds of inconsistent Excel files with merged headers,

footnote rows, and zero documentation.

You spend a full day just the data before doing any actual analysis.

I fixed that. Once. For everyone.

indiaset/census-2011

India's Census 2011 district data, clean, typed, and ready for pandas.

640 districts · 29 columns · 0 missing values

Validated against official India total · LGD codes attached

from huggingface_hub import hf_hub_download
import pandas as pd

path = hf_hub_download(
    repo_id="indiaset/census-2011",
    filename="census_2011_districts_final.parquet",
    repo_type="dataset"
)
df = pd.read_parquet(path)
print(df.shape)  # (640, 29)

Column	Description
`state_code`
Census 2011 state code
`state_name`
Official state/UT name
`district_code`
Census 2011 district code
`district_name`
District name as per Census
`lgd_code`
LGD permanent district code
`district_name_lgd`
District name as per LGD
`pop_total`
Total population
`pop_male`
Male population
`pop_female`
Female population
`pop_under6_total`
Children under 6 years
`pop_sc`
Scheduled Caste population
`pop_st`
Scheduled Tribe population
`literate_total`
Literate persons
`literate_male`
Literate males
`literate_female`
Literate females
`illiterate_total`
Illiterate persons
`workers_total`
Total workers
`workers_male`
Male workers
`workers_female`
Female workers
`non_workers_total`
Non workers
`literacy_rate`
Literate / Total × 100
`sex_ratio`
Females per 1000 males
`workforce_participation`
Workers / Total × 100

The most important test - do all 640 district populations

sum to India's official total?

print(df['pop_total'].sum())

Most literate district → Pathanamthitta, Kerala : 88.74%

Least literate district → Alirajpur, Madhya Pradesh : 28.77%

Literacy gap across India : 60 points

Highest sex ratio → Mahe, Puducherry : 1176 per 1000 males

Lowest sex ratio → Leh, Jammu & Kashmir : 690 per 1000 males

National population → 1,210,854,977

Our district sum → 1,210,854,977

Difference → 0 ✅

Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.

Without LGD codes, joining two Indian datasets is a nightmare:

df[df['district'] == 'Leh(Ladakh)']

df[df['lgd_code'] == 9]

This dataset has LGD codes for all 640 districts,

including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.

⚠️ This data reflects 2011 boundaries.

The full reproducible pipeline is on GitHub.

Clone it, run the notebook, get the exact same parquet file.

git clone https://github.com/indiaset/census-2011-pipeline

Raw file → filter → clean → validate → LGD join → parquet.

Every step documented. Every decision explained.

This is dataset #1 under indiaset -

India's open data layer.

Dataset	Status
Census 2011 districts	✅ Live
Indian Elections 1951–2024	🔜 Coming
RBI Economic Series	🔜 Coming
`pip install indiaset`
🔜 Coming

Jaiswal, Ansuman. (2026). India Census 2011 - District Level

[Dataset]. indiaset. Hugging Face.

https://huggingface.co/datasets/indiaset/census-2011

Licensed under CC-BY-4.0 - free to use, just credit the source.

🔗 Dataset → https://huggingface.co/datasets/indiaset/census-2011

🔗 Pipeline → https://github.com/indiaset/census-2011-pipeline

🔗 Follow → https://x.com/indiaset_data

source & further reading

dev.to — original article Tokeness review: one API key for GPT/Claude/Gemini/Grok/DeepSeek/Kimi (with real caveats) Our dev labs open-sourced a local Python middleware framework that intercepts, repairs, and stabilizes malformed AI JSON data streams within local in-memory arrays. Optimizing LLM Stream Ingestion: Reconstructing Truncated JSON Payloads in 0.0122ms

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-cleaned-india-s-census…

Read original on dev.to → dev.to/iam-ansuman/i-cleaned-indias-census-2011-…

mentioned entities

Ansuman Jaiswal

indiaset

Hugging Face

Census 2011

India

LGD

GitHub

CC-BY-4.0

metadata

slugi-cleaned-india-s-census-2011-data-so-you-never-have-to

topic#developer-tools

sentimentpositive

canonicaldev.to

navigation

← prevLogarithmic Math Fuels Bold Tens…

next →B-52 bomber crash: What to know …

── more in #developer-tools 4 stories · sorted by recency

startupfortune.com · 1 Aug · #developer-tools

Executive Order 14409 Just Set the Line for Dangerous AI Models

marginalrevolution.com · 1 Aug · #developer-tools

Emergent Ventures India, 18th cohort

promptcube3.com · 1 Aug · #developer-tools

Rogue AI Hacking Incidents: Open Source Isn't the Real Problem

insideai.news · 1 Aug · #developer-tools

OpenAI Finds More AI Agent Escape Incidents in Broader Review

── more on @ansuman jaiswal 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required