cd /news/developer-tools/i-cleaned-india-s-census-2011-data-s… Β· home β€Ί topics β€Ί developer-tools β€Ί article
[ARTICLE Β· art-29442] src=dev.to β†— pub= topic=developer-tools verified=true sentiment=↑ positive

I cleaned India's Census 2011 data so you never have to

An engineer released a cleaned, validated dataset of India's Census 2011 district-level data on Hugging Face, containing 640 districts with 29 columns and zero missing values. The dataset sums exactly to India's official total population of 1,210,854,977 and includes LGD codes for reliable joins. The project is part of the indiaset initiative to build India's open data layer.

read3 min views1 publishedJun 16, 2026

Every Indian data scientist hits the same wall.

You need district-level population data. You go to censusindia.gov.in.

You find hundreds of inconsistent Excel files with merged headers,

footnote rows, and zero documentation.

You spend a full day just the data before doing any actual analysis.

I fixed that. Once. For everyone.

indiaset/census-2011

India's Census 2011 district data, clean, typed, and ready for pandas.

640 districts Β· 29 columns Β· 0 missing values

Validated against official India total Β· LGD codes attached

from huggingface_hub import hf_hub_download
import pandas as pd

path = hf_hub_download(
    repo_id="indiaset/census-2011",
    filename="census_2011_districts_final.parquet",
    repo_type="dataset"
)
df = pd.read_parquet(path)
print(df.shape)  # (640, 29)
Column Description
state_code
Census 2011 state code
state_name
Official state/UT name
district_code
Census 2011 district code
district_name
District name as per Census
lgd_code
LGD permanent district code
district_name_lgd
District name as per LGD
pop_total
Total population
pop_male
Male population
pop_female
Female population
pop_under6_total
Children under 6 years
pop_sc
Scheduled Caste population
pop_st
Scheduled Tribe population
literate_total
Literate persons
literate_male
Literate males
literate_female
Literate females
illiterate_total
Illiterate persons
workers_total
Total workers
workers_male
Male workers
workers_female
Female workers
non_workers_total
Non workers
literacy_rate
Literate / Total Γ— 100
sex_ratio
Females per 1000 males
workforce_participation
Workers / Total Γ— 100

The most important test - do all 640 district populations

sum to India's official total?

print(df['pop_total'].sum())

Most literate district β†’ Pathanamthitta, Kerala : 88.74%

Least literate district β†’ Alirajpur, Madhya Pradesh : 28.77%

Literacy gap across India : 60 points

Highest sex ratio β†’ Mahe, Puducherry : 1176 per 1000 males

Lowest sex ratio β†’ Leh, Jammu & Kashmir : 690 per 1000 males

National population β†’ 1,210,854,977

Our district sum β†’ 1,210,854,977

Difference β†’ 0 βœ…

Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.

Without LGD codes, joining two Indian datasets is a nightmare:

df[df['district'] == 'Leh(Ladakh)']

df[df['lgd_code'] == 9]

This dataset has LGD codes for all 640 districts,

including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.

⚠️ This data reflects 2011 boundaries.

The full reproducible pipeline is on GitHub.

Clone it, run the notebook, get the exact same parquet file.

git clone https://github.com/indiaset/census-2011-pipeline

Raw file β†’ filter β†’ clean β†’ validate β†’ LGD join β†’ parquet.

Every step documented. Every decision explained.

This is dataset #1 under indiaset -

India's open data layer.

Dataset Status
Census 2011 districts βœ… Live
Indian Elections 1951–2024 πŸ”œ Coming
RBI Economic Series πŸ”œ Coming
pip install indiaset
πŸ”œ Coming

Jaiswal, Ansuman. (2026). India Census 2011 - District Level

[Dataset]. indiaset. Hugging Face.

https://huggingface.co/datasets/indiaset/census-2011

Licensed under CC-BY-4.0 - free to use, just credit the source.

πŸ”— Dataset β†’ https://huggingface.co/datasets/indiaset/census-2011

πŸ”— Pipeline β†’ https://github.com/indiaset/census-2011-pipeline

πŸ”— Follow β†’ https://x.com/indiaset_data

── more in #developer-tools 4 stories Β· sorted by recency
── more on @ansuman jaiswal 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/i-cleaned-india-s-ce…] indexed:0 read:3min 2026-06-16 Β· β€”