# I cleaned India's Census 2011 data so you never have to

> Source: <https://dev.to/iam-ansuman/i-cleaned-indias-census-2011-data-so-you-never-have-to-4g2m>
> Published: 2026-06-16 12:36:40+00:00

Every Indian data scientist hits the same wall.

You need district-level population data. You go to censusindia.gov.in.

You find hundreds of inconsistent Excel files with merged headers,

footnote rows, and zero documentation.

You spend a full day just loading the data before doing any actual analysis.

I fixed that. Once. For everyone.

[indiaset/census-2011](https://huggingface.co/datasets/indiaset/census-2011)

India's Census 2011 district data, clean, typed, and ready for pandas.

640 districts · 29 columns · 0 missing values

Validated against official India total · LGD codes attached

``` python
from huggingface_hub import hf_hub_download
import pandas as pd

path = hf_hub_download(
    repo_id="indiaset/census-2011",
    filename="census_2011_districts_final.parquet",
    repo_type="dataset"
)
df = pd.read_parquet(path)
print(df.shape)  # (640, 29)
```

| Column | Description |
|---|---|
`state_code` |
Census 2011 state code |
`state_name` |
Official state/UT name |
`district_code` |
Census 2011 district code |
`district_name` |
District name as per Census |
`lgd_code` |
LGD permanent district code |
`district_name_lgd` |
District name as per LGD |
`pop_total` |
Total population |
`pop_male` |
Male population |
`pop_female` |
Female population |
`pop_under6_total` |
Children under 6 years |
`pop_sc` |
Scheduled Caste population |
`pop_st` |
Scheduled Tribe population |
`literate_total` |
Literate persons |
`literate_male` |
Literate males |
`literate_female` |
Literate females |
`illiterate_total` |
Illiterate persons |
`workers_total` |
Total workers |
`workers_male` |
Male workers |
`workers_female` |
Female workers |
`non_workers_total` |
Non workers |
`literacy_rate` |
Literate / Total × 100 |
`sex_ratio` |
Females per 1000 males |
`workforce_participation` |
Workers / Total × 100 |

The most important test - do all 640 district populations

sum to India's official total?

```
print(df['pop_total'].sum())
# 1210854977 ✅ — exact match, zero discrepancy
```

Most literate district → Pathanamthitta, Kerala : 88.74%

Least literate district → Alirajpur, Madhya Pradesh : 28.77%

Literacy gap across India : 60 points

Highest sex ratio → Mahe, Puducherry : 1176 per 1000 males

Lowest sex ratio → Leh, Jammu & Kashmir : 690 per 1000 males

National population → 1,210,854,977

Our district sum → 1,210,854,977

Difference → 0 ✅

Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.

Without LGD codes, joining two Indian datasets is a nightmare:

```
# without LGD - name matching hell
df[df['district'] == 'Leh(Ladakh)']
# misses: "Leh Ladakh", "Leh", "LEH"

# with LGD - bulletproof
df[df['lgd_code'] == 9]
# always works, regardless of spelling
```

This dataset has LGD codes for all 640 districts,

including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.

⚠️ This data reflects 2011 boundaries.

The full reproducible pipeline is on GitHub.

Clone it, run the notebook, get the exact same parquet file.

```
git clone https://github.com/indiaset/census-2011-pipeline
```

Raw file → filter → clean → validate → LGD join → parquet.

Every step documented. Every decision explained.

This is dataset #1 under **indiaset** -

India's open data layer.

| Dataset | Status |
|---|---|
| Census 2011 districts | ✅ Live |
| Indian Elections 1951–2024 | 🔜 Coming |
| RBI Economic Series | 🔜 Coming |
`pip install indiaset` |
🔜 Coming |

Jaiswal, Ansuman. (2026). India Census 2011 - District Level

[Dataset]. indiaset. Hugging Face.

[https://huggingface.co/datasets/indiaset/census-2011](https://huggingface.co/datasets/indiaset/census-2011)

Licensed under **CC-BY-4.0** - free to use, just credit the source.

🔗 Dataset → [https://huggingface.co/datasets/indiaset/census-2011](https://huggingface.co/datasets/indiaset/census-2011)

🔗 Pipeline → [https://github.com/indiaset/census-2011-pipeline](https://github.com/indiaset/census-2011-pipeline)

🔗 Follow → [https://x.com/indiaset_data](https://x.com/indiaset_data)
