I cleaned India's Census 2011 data so you never have to An engineer released a cleaned, validated dataset of India's Census 2011 district-level data on Hugging Face, containing 640 districts with 29 columns and zero missing values. The dataset sums exactly to India's official total population of 1,210,854,977 and includes LGD codes for reliable joins. The project is part of the indiaset initiative to build India's open data layer. Every Indian data scientist hits the same wall. You need district-level population data. You go to censusindia.gov.in. You find hundreds of inconsistent Excel files with merged headers, footnote rows, and zero documentation. You spend a full day just loading the data before doing any actual analysis. I fixed that. Once. For everyone. indiaset/census-2011 https://huggingface.co/datasets/indiaset/census-2011 India's Census 2011 district data, clean, typed, and ready for pandas. 640 districts · 29 columns · 0 missing values Validated against official India total · LGD codes attached python from huggingface hub import hf hub download import pandas as pd path = hf hub download repo id="indiaset/census-2011", filename="census 2011 districts final.parquet", repo type="dataset" df = pd.read parquet path print df.shape 640, 29 | Column | Description | |---|---| state code | Census 2011 state code | state name | Official state/UT name | district code | Census 2011 district code | district name | District name as per Census | lgd code | LGD permanent district code | district name lgd | District name as per LGD | pop total | Total population | pop male | Male population | pop female | Female population | pop under6 total | Children under 6 years | pop sc | Scheduled Caste population | pop st | Scheduled Tribe population | literate total | Literate persons | literate male | Literate males | literate female | Literate females | illiterate total | Illiterate persons | workers total | Total workers | workers male | Male workers | workers female | Female workers | non workers total | Non workers | literacy rate | Literate / Total × 100 | sex ratio | Females per 1000 males | workforce participation | Workers / Total × 100 | The most important test - do all 640 district populations sum to India's official total? print df 'pop total' .sum 1210854977 ✅ — exact match, zero discrepancy Most literate district → Pathanamthitta, Kerala : 88.74% Least literate district → Alirajpur, Madhya Pradesh : 28.77% Literacy gap across India : 60 points Highest sex ratio → Mahe, Puducherry : 1176 per 1000 males Lowest sex ratio → Leh, Jammu & Kashmir : 690 per 1000 males National population → 1,210,854,977 Our district sum → 1,210,854,977 Difference → 0 ✅ Every district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit. Without LGD codes, joining two Indian datasets is a nightmare: without LGD - name matching hell df df 'district' == 'Leh Ladakh ' misses: "Leh Ladakh", "Leh", "LEH" with LGD - bulletproof df df 'lgd code' == 9 always works, regardless of spelling This dataset has LGD codes for all 640 districts, including manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export. ⚠️ This data reflects 2011 boundaries. The full reproducible pipeline is on GitHub. Clone it, run the notebook, get the exact same parquet file. git clone https://github.com/indiaset/census-2011-pipeline Raw file → filter → clean → validate → LGD join → parquet. Every step documented. Every decision explained. This is dataset 1 under indiaset - India's open data layer. | Dataset | Status | |---|---| | Census 2011 districts | ✅ Live | | Indian Elections 1951–2024 | 🔜 Coming | | RBI Economic Series | 🔜 Coming | pip install indiaset | 🔜 Coming | Jaiswal, Ansuman. 2026 . India Census 2011 - District Level Dataset . indiaset. Hugging Face. https://huggingface.co/datasets/indiaset/census-2011 https://huggingface.co/datasets/indiaset/census-2011 Licensed under CC-BY-4.0 - free to use, just credit the source. 🔗 Dataset → https://huggingface.co/datasets/indiaset/census-2011 https://huggingface.co/datasets/indiaset/census-2011 🔗 Pipeline → https://github.com/indiaset/census-2011-pipeline https://github.com/indiaset/census-2011-pipeline 🔗 Follow → https://x.com/indiaset data https://x.com/indiaset data