{"slug": "i-cleaned-india-s-census-2011-data-so-you-never-have-to", "title": "I cleaned India's Census 2011 data so you never have to", "summary": "An engineer released a cleaned, validated dataset of India's Census 2011 district-level data on Hugging Face, containing 640 districts with 29 columns and zero missing values. The dataset sums exactly to India's official total population of 1,210,854,977 and includes LGD codes for reliable joins. The project is part of the indiaset initiative to build India's open data layer.", "body_md": "Every Indian data scientist hits the same wall.\n\nYou need district-level population data. You go to censusindia.gov.in.\n\nYou find hundreds of inconsistent Excel files with merged headers,\n\nfootnote rows, and zero documentation.\n\nYou spend a full day just loading the data before doing any actual analysis.\n\nI fixed that. Once. For everyone.\n\n[indiaset/census-2011](https://huggingface.co/datasets/indiaset/census-2011)\n\nIndia's Census 2011 district data, clean, typed, and ready for pandas.\n\n640 districts · 29 columns · 0 missing values\n\nValidated against official India total · LGD codes attached\n\n``` python\nfrom huggingface_hub import hf_hub_download\nimport pandas as pd\n\npath = hf_hub_download(\n    repo_id=\"indiaset/census-2011\",\n    filename=\"census_2011_districts_final.parquet\",\n    repo_type=\"dataset\"\n)\ndf = pd.read_parquet(path)\nprint(df.shape)  # (640, 29)\n```\n\n| Column | Description |\n|---|---|\n`state_code` |\nCensus 2011 state code |\n`state_name` |\nOfficial state/UT name |\n`district_code` |\nCensus 2011 district code |\n`district_name` |\nDistrict name as per Census |\n`lgd_code` |\nLGD permanent district code |\n`district_name_lgd` |\nDistrict name as per LGD |\n`pop_total` |\nTotal population |\n`pop_male` |\nMale population |\n`pop_female` |\nFemale population |\n`pop_under6_total` |\nChildren under 6 years |\n`pop_sc` |\nScheduled Caste population |\n`pop_st` |\nScheduled Tribe population |\n`literate_total` |\nLiterate persons |\n`literate_male` |\nLiterate males |\n`literate_female` |\nLiterate females |\n`illiterate_total` |\nIlliterate persons |\n`workers_total` |\nTotal workers |\n`workers_male` |\nMale workers |\n`workers_female` |\nFemale workers |\n`non_workers_total` |\nNon workers |\n`literacy_rate` |\nLiterate / Total × 100 |\n`sex_ratio` |\nFemales per 1000 males |\n`workforce_participation` |\nWorkers / Total × 100 |\n\nThe most important test - do all 640 district populations\n\nsum to India's official total?\n\n```\nprint(df['pop_total'].sum())\n# 1210854977 ✅ — exact match, zero discrepancy\n```\n\nMost literate district → Pathanamthitta, Kerala : 88.74%\n\nLeast literate district → Alirajpur, Madhya Pradesh : 28.77%\n\nLiteracy gap across India : 60 points\n\nHighest sex ratio → Mahe, Puducherry : 1176 per 1000 males\n\nLowest sex ratio → Leh, Jammu & Kashmir : 690 per 1000 males\n\nNational population → 1,210,854,977\n\nOur district sum → 1,210,854,977\n\nDifference → 0 ✅\n\nEvery district in this dataset carries an LGD code - the Government of India's permanent identifier for every administrative unit.\n\nWithout LGD codes, joining two Indian datasets is a nightmare:\n\n```\n# without LGD - name matching hell\ndf[df['district'] == 'Leh(Ladakh)']\n# misses: \"Leh Ladakh\", \"Leh\", \"LEH\"\n\n# with LGD - bulletproof\ndf[df['lgd_code'] == 9]\n# always works, regardless of spelling\n```\n\nThis dataset has LGD codes for all 640 districts,\n\nincluding manual verification of Yanam and Mahe - two tiny Puducherry enclaves missing from the official LGD export.\n\n⚠️ This data reflects 2011 boundaries.\n\nThe full reproducible pipeline is on GitHub.\n\nClone it, run the notebook, get the exact same parquet file.\n\n```\ngit clone https://github.com/indiaset/census-2011-pipeline\n```\n\nRaw file → filter → clean → validate → LGD join → parquet.\n\nEvery step documented. Every decision explained.\n\nThis is dataset #1 under **indiaset** -\n\nIndia's open data layer.\n\n| Dataset | Status |\n|---|---|\n| Census 2011 districts | ✅ Live |\n| Indian Elections 1951–2024 | 🔜 Coming |\n| RBI Economic Series | 🔜 Coming |\n`pip install indiaset` |\n🔜 Coming |\n\nJaiswal, Ansuman. (2026). India Census 2011 - District Level\n\n[Dataset]. indiaset. Hugging Face.\n\n[https://huggingface.co/datasets/indiaset/census-2011](https://huggingface.co/datasets/indiaset/census-2011)\n\nLicensed under **CC-BY-4.0** - free to use, just credit the source.\n\n🔗 Dataset → [https://huggingface.co/datasets/indiaset/census-2011](https://huggingface.co/datasets/indiaset/census-2011)\n\n🔗 Pipeline → [https://github.com/indiaset/census-2011-pipeline](https://github.com/indiaset/census-2011-pipeline)\n\n🔗 Follow → [https://x.com/indiaset_data](https://x.com/indiaset_data)", "url": "https://wpnews.pro/news/i-cleaned-india-s-census-2011-data-so-you-never-have-to", "canonical_source": "https://dev.to/iam-ansuman/i-cleaned-indias-census-2011-data-so-you-never-have-to-4g2m", "published_at": "2026-06-16 12:36:40+00:00", "updated_at": "2026-06-16 12:47:15.497244+00:00", "lang": "en", "topics": ["developer-tools"], "entities": ["Ansuman Jaiswal", "indiaset", "Hugging Face", "Census 2011", "India", "LGD", "GitHub", "CC-BY-4.0"], "alternates": {"html": "https://wpnews.pro/news/i-cleaned-india-s-census-2011-data-so-you-never-have-to", "markdown": "https://wpnews.pro/news/i-cleaned-india-s-census-2011-data-so-you-never-have-to.md", "text": "https://wpnews.pro/news/i-cleaned-india-s-census-2011-data-so-you-never-have-to.txt", "jsonld": "https://wpnews.pro/news/i-cleaned-india-s-census-2011-data-so-you-never-have-to.jsonld"}}