Hyderabad based analytics startup – Modak Analytics, has built India’s first Big-Data based Electoral Data Repository of 81.4 crore voters in India’s 16th national election .
Started three years ago. Modak Analytics’s first pan-India project making use of these in-house automation technologies is the extensive demographic data mining of India’s 814 million voters for a leading national political party.
“The experience of mining 25 million open source data pages of Election Commission published in 12 local languages had given us the necessary push to develop customised data warehouses for corporates across the world,” said Milind Chitgupakar, chief analytics officer of Modak.
The massive exercise involved 81.4 crore voters, the largest ever on the planet. Comparatively, USA has 19.36 crore voter, Indonesia 17.1, Brazil 13.58 and UK 4.55 crore.
Few of the major challenges faced by Modak in envisaging Big-Data repository of electoral data are:
- 543 Parliamentary and 4120 assembly constituencies
- 9.3 lakh polling booths
- Voter Rolls in PDF in 12 languages
- 9 lakh PDFs, amounting to 2.5 crore pages to be deciphered
- Diverse range of Voter Names and Information
The real challenge was extraction of voter info from 2.5 crore PDF pages and transliteration of the same into English to fuse with other sources. Technology was a big hurdle. The infrastructure, built especially for the project, included 64 node Hadoop, PostgreSQL and servers that process master file containing over 8 Terabytes of Data. Besides, Testing and Validation was another big task. ‘First of a Kind’ Heuristic (machine learning) algorithms were developed for people classification based on Name, Geography etc., which help in identification of Religion, Caste and even Ethnicity.
“Data from multiple sources like Census, Economic and Social surveys were mapped to polling booths. Simultaneously, external and propriety data sources had to be fused with individual voters’ data. Because of this complex nature, no big IT company ever ventured into this”, informed Aarti Joshi.
Modak Analytics developed patent potential proprietary technologies – ‘Rapid Extraction, Transformation and Loading (RapidETL)’, Transliteration of Indian Languages, Data Dictionaries for all individual states (except Orissa). This was utilized to process unstructured data and quick transliteration while fuzzy logic helped match based on Indian names, address, villages. RapidETL automated transformation and merging of massive datasets which significantly reduced time and cost. Further, Math Quants produced actionable information to develop individual booth strategies.