Internship Description
SAMAGRAX - GOVERNMENT DOCUMENT CHATBOT: STREAMLINING ACCESS AND ASSISTANCE
Samagra-Code for GovTech
- Virtual Internship
- 17-Apr-2024
- Pan India,
-
Start date
Immediately -
Duration
3 Months -
Stipend
₹33000 /month -
No of Credits
10 -
Apply by
08-May-2024
About the program
Goal Create a bot capable of answering user questions based on RAG framework using government data extracted from PDFs Description The project aims to develop a chatbot capable of retrieving relevant information from government documents including both officially typed and scanned documents in Hindi and English Users should be able to ask questions and the bot will extract and present cohesive accurate answers from the PDFs Goals MidPoint Milestone Goals Technical Tasks 1.Data Collection and Integration Collect government data from sources such as upvidhaigovinActaspx and shasanadeshupgovin Integrate collected data into the chatbots knowledge base for retrieval 2.Language Processing Capability Develop algorithms for parsing English text from the text layer of PDFs Implement OCR algorithms for extracting Hindi text from scanned PDFs 3.Natural Language Understanding NLU Implement NLU techniques to understand and interpret user queries accurately Develop algorithms to search for relevant content based on user queries 4.Content Structuring and Storage Structure extracted text into cohesive chunks for efficient storage and retrieval Store structured content in a database for easy access and management 5.MultiLanguage Support Develop translation algorithms to support multiple languages including Hindi and English Ensure seamless translation of content to meet user language preferences 6.LLM Integration and Training Integrate a Language Model LLM for generating cohesive answers Train the LLM using relevant datasets to align with the style of government documents Expected Outcome 1.Functional chatbot capable of RAG frameworkbased query responses using government PDF data 2.Accurate retrieval of relevant content from HindiEnglish typed and scanned government documents 3.Seamless user experience with cohesive timely responses 4.Efficient parsing chunking and storage of PDF text for easy access 5.Multilanguage support for user interaction and response delivery 6.Integration of Language Model LLM for cohesive answer generation 7.Enhanced accessibility to government information 8.Streamlined document retrieval process for improved efficiency Implementation Details The implementation involves 1.Collecting data from sources such as upvidhaigovinActaspx and shasanadeshupgovin 2.Parsing documents to extract English text from the text layer of PDFs 3.Parsing documents using OCR to extract Hindi text from PDFs 4.Structuring extracted text into sensible chunks and storing them in a database 5.Translating text to required languages 6.Understanding natural language queries and searching for related content using vector databases 7.Utilizing an LLM to generate cohesive and relevant answers based on questions and retrieved content
Perks
1. Lucrative stipend of INR 1 lakh over a period of 3 months 2. Dedicated 1 on 1 mentorship by industry experts 3. Handson experience to hone your skills 4. Access to bootcamps and expert sessions 5. Potential job extended internship opportunities 6. Opportunity to network with global opensource tech leaders
Who can apply?
Only those candidates can apply who:
- are from Any,
- and specialisation from Any,
- are available for duration of 3 Months
- have relevant skills and interests
Terms of Engagement
1. 50000 received on completion of midpoint milestone as decided with mentor 2. 50000 received on completion of final milestone as decided with mentor 3. Certificate of completion received on successful completion of internship
Number of openings
1