Satya Saurabh Mishra

Exploring the Frontiers of AI and Data Science

About

👋 Hi, I'm Satya Saurabh Mishra, an Data Scientist Engineer and Researcher 🤖🔬 with a passion for Machine Learning, Data Science, and Large Language Models (LLMs).
🎓 Currently Working as Data Scientist at Dell Technologies, I am dedicated to pushing the boundaries of AI innovation.

🚀 Welcome to my portfolio!

SKILLS

Large Language Models: Transformers (Hugging Face), Langchain, Langgraph, LLamaIndex, Autogen, crewAI

LLM APIs: OpenAI API, Google AI studio API

Vector Databases: Pinecone, Chroma, FAISS

Machine Learning: PyTorch, TensorFlow, Scikit-learn, TensorFlow

Natural Language Processing: NLTK, SpaCy

Data Analysis: NumPy, Pandas, Hugging Face Datasets

Databases: PostgreSQL, MySQL

Data Visualization: Matplotlib, Seaborn, Plotly, Power BI, Tableau

Languages: Python, C++

Database: PostgreSQL, MySQL

Version Control: Git, GitHub

IDEs: VS Code, Jupyter Notebooks

Containerization: Docker

MLOps: dvc, MLflow

-->

--> -->

Publications

How Does Entropy Influence Modern Text-to-SQL Systems?

Building Trust Workshop at ICLR 2025

Chris Lazar∗ , Varun Kausika∗ , Satya Saurabh Mishra∗ , Saurabh Jha, Priyanka Pathak

Abstract : In the field of text-to-SQL candidate generation, a critical challenge remains in quantifying and assessing the confidence in the generated SQL queries. Existing approaches often rely on large language models (LLMs) that function as opaque processing units, producing outputs for every input without a mechanism to measure their confidence. Current uncertainty quantification techniques for LLMs do not incorporate domain-specific information. In this study, we introduce the concept of query entropy for Text-to-SQL candidate confidence estimation and integrate it into existing popular self-correction pipelines to guide generations and prevent resource overuse by including a novel clustering technique for generated SQL candidates based on entropy. We further study the treatment of different candidate generation techniques under this paradigm.

Finlang: Open-sourced Investopedia Financial Dataset, Finance chat and Embedding Model

HuggingFace Models and Dataset

Anamika Chatterjee∗ , Harshit Skichi∗ , Satya Saurabh Mishra∗ , Saurabh Jha

Dataset : We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.

Model :

Embedding model is finetuned on top of BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search in RAG applications.
This Large Language Model (LLM) is an instruct fine-tuned version of the mistralai/Mistral-7B-v0.1 using our open-sourced finance dataset developed for finance application by FinLang Team

Timeline

Data Scientist

Dell Technologies

3rd July 2024 - Present Banglore, Karnataka
Data Science Intern

Dell Technologies

3rd July 2023 - 31st May 2024 Banglore, Karnataka
- Enhanced LLM model’s query response using RAG methodology with Dell data.
- Conducted end-to-end experiments on llama-index, integrating OpenAI and custom LLM models and implementing them for better model responses.
- Led efforts to fine-tune embedding models, improving their ability for enhanced Dell data representation.
- Explored Vector Databases like Chroma DB and Faiss for efficient storage and retrieval of vector representations.
- Systematically optimized model performance through comprehensive evaluation strategy and comparisons with other LLM models.
- Integrated the MLflow tool for analyzing and monitoring the performance of machine learning models.
Master of Technology in AI/ML

IIITDM Jabalpur

CGPA: 8.8 | 2022 - 2024 Jabalpur, Madhya Pradesh

Completed Master of Technology (M.Tech) with a specialization in Artificial Intelligence and Machine Learning at the Indian Institute of Information Technology, Jabalpur (M.P.). Worked as a Teaching Assistant (TA) for several courses under the mentorship of distinguished professors.
- TA : Data structure Using Python under Prof. Kushum kumari Bharti.
- TA : Introduction to Data Science Using Python under Prof. Ayan Seal.
- TA : Design and Analysis of Algorithms under Prof. Avinash Chandra Pandey.
- TA : Computer Programming in C under Prof. Vinod Kumar Jain.
- TA : Computer Network under Prof. Neelam Dayal.
High & Intermediate Mathematics Teacher

SSV Academy

2020 - 2024 Jaunpur, Uttar Pradesh

Worked as a Mathematics Teacher at SSV Academy, Jaunpur, Uttar Pradesh from 2020 to 2022. Taught Mathematics to students, covering a wide range of topics including Algebra, Probability and Calculus. Focused on building strong understanding and problem-solving skills among students, helping them excel in their academic.
Bachelor of Technology in Computer Science Engineering

VBSPU Jaunpur

CGPA: 8.2 | 2018 - 2022 Jaunpur, Uttar Pradesh

Bachelor of Technology in Computer Science Engineering with a specialization in Computer Science subjects at Veer Bahadhur Singh Purvanchal University, Jaunpur.

Satya Saurabh Mishra

Exploring the Frontiers of AI and Data Science

About

SKILLS

Publications

Timeline

Data Scientist

Dell Technologies

Data Science Intern

Dell Technologies

Master of Technology in AI/ML

IIITDM Jabalpur

High & Intermediate Mathematics Teacher

SSV Academy

Bachelor of Technology in Computer Science Engineering

VBSPU Jaunpur