Satya Saurabh Mishra - Data Scientist Engineer and Researcher

Satya Saurabh Mishra

Exploring the Frontiers of AI and Data Science

Connect with Satya on LinkedIn View satya's GitHub projects
PDF icon View Resume

About

👋 Hi, I'm Satya Saurabh Mishra, an Data Scientist Engineer and Researcher 🤖🔬 with a passion for Machine Learning, Data Science, and Large Language Models (LLMs).
🎓 Currently Working as Data Scientist at Dell Technologies, I am dedicated to pushing the boundaries of AI innovation.

🚀 Welcome to my portfolio!

SKILLS

Large Language Models: Transformers (Hugging Face), Langchain, Langgraph, LLamaIndex, Autogen, crewAI
LLM APIs: OpenAI API, Google AI studio API
Vector Databases: Pinecone, Chroma, FAISS
Machine Learning: PyTorch, TensorFlow, Scikit-learn, TensorFlow
Natural Language Processing: NLTK, SpaCy
Data Analysis: NumPy, Pandas, Hugging Face Datasets
Databases: PostgreSQL, MySQL
Data Visualization: Matplotlib, Seaborn, Plotly, Power BI, Tableau
Languages: Python, C++
Database: PostgreSQL, MySQL
Version Control: Git, GitHub
IDEs: VS Code, Jupyter Notebooks
Containerization: Docker
MLOps: dvc, MLflow
-->
--> -->

Publications

Building Trust Workshop at ICLR 2025
Chris Lazar∗ , Varun Kausika∗ , Satya Saurabh Mishra∗ , Saurabh Jha, Priyanka Pathak

Abstract : In the field of text-to-SQL candidate generation, a critical challenge remains in quantifying and assessing the confidence in the generated SQL queries. Existing approaches often rely on large language models (LLMs) that function as opaque processing units, producing outputs for every input without a mechanism to measure their confidence. Current uncertainty quantification techniques for LLMs do not incorporate domain-specific information. In this study, we introduce the concept of query entropy for Text-to-SQL candidate confidence estimation and integrate it into existing popular self-correction pipelines to guide generations and prevent resource overuse by including a novel clustering technique for generated SQL candidates based on entropy. We further study the treatment of different candidate generation techniques under this paradigm.

HuggingFace Models and Dataset
Anamika Chatterjee∗ , Harshit Skichi∗ , Satya Saurabh Mishra∗ , Saurabh Jha

Dataset : We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.

Model :

  • Embedding model is finetuned on top of BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search in RAG applications.
  • This Large Language Model (LLM) is an instruct fine-tuned version of the mistralai/Mistral-7B-v0.1 using our open-sourced finance dataset developed for finance application by FinLang Team

Timeline