Hey DataGeeks, I'm Pavan π
I am a
Data Explorer passionate about diving into every field where data is prominent. My journey spans the full spectrum from
Market Research &
Supply Chain Analytics to designing
Databases &
ETL Pipelines. I extend this expertise into
AI, developing
Machine Learning and
Deep Learning models with a specific focus on
BERT-based
Text & Semantic Analysis.
Data Scientist | ML Engineer | Product Analytics
π Location : Seattle, WA, USA
π Mobile : +1 (929) 278-4589
βοΈ Email : pavan.yellathakota.ds@gmail.com
Linkedin : https://linkedin.com/in/yellatp
GitHub : https://github.com/yellatp
π¨βπ» Professional Summary
Data Scientist with 3+ years of experience developing predictive models and automated data infrastructure. Proven track record in improving search precision, designing quantitative research pipelines, and implementing data-driven solutions for marketing and product growth. Skilled in bridging the gap between data engineering and stakeholder decision-making through statistical validation, A/B testing, and interactive analytics.
π οΈ Technical Skills
| Domain | Stack |
|ββββββββββ-|ββ-|
| Languages & Databases |
|
| AWS Cloud Data |
|
| ML Frameworks |
|
| Tools & Visualization |
|
πΌ Professional Experience
Alphonso AI, backed by Shipley Center for Innovation | Founding ML Engineer
Potsdam, NY | Jul 2025 β Present
- Backend Architecture: Designed a 0β1 Backend Ecosystem using FastAPI and PostgreSQL, orchestrating a scalable microservices bridge between Java-based core services and Python-native ML workloads.
- Cost-Efficient Infrastructure: Deployed and managed production services on DigitalOcean VPS to optimize infrastructure overhead; implemented Docker-based containerization to ensure environment parity across R&D and production.
- Advanced Retrieval (RAG): Engineered a Multi-Model βText-to-Queryβ (TTQ) engine leveraging Gemini (Vertex AI) and DeepSeek APIs to enable dynamic, prompt-driven semantic search across high-dimensional talent data.
- Search Optimization: Deployed a multi-stage retrieval pipeline utilizing pgvector for Approximate Nearest Neighbor (ANN) search and CUDA-accelerated Cross-Encoders for high-precision re-ranking (targeting 38% improvement in Precision@N).
- Domain-Aware Recommendation: Developed a sector-specific ranking system using Vectorized Embeddings; shifted logic from generic role-matching to domain-expertise alignment, improving candidate-to-company fit.
- Generative Team-Composition: Built a module that translates natural language product descriptions into granular technical requirements and specific candidate matches, bridging the gap for non-technical founders.
- System Design & MCP: Led relational schema normalization, API contract definition, and R&D into Model Context Protocol (MCP) for agentic, self-correcting database interactions.
Key Technologies Used

Student Managed Investment Fund, Clarkson University | Graduate Quantitative Researcher
Potsdam, NY | Sep 2024 β Apr 2025
- Portfolio Management: Managed a $650K real-capital portfolio, delivering a 51% total return and outperforming the S&P 500 benchmark by 26% (2,600 bps).
- Alternative Data Pipeline: Built a sentiment analysis engine scraping Reddit/YouTube to validate fundamental buy signals, using BERT-based sentiment scoring to overlay quantitative signals on traditional financial metrics.
- Automation: Automated the extraction of financial statements from SEC EDGAR using Python & Vertex AI, reducing data collection time by 80% for the analyst team.
- Risk Modeling: Developed Monte Carlo simulations and risk-parity models to stress-test overweight positions and quantify potential drawdowns for high-conviction trades.
Key Technologies Used

HAVK Mladost (Elite Athletics Club) | Graduate Data Science Consultant
Potsdam, NY | Oct 2023 β May 2025
- Cloud Migration: Architected a centralized data lake on AWS S3, migrating legacy records to a queryable cloud environment and reducing data retrieval latency by 40%.
- ETL Optimization: Developed PySpark ETL jobs on AWS Glue to process 1M+ cross-channel events; utilized partition pruning to optimize query costs and speed.
- Uplift Modeling: Applied uplift modeling and behavioral clustering to identify high-value fan segments, optimizing marketing spend and merchandise revenue.
- Performance Analytics: Developed backend services with FastAPI and built interactive dashboards that delivered real-time performance insights to World Championship coaches.
Key Technologies Used

eAppSys Limited | Business Data Analyst
Hyderabad, India | Jul 2022 β Dec 2022
- Forecasting: Developed demand forecasting models (Prophet/SARIMAX) for 1,500+ SKUs, integrating exogenous variables (holidays, promotions) to improve forecast accuracy (MAPE) by 15%.
- Reporting Automation: Designed and deployed automated KPI dashboards in Oracle Analytics Cloud (OAC), saving the procurement team 12+ hours/week of manual reporting time.
- ML Workflows: Implemented GxP-compliant ML workflows on Oracle Cloud Infrastructure (OCI) with real-time alerts, achieving 99.9% uptime for critical inventory monitoring.
Key Technologies Used

Kantar GDC India | Data Analyst
Pune, India | Sep 2021 β May 2022
- Pipeline Automation: Built automated data pipelines for Tracker and Syndicated Research projects using Python and PySpark, integrating 10M+ survey records from 30+ sources and reducing processing latency by 30%.
- Statistical Analysis: Developed sampling approaches and statistical significance testing to ensure data representativeness across Middle East and Central Africa markets.
- Consumer Insights: Supported recurring monthly/quarterly client tracking projects by developing regression models and delivering insights for 10+ FMCG and Telecom clients.
Key Technologies Used

ποΈ Some Notable Projects
| Project | Description | Tech Stack |
|:---:|:---|:---:|
| **[Text-Analysis-using-NLP-LDA](https://github.com/yellatp/Text-Analysis-using-NLP-LDA)** | NLP project focused on topic modeling and text analysis. | NLP, LDA, Python |
| **[Detoxify Telugu](https://github.com/yellatp/detoxify-telugu)** | Toxic comment classification for Telugu language. | NLP, Deep Learning |
| **[Synthetic Data Generator](https://github.com/yellatp/Synthetic-Data-Generator)** | Tool to generate synthetic datasets for testing/training. | Python, Data Gen |
| **[BingeMax Recommendation Engine](https://github.com/yellatp/BingeMax-Personalized-Movie-Recommendation-Engine)** | Personalized movie recommendation system. | ML, Recommender Systems |
| **[Fintech Sales GAP Analysis](https://github.com/yellatp/Fintech-Sales-GAP-Analysis)** | Analyzing sales gaps in fintech products. | Data Analysis, Visualization |
| **[KonnectR Fullstack App](https://github.com/yellatp/KonnectR_flask_fullstack_app)** | Fullstack web application built with Flask. | Flask, Python, Web |
| **[PreOwned Cars Price Prediction](https://github.com/yellatp/PreOwnedCars_Price_Prediction_Model_V1.0)** | ML model to predict prices of used cars. | Regression, Scikit-learn |
| **[Fake News Classifier](https://github.com/yellatp/Fake-News-Classifier)** | Identification of fake news articles using ML. | Classification, NLP |
| **[Content Strategy Netflix](https://github.com/yellatp/Content-Strategy-Analysis-NETFLIX)** | Data-driven strategy analysis for Netflix content. | Data Science, EDA |
| **[Supply Chain Analysis](https://github.com/yellatp/Supply-Chain-Analysis-Python)** | Optimization and analysis of supply chain data. | Python, Logistics |
| **[GenZ Career Preferences](https://github.com/yellatp/GenZ-Career-Preferences-Report)** | Analysis report on GenZ career trends. | Research, Analytics |
| **[Website A/B Testing](https://github.com/yellatp/Website-AB-Testing-Python)** | Statistical analysis of A/B test results. | Statistics, Python |
Last Updated: 2026 by PAVAN YELLATHAKOTA </sub>
</p>