Sports Data Engineer
I build production-grade data pipelines for professional sports — from real-time MLB pitch tracking at Sportradar to cloud analytics on GCP & AWS. 5+ years turning raw game data into decisions.
I'm a Data Engineer based in Corpus Christi, TX focused on sports data infrastructure and cloud analytics. My career bridges raw event data and production-ready insights — whether that's pitch-by-pitch MLB tracking at Sportradar or GCP analytical models cutting client time-to-insight by 45%.
At Synergy Sports (Sportradar) I captured and structured live MLB game data feeding broadcast networks and franchise analytics platforms. At Vikua I led technical delivery across 6 client environments, engineering pipelines with 99.7% uptime and building SQL models that cut cloud compute costs by 18%.
Currently completing the MIT MicroMasters in Statistics and Data Science and expanding into statistical modeling. Fluent in English and Spanish, with working knowledge of Italian. Open to relocation and fully remote roles.
Hero project in active development: a dual-path architecture for pitcher fatigue, bullpen readiness, and matchup leverage. A deterministic replay engine publishes MLB game events to Kafka/Redpanda, while streaming jobs compute fast provisional alerts and batch models reconstruct canonical truth in an Iceberg lakehouse. The reconciliation layer measures where real-time decisions diverge from the final record.
End-to-end MLB analytics platform ingesting Statcast pitch-by-pitch data through Bronze/Silver/Gold medallion architecture. DuckDB analytical layer runs SQL window functions on millions of rows 3-5x faster than Pandas. XGBoost model predicts pitcher CSW rate with TimeSeriesSplit validation (zero data leakage) and per-prediction SHAP explainability. Claude API generates AI scouting reports, with the LLM strictly positioned after all statistics are computed.
Production-style sports data platform on full Bronze/Silver/Gold architecture: raw GPS tracking ingestion, validation, Parquet transformation, and a player analytics layer. CI/CD via GitHub Actions, Apache Airflow DAG for orchestration, and Terraform provisioning a 3-layer AWS S3 data lake.
Production ML pipeline for injury risk in professional football using 88 real Real Madrid injury events (2021-2025). Full Medallion Architecture with 13 point-in-time correct features, leakage guards in code and tests, 5-job CI/CD pipeline, 80% coverage gate, and ML outputs with mandatory confidence intervals and OOD flags.
Automated PII-safe pipeline on GCP integrating banking, POS, and university sources into a Master User Model via Bronze-Silver-Gold layering in BigQuery. SHA-256 hashing and boolean masking for PII compliance, orchestrated with Cloud Composer/Airflow. Full Terraform IaC.
Let's build something