2026 ML Engineer / Data Scientist

Almeria AH

PythonLightGBMDartsML

Quantitative pipeline to decide whether there is real predictive edge in Almeria's vegetable market prices. Daily scrapers, 26 time series, 6 competing models and walk-forward expanding window with a binary PASS/FAIL criterion.

The Challenge

Almeria is Europe's greenhouse, but auction-house vegetable prices move in informational fog. Before investing in a prediction product, I needed to answer a binary question: can a naive baseline be beaten using public data?

Results

  • 26 time series (5 products × 6 auction houses) since January 2025
  • 6 competing models with walk-forward expanding window
  • Point metrics (MAE/RMSE/MAPE) + interval metrics (pinball, 80/95% coverage)
  • Explicit go/no-go criterion: ≥15% MAE improvement vs naive baseline

The Solution

I built an end-to-end research pipeline: four daily scrapers (fhalmeria, ASAJA, AEMET, hortoinfo), idempotent normalization in SQLite, lag + weather + news features, six competing models (Naive, SeasonalNaive, ARIMA, LGBM, LGBMMeteo, LGBMRich) and walk-forward expanding window with point and interval metrics. The go/no-go criterion is explicit: beat the baseline by ≥15% MAE.

Motivation

I wanted a project where methodological rigor was the feature, not the decoration. If the answer is PASS there's a product in 2027; if it's FAIL, I archive the hypothesis with data and avoid repeating the experiment out of forgetfulness years later.

Challenges

The trickiest part was avoiding leakage in walk-forward: each window can only see data available at its actual point in time, forcing feature versioning (forecast vs realized weather, previous day's news, etc.). And the fhalmeria price-board scrapers change format every few weeks.

Learnings

I learned that a well-measured naive baseline is the best intellectual-honesty tool for an ML project: if you can't beat it consistently in walk-forward, you don't have a product. And that writing a binary PASS/FAIL criterion before the experiment saves months of self-deception.

Context

Research in its final phase. Go/no-go verdict pending before 2026-09-30. Not a product in 2026; the output is a public verdict document.