What we're building

Each project starts with a real problem and a dataset. We build the pipeline, train the models, deploy the infrastructure, and ship the interface.

01
Phase 0 Complete

Tutela Ignis

Wildfire Prediction System

Domain: Environmental ML / Climate Science

Tutela Ignis ingests data from 10 distinct sources -- ICNF historical fires, NASA FIRMS active detections, ERA5-Land reanalysis, IPMA weather stations, Copernicus Sentinel imagery, DEM terrain, CORINE land cover, OpenStreetMap, and WorldPop population density. These feed a 35-feature engineering pipeline (Bronze to Silver to Gold), producing training data for an ensemble of CNN-LSTM (50% weight), XGBoost (35%), and Random Forest (15%) models. The system targets 90%+ accuracy and 0.95+ AUC-ROC for daily fire risk prediction across continental Portugal.

  • 10 automated data collectors with full test coverage
  • Bronze to Silver to Gold data lake architecture
  • CNN-LSTM + XGBoost + Random Forest ensemble
  • Target deployment: NVIDIA GB10 Grace Blackwell
PythonTensorFlowCNN-LSTMXGBoostFastAPIPostGISPolarsCelery
10data sources
35features
84tests passing
02
Active Development

Earthbenders

LiDAR Terrain Analysis Platform

Domain: Geospatial ML / Remote Sensing

Earthbenders is a 13-service platform built for Instituto de Planeamento Regenerativo (IPR). Users draw polygons on a MapLibre interface, triggering a Celery worker pipeline that downloads LiDAR data from DGT, builds DTMs/DSMs, and generates 26 derivative products using GDAL, SAGA GIS, and WhiteboxTools. Outputs include contour lines, hillshade, slope, aspect, geomorphons, stream networks, watersheds, flood zones, erosion risk, and pond siting analysis. Results are served as Cloud Optimized GeoTIFFs via TiTiler and PostGIS vectors via Martin tile server.

  • 26 spatial analysis products from a single polygon input
  • 50cm/2m LiDAR from Portuguese DGT
  • COG raster + PostGIS vector dual tile serving
  • MapLibre 2D + CesiumJS 3D visualization
PythonFastAPIGDALWhiteboxToolsReactMapLibrePostGISCeleryAzure Blob
26spatial products
50cmLiDAR resolution
13services
03
Functional Prototype

Digital Archive AI

Document Intelligence Pipeline

Domain: Document AI / NLP / Computer Vision

This system processes municipal archival records through a 5-DAG Airflow pipeline. Stage 1 ingests records and PDFs from the Archeevo API. Stage 2 runs PP-OCRv5 for typed text, TrOCR-Large for handwriting, and DocLayout-YOLO for page layout analysis. Stage 3 uses Qwen3-VL 30B for visual descriptions and Qwen3 32B for ISAD(G) metadata enrichment. Stage 4 generates embeddings via E5-Large (text), SigLIP 2 (visual), and ColQwen2.5 (multi-vector) for semantic retrieval. A Streamlit portal provides 8 interactive pages including a LangGraph RAG chatbot for conversational search over the archive.

  • 11 AI models across OCR, VLM, LLM, NER, and embeddings
  • VRAM-aware orchestration on a single 32GB GPU
  • Multi-vector retrieval: text + visual + ColBERT
  • LangGraph RAG chatbot over archival documents
PythonAirflowPyTorchQwen3PaddleOCRQdrantFastAPIStreamlitLangGraph
11ML models
5pipeline DAGs
114tests passing
04
Production

Government AI Infrastructure

Public Sector AI Services Cluster

Domain: MLOps / Infrastructure

A k3s Kubernetes cluster running on GPU hardware, managed via Flux CD GitOps and Ansible playbooks. The cluster hosts LLM inference services, a real-time conversational avatar using faster_whisper ASR and Edge TTS, XTTS v2 text-to-speech, and Whisper speech-to-text services. CI/CD via GitLab with in-cluster runner. Infrastructure documentation served via MkDocs. The avatar system supports Portuguese, English, Spanish, and French with automatic language detection, accessible via Cloudflare tunnel.

  • GitOps-managed Kubernetes with Flux CD
  • Real-time conversational AI avatar (Aria)
  • Multi-language speech processing pipeline
  • Ansible-automated node provisioning
k3sFlux CDAnsibleDockerGitLab CIMkDocs
3AI services
4languages
99%availability