Research & Projects

What we're building

Each project starts with a real problem and a dataset. We build the pipeline, train the models, deploy the infrastructure, and ship the interface.

Phase 0 Complete

Tutela Ignis

Wildfire Prediction System

Domain: Environmental ML / Climate Science

Tutela Ignis ingests data from 10 distinct sources -- ICNF historical fires, NASA FIRMS active detections, ERA5-Land reanalysis, IPMA weather stations, Copernicus Sentinel imagery, DEM terrain, CORINE land cover, OpenStreetMap, and WorldPop population density. These feed a 35-feature engineering pipeline (Bronze to Silver to Gold), producing training data for an ensemble of CNN-LSTM (50% weight), XGBoost (35%), and Random Forest (15%) models. The system targets 90%+ accuracy and 0.95+ AUC-ROC for daily fire risk prediction across continental Portugal.

10 automated data collectors with full test coverage
Bronze to Silver to Gold data lake architecture
CNN-LSTM + XGBoost + Random Forest ensemble
Target deployment: NVIDIA GB10 Grace Blackwell

PythonTensorFlowCNN-LSTMXGBoostFastAPIPostGISPolarsCelery

10data sources

35features

84tests passing

Active Development

Earthbenders

LiDAR Terrain Analysis Platform

Domain: Geospatial ML / Remote Sensing

Earthbenders is a 13-service platform built for Instituto de Planeamento Regenerativo (IPR). Users draw polygons on a MapLibre interface, triggering a Celery worker pipeline that downloads LiDAR data from DGT, builds DTMs/DSMs, and generates 26 derivative products using GDAL, SAGA GIS, and WhiteboxTools. Outputs include contour lines, hillshade, slope, aspect, geomorphons, stream networks, watersheds, flood zones, erosion risk, and pond siting analysis. Results are served as Cloud Optimized GeoTIFFs via TiTiler and PostGIS vectors via Martin tile server.

26 spatial analysis products from a single polygon input
50cm/2m LiDAR from Portuguese DGT
COG raster + PostGIS vector dual tile serving
MapLibre 2D + CesiumJS 3D visualization

PythonFastAPIGDALWhiteboxToolsReactMapLibrePostGISCeleryAzure Blob

26spatial products

50cmLiDAR resolution

13services

Functional Prototype

Digital Archive AI

Document Intelligence Pipeline

Domain: Document AI / NLP / Computer Vision

This system processes municipal archival records through a 5-DAG Airflow pipeline. Stage 1 ingests records and PDFs from the Archeevo API. Stage 2 runs PP-OCRv5 for typed text, TrOCR-Large for handwriting, and DocLayout-YOLO for page layout analysis. Stage 3 uses Qwen3-VL 30B for visual descriptions and Qwen3 32B for ISAD(G) metadata enrichment. Stage 4 generates embeddings via E5-Large (text), SigLIP 2 (visual), and ColQwen2.5 (multi-vector) for semantic retrieval. A Streamlit portal provides 8 interactive pages including a LangGraph RAG chatbot for conversational search over the archive.

11 AI models across OCR, VLM, LLM, NER, and embeddings
VRAM-aware orchestration on a single 32GB GPU
Multi-vector retrieval: text + visual + ColBERT
LangGraph RAG chatbot over archival documents

PythonAirflowPyTorchQwen3PaddleOCRQdrantFastAPIStreamlitLangGraph

11ML models

5pipeline DAGs

114tests passing

Production

Government AI Infrastructure

Public Sector AI Services Cluster

Domain: MLOps / Infrastructure

A k3s Kubernetes cluster running on GPU hardware, managed via Flux CD GitOps and Ansible playbooks. The cluster hosts LLM inference services, a real-time conversational avatar using faster_whisper ASR and Edge TTS, XTTS v2 text-to-speech, and Whisper speech-to-text services. CI/CD via GitLab with in-cluster runner. Infrastructure documentation served via MkDocs. The avatar system supports Portuguese, English, Spanish, and French with automatic language detection, accessible via Cloudflare tunnel.

GitOps-managed Kubernetes with Flux CD
Real-time conversational AI avatar (Aria)
Multi-language speech processing pipeline
Ansible-automated node provisioning

k3sFlux CDAnsibleDockerGitLab CIMkDocs

3AI services

4languages

99%availability