Research & ProjectsWhat we're building
Each project starts with a real problem and a dataset. We build the pipeline, train the models, deploy the infrastructure, and ship the interface.
Tutela Ignis
Wildfire Prediction System
Domain: Environmental ML / Climate Science
Tutela Ignis ingests data from 10 distinct sources -- ICNF historical fires, NASA FIRMS active detections, ERA5-Land reanalysis, IPMA weather stations, Copernicus Sentinel imagery, DEM terrain, CORINE land cover, OpenStreetMap, and WorldPop population density. These feed a 35-feature engineering pipeline (Bronze to Silver to Gold), producing training data for an ensemble of CNN-LSTM (50% weight), XGBoost (35%), and Random Forest (15%) models. The system targets 90%+ accuracy and 0.95+ AUC-ROC for daily fire risk prediction across continental Portugal.
- 10 automated data collectors with full test coverage
- Bronze to Silver to Gold data lake architecture
- CNN-LSTM + XGBoost + Random Forest ensemble
- Target deployment: NVIDIA GB10 Grace Blackwell
PythonTensorFlowCNN-LSTMXGBoostFastAPIPostGISPolarsCelery
10data sources
35features
84tests passing
Earthbenders
LiDAR Terrain Analysis Platform
Domain: Geospatial ML / Remote Sensing
Earthbenders is a 13-service platform built for Instituto de Planeamento Regenerativo (IPR). Users draw polygons on a MapLibre interface, triggering a Celery worker pipeline that downloads LiDAR data from DGT, builds DTMs/DSMs, and generates 26 derivative products using GDAL, SAGA GIS, and WhiteboxTools. Outputs include contour lines, hillshade, slope, aspect, geomorphons, stream networks, watersheds, flood zones, erosion risk, and pond siting analysis. Results are served as Cloud Optimized GeoTIFFs via TiTiler and PostGIS vectors via Martin tile server.
- 26 spatial analysis products from a single polygon input
- 50cm/2m LiDAR from Portuguese DGT
- COG raster + PostGIS vector dual tile serving
- MapLibre 2D + CesiumJS 3D visualization
PythonFastAPIGDALWhiteboxToolsReactMapLibrePostGISCeleryAzure Blob
26spatial products
50cmLiDAR resolution
13services
Digital Archive AI
Document Intelligence Pipeline
Domain: Document AI / NLP / Computer Vision
This system processes municipal archival records through a 5-DAG Airflow pipeline. Stage 1 ingests records and PDFs from the Archeevo API. Stage 2 runs PP-OCRv5 for typed text, TrOCR-Large for handwriting, and DocLayout-YOLO for page layout analysis. Stage 3 uses Qwen3-VL 30B for visual descriptions and Qwen3 32B for ISAD(G) metadata enrichment. Stage 4 generates embeddings via E5-Large (text), SigLIP 2 (visual), and ColQwen2.5 (multi-vector) for semantic retrieval. A Streamlit portal provides 8 interactive pages including a LangGraph RAG chatbot for conversational search over the archive.
- 11 AI models across OCR, VLM, LLM, NER, and embeddings
- VRAM-aware orchestration on a single 32GB GPU
- Multi-vector retrieval: text + visual + ColBERT
- LangGraph RAG chatbot over archival documents
PythonAirflowPyTorchQwen3PaddleOCRQdrantFastAPIStreamlitLangGraph
11ML models
5pipeline DAGs
114tests passing
Government AI Infrastructure
Public Sector AI Services Cluster
Domain: MLOps / Infrastructure
A k3s Kubernetes cluster running on GPU hardware, managed via Flux CD GitOps and Ansible playbooks. The cluster hosts LLM inference services, a real-time conversational avatar using faster_whisper ASR and Edge TTS, XTTS v2 text-to-speech, and Whisper speech-to-text services. CI/CD via GitLab with in-cluster runner. Infrastructure documentation served via MkDocs. The avatar system supports Portuguese, English, Spanish, and French with automatic language detection, accessible via Cloudflare tunnel.
- GitOps-managed Kubernetes with Flux CD
- Real-time conversational AI avatar (Aria)
- Multi-language speech processing pipeline
- Ansible-automated node provisioning
k3sFlux CDAnsibleDockerGitLab CIMkDocs
3AI services
4languages
99%availability