Over the last few months, we’ve been reshaping our workflows at EDGEMTech to put AI directly in the loop of our daily engineering work, especially around embedded systems and edge computing.
Like many teams, we see clear gains in development speed and bug resolution, and it strengthens our specification-driven approach and software component integration. At the same time, our projects often involve industrial, medical, or otherwise critical environments, which come with strict constraints on confidentiality, digital sovereignty, and long‑term control over our infrastructure.
Cloud-first AI tooling is not acceptable for a large part of our work. This is why we started building our own AI stack, designed from day one to be local, privacy-preserving, and specialized for embedded and edge software development.
What we are building
We’re developing an internal AI coding assistant, currently used on:
-
our edgem1/verdin build system including TorizonOS
-
The SO3/AVZ hypervisor
-
LVGL projects and other UI stacks
-
generic embedded and Linux-based repos
The assistant runs fully local: no source code or customer data ever leaves the machine. It is built around a modern open model (Qwen 3.6 35B MoE, quantized) serving as the core, with a harness that adds:
-
RAG over code and docs (per-project ChromaDB collections)
-
corpus-aware context (one “corpus” per project/repo)
-
persistent memory and skills, allowing the assistant to learn from real usage across sessions
-
optional fine-tuning via lightweight adapters when it becomes useful again
In other words, it is not “just a model on a GPU”, but an AI tooling layer tailored to our codebases and workflows.
Architecture, in practice
Our current deployment runs on a developer laptop (RTX 4060 8GB + 62GB RAM). This configuration is deliberately constrained: it forces us to focus, to rationalize the scope, and to validate the architecture and tooling on realistic, everyday hardware.
-
Model server: Qwen3.6‑35B‑A3B in Q8_0 via llama.cpp, with hybrid offload
-
attention on GPU, experts in RAM
-
~7 tokens/s in typical usage, offline only
-
-
Front-end: a CLI chat client aware of the current working directory
-
Tools exposed to the model:
-
bash,edit_file,append_file,write_filefor code edits -
web_searchfor explicit internet queries when allowed -
remember,save_skill,search_historyfor long-term memory and procedure learning
-
-
Knowledge layers:
-
system prompts and project rules
-
RAG corpora per repo
-
persistent memories and skills
-
cross-session history search
-
Thanks to our collaboration with the REDS (Reconfigurable Embedded Digital Institute) from HEIG‑VD (University of Applied Sciences in Yverdon-les-Bains), we also have access to high‑end GPU infrastructure (for example RTX 6000‑class GPUs with 96 GB of VRAM). This will allow us to:
-
run heavier models or higher‑throughput instances when needed
-
perform faster and more extensive fine‑tuning runs
-
~48 tokens/s in typical usage, offline only
-
deploy a dedicated server‑grade instance of the assistant in the near future, while preserving the same architecture and local‑first design
Everything is orchestrated so that the assistant can read, navigate, search, and edit code in a controlled way, with guardrails on file size, number of tool calls per turn, and explicit confirmation for any non‑read‑only actions.
Corpora: how we structure knowledge
Central to this approach is the notion of a corpus: each project or workspace is its own corpus with:
-
its own RAG index
-
its own memories and skills
-
project-specific rules and conventions
For example:
-
edgem1is a curated corpus built from our build system checkout (verdin, virt64, scripts, doc, etc.) -
any other repo (LVGL, SO3, u-boot, …) becomes a generic corpus indexed with a standard directory walker
The CLI automatically detects the active corpus from the current directory, or lets you pick one:
-
edgem-chat— open a chat bound to the current repo -
edgem-chat --corpus lvgl— explicitly pick a known corpus -
edgem-chat --here— treat the current directory as an ad-hoc corpus -
edgem-reindex/edgem-index— (re)build the index for a given tree
Large multi-component workspaces can be split into multiple corpora so that, for example, a huge upstream tree like u‑boot doesn’t dilute the relevance of RAG for a smaller in‑house component.
Four learning loops, different time scales
One important lesson from our experiments is that the model is only one part of the system. The real performance comes from how we manage knowledge and learning around it.
We currently combine four learning loops, each with its own “clock”:
-
RAG
-
On‑the‑fly retrieval of code and documentation from the indexed corpora
-
No retraining required, index refresh via reindexing
-
-
Memory and skills
-
The assistant can persist facts and procedures (with confirmation)
-
These are re‑used in later sessions, effectively encoding habits and best practices
-
-
Curated datasets from real usage
-
We log selected good exchanges as training samples
-
We also build datasets from git history, documentation, “how‑to” procedures, and teacher‑student distillation on our own codebases
-
-
Fine-tuning adapters (QLoRA)
-
When needed, we can train small LoRA adapters on external GPUs (RunPod, or HEIG‑VD infrastructure)
-
Only a small fraction of weights is adapted, which keeps costs and deployment simple
-
Interestingly, our latest tests with Qwen3.6‑35B show that, on a focused benchmark of our LVGL/SO3/Verdin questions, the stock model already outperforms our previous fine‑tuned coder model. In that context, a well‑designed RAG + memory + skills setup matters more than aggressive fine‑tuning, at least for now.
Safety and reliability
Because this assistant can read and modify code, we’ve built in several safety mechanisms:
-
automatic approval only for read‑only commands
-
confirmation required for any write, run, or destructive action
-
protection against full rewrites of large files
-
per‑turn tool call budget and caching to avoid repeated commands
-
explicit mechanisms to mark answers as “good” or “bad” so they are (or are not) used in future training
We also keep the model’s “thinking mode” disabled in this context, as it tends to waste tokens without bringing benefits for our type of tasks.
Why this matters for our services
For our customers, this infrastructure is a way to:
-
accelerate development and debugging on complex embedded and edge platforms
-
keep all source code and sensitive data within a trusted Swiss‑based infrastructure
-
leverage the latest open models (Qwen, DeepSeek, Apertus, etc.) while retaining full control over deployment
-
build project‑specific assistants that really “know” the codebase and its conventions
Our collaboration and proximity with REDS is a key asset here. It gives us access to cutting‑edge research around open models and agent frameworks, and allows us to validate ideas quickly on real industrial use cases. It also paves the way for a server‑grade deployment of this assistant on powerful, on‑premise GPU hardware in the near future.
We will continue to evolve this platform, refine our datasets, and specialize the assistant further for embedded and edge scenarios.
If you are interested in local AI assistants, RAG over real codebases, or agent frameworks for embedded systems, we’d be very happy to exchange experiences and ideas.
