Notifications

Clear all

EDGEM·AI: A Local, RAG‑Powered Assistant for Our Embedded Projects

General Discussions

Last Post by Daniel Rossier 2 months ago

1 Posts

1 Users

0 Reactions

321 Views

Daniel Rossier

(@daniel)

Member Admin

Joined: 7 months ago

Posts: 13

Topic starter 10/06/2026 3:05 pm [#57]

Over the last few months, we’ve been reshaping our workflows at EDGEMTech to put AI directly in the loop of our daily engineering work, especially around embedded systems and edge computing.

Like many teams, we see clear gains in development speed and bug resolution, and it strengthens our specification-driven approach and software component integration. At the same time, our projects often involve industrial, medical, or otherwise critical environments, which come with strict constraints on confidentiality, digital sovereignty, and long‑term control over our infrastructure.

Cloud-first AI tooling is not acceptable for a large part of our work. This is why we started building our own AI stack, designed from day one to be local, privacy-preserving, and specialized for embedded and edge software development.

What we are building

We’re developing an internal AI coding assistant, currently used on:

our edgem1/verdin build system including TorizonOS
The SO3/AVZ hypervisor
LVGL projects and other UI stacks
generic embedded and Linux-based repos

The assistant runs fully local: no source code or customer data ever leaves the machine. It is built around a modern open model (Qwen 3.6 35B MoE, quantized) serving as the core, with a harness that adds:

RAG over code and docs (per-project ChromaDB collections)
corpus-aware context (one “corpus” per project/repo)
persistent memory and skills, allowing the assistant to learn from real usage across sessions
optional fine-tuning via lightweight adapters when it becomes useful again

In other words, it is not “just a model on a GPU”, but an AI tooling layer tailored to our codebases and workflows.

Architecture, in practice

Our current deployment runs on a developer laptop (RTX 4060 8GB + 62GB RAM). This configuration is deliberately constrained: it forces us to focus, to rationalize the scope, and to validate the architecture and tooling on realistic, everyday hardware.

Model server: Qwen3.6‑35B‑A3B in Q8_0 via llama.cpp, with hybrid offload
- attention on GPU, experts in RAM
- ~7 tokens/s in typical usage, offline only
Front-end: a CLI chat client aware of the current working directory
Tools exposed to the model:
- bash, edit_file, append_file, write_file for code edits
- web_search for explicit internet queries when allowed
- remember, save_skill, search_history for long-term memory and procedure learning
Knowledge layers:
- system prompts and project rules
- RAG corpora per repo
- persistent memories and skills
- cross-session history search

Thanks to our collaboration with the REDS (Reconfigurable Embedded Digital Institute) from HEIG‑VD (University of Applied Sciences in Yverdon-les-Bains), we also have access to high‑end GPU infrastructure (for example RTX 6000‑class GPUs with 96 GB of VRAM). This will allow us to:

run heavier models or higher‑throughput instances when needed
perform faster and more extensive fine‑tuning runs
~48 tokens/s in typical usage, offline only
deploy a dedicated server‑grade instance of the assistant in the near future, while preserving the same architecture and local‑first design

Everything is orchestrated so that the assistant can read, navigate, search, and edit code in a controlled way, with guardrails on file size, number of tool calls per turn, and explicit confirmation for any non‑read‑only actions.

Corpora: how we structure knowledge

Central to this approach is the notion of a corpus: each project or workspace is its own corpus with:

its own RAG index
its own memories and skills
project-specific rules and conventions

For example:

edgem1 is a curated corpus built from our build system checkout (verdin, virt64, scripts, doc, etc.)
any other repo (LVGL, SO3, u-boot, …) becomes a generic corpus indexed with a standard directory walker

The CLI automatically detects the active corpus from the current directory, or lets you pick one:

edgem-chat — open a chat bound to the current repo
edgem-chat --corpus lvgl — explicitly pick a known corpus
edgem-chat --here — treat the current directory as an ad-hoc corpus
edgem-reindex / edgem-index — (re)build the index for a given tree

Large multi-component workspaces can be split into multiple corpora so that, for example, a huge upstream tree like u‑boot doesn’t dilute the relevance of RAG for a smaller in‑house component.

Four learning loops, different time scales

One important lesson from our experiments is that the model is only one part of the system. The real performance comes from how we manage knowledge and learning around it.

We currently combine four learning loops, each with its own “clock”:

RAG
- On‑the‑fly retrieval of code and documentation from the indexed corpora
- No retraining required, index refresh via reindexing
Memory and skills
- The assistant can persist facts and procedures (with confirmation)
- These are re‑used in later sessions, effectively encoding habits and best practices
Curated datasets from real usage
- We log selected good exchanges as training samples
- We also build datasets from git history, documentation, “how‑to” procedures, and teacher‑student distillation on our own codebases
Fine-tuning adapters (QLoRA)
- When needed, we can train small LoRA adapters on external GPUs (RunPod, or HEIG‑VD infrastructure)
- Only a small fraction of weights is adapted, which keeps costs and deployment simple

Interestingly, our latest tests with Qwen3.6‑35B show that, on a focused benchmark of our LVGL/SO3/Verdin questions, the stock model already outperforms our previous fine‑tuned coder model. In that context, a well‑designed RAG + memory + skills setup matters more than aggressive fine‑tuning, at least for now.

Safety and reliability

Because this assistant can read and modify code, we’ve built in several safety mechanisms:

automatic approval only for read‑only commands
confirmation required for any write, run, or destructive action
protection against full rewrites of large files
per‑turn tool call budget and caching to avoid repeated commands
explicit mechanisms to mark answers as “good” or “bad” so they are (or are not) used in future training

We also keep the model’s “thinking mode” disabled in this context, as it tends to waste tokens without bringing benefits for our type of tasks.

Why this matters for our services

For our customers, this infrastructure is a way to:

accelerate development and debugging on complex embedded and edge platforms
keep all source code and sensitive data within a trusted Swiss‑based infrastructure
leverage the latest open models (Qwen, DeepSeek, Apertus, etc.) while retaining full control over deployment
build project‑specific assistants that really “know” the codebase and its conventions

Our collaboration and proximity with REDS is a key asset here. It gives us access to cutting‑edge research around open models and agent frameworks, and allows us to validate ideas quickly on real industrial use cases. It also paves the way for a server‑grade deployment of this assistant on powerful, on‑premise GPU hardware in the near future.

We will continue to evolve this platform, refine our datasets, and specialize the assistant further for embedded and edge scenarios.

If you are interested in local AI assistants, RAG over real codebases, or agent frameworks for embedded systems, we’d be very happy to exchange experiences and ideas.

This topic was modified 2 months ago 3 times by Daniel Rossier

Quote

Topic Tags

#EmbeddedSoftware #AI #LVGL #LVGLSafe #SWEngineering

4 Forums
49 Topics
49 Posts
1 Online
5 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed

BACK TO HOME