"; */ ?>


13
Apr 26

Qwen 3.5 397B on a MacBook @ 29 tokens / s

A year ago I would just read about 397B league of models. Today I can run it on my laptop. The combination of llama.cpp’s importance matrix (imatrix) with Unsloth’s per-model adaptive layer quantization is what makes it all possible

Qwen3.5-397B-A17B is an 807GB beast that, even at Q4, would need more than 200GB of GPU RAM. Some people even call it “local”.

I have fairly strong laptop: M5 Max MacBook Pro 128GB (40-core GPU). That’s 128GB unified RAM, most of which I can dedicate to GPU. So 200GB+ is not something I can comfortably run on it. I could via ssd + ram, but it would be more of an experiment rather than putting the model to work.

But Qwen 397B is an MoE model: it has 512 experts and only routes to 10 of them per token, so the other 502 can be quantized way below Q4 without hurting that particular inference. imatrix keeps the important weights close to their original values, Unsloth’s adaptive layer quantization figures out which layers can afford to be stronger quantized. The combination of all makes 397B at just ~2 bits to fit in 106GB, which, according to my test, retains a surprising amount of its intelligence. And on top of all that.. 106GB is my laptop territory.

The model is “Qwen3.5-397B-A17B-UD-IQ2_XXS“:

  • “UD” = Unsloth Dynamic 2.0, different layers are quantized differently
  • “I” / imatrix = most important weights are rounded to minimize their loss / error
  • “XXS”: extra, extra small.. 807GB => 106GB

After downloading it, before doing anything else I wanted to understand what it is I am going to ask my laptop to swallow:

$ ll -h ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS
total 224361224
-rw-r--r--  1 user  staff    10M Apr 12 18:50 Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf
-rw-r--r--  1 user  staff    46G Apr 12 20:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00003-of-00004.gguf
-rw-r--r--  1 user  staff    14G Apr 12 20:57 Qwen3.5-397B-A17B-UD-IQ2_XXS-00004-of-00004.gguf
-rw-r--r--  1 user  staff    46G Apr 12 21:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf

Here is the 106GB. The original 16bit model is 807GB, if it was “just” quantized to 2bit model it would take (397B * 2 bits) / 8 = ~99 GB, but I am looking at 106GB, hence I wanted to look under the hood to see the actual quantization recipe for this model:

$ gguf-dump \
  ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf \
  2>&1 | head -200

super interesting. the expert tensors (ffn_gate_exps, ffn_up_exps and ffn_down_exps) are quantized at ~2 bits, but the rest are much larger. This is where the 7GB difference (99GB vs. 106GB) really pays off: 7GB of packed intelligence on top of expert tensors.

trial by fire

By trial and error I found that 16K for the context would be a sweet spot for the 128GB unified memory. but the GPU space needs to be moved up a little to fit it (it is around 96GB by default):

$ sudo sysctl iogpu.wired_limit_mb=122880

llama.cpp” would be the best choice to run this model (since MLX does not quantize to IQ2_XXS):

$ llama-server \
  -m ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf \
  --n-gpu-layers 99 \
  --ctx-size 16384 \
  --temp 1.0 --top-p 0.95 --top-k 20

My current use case, as I described here, is finding the best model assembly to help me making sense of my kids school work and progress since if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. A small army of Claude Sonnets does it well’ish, but it is really expensive, hence “Qwen3.5 397B” could be just a drop in replacement (that’s the hope)

In order to make sense of which local models “do good” I built cupel: https://github.com/tolitius/cupel, and that is the next step: fire it up and test “Qwen3.5 397B” on muti-turn, tool use, etc.. tasks:

Qwen3.5 397B A17B on M5 Max MacBook 128GB

It is on par with “Qwen 3.5 122B 4bit“, but I suspect I need to work on more exquisite prompts to distill the difference.

And, after all the tests I found “Qwen3.5 397B IQ2” to be.. amazing. Even at 2 bits, it is extremely intelligent, and is able to call tools, pass context between turns, organize very messy set of tables into clean aggregates, etc.

What surprised me the most is the 29 tokens per second average generation speed:

prompt eval time = 269.46 ms / 33 tokens ( 8.17 ms per token, 122.46 tokens per second)
 
   eval time = 79785.85 ms / 2458 tokens ( 32.46 ms per token, 30.81 tokens per second)
   total time = 80055.31 ms / 2491 tokens
 
slot release: id 1 | task 7953 | stop processing: n_tokens = 2490, truncated = 0
srv update_slots: all slots are idle

this is one of the examples from “llama.cpp” logs. the prompt processing depends on batching and ranged from 80 tokens per second to 330 tokens per second

The disadvantages I can see so far:

  • can’t really efficiently run it in the assembly, since it is the only model that can be loaded / fits. with 122B (65GB) I can still run more models side by side
  • I don’t expect it to handle large context well due to hardware memory limitation
  • theoretically it would have a worse time dealing with a very specialized knowledge where a specific expert is needed, but its weights are “too crushed” to give a clean answer. But, just maybe, the “I” in “IQ2-XXS” makes sure that the important weights stay very close to their original value
  • under load I saw the speed dropping from 30 to 17 tokens per second. I suspect it is caused by the prompt cache filling up and triggering evictions, but needs more research

But.. 512 experts, 397B of stored knowledge, 17B active parameters per token and all that at 29 tokens per second on a laptop.


09
Apr 26

US Public Schools meet Qwen 3.5 122B A10B

The last Llama (Scout/Maverick) was released a year ago. Since then US based releases have been super rare: Granite 3.3, GPT-OSS 20B & 120B, Nemotron 3 Nano / Super and now Gemma 4. Meanwhile Chinese open model output is multiples of high quality Qwens, DeepSeeks, Kimis, MiniMaxes, GLMs, MiMos, Seeds, etc..

Gemma 4 is like a breath of fresh air. Not just the model itself, but the rollout, the beauty, the innovation: K=V in global attention, Per-Layer Embeddings, tri-modal minis (E4B, E2B), etc.

Most of my local LLM usage used to be via rented GPUs: Google Cloud, AWS, etc. But about a month ago I decided to bring it all home, and bought a shiny M5 Max MacBook Pro 128GB. It is a beast of a laptop, but also opens up the kind of models I can run locally: 128GB of unified RAM and all.

Besides the cost, the true benefit of running models locally is privacy. I never felt easy sending my data to “OpenRouter => Model A” or even hosting it in AWS on P4d/P4de instances (NVIDIA A100): it is still my data, and it is not home. where I am.

But my laptop is.

When it comes to LLMs, unless it is research or coding finding utility is difficult. But I have kids, and they have school, and if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. But being a parent is fun, and this mess is a great fit for LLMs to make sense of. Local LLMs solve the last piece: my kids data stay on my laptop at home.

So it began. I loaded all I could to my 128GB friendly beast and start looking at which models are good for what. The flow is not difficult:

  • go to many different school affiliated websites
    • some have APIs
    • some I need to playwright screen scape
    • some are a little of both plus funky captchas and logins, etc.
  • then, when on “a” website
    • some teachers have things inside a slide deck on a “slide 13”
    • some in some obscure folders
    • others on different systems buried under many irrelevant links

LLMs need to scout all this ambiguity and come back to me with a clear signal of what is due tomorrow, this week; what the grades are, why they are what they are, etc. Again, a great use case for LLM, since it is lots of unorganized text with a clear goal to optimize for.

You maybe thinking just about now: “OpenClaw“. And you would be correct, this is what I have started from, but then I realized that OpenClaw is as good as the set of LLMs behind it. Also if I schedule a vanilla OS cron that invokes a “school skill”, the number of tokens sent to LLM goes from 10K down to about 600. And while I do have an OpenClaw running on VPS / OpenRouter, this was not (maybe yet) a good use of it.

In order to rank local models I scavenged a few problems over the years that I had to solve with big boys: Claude, OpenAI, Grok and Gemini. They are nice enough to record everything we talk about, which is anything but local, but in this case gave me a chance to collect a few problems and convert them to prompts with rubrics.

I then wrote a script to start making sense of what works for me vs. what is advertised and/or works for others. The script grew fast, and was missing look and feel.

So I added UI to it: https://github.com/tolitius/cupel

Besides the usual general problems, I used a few specific prompts that had tool use and muli-turns (multiple steps composed via tool calling) focused specifically on school related activities.

After a few nights and trial and error, I found that “Qwen 3.5 122B A10B Q4” is the best and the closest that solves most of the tasks. A pleasant surprise, by the way, was the “NVIDIA Nemotron 3 Super 120B A12B 4bit“. I really like this model, it is fast and unusually great. “Unusually” because previous Nemotrons did not genuinely stand out as this one.

And then Gemma 4 came around.

Interestingly, at least for my use case, “Qwen 3.5 122B A10B Q4” still performs better than “Gemma 4 26B A4B“, and about 50/50 accuracy wise with “Gemma 4 31B“, but it wins hands down in speed. “Gemma 4 31B” full precision is about 7 tokens per second on M5 Max MacBook Pro 128GB, whereas “Qwen 3.5 122B A10B Q4” is 50 to 65 tokens / second.

here tested Gemma 4 via OpenRouter to avoid any misconfiguration on my side + 2x faster

But I suspect I still need to learn “The Way of Gemma” to make it work much better. It really is a giant leap forward given its size vs. quality. After all, at 31B, although dense, it stands side by side with 122B.


06
Jan 26

Mad Skills to Learn The Universe

Simple wins. Always. Because it is very composable, fun to work with, and organically evolves.

LLMs are great at language and text, but they don’t know how to do anything, so they rely on something running (e.g. a server) and listening to their output. If the output is formatted in a certain way, say:

sum(41, 1)

the server / listener can convert it to code, call a “sum” function and return 42:

In this particular case we would say that LLM has a “sumtool. And the LLM would have a prompt similar to this one where it would identify that two numbers need to be added from the user’s prompt, and would “call a tool” with “sum(41, 1)“.

But that was a few years ago, where we would beg LLMs to call a tool and return the response in JSON pretty please. And sometimes it would even work, and sometimes it would not.

Fast forward to now, we have Tools, MCP, Agents, LSP, + many more acronyms. Plus LLMs are much better at following our “please JSON” begging.

However, one capability, one discovery and innovation in this field stands out: Skills. The reason it stands out is because, compared to all the hype and acronyms, Skills are… simple.

Just a Directory

A Skill is effectively a directory. That’s it.

It doesn’t require a complex registration process or a heavy framework. It requires a folder, a markdown file, and maybe a script or two. It relies on a standard structure that the Agent (current fancy name for a “piece of code that calls LLM”) knows how to traverse.

This simplicity lowers the barrier to entry significantly. The best way to learn something is to do it, hence I needed a problem to solve.

The “Dad” Problem

My kids are navigating the ways of middle/high school: one is deep in ionic bonding in Chemistry, another one is grappling with graphing sine waves in Pre-Calculus.

I can explain these concepts to them with words, or I can scribble on a napkin (which I love doing), but some concepts just need to move. They need to be visualized. I could write a Python program or a D3.js visualization for every single problem, but that takes too long, even when done directly with “the big 4“.

I needed a way to say: “Show me a covalent bond in H₂O” and have the visualization write itself, test itself, run itself, and show me that interactive covalent bond moving step by step.

Great fit for a “Skill” I thought and built salvador that can do this:

> /visualize a covalent bond in H₂O


Salvador… Who?

Salvador is an Anthropic Skill (but I hear others are adopting skills as well) that lives in my terminal. Its job is to visualize physics, math, or any other Universe concepts on demand.

The beauty is in the organization and progressive disclosure. The Agent doesn’t load the entire “visualization knowledge base” into its context window at once, like it did with MCP. It discovers the skill via metadata and only loads the heavy instructions when I actually ask for a visualization.

Here is what it looks like on disk:

salvador/
├── SKILL.md                 <-- the brain
├── scripts/                 <-- the hands
│   ├── inspect.js
│   └── setup.sh
└── templates/               <-- the scaffolding
    └── visualization.html

The Brain (SKILL.md)

This is the core. It uses a markdown format with YAML frontmatter. But inside, it’s not code – it’s a codified intent. It defines the flow: analyze the ask, divide and conquer, plan the steps / structure, code the solution, verify the result, go back if needed, etc.

The Hands (scripts/)

An agent without tools is just a chatbot. For salvador to work, it needs to “see” what it created.

A small script, inspect.js, that lives inside, spins up a headless browser. When the Agent generates a visualization, it doesn’t “hope” it works. It runs inspect.js to capture console errors and take screenshots.

If the screen, or a sequence of screens, are not what they should be, whether the laws of Physics are off, or a text overlay is misplaced, the Agent catches it, understands the problem, and rewrites the code.

The Loop

This is where the “agent” part becomes real. It’s not the usual “text in, code out.”, it’s a loop:

Idea -> Code -> Inspect -> Fix (as many times as needed) -> Render

Now, when we work on grokking a Physics problem at the kitchen table, I have a lot more than a napkin drawing: me and Salvador are cooking:

$ claude
> /visualize the difference between gravity on Earth vs Mars using a bouncing ball

Salvador picks up the intent, writes the physics simulation, tests it, fixes the gravity constant if it got it wrong, and pops open a browser window:

Humans today create and “conduct” code with or without LLMs. But complementing it with a Skill unlocks learning The Universe quark by quark, lepton by lepton.

Composable. And very open for evolution, because it is.. Simple. and Simple Wins. Always.


09
Apr 17

Hazelcast: Keep your cluster close, but cache closer

Hazelcast has a neat feature called Near Cache. Whenever clients talk to Hazelcast servers each get/put is a network call, and depending on how far the cluster is these calls may get pretty costly.

The idea of Near Cache is to bring data closer to the caller, and keep it in sync with the source. Which is why it is highly recommended for data structures that are mostly read.

Hazelcast Near Cache

Near Cache can be created / configured on both: server and client sides

Optionally Near Cache keys can be stored on the file system, and then preloaded when the client restarts.

The examples below are run from a Clojure REPL and use chazel, which is a Clojure library for Hazelcast. To follow along you can:

$ git clone https://github.com/tolitius/chazel
$ cd chazel
$ boot dev
boot.user=> ;; ready for examples

In case you don’t have boot installed it is a one liner install.

Server Side Setup

Well use two different servers not too far from each other so the network latency is enough to get a good visual on how Near Cache could help.

On the server side we’ll create an "events" map (which will start the server if it was not yet started), and will add 100,000 pseudo events to it:

;; these are done on the server:
 
(def m (hz-map "events"))
 
(dotimes [n 100000] (put! m n n))

We can visualize all these puts with hface:

hface putting 100,000 entries

Client Side Without Near Cache

On the client side we’ll create a function to walk over first n keys in the "events" map:

(defn walk-over [m n]
  (dotimes [k n] (get m k)))

Create a new Hazelcast client instance (without Near Cache configured), and walk over first 100,000 events (twice):

(def hz-client (client-instance {:hosts ["10.x.y.z"]}))
 
(def m (hz-map "events" hz-client))
 
(time (walk-over m 100000))
=> "Elapsed time: 30534.997599 msecs"
 
(time (walk-over m 100000))
=> "Elapsed time: 30547.810322 msecs"

Each iteration roughly took 30.5 seconds, and by monitoring the server’s network it was sending packets back and forth for every get:

Hazelcast with no Near Cache

We can see that all these packets came from / correlate well to an "events" map:

hface putting 100,000 entries

Client Side With Near Cache

Now let’s create a different client and configure it with Near Cache for the "events" map:

(def client-with-nc (client-instance {:hosts ["10.x.y.z"]
                                      :near-cache {:name "events"}}))

Let’s repeat the exercise:

(def m (hz-map "events" client-with-nc))
 
(time (walk-over m 100000))
=> "Elapsed time: 30474.719965 msecs"
 
(time (walk-over m 100000))
=> "Elapsed time: 102.141527 msecs"

The first iteration took 30.5 seconds as expected, but the second, and all the subsequent ones, took 100 milliseconds. That’s because a Near Cache kicked in, and all these events are close to the client: are in the client’s memory.

As expected all subsequent calls do not use the server:

Hazelcast with Near Cache

Keeping Near Cache in Sync

The first logical question is: ok, I brought these events into memory, but would not they become stale in case they change on the server?

Let’s check:

;; checking on the client side
(get m 41)
=> 41
;; on the server: changing the value of a key 41 to 42
(put! m 41 42)
;; checking again on the client side
(get m 41)
=> 42

Pretty neat. Hazelcast invalidates “nearly cached” entries by broadcasting invalidation events from the cluster members. These events are fire and forget, but Hazelcast is very good at figuring out if and when these events are lost.

There are a couple of system properties that could be configured to control this behaviour:

  • hazelcast.invalidation.max.tolerated.miss.count: Default value is 10. If missed invalidation count is bigger than this value, relevant cached data will be made unreachable, and the new value will be populated from the source.

  • hazelcast.invalidation.reconciliation.interval.seconds: Default value is 60 seconds. This is a periodic task that scans cluster members periodically to compare generated invalidation events with the received ones from Near Cache.

Near Cache Preloader

In case clients are restarted all the near caches would be lost and would need to be naturally repopulated by applications / client requests.

Near Cache can be configured with a preloader that would persist all the keys from the map to disk, and would repopulate the cache using the keys from the file in case of a restart.

Let’s create a client instance with such a preloader:

(def client-with-nc (client-instance {:hosts ["10.x.y.z"] 
                                      :near-cache {:name "events"
                                                   :preloader {:enabled true
                                                               :store-initial-delay-seconds 60}}}))

And walk over the map:

(def m (hz-map "events" client-with-nc))
 
(walk-over m 100000)

As per store-initial-delay-seconds config property, 60 seconds after we created a reference to this map, preloader will persist the keys into the nearCache-events.store file (filename is configurable):

INFO: Stored 100000 keys of Near Cache events in 306 ms (1953 kB)

Now let’s restart the client and try to iterate over the map again:

(shutdown-client client-with-nc)
(def client-with-nc (client-instance {:hosts ["10.x.y.z"]
                                      :near-cache {:name "events"
                                      :preloader {:enabled true}}}))
 
(def m (hz-map "events" client-with-nc))
 
(time (walk-over m 100000))
INFO: Loaded 100000 keys of Near Cache events in 3230 ms
"Elapsed time: 2920.688369 msecs"
 
(time (walk-over m 100000))
;; "Elapsed time: 103.878848 msecs"

The first iteration took 3 seconds (and not 30) since once the preloader loaded all the keys, the rest (27 seconds worth of data) came back from the client’s memory.

This 3 second spike can be observed by the network usage:

Hazelcast with Near Cache

And all the subsequent calls now again take 100 ms.

Near Cache Full Config

There are a lot more Near Cache knobs beyond the map name and preloader. All are well documented in the Hazelcast docs and available as edn config with chazel.

Here is an example:

{:in-memory-format :BINARY,
 :invalidate-on-change true,
 :time-to-live-seconds 300,
 :max-idle-seconds 30,
 :cache-local-entries true,
 :local-update-policy :CACHE_ON_UPDATE,
 :preloader {:enabled true,
             :directory "nearcache-example",
             :store-initial-delay-seconds 15,
             :store-interval-seconds 60},
 :eviction  {:eviction-policy :LRU,
             :max-size-policy :ENTRY_COUNT,
             :size 800000}}

Any config options that are not provided will be set to Hazelcast defaults.


10
Jan 17

Hubble Space Mission Securely Configured

This week we learned of a recent alien hacking into Earth. We have a suspect but unsure about the true source. Could be The Borg, could be Klingons, could be Cardassians. One thing is certain: we need better security and flexibility for our space missions.

We’ll start with the first line of defense: science based education. The better we are educated, the better we are equipped to make decisions, to understand the universe, to vote.

One of the greatest space exploration frontiers is the Hubble Space Telescope. The next 5 minutes we will work on configuring and bringing the telescope online keeping things secure and flexible in a process.

The Master Plan

In order to keep things simple and solid we are going to use these tools:

  • Vault is a tool for managing secrets.
  • Consul besides being a chief magistrates of the Roman Republic* is now also a service discovery and distributed configuration tool.
  • Hazelcast is a simple, powerful and pleasure to work with an in memory data grid.

Here is the master plan to bring Hubble online:

override hazelcast hosts

Hubble has its own internal configuration file which is not environment specific:

{:hubble {:server {:port 4242}
          :store {:url "spacecraft://tape"}
          :camera {:mode "mono"}
          :mission {:target "Eagle Nebula"}
 
          :log {:name "hubble-log"
                :hazelcast {:hosts "OVERRIDE ME"
                            :group-name "OVERRIDE ME"
                            :group-password "OVERRIDE ME"
                            :retry-ms 5000
                            :retry-max 720000}}}}

As you can see the initial, default mission is the “Eagle Nebula”, Hubble’s state is stored on tape, it uses a mono (vs. color) camera and has in internal server that runs on port 4242.

Another thing to notice, Hubble stores an audit/event log in a Hazelcast cluster. This cluster needs environment specific location and creds. While the location may or may not be encrypted, the creds should definitely be.

All the above of course can be, and some of them will be, overridden at startup. We are going to keep the overrides in Consul, and the creds in Vault. On Hubble startup the Consul overrides will be merged with the Hubble internal config, and the creds will be unencrypted and security read from Vault and used to connect to the Hazelcast cluster.

Environment Matters

Before configuring Hubble, let’s create and initialize the environment. As I mentioned before we would need to setup Consul, Vault and Hazelcast.

Consul and Vault

Consul will play two roles in the setup:

  • a “distributed configuration” service
  • Vault’s secret backend

Both can be easily started with docker. We’ll use cault‘s help to setup both.

$ git clone https://github.com/tolitius/cault
$ cd cault
 
$ docker-compose up -d
Creating cault_consul_1
Creating cault_vault_1

Cault runs both Consul and Vault’s official docker images, with Consul configured to be Vault’s backend. Almost done.

Once the Vault is started, it needs to be “unsealed”:

docker exec -it cault_vault_1 sh
$ vault init          ## will show 5 unseal keys and a root token
$ vault unseal        ## use 3 out 5 unseal keys
$ vault auth          ## use a root token        ## >>> (!) remember this token

Not to duplicate it here, you can follow unsealing Vault step by step with visuals in cault docs.

We would also save Hubble secrets here within the docker image:

$ vi creds

add {"group-name": "big-bank", "group-password": "super-s3cret!!!"} and save the file.

now write it into Vault:

$ vault write secret/hubble-audit value=@creds
 
Success! Data written to: secret/hubble-audit

This way the actual group name and password won’t show up in the bash history.

Hazelcast Cluster in 1, 2, 3

The next part of the environment is a Hazelcast cluster where Hubble will be sending all of the events.

We’ll do it with chazel. I’ll use boot in this example, but you can use lein / gradle / pom.xml, anything that can bring [chazel "0.1.12"] from clojars.

Open a new terminal and:

$ boot repl
boot.user=> (set-env! :dependencies '[[chazel "0.1.12"]])
boot.user=> (require '[chazel.core :as hz])
 
;; creating a 3 node cluster
boot.user=> (hz/cluster-of 3 :conf (hz/with-creds {:group-name "big-bank"
                                                   :group-password "super-s3cret!!!"}))
 
Members [3] {
    Member [192.168.0.108]:5701 - f6c0f121-53e8-4be0-a958-e8d35571459d
    Member [192.168.0.108]:5702 - e773c493-efe8-4806-b568-d2af57947fc9
    Member [192.168.0.108]:5703 - f9e0719d-aec7-405e-9aef-48baa56b11ec this}

And we have a 3 node Hazelcast cluster up and running.

Note that Consul, Vault and Hazelcast cluster would already be running in the real world scenario before we get to write and deploy Hubble code.

Let there be Hubble!

The Hubble codebase lives on github, as it should :) So let’ clone it first:

$ git clone https://github.com/tolitius/hubble
 
$ cd hubble

“Putting some data where Consul is”

We do have Consul up and running, but we have no overrides in it. We can either:

  • manually add overrides for Hubble config or
  • just initialize Consul with current Hubble config / default overrides

Hubble has init-consul boot task which will just copy a part of Hubble config to Consul, so we can override values later if we need to:

$ boot init-consul
read config from resource: "config.edn"
22:49:34.919 [clojure-agent-send-off-pool-0] INFO  hubble.env - initializing Consul at http://localhost:8500/v1/kv

Let’s revisit Hubble config and figure out what needs to be overridden:

{:hubble {:server {:port 4242}
          :store {:url "spacecraft://tape"}
          :camera {:mode "mono"}
          :mission {:target "Eagle Nebula"}
 
          :log {:enabled false                              ;; can be overridden at startup / runtime / consul, etc.
                :auth-token "OVERRIDE ME"
                :name "hubble-log"
                :hazelcast {:hosts "OVERRIDE ME"
                            :group-name "OVERRIDE ME"
                            :group-password "OVERRIDE ME"
                            :retry-ms 5000
                            :retry-max 720000}}
 
          :vault {:url "OVERRIDE ME"}}}

The only obvious thing to override is hubble/log/hazelcast/hosts since creds need to be later overridden securely at runtime, as well as the hubble/log/auth-token. In fact if you look into Consul, you would see neither creds nor the auth token.

The less obvious thing to override is the hubble/vault/url. We need this, so Hubble knows where Vault lives once it needs to read and decrypt creds at runtime.

We will also override hubble/log/enabled to enable Hubble event logging.

So let’ override these in Consul:

  • hubble/log/hazelcast/hosts to ["127.0.0.1"]
  • hubble/vault/url to http://127.0.0.1:8200
  • hubble/log/enabled to true

We can either go to the Consul UI to override these one by one, but it is easier to do it programmatically in one shot.

Envoy Extraordinary and Minister Plenipotentiary

Hubble relies on envoy to communicate with Consul, so writing a value or a map with all overrides can be done in a single go:

(from under /path/to/hubble)

$ boot dev
boot.user=> (require '[envoy.core :as envoy])
nil
boot.user=> (def overrides {:hubble {:log {:enabled true
                                           :hazelcast {:hosts ["127.0.0.1"]}}
 
                                     :vault {:url "http://127.0.0.1:8200"}}})
#'boot.user/overrides
boot.user=> (envoy/map->consul "http://localhost:8500/v1/kv" overrides)

We can spot check these in Consul UI:

override hazelcast hosts

Consul is all ready. And we are ready to bring Hubble online.

Secrets are best kept by people who don’t know them

Two more things to solve the puzzle are Hazelcast creds and the auth token. We know that creds are encrypted and live in Vault. In order to securely read them out we would need a token to access them. But we also do not want to expose the token to these creds, so we would ask Vault to place the creds in one of the Vault’s cubbyholes for, say 120 ms, and generate a temporary, one time use, token to access this cubbyhole. This way, once the Hubble app gets creds at runtime, this auth token did its job and can no longer be used.

In Vault lingo this is called “Response Wrapping“.

cault, the one you cloned at the very beginning, has a script to generate this token. And supporting documentation on response wrapping.

We saved Hubble Hazelcast creds under secret/hubble-audit, so let’s generate this temp token for it. We need to remember the Vault’s root token from the “Vault init” step in order for cault script to work:

(from under /path/to/cault)

$ export VAULT_ADDR=http://127.0.0.1:8200
$ export VAULT_TOKEN=797e09b4-aada-c3e9-7fe8-4b7f6d67b4aa
 
$ ./tools/vault/cubbyhole-wrap-token.sh /secret/hubble-audit
eda33881-5f34-cc34-806d-3e7da3906230

eda33881-5f34-cc34-806d-3e7da3906230 is the token we need, and, by default, it is going to be good for 120 ms. In order to pass it along to Hubble start, we’ll rely on cprop to merge an ENV var (could be a system property, etc.) with existing Hubble config.

In the Hubble config the token lives here:

{:hubble {:log {:auth-token "OVERRIDE ME"}}}

So to override it we can simply export an ENV var before running the Hubble app:

(from under /path/to/hubble)

$ export HUBBLE__LOG__AUTH_TOKEN=eda33881-5f34-cc34-806d-3e7da3906230

Now we 100% ready. Let’s roll:

(from under /path/to/hubble)

$ boot up
INFO  mount-up.core - >> starting.. #'hubble.env/config
read config from resource: "config.edn"
INFO  mount-up.core - >> starting.. #'hubble.core/camera
INFO  mount-up.core - >> starting.. #'hubble.core/store
INFO  mount-up.core - >> starting.. #'hubble.core/mission
INFO  mount-up.core - >> starting.. #'hubble.watch/consul-watcher
INFO  hubble.watch - watching on http://localhost:8500/v1/kv/hubble
INFO  mount-up.core - >> starting.. #'hubble.server/http-server
INFO  mount-up.core - >> starting.. #'hubble.core/mission-log
INFO  vault.client - Read cubbyhole/response (valid for 0 seconds)
INFO  chazel.core - connecting to:  {:hosts [127.0.0.1], :group-name ********, :group-password ********, :retry-ms 5000, :retry-max 720000}
Jan 09, 2017 11:54:40 PM com.hazelcast.core.LifecycleService
INFO: hz.client_0 [big-bank] [3.7.4] HazelcastClient 3.7.4 (20161209 - 3df1bb5) is STARTING
Jan 09, 2017 11:54:40 PM com.hazelcast.core.LifecycleService
INFO: hz.client_0 [big-bank] [3.7.4] HazelcastClient 3.7.4 (20161209 - 3df1bb5) is STARTED
Jan 09, 2017 11:54:40 PM com.hazelcast.client.connection.ClientConnectionManager
INFO: hz.client_0 [big-bank] [3.7.4] Authenticated with server [192.168.0.108]:5703, server version:3.7.4 Local address: /127.0.0.1:52261
Jan 09, 2017 11:54:40 PM com.hazelcast.client.spi.impl.ClientMembershipListener
INFO: hz.client_0 [big-bank] [3.7.4]
 
Members [3] {
    Member [192.168.0.108]:5701 - f6c0f121-53e8-4be0-a958-e8d35571459d
    Member [192.168.0.108]:5702 - e773c493-efe8-4806-b568-d2af57947fc9
    Member [192.168.0.108]:5703 - f9e0719d-aec7-405e-9aef-48baa56b11ec
}
 
Jan 09, 2017 11:54:40 PM com.hazelcast.core.LifecycleService
INFO: hz.client_0 [big-bank] [3.7.4] HazelcastClient 3.7.4 (20161209 - 3df1bb5) is CLIENT_CONNECTED
Starting reload server on ws://localhost:52265
Writing adzerk/boot_reload/init17597.cljs to connect to ws://localhost:52265...
 
Starting file watcher (CTRL-C to quit)...
 
Adding :require adzerk.boot-reload.init17597 to app.cljs.edn...
Compiling ClojureScript...
• js/app.js
Elapsed time: 8.926 sec

Exploring Universe with Hubble

… All systems are check … All systems are online

Let’s go to http://localhost:4242/ where Hubble’s server is listening to:

Let’s repoint Hubble to the Cat’s Eye Nebula by changing a hubble/mission/target to “Cats Eye Nebula”:

Also let’s upgrade Hubble’s camera from a monochrome one to the one that captures color by changing hubble/camera/mode to “color”:

Check the event log

Captain wanted the full report of events from the Hubble log. Aye aye, captain:

(from under a boot repl with a chazel dep, as we discussed above)

;; (Hubble serializes its events with transit)
boot.user=> (require '[chazel.serializer :as ser])
 
boot.user=> (->> (for [[k v] (into {} (hz/hz-map "hubble-log"))]
                   [k (ser/transit-in v)])
                 (into {})
                 pprint)
{1484024251414
 {:name "#'hubble.core/mission",
  :state {:active true, :details {:target "Cats Eye Nebula"}},
  :action :up},
 1484024437754
 {:name "#'hubble.core/camera",
  :state {:on? true, :settings {:mode "color"}},
  :action :up}}

This is the event log persisted in Hazelcast. In case Hubble goes offline, we still have both: its configuration reliably stored in Consul and all the events are stored in the Hazelcast cluster.

Looking Hazelcast cluster in the face

This is not necessary, but we can also monitor the state of Hubble event log with hface:

But.. how?

To peek a bit inside, here is how Consul overrides are merged with the Hubble config:

(defn create-config []
  (let [conf (load-config :merge [(from-system-props)
                                  (from-env)])]
    (->> (conf :consul)
         to-consul-path
         (envoy/merge-with-consul conf))))

And here is how Hazelcast creds are read from Vault:

(defn with-creds [conf at token]
  (-> (vault/merge-config conf {:at at
                                :vhost [:hubble :vault :url]
                                :token token})
      (get-in at)))

And these creds are only merged into a subset Hubble config that is used once to connect to the Hazelcast cluster:

(defstate mission-log :start (hz/client-instance (env/with-creds env/config
                                                                 [:hubble :log :hazelcast]
                                                                 [:hubble :log :auth-token]))
                      :stop (hz/shutdown-client mission-log))

In other words creds never get to env/config, they are only seen once at the cluster connection time, and only by Hazelcast client instance.

You can follow the hubble/env.clj to see how it all comes together.

While we attempt to be closer to a rocket science, it is in fact really simple to integrate Vault and Consul into a Clojure application.

The first step is made

We are operating Hubble and raising the human intelligence one nebula at a time.