Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

The 10-second hook

No Java install, no daemon, no native build — npx a tool-calling local LLM and start chatting:

bash

npx @deemwario/mochallama chat -m qwen2.5-1.5b

The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models.

Embed it: the smallest plain-Java snippet

Two dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:

build.gradle.ktspom.xml

kotlin

implementation("io.github.deemwario:mochallama-core:0.1.6")
runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6")

xml

<dependency>
  <groupId>io.github.deemwario</groupId>
  <artifactId>mochallama-core</artifactId>
  <version>0.1.6</version>
</dependency>
<dependency>
  <groupId>io.github.deemwario</groupId>
  <artifactId>mochallama-core-platform</artifactId>
  <version>0.1.6</version>
  <scope>runtime</scope>
</dependency>

java

import tools.deemwar.mochallama.panama.ChatEngine;
import java.nio.file.Path;

var engine = ChatEngine.load(Path.of("/path/to/model.gguf"));
String reply = engine.chat("Write a haiku about Project Panama.", 128, 0.7);
System.out.println(reply);

JVM flags

JDK 22+ is required (FFM is GA there). Run with --enable-native-access=ALL-UNNAMED.

Or one Spring dependency

The starter autoconfigures a local model service and the OpenAI-compatible endpoints — no spring-ai dependency required:

build.gradle.ktspom.xml

kotlin

implementation("io.github.deemwario:mochallama-spring-boot-starter:0.1.6")
runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6")

xml

<dependency>
  <groupId>io.github.deemwario</groupId>
  <artifactId>mochallama-spring-boot-starter</artifactId>
  <version>0.1.6</version>
</dependency>
<dependency>
  <groupId>io.github.deemwario</groupId>
  <artifactId>mochallama-core-platform</artifactId>
  <version>0.1.6</version>
  <scope>runtime</scope>
</dependency>

Tell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties:

properties

llamacpp.model.hf-id=Qwen/Qwen2.5-1.5B-Instruct-GGUF
# or an explicit url + filename:
# llamacpp.model.url=https://.../qwen2.5-1.5b-instruct-q4_k_m.gguf
# llamacpp.model.filename=qwen2.5-1.5b-instruct-q4_k_m.gguf

Start the app (the model loads asynchronously — endpoints return 503 until state: READY), then point any OpenAI client at it. POST /v1/chat/completions handles non-streaming, stream:true SSE, and tools / tool_choice; GET /v1/models lists the loaded model.

bash

curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"Hello from local llama.cpp"}]}'

A real multi-turn CLI

mochallama chat is a stateful REPL — it keeps the full conversation history, not amnesiac single turns.

bash

# List the tool-capable presets / loaded models
npx @deemwario/mochallama models

# Start a multi-turn chat; the conversation is saved as a session
npx @deemwario/mochallama chat -m qwen2.5-1.5b

# List past sessions (id, model, turns, last-updated)
npx @deemwario/mochallama sessions

# Continue a prior conversation
npx @deemwario/mochallama chat --resume <id>

Sessions persist at ~/.chatbot_models/sessions/<id>.json. Pass --no-save for an ephemeral run. Inside the REPL, slash commands /reset, /help, and /exit are available.

Honest positioning

Today every local-LLM path for the JVM reaches your app over HTTP — Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.

It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j — mochallama slots in under them as the local provider via its Spring AI ChatModel adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in Compare.

What to do next

Quickstart — time-to-first-success: npx, plain Java, and Spring Boot.
Why mochallama — the FFM-not-JNI, prebuilt-not-compiled, tool-only decisions.
Examples — curl, OpenAI Python SDK, Spring Boot, CLI, tools + streaming.
Compare — mochallama vs Ollama, Jlama, java-llama.cpp, Spring AI, node-llama-cpp.

mochallamaA local, tool-calling LLM inside your JVM

The 10-second hook ​

Embed it: the smallest plain-Java snippet ​

Or one Spring dependency ​

A real multi-turn CLI ​

Honest positioning ​

What to do next ​

The 10-second hook

Embed it: the smallest plain-Java snippet

Or one Spring dependency

A real multi-turn CLI

Honest positioning

What to do next