No Java install, no daemon, no native build — npx a tool-calling local LLM and start chatting:
bash
npx @deemwario/mochallama chat -m qwen2.5-1.5bThe CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models.
Two dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:
kotlin
implementation("io.github.deemwario:mochallama-core:0.1.6")
runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6")xml
<dependency>
<groupId>io.github.deemwario</groupId>
<artifactId>mochallama-core</artifactId>
<version>0.1.6</version>
</dependency>
<dependency>
<groupId>io.github.deemwario</groupId>
<artifactId>mochallama-core-platform</artifactId>
<version>0.1.6</version>
<scope>runtime</scope>
</dependency>java
import tools.deemwar.mochallama.panama.ChatEngine;
import java.nio.file.Path;
var engine = ChatEngine.load(Path.of("/path/to/model.gguf"));
String reply = engine.chat("Write a haiku about Project Panama.", 128, 0.7);
System.out.println(reply);JVM flags
JDK 22+ is required (FFM is GA there). Run with --enable-native-access=ALL-UNNAMED.
The starter autoconfigures a local model service and the OpenAI-compatible endpoints — no spring-ai dependency required:
kotlin
implementation("io.github.deemwario:mochallama-spring-boot-starter:0.1.6")
runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6")xml
<dependency>
<groupId>io.github.deemwario</groupId>
<artifactId>mochallama-spring-boot-starter</artifactId>
<version>0.1.6</version>
</dependency>
<dependency>
<groupId>io.github.deemwario</groupId>
<artifactId>mochallama-core-platform</artifactId>
<version>0.1.6</version>
<scope>runtime</scope>
</dependency>Tell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties:
properties
llamacpp.model.hf-id=Qwen/Qwen2.5-1.5B-Instruct-GGUF
# or an explicit url + filename:
# llamacpp.model.url=https://.../qwen2.5-1.5b-instruct-q4_k_m.gguf
# llamacpp.model.filename=qwen2.5-1.5b-instruct-q4_k_m.ggufStart the app (the model loads asynchronously — endpoints return 503 until state: READY), then point any OpenAI client at it. POST /v1/chat/completions handles non-streaming, stream:true SSE, and tools / tool_choice; GET /v1/models lists the loaded model.
bash
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Hello from local llama.cpp"}]}'mochallama chat is a stateful REPL — it keeps the full conversation history, not amnesiac single turns.
bash
# List the tool-capable presets / loaded models
npx @deemwario/mochallama models
# Start a multi-turn chat; the conversation is saved as a session
npx @deemwario/mochallama chat -m qwen2.5-1.5b
# List past sessions (id, model, turns, last-updated)
npx @deemwario/mochallama sessions
# Continue a prior conversation
npx @deemwario/mochallama chat --resume <id>Sessions persist at ~/.chatbot_models/sessions/<id>.json. Pass --no-save for an ephemeral run. Inside the REPL, slash commands /reset, /help, and /exit are available.
Today every local-LLM path for the JVM reaches your app over HTTP — Ollama, llama-server, LM Studio and friends are all separate processes, and Spring AI / LangChain4j just point an HTTP client at them. The other in-process options are non-JVM, or on the JVM are pure-Java Jlama (reimplements inference on the incubating Vector API, GGUF-less) or JNI bindings whose native faults can take down the whole JVM. mochallama fills the empty quadrant: FFM (GA) + real upstream llama.cpp + Spring-autoconfigured OpenAI wire API + tools-and-SSE-together + zero native-install.
It is an inference engine and wire API, not a RAG/agent framework. For orchestration, memory, and provider-portability you still want Spring AI or LangChain4j — mochallama slots in under them as the local provider via its Spring AI ChatModel adapter. And if you want a shared standalone model server with automatic GPU offload and the widest model catalogue, Ollama is the easier on-ramp. See the full, PR-welcome breakdown in Compare.