Everyone securing physical AI is staring at the wrong layer. The fear is the model: jailbreaks, adversarial patches, a vision-language brain that hallucinates an action. Real concerns, but not where the cheap compromise lives. The cheap compromise is the inference server you start to run that model, and almost nobody who clones the repo reads what it binds to. By the end of this guide you will have a GR00T N1.7 policy server that listens only on loopback, sits behind a mutual-TLS proxy, and answers to exactly one named certificate, and you will have proven the default exposure before you closed it.

NVIDIA shipped Isaac GR00T N1.7 in 2026 as an open, commercially licensed humanoid foundation model with a vision-language backbone (Cosmos-Reason2 / Qwen3-VL) pretrained on 20,854 hours of egocentric video. On May 7, 2026 NVIDIA published the end-to-end workflow that fine-tunes N1.7 and deploys it on a Unitree G1 through the GEAR-SONIC whole-body controller, and on June 1, 2026 it announced a reference humanoid built on the same stack. The standard deployment is a GPU policy server that a thin client polls for actions. That server speaks ZeroMQ, and by default it listens on tcp://0.0.0.0:5555 with msgpack frames: no authentication, no transport encryption. That socket is a remote control for a body. Anyone who can reach port 5555 can send get_action with their own image and instruction and get joint commands back. Hold this frame the whole way through: the policy server is a service account with hands, so treat it like one.

Prerequisites

  • A Linux x86_64 workstation with an NVIDIA dGPU, 16 GB VRAM minimum for inference, CUDA 12.8 driver stack.
  • git, git-lfs, and ffmpeg installed. Confirm git lfs version returns cleanly.
  • The uv package manager (the repo's supported install path).
  • ss (iproute2) and nmap for the exposure check. A second machine or second user on the same LAN to play attacker, or just a second terminal if you only want the local proof.
  • For the identity layer: ghostunnel (a small mTLS proxy) and openssl.
  • A HuggingFace login able to pull nvidia/GR00T-N1.7-3B.

Step-by-step

1. Clone the repo and build the environment

sudo apt install git-lfs && git lfs install
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
curl -LsSf https://astral.sh/uv/install.sh | sh
sudo apt-get update && sudo apt-get install -y ffmpeg
uv sync --python 3.10
uv run python -c "import gr00t; print('GR00T installed successfully')"

This pulls GR00T N1.7 with its submodules and builds the pinned Python 3.10 environment. Let uv sync own the GPU dependencies, flash-attn included. Those are the ones that break when you try to pip them by hand.

2. Start the policy server and read the bind address

uv run python gr00t/eval/run_gr00t_server.py \
    --model-path nvidia/GR00T-N1.7-3B \
    --embodiment-tag OXE_DROID_RELATIVE_EEF_RELATIVE_JOINT \
    --device cuda:0

The server loads the checkpoint and prints Server is ready and listening on tcp://0.0.0.0:5555. Read that bind literally. 0.0.0.0 means every interface on the box, not just localhost. This is the line everyone scrolls past while the weights load. The first time I stood one of these up, I watched a 30-second load and never registered the address until I ran ss out of habit.

3. Confirm the listener is exposed

In a second shell:

ss -ltnp | grep 5555

Expected:

LISTEN 0  ... 0.0.0.0:5555  0.0.0.0:*  users:(("python",...))

From another host on the same network:

nmap -p 5555 <server-lan-ip>

A 5555/tcp open result means the action channel is reachable from anything that can route to this machine. No password prompt, no certificate, nothing standing between the network and the actuators.

4. Prove the channel takes commands with no identity

The repo's open-loop evaluator is the normal client, and it doubles as the cleanest demonstration that the socket trusts whoever connects:

uv run python gr00t/eval/open_loop_eval.py \
    --dataset-path demo_data/droid_sample \
    --embodiment-tag OXE_DROID_RELATIVE_EEF_RELATIVE_JOINT \
    --host <server-lan-ip> \
    --port 5555 \
    --traj-ids 1 2 \
    --action-horizon 8

Run this from the second host. It connects to a server it does not own, sends observations as a ZMQ REQ peer, and gets a predicted action horizon back. The server never asked who was calling. On a real deployment those returned values are decoded by GEAR-SONIC into leg, arm, and hand joint targets under the UNITREE_G1_SONIC embodiment tag.

5. Stop listening on every interface

Kill the server, then pin the bind to loopback:

uv run python gr00t/eval/run_gr00t_server.py \
    --model-path nvidia/GR00T-N1.7-3B \
    --embodiment-tag OXE_DROID_RELATIVE_EEF_RELATIVE_JOINT \
    --device cuda:0 \
    --host 127.0.0.1

Re-run ss -ltnp | grep 5555 and it should read 127.0.0.1:5555. The remote nmap from step 3 should now report the port closed or filtered. One flag removes the entire LAN from the attack surface. It does not finish the job, because your on-robot client still has to reach the model, but it forces every legitimate caller through a channel you control.

6. Give the server an identity and authenticate every caller with mTLS

Loopback-only means traffic has to be tunneled in. Put a mutual-TLS proxy in front so only a client holding a trusted certificate can open the channel. Generate a service identity for the policy server and a client identity for the robot controller:

# Server (policy) identity
openssl req -x509 -newkey rsa:2048 -nodes -days 90 \
  -keyout policy-key.pem -out policy-cert.pem -subj "/CN=groot-policy"
# Client (controller) identity
openssl req -x509 -newkey rsa:2048 -nodes -days 90 \
  -keyout ctrl-key.pem -out ctrl-cert.pem -subj "/CN=g1-controller"

Run ghostunnel on the server box, terminating TLS and forwarding plaintext to the loopback ZMQ port, accepting only the controller's certificate CN:

ghostunnel server \
  --listen 0.0.0.0:7555 \
  --target 127.0.0.1:5555 \
  --cert policy-cert.pem --key policy-key.pem \
  --cacert ctrl-cert.pem \
  --allow-cn g1-controller

On the controller side, run a client tunnel and point GR00T's client at the local end:

ghostunnel client \
  --listen 127.0.0.1:5555 \
  --target <server-lan-ip>:7555 \
  --cert ctrl-cert.pem --key ctrl-key.pem \
  --cacert policy-cert.pem

Now the client command from step 4 uses --host 127.0.0.1 again, but the bytes cross the network only inside an authenticated, encrypted session. A caller without ctrl-cert.pem is rejected at the proxy before a single observation reaches the model. The --allow-cn flag is your access-control rule: the policy server trusts one named identity, not one open port. The 90-day certificate lifetime gives you a forced rotation cadence rather than a key that lives forever.

Verify it works

  1. ss -ltnp | grep 5555 on the server shows 127.0.0.1:5555, never 0.0.0.0.
  2. nmap -p 5555 <server-lan-ip> from a third host reports the port not open. nmap -p 7555 <server-lan-ip> shows the tunnel port open.
  3. The GR00T client through the ghostunnel client returns an action horizon. A direct connection to 7555 without a client certificate fails the TLS handshake, and ghostunnel logs remote error: tls: certificate required.

Common pitfalls

  • --network host re-opens everything. If you containerize the server and run it with docker run --network host, the --host 127.0.0.1 bind still maps to host loopback, but the failure modes get easy to hit. Prefer bridge networking and publish nothing, or publish only 7555 from the tunnel.
  • msgpack reads as opaque, so people assume it is safe. It is neither encrypted nor signed. Anyone sniffing the LAN before you add TLS can see and replay observations and actions.
  • Embodiment-tag mismatch fails quietly. If the client tag does not match the server tag, you can still get a response shaped like actions that means nothing for your robot. Keep --embodiment-tag identical on both ends and assert it.
  • CUDA 13.x platforms. On DGX Spark or other CUDA 13 boxes, run uv run bash scripts/patch_triton_cuda13.sh before serving, or Triton import errors will masquerade as a networking failure.
  • flash-attn build noise. Let uv sync own it. Hand-installing flash-attn against the wrong CUDA path is the most common setup failure, and it has nothing to do with the security work.

Wrap-up

You now have a GR00T N1.7 policy server bound to loopback, fronted by a mutual-TLS proxy, answering to exactly one named certificate. You proved the default exposure, then closed it, and you can show the difference with nmap and a failed handshake.

The honest counterpoint: loopback plus mTLS does not stop a caller who holds a valid certificate from sending a malicious instruction, and it does nothing about adversarial input to the camera. Identity is authentication and authorization, not intent. But it is the layer that turns "anyone on the network" into "one auditable, revocable service," which is the precondition for everything after it. You cannot rate-limit, log, or revoke a caller you never identified.

Next moves, each tied to something above:

  1. Put the action stream behind a policy check: log every get_action with the client CN and the instruction text, so a misbehaving identity is traceable rather than anonymous.
  2. Add a guard that refuses any --action-horizon above your tested ceiling (start at the value you ran in step 4, which was 8), and reject requests over it.
  3. Alert on the bind: if ss -ltnp ever shows 5555 on anything but 127.0.0.1, treat it as an incident and rotate both certificates before the 90-day window would have.

Sources

  • https://github.com/NVIDIA/Isaac-GR00T
  • https://github.com/NVIDIA/Isaac-GR00T/blob/main/getting_started/policy.md
  • https://huggingface.co/blog/nvidia/gr00t-n1-7
  • https://github.com/NVlabs/GR00T-WholeBodyControl
  • https://nvidianews.nvidia.com/news/nvidia-open-humanoid-robot-reference-design