October | 2025 | numbersandcode

This past Saturday I attended the Small Language Model Build day at AGI House where AWS ran a workshop to help developers familiarize themselves with the Trainium platform. My main focus was to see how easy it was to develop specialized kernels for Trainium, and I have to say it wasn’t too hard on the instances and notebooks that were provided. Additionally, with some help from the AWS solutions architects on hand, the examples were easily transferable to your own instances if you choose to learn more after the workshop.

I won’t go too much into the background and architecture of Trainium – that’s for a later post and there are links below if you want to learn more. This post is more about getting your environment up using your own instances so you can experiment on your own time.

Spinning up an instance

I’m assuming you already have an AWS account and are familiar with the console. Go to EC2 and search for ‘neuron’. You should get two results – I chose the Ubuntu version and started it. You should see something like the below screenshot after it’s running

If you select your instance and hit the Connect button, you should be provided with the ssh command to log into the instance.

Setup your environment

Once you’re logged into your instance, make sure you source the right environment. This was something that wasn’t apparent from the documetation, but critical to make sure everything works.

% source /opt/aws_neuronx_venv_pytorch_2_8_nxd_training/bin/activate

Clone the repo here and pip install the requirements.txt

cd ~/neuron-workshops/labs/FineTuning/HuggingFaceExample/02_inference/01_finetuning/assets

pip install -r requirements.txt

You should also configure a HuggingFace token in the environment variable HF_TOKEN as it’s used by some of the scripts.

Running the scripts

At this point, your environment is setup and it’s just a matter of moving the code from the notebooks to .py files and running them. As I mentioned before, sourcing the virtual environment and installing from the requirements.txt pretty much ensures that everything runs. As of this writing, I’ve run the first two examples without any problems.

Running Lab 01 ( finetune_llama.py )

Use the following command line to run the finetune example in your instance. The code can be found here

neuron_parallel_compile torchrun --nnodes 1 --nproc_per_node 2 finetune_llama.py --bf16 True --dataloader_drop_last True --disable_tqdm True --gradient_accumulation_steps 1 --gradient_checkpointing True --learning_rate 5e-05 --logging_steps 10 --lora_alpha 32 --lora_dropout 0.05 --lora_r 16 --max_steps 1000 --model_id Qwen/Qwen3-1.7B --output_dir ~/environment/ml/qwen --per_device_train_batch_size 2 --tensor_parallel_size 2 --tokenizer_id Qwen/Qwen3-1.7B

This will go through the compilation and training phases. Some output you may see are:

{'loss': 0.0, 'learning_rate': 1.0000000000000002e-06, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3543022415039769}
{'loss': 0.0, 'learning_rate': 5.000000000000001e-07, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3579175704989154}
nrtucode: internal error: 27 object(s) leaked, improper teardown
{'loss': 0.0, 'learning_rate': 0.0, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3615328994938539}
Skipping trainer.save_model() while running under neuron_parallel_compile
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 654.4413, 'train_samples_per_second': 3.056, 'train_steps_per_second': 1.528, 'train_loss': 5.508159908946247e-35, 'epoch': 0.3615328994938539}

and

    "start_time": 1759690929.3364053,
    "compilation_time": 1270.3649690151215
}
2025-10-05 19:23:19.000701:  1390  INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 6
2025-10-05 19:23:19.000701:  1390  INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 5
2025-10-05 19:23:19.000701:  1390  INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 1

Running Lab 2 – Writing your own kernel

We’re going to run nki.py, which is based off of the notebook

% python3 nki.py

The output is much simpler than Lab 1 and will look something like this:

(aws_neuronx_venv_pytorch_2_8_nxd_training) ubuntu@ip-172-31-7-165:~/neuron-workshops/labs$ python3 nki.py
NKI and NumPy match
/home/ubuntu/neuron-workshops/labs/nki.py:40: DeprecationWarning: Use torch_xla.device instead
  device = xm.xla_device()
2025-10-06 22:05:32.136528: W neuron/pjrt-api/neuronpjrt.cc:1972] Use PJRT C-API 0.73 as client did not specify a PJRT C-API version
2025-Oct-06 22:05:36.0816 1617:1679 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):213 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2025-Oct-06 22:05:36.0821 1617:1679 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Oct-06 22:05:36.0826 1617:1679 [0] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
2025-Oct-06 22:05:36.0830 1617:1679 [0] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Checking correctness of nki_matmul_basic
2025-10-06 22:05:36.000984:  1617  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.21.18209.0+043b1bf7/MODULE_17414998555191982264+e30acd3a/model.neff
NKI and Torch match

There you go – a simple way to set up your instances to start experimenting with Trainium and run a few examples. More references below, as well as links to some of the AWS solutions architects that helped me.

References

https://youtu.be/9ihlYCzEuLQ?si=BIqMta-7qeH0RqFG

https://catalog.workshops.aws/event/dashboard/en-US/workshop/labs/02-lab-two

https://github.com/aws-neuron/neuron-workshops/tree/main