This past Saturday I attended the Small Language Model Build day at AGI House where AWS ran a workshop to help developers familiarize themselves with the Trainium platform. My main focus was to see how easy it was to develop specialized kernels for Trainium, and I have to say it wasn’t too hard on the instances and notebooks that were provided. Additionally, with some help from the AWS solutions architects on hand, the examples were easily transferable to your own instances if you choose to learn more after the workshop.
I won’t go too much into the background and architecture of Trainium – that’s for a later post and there are links below if you want to learn more. This post is more about getting your environment up using your own instances so you can experiment on your own time.
Spinning up an instance
I’m assuming you already have an AWS account and are familiar with the console. Go to EC2 and search for ‘neuron’. You should get two results – I chose the Ubuntu version and started it. You should see something like the below screenshot after it’s running
If you select your instance and hit the Connect button, you should be provided with the ssh command to log into the instance.
Setup your environment
Once you’re logged into your instance, make sure you source the right environment. This was something that wasn’t apparent from the documetation, but critical to make sure everything works.
% source /opt/aws_neuronx_venv_pytorch_2_8_nxd_training/bin/activate
Clone the repo here and pip install the requirements.txt
cd ~/neuron-workshops/labs/FineTuning/HuggingFaceExample/02_inference/01_finetuning/assets
pip install -r requirements.txt
You should also configure a HuggingFace token in the environment variable HF_TOKEN as it’s used by some of the scripts.
Running the scripts
At this point, your environment is setup and it’s just a matter of moving the code from the notebooks to .py files and running them. As I mentioned before, sourcing the virtual environment and installing from the requirements.txt pretty much ensures that everything runs. As of this writing, I’ve run the first two examples without any problems.
Running Lab 01 ( finetune_llama.py )
Use the following command line to run the finetune example in your instance. The code can be found here
neuron_parallel_compile torchrun --nnodes 1 --nproc_per_node 2 finetune_llama.py --bf16 True --dataloader_drop_last True --disable_tqdm True --gradient_accumulation_steps 1 --gradient_checkpointing True --learning_rate 5e-05 --logging_steps 10 --lora_alpha 32 --lora_dropout 0.05 --lora_r 16 --max_steps 1000 --model_id Qwen/Qwen3-1.7B --output_dir ~/environment/ml/qwen --per_device_train_batch_size 2 --tensor_parallel_size 2 --tokenizer_id Qwen/Qwen3-1.7B
This will go through the compilation and training phases. Some output you may see are:
{'loss': 0.0, 'learning_rate': 1.0000000000000002e-06, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3543022415039769}
{'loss': 0.0, 'learning_rate': 5.000000000000001e-07, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3579175704989154}
nrtucode: internal error: 27 object(s) leaked, improper teardown
{'loss': 0.0, 'learning_rate': 0.0, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3615328994938539}
Skipping trainer.save_model() while running under neuron_parallel_compile
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 654.4413, 'train_samples_per_second': 3.056, 'train_steps_per_second': 1.528, 'train_loss': 5.508159908946247e-35, 'epoch': 0.3615328994938539}
and
"start_time": 1759690929.3364053,
"compilation_time": 1270.3649690151215
}
2025-10-05 19:23:19.000701: 1390 INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 6
2025-10-05 19:23:19.000701: 1390 INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 5
2025-10-05 19:23:19.000701: 1390 INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 1
Running Lab 2 – Writing your own kernel
We’re going to run nki.py, which is based off of the notebook
% python3 nki.py
The output is much simpler than Lab 1 and will look something like this:
(aws_neuronx_venv_pytorch_2_8_nxd_training) ubuntu@ip-172-31-7-165:~/neuron-workshops/labs$ python3 nki.py
NKI and NumPy match
/home/ubuntu/neuron-workshops/labs/nki.py:40: DeprecationWarning: Use torch_xla.device instead
device = xm.xla_device()
2025-10-06 22:05:32.136528: W neuron/pjrt-api/neuronpjrt.cc:1972] Use PJRT C-API 0.73 as client did not specify a PJRT C-API version
2025-Oct-06 22:05:36.0816 1617:1679 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):213 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2025-Oct-06 22:05:36.0821 1617:1679 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Oct-06 22:05:36.0826 1617:1679 [0] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
2025-Oct-06 22:05:36.0830 1617:1679 [0] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Checking correctness of nki_matmul_basic
2025-10-06 22:05:36.000984: 1617 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.21.18209.0+043b1bf7/MODULE_17414998555191982264+e30acd3a/model.neff
NKI and Torch match
There you go – a simple way to set up your instances to start experimenting with Trainium and run a few examples. More references below, as well as links to some of the AWS solutions architects that helped me.
References
https://youtu.be/9ihlYCzEuLQ?si=BIqMta-7qeH0RqFG
https://catalog.workshops.aws/event/dashboard/en-US/workshop/labs/02-lab-two
https://github.com/aws-neuron/neuron-workshops/tree/main
LinkedIn ( thanks for the help! )
https://www.linkedin.com/in/jianying-lang-303bb538/

