Experimenting with the LeRobot S101

Some good, some not so good, and clarifications on the documentation

Summary:

About a month ago I got a Lerobot kit from Wowrobo. I’ve assembled the leader and follower arms, teleoperated the units and also collected data and trained a model to perform tasks. So far the documentation has been easy to follow. However, there have been some hiccups and they are documented here:

Prerequisites

HF Lerobot documentation: https://huggingface.co/docs/lerobot/so101

Wowrobo kit – https://shop.wowrobo.com/products/so-arm101-diy-kit-assembled-version-1?variant=46588641607897

Issue #1Power Supplies

Maybe it’s different for different motors ( mine was Feetech ), but the leader and follower arms have different power supplies and they make a big difference. The Leader has a 30W 5V, while the follower has a 36W 12V power supply. This tripped me up several times when the board would lose motor ids in a middle of a run.

Issue #2Training

So you’ve collected your teleop data and are ready to train. There are several options

  1. Locally – I would not recommend training a model locally as this is VERY slow and I have an M3 Macbook Pro – and it may be days before you get to 10K steps
  2. Google Colab – this is an alternative for those who are GPU Poor and the option I ultimately used. The HF instructions have a page that walks you through setting it up here. However, the free tier just gives you a T4, which works if you set batch = 1 and then you’ll have to hope you won’t run out of memory. If you want the beefier A100, which will train 100k steps in about 5 hours, you’ll either have to upgrade to Colab Pro or pay as you go. The PAYG option works if you’re doing a one time deal, but you’ll have to babysit the notebook otherwise it’ll disconnect ( I observed this happens every 90 minutes ) and then you’ll have to start over, unless you mounted the output to your own Google Drive ( see below ). The Colab Pro method is supposed to cause less disconnects, but your mileage may vary. As of today, $10 ( 100 credits ) will train your model with some credits left over.
  3. GPU Providers. There’s plenty to choose from, but then you’ll have to do your own setup.

Mounting your google drive to your Colab notebook. Do this in Colab to save your checkpoints so you can resume if you’re disconnected.

from google.colab import drive
drive.mount('/content/drive')

Issue #3Missing model.safetensors file

When it came time to exercise my newly trained model, I discovered the model.safetensors file wasn’t around. I’m still not sure what happened, but make sure as checkpoints are written to check for this file otherwise all that training is for naught.

I used the following command line, which is different from the one published in HF.

!python lerobot/src/lerobot/scripts/lerobot_train.py --dataset.repo_id=dpang/record-test --policy.type=act --output_dir=drive/MyDrive/outputs/train/act_so101_test --job_name=lr_20251211_0949 --policy.device=cuda --wandb.enable=true --policy.push_to_hub=true --policy.repo_id=dpang/my_policy --save_freq=1000 --batch_size=2

Issue #4

Running inference

In order to properly exercise the model, make sure to uncomment/add the telelop arguments to command line provided by the HF instructions, otherwise you can’t reset the scene. I’m not sure why it’s commented out in the example, you really need it between episodes.

lerobot-record  --robot.type=so101_follower  --robot.port=/dev/tty.usbmodem5AB01812601   --robot.cameras="{front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}, top: {type: opencv, index_or_path: 1, width: 1920, height: 1080, fps: 30}}"  --robot.id=le_follower_arm  --display_data=false  --dataset.repo_id=dpang/eval_test  --dataset.single_task="Push cup forwrd"  --policy.path=/Users/dpang/dev/lerebotHackathon20250615/lerobot/outputs_push_cup/train/push_cup_test/checkpoints/100000/pretrained_model --teleop.type=so101_leader --teleop.port=/dev/tty.usbmodem5AB01788091 --teleop.id=le_leader_arm

Issue #5

Colab line to train pi05

So inference with the ‘act’ model went smoothly, but when it came time to try the ‘pi05’ model, things didn’t work as expected. The documentation here for the colab to the train the model didn’t work. I used the line below instead.

!python lerobot/src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=dpang/record-test \
--policy.type=pi05 \
--batch_size=4 \
--steps=20000 \
--output_dir=drive/MyDrive/outputs/train/my_pi0_5 \
--job_name=my_pi0_5_training_20260116 \
--policy.device=cuda \
--wandb.enable=true \
--policy.repo_id=dpang/my_policy

In addition, I got error messages requiring authorization for the “paligemma-3b-pt-224” model. There is a link provided to get said authorization, but then you’ll have to restart the notebook. Also, make sure to log into HF otherwise you’ll error out trying to write the model to HF.

!huggingface-cli login

Here is a link to a successful run.

An AI investor’s thesis on Google

A few months ago, I wrote about how the large models eat up the stack and the workflow and also expand an individual’s reach beyond their skill set (think engineers becoming product managers and GTM specialists). Carrying that analogy from individuals to companies, we have a company that could potentially eat the entire AI ecosystem: Google. They already have the full stack from the chips, the talent, the resources, models, not to mention loads and loads of data. It would not be inconceivable of them to spread their wings and embed themselves into more of the AI and business domains.

The impetus for this train of thought is a hackathon I went to that featured the app building talents of their AI Studio product. Google has been on a marketing and publicity blitz the past few months – holding a ton of meetups and hackathons publicizing this and other features. I’ve probably seen Paige Bailey more in the past few months than in the past 2 years combined.

In the hackathon, we’re given 3 hours to vibe code, deploy a product and put together a presentation with a 3-minute video (on YouTube of course). On top of that, part of the criteria for winning is how our product performs on social media. In most hackathons, this would be impossible because most new offerings from startups don’t work half the time. If something was to be accomplished in 3 hours, there would be a template or a workshop-like program where teams are walked through a reference implementation, which most would hand in anyway.

This was purely starting from scratch with nothing but the build feature of AI Studio, and it worked. We vibe coded and deployed our app into production as did so many other teams, and the variety of ideas that came into fruition was staggering.

So how does this go towards the thesis of Google eating everything?

It’s similar to the Apple Store or Amazon’s marketplace. As more apps get deployed with increasing sophistication in Google’s app ecosystem, Google gets to see what works and what doesn’t. They can then choose to buy, host, or duplicate the product. Either way, Google gets to expand their footprint throughout the AI economy (and collect all that data to boot).

So, what could get in their way? Plenty. Don’t forget Google is still a large company and unless they’ve drastically revamped their culture and structure, they’re still prone to the same missteps that plague big behemoths. Throw in antitrust and competition from another 800-pound gorilla – the Elon Musk company universe (X, XAI, Tesla, Neuralink, SpaceX, …) which will stop at nothing until achieving total domination, and we should see plenty of fireworks in the next few years.

I haven’t mentioned the big frontier labs (OpenAI, Anthropic). They’ll still be around and possibly survive, but they’re not going to 10x, let alone 100x from where they are – they still have to spend to expand, and the combination of Google, Elon, open source, and Chinese models/companies are going to eat into their margins. As an AI investor, my money, across all private and public markets, is going to be on Google.

Trainium exploration

This past Saturday I attended the Small Language Model Build day at AGI House where AWS ran a workshop to help developers familiarize themselves with the Trainium platform.  My main focus was to see how easy it was to develop specialized kernels for Trainium, and I have to say it wasn’t too hard on the instances and notebooks that were provided.  Additionally, with some help from the AWS solutions architects on hand, the examples were easily transferable to your own instances if you choose to learn more after the workshop.

I won’t go too much into the background and architecture of Trainium – that’s for a later post and there are links below if you want to learn more.  This post is more about getting your environment up using your own instances so you can experiment on your own time.

Spinning up an instance

I’m assuming you already have an AWS account and are familiar with the console. Go to EC2 and search for ‘neuron’.  You should get two results – I chose the Ubuntu version and started it.  You should see something like the below screenshot after it’s running

If you select your instance and hit the Connect button, you should be provided with the ssh command to log into the instance.

Setup your environment

Once you’re logged into your instance, make sure you source the right environment.  This was something that wasn’t apparent from the documetation, but critical to make sure everything works.

% source /opt/aws_neuronx_venv_pytorch_2_8_nxd_training/bin/activate

Clone the repo here and pip install the requirements.txt

cd ~/neuron-workshops/labs/FineTuning/HuggingFaceExample/02_inference/01_finetuning/assets

pip install -r requirements.txt

You should also configure a HuggingFace token in the environment variable HF_TOKEN as it’s used by some of the scripts.

Running the scripts

At this point, your environment is setup and it’s just a matter of moving the code from the notebooks to .py files and running them.  As I mentioned before, sourcing the virtual environment and installing from the requirements.txt pretty much ensures that everything runs.  As of this writing, I’ve run the first two examples without any problems.

Running Lab 01 ( finetune_llama.py )

Use the following command line to run the finetune example in your instance.  The code can be found here

neuron_parallel_compile torchrun --nnodes 1 --nproc_per_node 2 finetune_llama.py --bf16 True --dataloader_drop_last True --disable_tqdm True --gradient_accumulation_steps 1 --gradient_checkpointing True --learning_rate 5e-05 --logging_steps 10 --lora_alpha 32 --lora_dropout 0.05 --lora_r 16 --max_steps 1000 --model_id Qwen/Qwen3-1.7B --output_dir ~/environment/ml/qwen --per_device_train_batch_size 2 --tensor_parallel_size 2 --tokenizer_id Qwen/Qwen3-1.7B

This will go through the compilation and training phases. Some output you may see are:

{'loss': 0.0, 'learning_rate': 1.0000000000000002e-06, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3543022415039769}
{'loss': 0.0, 'learning_rate': 5.000000000000001e-07, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3579175704989154}
nrtucode: internal error: 27 object(s) leaked, improper teardown
{'loss': 0.0, 'learning_rate': 0.0, 'grad_norm': -1.7642974853515625e-05, 'epoch': 0.3615328994938539}
Skipping trainer.save_model() while running under neuron_parallel_compile
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 654.4413, 'train_samples_per_second': 3.056, 'train_steps_per_second': 1.528, 'train_loss': 5.508159908946247e-35, 'epoch': 0.3615328994938539}

and

    "start_time": 1759690929.3364053,
    "compilation_time": 1270.3649690151215
}
2025-10-05 19:23:19.000701:  1390  INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 6
2025-10-05 19:23:19.000701:  1390  INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 5
2025-10-05 19:23:19.000701:  1390  INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 1

Running Lab 2 – Writing your own kernel

We’re going to run nki.py, which is based off of the notebook

% python3 nki.py

The output is much simpler than Lab 1 and will look something like this:

(aws_neuronx_venv_pytorch_2_8_nxd_training) ubuntu@ip-172-31-7-165:~/neuron-workshops/labs$ python3 nki.py
NKI and NumPy match
/home/ubuntu/neuron-workshops/labs/nki.py:40: DeprecationWarning: Use torch_xla.device instead
  device = xm.xla_device()
2025-10-06 22:05:32.136528: W neuron/pjrt-api/neuronpjrt.cc:1972] Use PJRT C-API 0.73 as client did not specify a PJRT C-API version
2025-Oct-06 22:05:36.0816 1617:1679 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):213 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2025-Oct-06 22:05:36.0821 1617:1679 [0] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Oct-06 22:05:36.0826 1617:1679 [0] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
2025-Oct-06 22:05:36.0830 1617:1679 [0] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Checking correctness of nki_matmul_basic
2025-10-06 22:05:36.000984:  1617  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.21.18209.0+043b1bf7/MODULE_17414998555191982264+e30acd3a/model.neff
NKI and Torch match

There you go – a simple way to set up your instances to start experimenting with Trainium and run a few examples. More references below, as well as links to some of the AWS solutions architects that helped me.

References

https://youtu.be/9ihlYCzEuLQ?si=BIqMta-7qeH0RqFG

https://catalog.workshops.aws/event/dashboard/en-US/workshop/labs/02-lab-two

https://github.com/aws-neuron/neuron-workshops/tree/main

LinkedIn ( thanks for the help! )

https://www.linkedin.com/in/jianying-lang-303bb538/

https://www.linkedin.com/in/emily-webber-921b4969/

https://www.linkedin.com/in/jimburtoft/

It all seems so quaint – looking back at GenAI posts from 3 years ago

I just had a chance to pull my head above water and spend some time putting some thoughts down after looking at my AI posts from 3 years ago. It’s amazing what has evolved since then and provides a framework to think about what comes next.

The posts in question are related to coding https://numbersandcode.wordpress.com/2022/12/16/coding-kaprekars-constant-using-chatgpt/, and image and video generation https://numbersandcode.wordpress.com/2022/06/22/dall-e/

One recurring theme when writing these articles is that there’ll always be a phrase, “unless you’ve been living under a rock” or “unless you’ve been a hermit” because every week some AI subject goes viral. So unless you’ve been sleeping for the past 20 years like Rip Van Winkle, you would know about…

Cursor from Anysphere and other coding apps like Windsurf, Factory, Devin from Cognition Labs, Lovable, Replit, Bolt…the list goes on and on, not to mention their valuations ( Cursor – $9B https://techcrunch.com/2025/06/05/cursors-anysphere-nabs-9-9b-valuation-soars-past-500m-arr/ and Windsurf – bought by OpenAI for $3B )

Three years ago I was cutting and pasting code between the editor and ChatGPT. Now it’s built into the IDE or in the case of Replit and Lovable, everything is done online. The stack has compressed and with the current trends around agents and reasoning models, what’s preventing whole workflows and systems from being reproduced from a prompt? In the future, you could just have an idea and prompt a product into existence, complete with market research and GTM strategy, not to mention the implementation and product documentation in between.

…and unless you’re sleeping beauty, you would know about

Veo3. Google’s new video generation model that’s been taking the internet by storm with numerous examples here and here. The improvements here are more dramatic – whereas code generation and putting together an algorithm was achievable, stringing videos together to make a believable ad wasn’t remotely possible three years ago. Now anyone can make movies, shorts, ads, and scenarios, and personalize them, possibly in realtime.

By now you would’ve figured out the framework needed to think about what could happen in the next 3 years. It’s compression and expansion – the collapse of the tech stack, workflows, functions, and organizations that are familiar in your world, and it’s ability to help you expand your reach beyond your perception.