Prototyping with the Google Coral ML (TPU) Accelerator Module

5 min readFeb 15, 2021

Google recently released the Coral TPU Accelerator Module, a solderable multi-chip module that support PCIe Gen2 x1 and USB 2.0 HS serial interfaces. It belongs to the same family of Coral products that are available in multiple formats (single board computer, pci-e boards, USB accelerators, SoM daughter board) that work with Tensorflow Lite inference models.

With the release of the module, google is enabling the end user to embed the TPU in custom designed solutions. The physical dimensions of the chip is just 15 x 10 mm, with 120 tiny pins in LGA footprint.

To be able to build a prototype, I had to create a symbol for the Coral Module and the corresponding PCB footprint using KiCad (Open Source schematics and PCB design software). The module is tiny and it has 120 pins, but most of them are connected to VIN and GND. For USB you need to work with around 8–10 pins, but my breakout board brings all the available signals to the DIP pins. One mistake I made was not matching the length of the PCI traces. The manual specifies that due to internal dimension constraints, the differential traces for PCI are not matched in length and it is required to match them outside the module. But as I am experimenting only with USB at this time, the small mistake is not an issue. The USB differential traces are length matched inside the module.

I ordered the PCBs and the laser stencil. Applied solder paste and used a low cost SMD soldering oven using the profile recommended by Google, then added some male pin headers.

Coral Module soldered to the breakout board

Next step is to add the additional external components in the protoboard (I used a PCB as it is better than breadboard for fast signals like USB 2.0 High Speed). The coral module requires 3.3V and 1.8V to operate. All I/O level is 1.8V.

Fully assembled protoboard with Coral Module breakout inserted in a socket, the 3.3V and 1.8V supply on the left and the USB mini connector on the top.

There is an option to use only 3.3V applied to VIN and the module generates the 1.8V for the I/O, but it requires some additional signal conditioning outside the module. If you supply both 3.3V and 1.8V, the external circuitry is much simpler. I used 2 LM338T-ADJ (It is not optimal, but what I had on hand… I need to switch to using LDOs) to generate 3.3V and 1.8V from the 5V supplied by the USB Host. It means I am using Bus powered mode. To be able to use self-powered mode, I need to do some more experimentation around detecting VBUS and enabling the TPU so it can be enumerated and recognized by the Host. I tried just plugging a self-powered board to USB and it resulted in a device not recognized error. But if you power everything from VBUS, it works fine.

10ms for PGOOD (Purple) after applying power (yellow and cyan) to the board. Green is RST_L using arbitrary values for the RC circuit that generated a signal with 74ms risetime. Later tuned to 26ms

You also need an RC circuit to generate a delay greater than 10ms for the RST_L signal required by the module to. I used a simple 10K resistor + 10uF capacitor to generate a 26ms rise time and applied the output to the RST_L signal. RST_L going LOW to HIGH enables the module.

Connections used for USB 2.0 prototyping

VIN (3.3V)
GND
AON (1.8V)
PGOOD4 (Power Good)
PMIC_EN (connected to AON, enable Power Management IC)
RST_L (TPU reset, connected to 1.8V through RC network)
USB2_DATA_P
USB2_DATA_N
USB_SEL (tied to AON. It is possible to monitor PGOOD4 and activate USB)

RC circuit to generate 26ms delay on RST_L

Once the protoboard is plugged to the PC (USB 2.0), it is recognized and you need to install the driver and the API (I used pycoral) by following this guide.

Testing the module in USB 2.0 High Speed Mode and comparison to USB 3.0

When I tested the classify_image.py test with a USB accelerator (the accelerator, not the module) in USB 3.0 mode, I got the following results:

USB 3.0: 12ms first run (model load + inference), 2.3ms following tests (inference only)

The same test using the USB accelerator and my custom board with the Coral module in USB 2.0 mode I got the following results:

USB 2.0: 95ms first run (model load + inference), 8.5ms following tests (inference only)

Some back of the envelope calculation

The Tensorflow Lite model used by the classify_image.py example is mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite. The model size is 4,067,904 bytes. The parrot image size is 150,528 bytes. Transfer speed is the most important factor in the total inference time. Assuming SPI @ 80MHz to be 6 times slower than USB 2.0 HS, and assuming it is possible to port the driver and pycoral to a microcontroller, we should expect to see inference times of 48ms-100ms or around 10Hz.

Conclusions

So I could verify my prototype board was working fine. The example gives an error if it cannot find the Coral TPU connected.

I will be testing PCIe by manufacturing a PCIe X1 board for my PC.

The coral ecosystem now requires a Linux, Windows or Mac OS X system. My final objective is to do a feasibility analysis of running the Coral Module with a bare microcontroller using a serial interface like SPI. Of course it means I need to port all the software to an interface microcontroller powerful enough to act as USB 2.0 High Speed Host and interface to the user microcontroller through SPI (some sort of SPI to Google Coral TPU bridge). If it works, it will lower considerably the energy footprint and enable powerful ML inference on the edge solution to the massive IoT marketplace.

Please read my follow up story… Prototyping with Google Coral ML (TPU) Accelerator Module with PCIe interface.

Prototyping with the Google Coral ML (TPU) Accelerator Module

Testing the module in USB 2.0 High Speed Mode and comparison to USB 3.0

Some back of the envelope calculation

Conclusions

Written by Tony Kim