What It Takes for AMD to Bring AI Into Space
From C3PO and R2D2 in “Star Wars” to HAL 9000 in “2001: Space Odyssey” and to Data in “Star Trek,” perhaps one of the most common elements in space-based science fiction is the fully aware, self-learning AI, a.k.a Artificial General Intelligence, or AGI. While we’re definitely not there yet, recent advances in AI, as well as edge computing, are beginning to make it look like the “fiction” in “science fiction” might not be so far-fetched.
Part of the excitement around these technologies is that advances in capabilities are being driven by a global ecosystem across multiple areas. Many of the recent advances target terrestrial systems and devices—and rightly so, given the vast number of untapped and yet-to-be-invented use cases and applications. Additionally, there’s recently been a push for bringing as much AI capability to the edge as possible to help address the exploding cloud resource demand that’s making the total cost of operation untenable.
When it comes to the edge, there really isn’t any further you can go than outer space. Every starry-eyed child (and perhaps a few adults) who has ever fallen in love with stories of humans in space can’t help but get excited when the topics of space and AI are brought together. This was the case for me recently when Advanced Micro Devices (AMD) announced the launch of the Versal AI Edge XQRVE2302 adaptive SoC.
This second-generation SoC is a radiation-tolerant, space-grade adaptive SoC that has an array of integrated AI/machine learning (ML) engines and will be, according to AMD, qualified for space flight.
Aside from the extreme environments in which a space-qualified processor must operate, anything going into a space-based platform will be limited in terms of physical footprint and power consumption. Relative to their first-generation space-grade SoC, AMD has stated that the VE2302 is nearly 75% smaller with respect to board area, as well as power.
Radiation tolerance vs. space grade
Radiation-tolerance and space-grade qualifications are, in fact, two different characteristics. In an interview with EE Times, Ken O’Neill, AMD’s space systems architect, made the distinction that “radiation tolerance is essential but not sufficient to be space-grade qualified.”
Being a part of the Versal family of adaptive processors, the VE2302 achieves its radiation tolerance through patented techniques developed with the Versal family’s terrestrial solutions. While the types of radiation found in space are different than in terrestrial applications, with the former experiencing more protons and heavy ion radiation, O’Neill said the same mitigation techniques are applicable and have been shown to greatly mitigate space-based radiation.
One such radiation-tolerance technique is the use of a triple-redundant, hard-wired MicroBlaze CPU with a single-event upset (SEU)-optimized voting circuit in the platform management controller of the VE2302.
When charged particles like protons or heavy ions pass through the semiconductor, an SEU could occur. An SEU essentially changes the state of a flip-flop or a memory cell due to a sudden pulse of current caused by the radiation passing through the semiconductor.
In and of itself, an SEU doesn’t necessarily mean it will cause a catastrophic malfunction. It all depends in what part of the system the SEU occurs and what function’s being affected.
For example, when an SEU occurs in a control register, it could cause a temporary malfunction of the IC. When this happens, the SEU is called a single-event functional interrupt (SEFI). One of the goals of radiation tolerance is to mitigate SEFIs. Through AMDs radiation testing, they’ve found that this triple-redundant architecture significantly reduces the probability of SEFIs in the platform management controller. The importance of this is that the platform management controller is used to perform internal management, security and power-on reset functions within the Versal adaptive SoC, and crucially, it mitigates the occurrence of SEUs in the configuration memory of the device.
Along with radiation tolerance, monolithic ICs like the VE2302 must then comply with qualification requirements from the various space agencies around the world, in order for them to achieve space-grade qualification. While each of the space agencies has its own variation, they have much in common with the Department of Defense’s (DoD) MIL-PRF-38535 document.
According to O’Neill, there were four specific areas that AMD worked on to ensure that their latest part would qualify.
The first of these was to mitigate the effects of repeated thermal cycling experienced by ICs used in space-based platforms, such as satellites. This is especially important for low-earth-orbit (LEO) satellites, which among other use cases, are used for bi-directional communications, as well as sensing satellites. In these applications, latency is critical and minimizing the distance to earth by utilizing this orbit is ideal.
At LEO, satellites typically orbit the earth every 90 minutes and continuously cycle at that frequency between maximum temperatures when facing the sun and minimum temperatures when shielded from the sun by the earth. Depending on the design, ICs going through this thermal cycle will experience stresses which may cause physical warping of the device. An enhancement to the VE2302 packaging to protect against this was employed in the form of a metal structure called a stiffener ring attached to the package to mitigate thermal cycling effects and extend the number of cycles the VE2302 can withstand while maintaining structural and functional integrity, also known as coplanarity.
Another focus area was in the solder ball materials used with the part. Restriction of Hazardous Substances (RoHS) compliance requires solder balls to be lead free and as such, solder balls in terrestrial ICs typically use a tin-based material. However, these types of solder balls are susceptible to what are called “tin whiskers,” which are artifacts that spread out from the connection points and interfere with their physical and operational integrity. As such, to increase reliability of the connection points, a tin/lead alloy is used in space-based components like the VE2302. While clearly no longer lead free, the RoHS directive does have an exemption to allow for this for space flight hardware.
The third area of focus was on the encapsulation of components at a chip level. While the VE2302 is a monolithic SoC, the package also includes decoupling capacitors. For space-grade ICs, all components are coated with an organic compound to prevent dust and debris from compromising physical and operational integrity of the chip.
Last, but certainly not least, is the actual qualification of the part. While not simple by any means, this is normally relatively straight forward, as mentioned earlier, by ensuring compliance with the DoDs MIL-PRF-38535 standard.
However, in the VE2302s case, there was a slight wrinkle, in that the AMD part is the first time an organic package was used as opposed to a ceramic package. This was done because ceramic packages can’t support greater than 12.5 Gb/s transceivers. With the VE2302 being capable of slightly faster than 26 Gb/s, the decision was made to use organic packaging so as not to constrain transceiver performance, which is critical for the AI/ML use cases that the VE2302 is targeting.
This ceramic packaging transceiver limitation is widely known and accepted, and as O’Neill said, the rest of the industry is also moving to organic packaging for this reason. AMD just happens to be one of the first to do so and at the time of qualification testing, the industry hadn’t yet converged on a revision to the MIL-PRF-38535 standard to accommodate organic packaging. As such, AMD used pre-standard testing by taking the existing standard and modifying it with applicable organic packaging specifications from JEDEC. MIL-PRF-38535 has since been updated with organic packaging requirements released in November of 2022 and AMD, according to O’Neill, is confident that the pre-standard testing results will be more than sufficient for qualification.
AI in space
On the AI side of things, AMD optimized the VE2302 for converting raw sensor data into useful information, as well as ML inferencing with not just INT8 support but also the addition of two of the more prevalent inferencing data types, INT4 and BFLOAT16. As such, the SoC is targeted primarily at anomaly and image detection use cases.
Automated command and control with on-board AI
One such use case is a planetary lander attempting a landing on a remote planet or moon. While the planet might have been surveyed from afar, an image feed from one of the on-board cameras can be passed through the SoCs AI engine to refine the search for a suitable landing site. Upon identifying such a site, the VE2302 can then power up a laser rangefinder to continuously provide altitude and velocity readings, which the chip can then use to autonomously make real-time thrust and directional adjustments.
Fault condition monitoring, prediction and resolution with on-board AI
Due to the remote and hostile environment in which space-based platforms operate, what might be simple methods of identifying and resolving common fault conditions, such as system crashes while on earth, become extremely difficult, sometimes impossible, in space—at least without AI.
This next example actually happened to a friend of mine, who at the time was the director of satellite operations at one of the major satellite companies in the United States. Working with her software development group, they realized that the code would sometimes cause the processor to crash. When she asked the group how they planned on recovering control of the satellite when this happens, the answer she received was that they would simply reboot the computer. To which she responded with an arched eyebrow and a simple, “While in space?”
The collective face palming that ensued could probably have been heard from said satellite as the team realized the complexity of not only monitoring for the fault condition from the ground station, but then also sending a remote command to a satellite that was essentially dead in terms of communications and processing resources. With an AI/ML-capable SoC, however, monitoring for and even predicting the conditions in which a computer crash, such as the above example, would occur can be accomplished with on-board resources without the need to communicate with the ground station, as well as automating a response for avoiding such a condition in the first place.
AI inferencing could also use inputs like telemetry signals to predict impeding fault conditions or use image sensor inputs for obstacle detection and avoidance.
Other useful capabilities
While neither required nor necessarily AI-related, one other useful feature when dealing with space-based platforms is the ability to reprogram the adaptive SoC even while deployed in space. Not all radiation-tolerant FPGAs are capable of this type of feature, but according to AMD, the VE2302 supports “unlimited reprogramming during development as well as after deployment, including in-flight in the harsh radiation environment of space.” This type of feature allows the adaptive SoC to be reconfigured as the mission changes or perform different duties using the same chip, saving board space and costs.
One small step leading to a giant leap?
In terms of AI and more specifically AI in space, we must first learn to walk before we can run.
With SoCs like the VE2302, we are starting to take the first steps in trying to figure out how to maximize AI-capable compute resources in space-grade form factors, as well as what we can do with them once we get them up there.
We’re clearly not yet at the AGI level of AI capability. However, with many small steps being taken by a global ecosystem spanning hardware, software, as well as model creation and optimization, it might not be very far into the future when we look back and see how big of a leap we’ve taken.
After all, it only took a little more than 60 years from when we first discovered flight to when we landed on the moon.
This piece of article is extracted from https://www.eetimes.com/what-it-takes-for-amd-to-bring-ai-into-space/
Not Blog