An engineering story
To be able to track interesting quality metrics of our upcoming XiVO Office product (XiVO IPBX Open Hardware project), we have decided to add temperature sensors to our current XIOH pcb.
In computers, the typical way to report the temperature to the main operating system is through SMBus. This is suitable in our case: we already have an MSP430 microcontroller that handles the power sequence and is connected to the SMBus of the board. We will connect some diodes to the MSP430 to measure the temperature. So the time has come to make use of the SMBus between the MSP and the EP80579 (our main System On Chip), for temperature measurements and also other purposes.
The MSP430 does not have a full featured SMBus controller, only a generic I2C one. SMBus is a variant of I2C, with additional electrical and timing constraints in the physical layer and definition of the messages at the network layer level.
Although formatting and parsing SMBus messages is easy, properly using the I2C controller of the MSP430 in a multi-master environment is not without pitfalls, even if we did not care about the SMBus timing constraints. To do it with the needed reliability, it is necessary to have a detailed knowledge of the whole system and to take into consideration all kind of interactions on the bus and in the chips. In our business, the reliability wanted by the customers is typically high enough that it makes sense to build robust systems instead of rushing a collapsing sandcastle to market. Plus, in that particular case, we are dealing with the subsystem that brings and keeps the whole board running, and for which the cost to debug in the field is absurdly high.
All complex chips come with various design errors, and the MSP430 is no exception. On the exact version that we use, there are 6 documented errors affecting the I2C controller, of which 4 clearly apply to our board, 1 clearly does not apply, and one required careful system analysis to determine that the preconditions to this erroneous behavior could not happen in our system.
On top of the 4 errata applying to the I2C controller, we have to deal with errata for other parts of the MSP430, plus some detailled aspects that are not errata but are also limiting the way we can make a reliable use of the chip for the tasks we want. Failure to properly take all those details into consideration would lead to eventual faults of various natures, probably including MSP430 crashes impossible to diagnose and leading to spurious shutdowns, systems stuck in the powered state, or any random behavior and degradation of system functions.
The impact could be full-scale, with potential consequences on: availability, maintainability, safety, security, and reliability!
It is worthwhile to note that one of the errata that could have the biggest consequences can only be handled by using one specific software architecture to drive the I2C controller, and that specific software architecture is not the first thing that comes to mind in our preexisting firmware. This is a case where iteration on the design of the I2C code would have meant its complete rewrite.
Complex systems, even moderately so, need a careful design, especially on components that are critical for business or technical reasons. Wishful thinking never produces high reliability and neither does excessive reliance on luck. Modeling, even informally, sometimes pays.