Building Safety-Critical Systems: Lessons from Aviation, Railway, and Industrial Domains

Safety-critical systems are the backbone of modern transportation, industrial automation, and medical devices. When software failures can lead to loss of life or catastrophic damage, development becomes a different discipline entirely. Here's what I've learned from building certified systems across aviation, railway, and industrial sectors.

Understanding Safety Integrity Levels

Different domains use different safety classification schemes, but they all serve the same purpose: quantifying acceptable risk.

Aviation: DO-178C Levels

Level A: Catastrophic failure (loss of aircraft)
Level B: Hazardous failure (serious injury)
Level C: Major failure (passenger discomfort)
Level D: Minor failure (no safety impact)
Level E: No safety effect

Railway: EN 50128 SIL Levels

SIL-4: Intolerable risk (10⁻⁹ failures/hour)
SIL-3: Undesirable risk (10⁻⁸ failures/hour)
SIL-2: Tolerable risk (10⁻⁷ failures/hour)
SIL-1: Acceptable risk (10⁻⁶ failures/hour)

Industrial: ISO 13849 Performance Levels

PL e: Highest risk reduction
PL d: High risk reduction
PL c: Medium risk reduction (our current HMI target)
PL b: Low risk reduction
PL a: Minimal risk reduction

The DO-178C Journey: Aviation Software Certification

Working with DO-178C at Schiebel and RT-RK taught me that aviation software development is fundamentally different from commercial development.

Key Principles

1. Requirements Traceability

Every line of code must trace back to a requirement. Every requirement must trace to:

System requirements
Test cases
Design documents
Verification procedures

// Example: Requirement-driven development
// REQ-SYS-001: System shall process heartbeat every 100ms
// REQ-SW-010: Software shall implement heartbeat handler
// TEST-SW-010: Verify heartbeat timing ±5ms

void heartbeat_handler(void) {
    // MISRA compliant implementation
    static uint32_t counter = 0U;

    counter++;

    if (counter >= MAX_HEARTBEAT_COUNT) {
        counter = 0U;
        trigger_safety_shutdown();
    }
}

2. Structural Coverage

DO-178C requires specific code coverage levels:

Level A: Modified Condition/Decision Coverage (MC/DC)
Level B: Decision Coverage
Level C: Statement Coverage

MC/DC is particularly challenging—every condition in a decision must be shown to independently affect the outcome.

3. MISRA Compliance

MISRA C coding standards are typically mandatory. Key rules include:

No dynamic memory allocation
No recursion
Restricted pointer arithmetic
Explicit type conversions
No undefined behavior

Practical Challenges

Challenge 1: Tool Qualification Even your compiler and debugger need qualification. Any tool that can introduce errors must be qualified according to DO-330.

Solution: Use pre-qualified toolchains or budget for tool qualification activities.

Challenge 2: Change Management A one-line bug fix can trigger extensive re-verification if it affects certified code.

Solution: Partition software into safety-critical and non-critical regions. Use strong interfaces with minimal coupling.

EN 50128: Railway Safety at SIL-4

The railway R&D project at Thales targeted SIL-4—the highest safety integrity level. This meant designing systems where failure probability must be less than 10⁻⁹ per hour.

Architecture for Safety

Hardware Redundancy

Primary Controller (STM32F4) <-> Safety Monitor (Independent MCU)
         |                                    |
         v                                    v
   Dual Watchdogs                      Cross-checking
         |                                    |
         v                                    v
   Safe State Transition         <->    Voting Logic

Software Patterns

Defensive Programming:

// SIL-4 compliant data handling
typedef struct {
    uint32_t data;
    uint32_t data_inverse;  // Redundant inverse
    uint16_t crc;           // Integrity check
} SafeData_t;

ErrorCode_t safe_data_write(SafeData_t *sd, uint32_t value) {
    if (sd == NULL) {
        return ERROR_NULL_POINTER;
    }

    sd->data = value;
    sd->data_inverse = ~value;
    sd->crc = calculate_crc16((uint8_t*)&value, sizeof(value));

    // Verify write
    if ((sd->data != value) || (sd->data != ~sd->data_inverse)) {
        trigger_safety_fault();
        return ERROR_DATA_CORRUPTION;
    }

    return ERROR_NONE;
}

Testing at SIL-4

Formal Methods: Mathematical proof of correctness
Fault Injection: Deliberate introduction of errors
FMEA/FMECA: Systematic failure mode analysis
Environmental Testing: Temperature, vibration, EMI

ISO 13849 & ISO 25119: Industrial Safety HMI

Currently at TTControl, I'm developing a safety HMI device compliant with:

ISO 13849 (Performance Level c)
ISO 25119 (Safety Requirement Level C)

Multi-Processor Safety Architecture

The NXP iMX8QM presents unique challenges with multiple ARM Cortex M4 cores:

+---------------------------------------+
|         Application Core              |
|  (FreeRTOS - Non-Safety-Critical)     |
+---------------------------------------+
              | (IPC)
              v
+---------------------------------------+
|          Safety Core                  |
|   (SafeRTOS - Safety-Critical)        |
|   - Input validation                  |
|   - Safety logic                      |
|   - Watchdog management               |
+---------------------------------------+
              |
              v
+---------------------------------------+
|      Hardware Safety Layer            |
|   - Emergency stop circuits           |
|   - Dual-channel inputs               |
|   - Failsafe outputs                  |
+---------------------------------------+

Key Design Decisions

1. Partitioning

Separate safety-critical from non-critical functions:

Safety Core: Emergency stop, safety logic, diagnostics
Application Core: UI rendering, networking, data logging

2. Communication

Inter-processor communication with safety in mind:

// Safety IPC with timeout and validation
typedef enum {
    IPC_CMD_ESTOP = 0x01,
    IPC_CMD_STATUS = 0x02,
    IPC_CMD_HEARTBEAT = 0xFF
} IPC_Command_t;

typedef struct {
    IPC_Command_t cmd;
    uint32_t sequence;
    uint32_t timestamp;
    uint8_t data[32];
    uint16_t crc;
} __attribute__((packed)) IPC_Message_t;

bool ipc_send_safety(IPC_Command_t cmd, const uint8_t *data, size_t len) {
    IPC_Message_t msg;

    msg.cmd = cmd;
    msg.sequence = get_next_sequence();
    msg.timestamp = get_timestamp_ms();
    memcpy(msg.data, data, MIN(len, sizeof(msg.data)));
    msg.crc = calculate_crc16((uint8_t*)&msg,
                              sizeof(msg) - sizeof(msg.crc));

    return send_with_timeout(&msg, sizeof(msg), IPC_TIMEOUT_MS);
}

3. Diagnostics

Continuous self-testing:

RAM tests (March algorithm)
Flash integrity (CRC)
Clock monitoring
Voltage supervision
Communication path testing

Lessons Learned Across Domains

1. Start with Safety in Mind

Retrofitting safety into an existing design is expensive and often impossible. Safety must be part of the initial architecture.

2. Document Everything

In safety-critical development, if it's not documented, it doesn't exist. Maintain:

Design rationale
Safety analysis
Test evidence
Change history

3. Automate What You Can

Static analysis (MISRA checking)
Unit testing frameworks
Requirements tracing tools
Coverage analysis

4. Plan for Certification Early

Understand certification requirements before writing code:

Which standard applies?
What safety level is needed?
What evidence is required?
What tools need qualification?

5. Build a Safety Culture

Safety isn't just process—it's mindset. Encourage:

Questioning assumptions
Reporting potential issues
Learning from near-misses
Sharing lessons learned

The Future of Safety-Critical Systems

Emerging trends include:

AI/ML in safety contexts (challenging for certification)
Cybersecurity integration (ISO 21434, IEC 62443)
Over-the-air updates for certified software
Model-based development with automatic code generation

Each brings new challenges but also opportunities for safer, more reliable systems.

Conclusion

Building safety-critical systems is demanding but rewarding. Every project contributes to systems that protect human lives. Whether it's aircraft flying safely, trains running reliably, or industrial equipment operating without accidents—our work makes a difference.

The key is maintaining rigorous engineering discipline while continuously learning and adapting to new technologies and standards. Safety is never "done"—it's an ongoing commitment to excellence.

These lessons come from real projects across multiple safety domains. While specific implementations vary, the fundamental principles of safety-critical development remain constant: rigor, traceability, and an unwavering commitment to doing things right.