burgiclab Logoburgiclab

Building Safety-Critical Systems: Lessons from Aviation, Railway, and Industrial Domains

by Sani Saša BurgićTechnology

Deep dive into developing safety-critical embedded systems across multiple domains, covering DO-178C, EN 50128, ISO 13849, and the practical challenges of achieving functional safety certification.

Safety-CriticalFunctional SafetyDO-178CISO 13849Certification

Safety-critical systems are the backbone of modern transportation, industrial automation, and medical devices. When software failures can lead to loss of life or catastrophic damage, development becomes a different discipline entirely. Here's what I've learned from building certified systems across aviation, railway, and industrial sectors.

Understanding Safety Integrity Levels

Different domains use different safety classification schemes, but they all serve the same purpose: quantifying acceptable risk.

Aviation: DO-178C Levels

  • Level A: Catastrophic failure (loss of aircraft)
  • Level B: Hazardous failure (serious injury)
  • Level C: Major failure (passenger discomfort)
  • Level D: Minor failure (no safety impact)
  • Level E: No safety effect

Railway: EN 50128 SIL Levels

  • SIL-4: Intolerable risk (10⁻⁹ failures/hour)
  • SIL-3: Undesirable risk (10⁻⁸ failures/hour)
  • SIL-2: Tolerable risk (10⁻⁷ failures/hour)
  • SIL-1: Acceptable risk (10⁻⁶ failures/hour)

Industrial: ISO 13849 Performance Levels

  • PL e: Highest risk reduction
  • PL d: High risk reduction
  • PL c: Medium risk reduction (our current HMI target)
  • PL b: Low risk reduction
  • PL a: Minimal risk reduction

The DO-178C Journey: Aviation Software Certification

Working with DO-178C at Schiebel and RT-RK taught me that aviation software development is fundamentally different from commercial development.

Key Principles

1. Requirements Traceability

Every line of code must trace back to a requirement. Every requirement must trace to:

  • System requirements
  • Test cases
  • Design documents
  • Verification procedures
// Example: Requirement-driven development
// REQ-SYS-001: System shall process heartbeat every 100ms
// REQ-SW-010: Software shall implement heartbeat handler
// TEST-SW-010: Verify heartbeat timing ±5ms

void heartbeat_handler(void) {
    // MISRA compliant implementation
    static uint32_t counter = 0U;

    counter++;

    if (counter >= MAX_HEARTBEAT_COUNT) {
        counter = 0U;
        trigger_safety_shutdown();
    }
}

2. Structural Coverage

DO-178C requires specific code coverage levels:

  • Level A: Modified Condition/Decision Coverage (MC/DC)
  • Level B: Decision Coverage
  • Level C: Statement Coverage

MC/DC is particularly challenging—every condition in a decision must be shown to independently affect the outcome.

3. MISRA Compliance

MISRA C coding standards are typically mandatory. Key rules include:

  • No dynamic memory allocation
  • No recursion
  • Restricted pointer arithmetic
  • Explicit type conversions
  • No undefined behavior

Practical Challenges

Challenge 1: Tool Qualification Even your compiler and debugger need qualification. Any tool that can introduce errors must be qualified according to DO-330.

Solution: Use pre-qualified toolchains or budget for tool qualification activities.

Challenge 2: Change Management A one-line bug fix can trigger extensive re-verification if it affects certified code.

Solution: Partition software into safety-critical and non-critical regions. Use strong interfaces with minimal coupling.

EN 50128: Railway Safety at SIL-4

The railway R&D project at Thales targeted SIL-4—the highest safety integrity level. This meant designing systems where failure probability must be less than 10⁻⁹ per hour.

Architecture for Safety

Hardware Redundancy

Primary Controller (STM32F4) <-> Safety Monitor (Independent MCU)
         |                                    |
         v                                    v
   Dual Watchdogs                      Cross-checking
         |                                    |
         v                                    v
   Safe State Transition         <->    Voting Logic

Software Patterns

Defensive Programming:

// SIL-4 compliant data handling
typedef struct {
    uint32_t data;
    uint32_t data_inverse;  // Redundant inverse
    uint16_t crc;           // Integrity check
} SafeData_t;

ErrorCode_t safe_data_write(SafeData_t *sd, uint32_t value) {
    if (sd == NULL) {
        return ERROR_NULL_POINTER;
    }

    sd->data = value;
    sd->data_inverse = ~value;
    sd->crc = calculate_crc16((uint8_t*)&value, sizeof(value));

    // Verify write
    if ((sd->data != value) || (sd->data != ~sd->data_inverse)) {
        trigger_safety_fault();
        return ERROR_DATA_CORRUPTION;
    }

    return ERROR_NONE;
}

Testing at SIL-4

  • Formal Methods: Mathematical proof of correctness
  • Fault Injection: Deliberate introduction of errors
  • FMEA/FMECA: Systematic failure mode analysis
  • Environmental Testing: Temperature, vibration, EMI

ISO 13849 & ISO 25119: Industrial Safety HMI

Currently at TTControl, I'm developing a safety HMI device compliant with:

  • ISO 13849 (Performance Level c)
  • ISO 25119 (Safety Requirement Level C)

Multi-Processor Safety Architecture

The NXP iMX8QM presents unique challenges with multiple ARM Cortex M4 cores:

+---------------------------------------+
|         Application Core              |
|  (FreeRTOS - Non-Safety-Critical)     |
+---------------------------------------+
              | (IPC)
              v
+---------------------------------------+
|          Safety Core                  |
|   (SafeRTOS - Safety-Critical)        |
|   - Input validation                  |
|   - Safety logic                      |
|   - Watchdog management               |
+---------------------------------------+
              |
              v
+---------------------------------------+
|      Hardware Safety Layer            |
|   - Emergency stop circuits           |
|   - Dual-channel inputs               |
|   - Failsafe outputs                  |
+---------------------------------------+

Key Design Decisions

1. Partitioning

Separate safety-critical from non-critical functions:

  • Safety Core: Emergency stop, safety logic, diagnostics
  • Application Core: UI rendering, networking, data logging

2. Communication

Inter-processor communication with safety in mind:

// Safety IPC with timeout and validation
typedef enum {
    IPC_CMD_ESTOP = 0x01,
    IPC_CMD_STATUS = 0x02,
    IPC_CMD_HEARTBEAT = 0xFF
} IPC_Command_t;

typedef struct {
    IPC_Command_t cmd;
    uint32_t sequence;
    uint32_t timestamp;
    uint8_t data[32];
    uint16_t crc;
} __attribute__((packed)) IPC_Message_t;

bool ipc_send_safety(IPC_Command_t cmd, const uint8_t *data, size_t len) {
    IPC_Message_t msg;

    msg.cmd = cmd;
    msg.sequence = get_next_sequence();
    msg.timestamp = get_timestamp_ms();
    memcpy(msg.data, data, MIN(len, sizeof(msg.data)));
    msg.crc = calculate_crc16((uint8_t*)&msg,
                              sizeof(msg) - sizeof(msg.crc));

    return send_with_timeout(&msg, sizeof(msg), IPC_TIMEOUT_MS);
}

3. Diagnostics

Continuous self-testing:

  • RAM tests (March algorithm)
  • Flash integrity (CRC)
  • Clock monitoring
  • Voltage supervision
  • Communication path testing

Lessons Learned Across Domains

1. Start with Safety in Mind

Retrofitting safety into an existing design is expensive and often impossible. Safety must be part of the initial architecture.

2. Document Everything

In safety-critical development, if it's not documented, it doesn't exist. Maintain:

  • Design rationale
  • Safety analysis
  • Test evidence
  • Change history

3. Automate What You Can

  • Static analysis (MISRA checking)
  • Unit testing frameworks
  • Requirements tracing tools
  • Coverage analysis

4. Plan for Certification Early

Understand certification requirements before writing code:

  • Which standard applies?
  • What safety level is needed?
  • What evidence is required?
  • What tools need qualification?

5. Build a Safety Culture

Safety isn't just process—it's mindset. Encourage:

  • Questioning assumptions
  • Reporting potential issues
  • Learning from near-misses
  • Sharing lessons learned

The Future of Safety-Critical Systems

Emerging trends include:

  • AI/ML in safety contexts (challenging for certification)
  • Cybersecurity integration (ISO 21434, IEC 62443)
  • Over-the-air updates for certified software
  • Model-based development with automatic code generation

Each brings new challenges but also opportunities for safer, more reliable systems.

Conclusion

Building safety-critical systems is demanding but rewarding. Every project contributes to systems that protect human lives. Whether it's aircraft flying safely, trains running reliably, or industrial equipment operating without accidents—our work makes a difference.

The key is maintaining rigorous engineering discipline while continuously learning and adapting to new technologies and standards. Safety is never "done"—it's an ongoing commitment to excellence.


These lessons come from real projects across multiple safety domains. While specific implementations vary, the fundamental principles of safety-critical development remain constant: rigor, traceability, and an unwavering commitment to doing things right.