It is exam season, and rather than relying solely on textbooks, I decided to work through a hands-on case study to better understand how process mining techniques are applied to real-life data.
This topic is structured as a three-part series:
- Part 1: Process Discovery
- Part 2: Conformance Checking
- Part 3: Performance Analysis
Part 1: Process Discovery
This article uses “Sepsis Cases - Event Log” - a real-life event log contains events of sepsis cases from a hospital. Sepsis is a life-threatening condition typically caused by infection and requires timely and coordinated medical interventions.
Event Log Overview
The data was recorded by the hospital’s ERP (Enterprise Resource Planning) system and consists of approximately 1,000 cases and 15,000 events, spanning 16 distinct activity types. In addition to event-level information, the log contains 39 contextual data attributes, including responsible medical units, diagnostic test results, and checklist-related information.
In this event log, each case represents the complete care pathway of an individual patient through the hospital, from admission to subsequent examinations and treatments.
Before conducting any analysis, the raw XES file was inspected programmatically using Python and pm4py in order to understand the full set of recorded event and case attributes.
Full list of attributes
Index(['InfectionSuspected', 'org:group', 'DiagnosticBlood', 'DisfuncOrg',
'SIRSCritTachypnea', 'Hypotensie', 'SIRSCritHeartRate', 'Infusion',
'DiagnosticArtAstrup', 'concept:name', 'Age', 'DiagnosticIC',
'DiagnosticSputum', 'DiagnosticLiquor', 'DiagnosticOther',
'SIRSCriteria2OrMore', 'DiagnosticXthorax', 'SIRSCritTemperature',
'time:timestamp', 'DiagnosticUrinaryCulture', 'SIRSCritLeucos',
'Oligurie', 'DiagnosticLacticAcid', 'lifecycle:transition', 'Diagnose',
'Hypoxie', 'DiagnosticUrinarySediment', 'DiagnosticECG',
'case:concept:name', 'Leucocytes', 'CRP', 'LacticAcid'],
dtype='object')
Like other event logs used in process mining, the dataset contains the three core elements required for analysis:
case:concept:name: case ID (patient-level)concept:name: activity nametime:timestamp: timestamp
To further my understanding of this log, I am taking a closer look on attributes.
The event log contains a wide range of contextual attributes. For example, diagnostic results such as leukocyte count, CRP, and lactic acid levels are recorded only for relevant events.
Sample Event-Level Attributes (Excerpt)
| Row | InfectionSuspected | org:group | DiagnosticBlood | Leucocytes | CRP | LacticAcid |
|---|---|---|---|---|---|---|
| 0 | True | A | True | NaN | NaN | NaN |
| 1 | NaN | B | NaN | 9.6 | NaN | NaN |
| 2 | NaN | B | NaN | NaN | 21.0 | NaN |
| 3 | NaN | B | NaN | NaN | NaN | 2.2 |
| 4 | NaN | C | NaN | NaN | NaN | NaN |
This illustrates that many clinical attributes are event-specific rather than universally recorded across all events, resulting in a structurally sparse but semantically meaningful log.
Process Discovery
I typically start with Inductive Miner for logs with noises, and only explore alternative algorithms if interpretability or specific patterns require it.
Before applying the Inductive Miner, I assessed whether this dataset would be suitable for simpler discovery approaches such as the Alpha Miner. This was deemed inappropriate for the following reasons:
Activity Sequences
- The count of unique activity sequences is 846. This dataset has approx. 1,000 cases which means that most of the cases have the unique paths.
Loops
- Around 700 out of approx. 1,000 cases had at least one repetition of activities, which means there is 70% chance a case has a loop.
Noises
- Several activities occur in less than 1% of all events, representing rare but valid clinical pathways.
Based on these characteristics, the Inductive Miner was selected as the primary discovery technique, as it produces sound, block-structured process models while remaining robust to variability, loops, and noise.
Key Structural Insights from Process Discovery
The process discovery phase reveals that sepsis treatment pathways are highly variable, loop-intensive, and exception-driven. Rather than exhibiting a dominant linear flow, the process is characterized by iterative diagnostics, parallel activities, and a wide range of low-frequency pathways.
These structural characteristics are not artifacts of poor data quality, but intrinsic properties of the underlying clinical process. As a result, subsequent analyses do not assume a single reference model, but instead focus on understanding deviations, performance implications, and improvement opportunities within a highly flexible process structure.