Python for Clinical Study Reports and Submission

R/Pharma 2025 Workshop

Yilong Zhang, Nan Xiao

2025-11-07

Welcome

Outline

Four parts of this workshop:

  1. Python environment setup (Nan)
    Use uv to create and manage reproducible Python projects. Develop and collaborate in GitHub Codespaces, Visual Studio Code, or Positron.

  2. Python packages for clinical reporting (Yilong)
    A guided tour of essential packages such as polars and rtflite, with demonstrations of creating TLFs commonly used in clinical trials.

  3. Manage clinical trial A&R projects (Yilong)
    Practical project structure, conventions, and execution from data to deliverables.

  4. Prepare eCTD submission packages (Nan)
    An example workflow for assembling submission-ready source code and outputs using py-pkglite, aligned with eCTD requirements.

Disclaimer

The views and opinions expressed in this presentation are those of the individual presenters and do not represent those of their affiliated organizations or institutions.

Training objective

With Python, learning how to:

  • Create tables for clinical study reports
  • Organize clinical development projects effectively
  • Prepare eCTD submission packages to regulatory agencies

Note

The toolchain, process, and formats may be different in different organizations. We only provide one common way to address them.

Note

Interested in R? check https://r4csr.org/

Acknowledgements

  • R/Pharma organizers

    • It is a fun and productive annual gathering
    • Please consider sharing stories and use cases to expand the community
  • Team members from Meta Platforms and Merck & Co., Inc., Rahway, NJ, USA

  • Contributors of pycsr and r4csr training materials

    • Please consider submitting issues or PR in the repos

Preparation

In this workshop, we assume you have basic Python programming experience and clinical development knowledge.

Pre-configured Codespaces for pycsr book

Examples:

  • Data manipulation: polars, plotnine, rtflite.
  • ADaM data: adsl, adae, etc.

Resource

Philosophy

We share the same automation philosophy as the R community described in Section 1.1 of the R Packages book and quote here.

  • “Anything that can be automated, should be automated.”
  • “Do as little as possible by hand. Do as much as possible with functions.”
  • “The goal is to spend your time thinking about what you want to do rather than thinking about the minutiae of package structure.”

Python environment setup

Development environments

Three recommended options:

GitHub Codespaces

  • Cloud-based, pre-configured
  • No local setup needed
  • 120 free hours/month

Positron

  • Posit’s next-gen IDE
  • Native notebook support
  • Built-in data viewer

VS Code

  • Most popular choice
  • Rich extension ecosystem
  • Essential extensions: Python, Pylance, Ruff, Quarto

Why uv?

uv is a modern Python package and project manager written in Rust.

Replaces scattered toolchain:

  • pip + venv + pyenv + pip-tools + setuptools

Benefits:

  • Fast: 10-100x faster than pip
  • Complete: Manages Python versions, dependencies, builds
  • Modern: Uses pyproject.toml as single source of truth
  • Reliable: Automatic dependency resolution and lock files

Installing uv

Skip if using GitHub Codespaces: uv is pre-installed there.

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Verify:

uv --version

Quick start with uv

# Create new project
uv init pycsr-example
cd pycsr-example

# Pin Python version
uv python pin 3.14.0

# Add dependencies
uv add polars plotnine rtflite

# Add dev dependencies
uv add --dev ruff pytest mypy

# Sync environment
uv sync

Python toolchain essentials

Ruff - Code formatting and linting

uv run ruff format .
uv run ruff check .

mypy - Type checking

uv run mypy src/

pytest - Testing framework

uv run pytest tests/

All configured in pyproject.toml.

Key concepts

Virtual environments are mandatory in Python

  • Isolate project dependencies
  • Prevent conflicts
  • Enable reproducibility

Dependency locking

  • uv.lock pins exact versions
  • Ensures reproducible environments
  • Similar to R’s renv.lock

.python-version file

  • Specifies exact Python version (for example, 3.14.0)
  • Critical for regulatory submissions

Delivering TLFs in CSR

ICH E3 guidance

The ICH E3: structure and content of clinical study reports provide guidance to assist sponsors in the development of a CSR.

In a CSR, most of TLFs are located in:

  • Section 10: Study patients
  • Section 11: Efficacy evaluation
  • Section 12: Safety evaluation
  • Section 14: Tables, Figures and Graphs referred to but not included in the text
  • Section 16: Appendices

Datasets

Tools

  • polars: Python package for data manipulation similar to dplyr/tidyr R packages

  • rtflite: Python package for creating production-ready tables and figures in RTF format similar to R package r2rtf

polars intro

Why polars?

https://pycsr.org/tlf-overview.html#polars

Modern Python dataframe library designed for performance and expressiveness.

Key advantages:

  • Fast: Written in Rust with parallel execution
  • Memory efficient: Lazy evaluation and streaming support
  • Type-safe: Strong type system prevents common errors
  • Modern API: Method chaining with clear, readable syntax

Core operations:

df.filter(pl.col("AGE") > 65)
df.group_by("TRT01P").agg(n = pl.len())
df.pivot(index="row", on="TRT01PN", values="n")

Essential patterns for TLFs

Counting participants:

df.group_by("TRT01P").agg(n = pl.len())

Calculating percentages:

.join(totals, on="TRT01P")
.with_columns(
    pct = (100.0 * pl.col("n") / pl.col("total")).round(1)
)

Pivoting to wide format:

.pivot(index="category", on="TRT01P", values="n")

rtflite intro

Motivation

In the pharmaceutical industry, RTF/Microsoft Word play a central role in preparing clinical study reports

Different organizations can have different table standards

  • For example, table layout, font size, border type, footnote, data source

Package overview

rtflite package provides the flexibility to customize table appearance for

  • Table component: title, column header, footnote, etc.
  • Table cell style: size, border type, color, font size, text color, alignment, etc.
  • Flexible control: the specification of the cell style can be row or column vectorized.
  • Complicated format: pagination, section grouping, multiple table concatenations, etc.

Note

rtflite package also provides the flexibility to convert figures in RTF format.

Example

import polars as pl
import rtflite as rtf

# Load and prepare data
df = pl.read_parquet("data/adsl.parquet")

# Create RTF document
doc = rtf.RTFDocument(
    df=df.head(6),
    rtf_body=rtf.RTFBody(),
)

doc.write_rtf("output.rtf")

Key components

rtflite provides Python classes RTFDocument that map to table elements. The goal is to help you translate data frame to tables in RTF file.

rtflite

CSR examples

Disposition table

Key concepts:

  • Track participant flow (enrolled -> completed/discontinued)
  • Use .pivot() to reshape data to wide format
  • Handle missing categories with .fill_null(0)
  • Multi-level headers with borders

Link: https://pycsr.org/tlf-disposition.html

Analysis population

Key concepts:

  • Document multiple analysis populations (ITT, efficacy, safety)
  • Population flags: ITTFL, EFFFL, SAFFL
  • Reusable helper functions
  • Conditional formatting: N for totals, N (%) for subsets

Link: https://pycsr.org/tlf-population.html

Baseline characteristics

Key concepts:

  • Separate functions for continuous vs categorical
  • Continuous: Mean (SD), Median [Min, Max]
  • Categorical: n (%)
  • Build tables with proper indentation

Link: https://pycsr.org/tlf-baseline.html

Efficacy table

Key concepts:

  • LOCF imputation for missing data
  • ANCOVA with statsmodels
  • LS means at baseline mean
  • Multiple table sections in one document
  • Comprehensive footnotes

Link: https://pycsr.org/tlf-efficacy-ancova.html

AE Summary table

Key concepts:

  • Count unique participants with .n_unique()
  • Standard AE categories (any, drug-related, serious, deaths)
  • Join with population totals for percentages
  • Multi-level column headers

Link: https://pycsr.org/tlf-ae-summary.html

Specific AE table

Key concepts:

  • Hierarchical structure: SOC -> Preferred Terms
  • Standardize terms with .str.to_titlecase()
  • Conditional formatting with lambda functions
  • Bold headers for top-level categories

Link: https://pycsr.org/tlf-ae-specific.html

Break (5 min)

Analysis package

What is an analysis package?

A Python package designed specifically to organize analysis scripts and code for a clinical trial project.

Purpose:

Our primary focus is creating a standard folder structure to organize the project, with 4 goals in mind:

  • Project containers for clinical trial deliverables
  • Reproducible environments for analyses
  • Submission-ready structures for regulatory review

Combines:

  • Python package structure (code organization)
  • Quarto project (report generation)
  • Regulatory requirements (eCTD submission)

Package structure

demo-py-esub/
├── pyproject.toml          # Project metadata
├── .python-version         # Python version
├── uv.lock                 # Locked dependencies
├── src/demo001/            # Study-specific code
│   ├── __init__.py
│   └── utils.py
├── analysis/               # Quarto analysis docs
│   └── tlf-*.qmd
├── data/                   # ADaM datasets
├── output/                 # Generated TLFs
└── tests/                  # Validation tests

See: https://pycsr.org/pkg-structure.html

Benefits

Consistency

  • Standard structure across projects
  • Team knows where files belong

Reproducibility

  • uv.lock pins dependencies
  • .python-version specifies Python

Automation

  • uv sync restores environment
  • quarto render generates outputs
  • pytest validates code

Compliance

  • Built-in documentation
  • Testing infrastructure
  • Standard structure

Git-centric workflow

Core principle: All project assets in version control.

Plain text workflow:

  • .qmd files for analysis (not .ipynb for final deliverables)
  • .md files for documentation
  • .toml files for configuration
  • Avoid .xlsx files for tracking

Project tracking:

  • Issues for requirements
  • Pull requests for review
  • Project boards (Kanban)

See: https://pycsr.org/pkg-management.html

Development lifecycle

Planning:

  • Define TLFs from SAP
  • Create mock tables
  • Assign validation levels
  • Lock Python version and package repo

Development:

  • Create feature branches
  • Implement in analysis/ and src/
  • Self-test against mocks
  • Open pull requests

Validation:

  • Independent review
  • Write unit tests in tests/
  • Run automated checks (ruff, mypy, pytest)

Delivery:

  • Generate all outputs with quarto render
  • Prepare submission package

Break (5 min)

eCTD submission

FDA requirements

FDA Study Data Technical Conformance Guide Section 4.1.2.10:

Submit programs for primary and secondary efficacy analyses. Specify software in ADRG. Use ASCII text format. No executable extensions.

Goal: Enable reviewers to understand and confirm analysis algorithms.

See: https://pycsr.org/submission-overview.html

Demo repositories

Analysis package: https://github.com/elong0527/demo-py-esub

Submission package: https://github.com/elong0527/demo-py-ectd

Clone and explore to see complete examples.

eCTD Module 5 structure

m5/datasets/<study-id>/analysis/adam/
├── datasets/
│   ├── *.xpt               # ADaM datasets
│   ├── define.xml
│   ├── adrg.pdf            # Instructions
│   └── analysis-results-metadata.pdf
└── programs/
    ├── py0pkgs.txt         # Packed Python package
    ├── tlf-01-*.txt        # Analysis programs
    └── tlf-02-*.txt

Key: All files in programs/ must be ASCII text.

The solution: pkglite for Python

Packs Python projects into portable text files.

Why needed:

  • Python packages have directory structure
  • May contain binary files
  • FDA requires ASCII text format

pkglite capabilities:

  • Pack entire project into single .txt file
  • Preserve file paths and metadata
  • Unpack to restore original structure
  • Support multiple packages in one file

Documentation: https://pharmaverse.github.io/py-pkglite/

Packing workflow

1. Create .pkgliteignore

uvx pkglite use demo-py-esub/

2. Pack the package

uvx pkglite pack demo-py-esub/ \
  -o programs/py0pkgs.txt

3. Convert Quarto to Python scripts

  • Render .qmd -> verify it works
  • Convert .qmd -> .ipynb -> .py
  • Clean and format with ruff
  • Save as .txt (no .py extension)

See: https://pycsr.org/submission-package.html

Packed file format

Human-readable Debian Control File (DCF) format:

# Generated by py-pkglite
# Use `pkglite unpack` to restore

Package: demo-py-esub
File: pyproject.toml
Format: text
Content:
  [project]
  name = "demo001"
  version = "0.1.0"
  ...

Reviewers can read without special tools.

Updating ADRG

Document the Python environment:

Python environment:

Software Version Description
Python 3.14.0 Programming language
uv 0.9.9 Package manager

Packages:

Package Version Description
polars 1.35.1 Data manipulation
rtflite 1.1.0 RTF generation
demo001 0.1.0 Study functions

Appendix: Step-by-step reproduction instructions.

Dry run testing

Essential: Simulate reviewer experience before submission.

Workflow:

  1. Create clean directory
  2. Copy submission materials
  3. Unpack package: uvx pkglite unpack programs/py0pkgs.txt -o .
  4. Install dependencies: cd demo-py-esub && uv sync
  5. Run programs: python ../programs/tlf-*.txt
  6. Verify outputs match originals

Catches: Missing dependencies, path errors, platform issues.

See: https://pycsr.org/submission-dryrun.html

Q&A

Resources

Book:

Regulatory:

Technical: