Skip to content

🐍 Python Execution in Manuscripts

Execute Python code directly in your manuscripts with Jupyter notebook-like functionality, creating dynamic documents where data analysis and narrative text are seamlessly integrated.

Overview

Rxiv-Maker's Python execution feature transforms static manuscripts into dynamic, data-driven documents. Code executes during PDF compilation, ensuring your results stay synchronized with your data and analysis.

Key Benefits

  • 🔄 Live Data Integration: Numbers update automatically when data changes
  • 📊 Dynamic Analysis: Perform calculations and generate insights during compilation
  • 🔒 Reproducible Results: All analysis code lives with your manuscript
  • ⚡ Intelligent Caching: Skip unchanged computations for efficient rebuilds

Syntax Overview

{{py:exec}} - Initialization Blocks

Use for data loading, computations, and variable setup:

{{py:exec
import pandas as pd
import numpy as np

# Load your research data
df = pd.read_csv("FIGURES/DATA/experiment_results.csv")

# Perform analysis
total_samples = len(df)
mean_response = df['response_time'].mean()
correlation = np.corrcoef(df['treatment'], df['outcome'])[0, 1]

# Prepare summary statistics
stats = {
    'n': total_samples,
    'mean_rt': mean_response,
    'r': correlation,
    'effect_size': df['treatment'].mean() - df['control'].mean()
}
}}

Important: {{py:exec}} blocks are removed from the final PDF - they exist only for initialization.

{{py:get variable}} - Value Insertion

Insert computed values directly into your narrative text:

Our experiment included {{py:get stats['n']}} participants.
The mean response time was {{py:get stats['mean_rt']:.2f}} ms,
with a treatment effect of {{py:get stats['effect_size']:.3f}}
(r = {{py:get stats['r']:.2f}}).

Output in PDF:

Our experiment included 847 participants. The mean response time was 423.56 ms, with a treatment effect of 0.152 (r = 0.73).

3-Step Execution Model

Rxiv-Maker follows a predictable execution sequence:

Step 1: Execute All {{py:exec}} Blocks

All initialization blocks run in document order, building up the analysis context:

{{py:exec
# Block 1 - Data loading
df = pd.read_csv("FIGURES/DATA/data.csv")
}}

Text content here...

{{py:exec
# Block 2 - Analysis (can use df from Block 1)
mean_value = df['measurement'].mean()
std_value = df['measurement'].std()
}}

More text...

{{py:exec
# Block 3 - Summary (can use df and results)
summary = create_summary(df, results)
}}

Step 2: Process All {{py:get}} Insertions

After all execution blocks complete, value insertions are processed:

Dataset size: {{py:get len(df)}}
Key finding: {{py:get results['main_effect']:.3f}}
Summary: {{py:get summary['conclusion']}}

Step 3: Continue with LaTeX Processing

The document proceeds with normal Rxiv-Maker processing (figures, citations, etc.).

Practical Examples

Research Data Analysis

{{py:exec
import pandas as pd
from scipy import stats
from pathlib import Path

# Load experimental data
data_file = Path("FIGURES/DATA/experiment_2024.csv")
if data_file.exists():
    df = pd.read_csv(data_file)

    # Data validation
    complete_cases = df.dropna()
    exclusion_rate = (len(df) - len(complete_cases)) / len(df) * 100

    # Statistical analysis
    treatment_group = complete_cases[complete_cases['group'] == 'treatment']
    control_group = complete_cases[complete_cases['group'] == 'control']

    # T-test
    t_stat, p_value = stats.ttest_ind(
        treatment_group['outcome'],
        control_group['outcome']
    )

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(
        ((len(treatment_group) - 1) * treatment_group['outcome'].var() +
         (len(control_group) - 1) * control_group['outcome'].var()) /
        (len(treatment_group) + len(control_group) - 2)
    )
    cohens_d = (treatment_group['outcome'].mean() -
                control_group['outcome'].mean()) / pooled_std

    analysis_complete = True
else:
    print("Warning: Data file not found")
    analysis_complete = False
}}

## Methods

{% if analysis_complete %}
We analyzed {{py:get len(complete_cases)}} participants after excluding
{{py:get exclusion_rate:.1f}}% of cases due to missing data. Participants
were randomly assigned to treatment (n = {{py:get len(treatment_group)}})
or control (n = {{py:get len(control_group)}}) conditions.

## Results

The treatment group showed significantly higher outcomes
(M = {{py:get treatment_group['outcome'].mean():.2f}},
SD = {{py:get treatment_group['outcome'].std():.2f}}) compared to
the control group (M = {{py:get control_group['outcome'].mean():.2f}},
SD = {{py:get control_group['outcome'].std():.2f}}),
t({{py:get len(treatment_group) + len(control_group) - 2}}) = {{py:get t_stat:.2f}},
p = {{py:get p_value:.3f}}, d = {{py:get cohens_d:.2f}}.
{% else %}
*Analysis pending - data file not available*
{% endif %}

Real-time Data Updates

{{py:exec
import requests
import pandas as pd
from datetime import datetime

# Fetch latest arXiv submission data
def update_arxiv_stats():
    try:
        # This would fetch real data in practice
        # url = "https://arxiv.org/stats/monthly_submissions"
        # response = requests.get(url)

        # For demo, using cached data
        df = pd.read_csv("FIGURES/DATA/arxiv_monthly.csv")
        df['date'] = pd.to_datetime(df['year_month'])

        return {
            'total_submissions': int(df['submissions'].sum()),
            'latest_month': df.iloc[-1]['year_month'],
            'monthly_average': float(df['submissions'].mean()),
            'growth_rate': 0.15,  # Computed from trend analysis
            'last_updated': datetime.now().strftime("%B %Y")
        }
    except Exception as e:
        print(f"Data update failed: {e}")
        return None

# Get current statistics
arxiv_stats = update_arxiv_stats()
data_available = arxiv_stats is not None
}}

{% if data_available %}
This analysis uses arXiv data through {{py:get arxiv_stats['latest_month']}}
containing {{py:get arxiv_stats['total_submissions']:,}} total submissions
with an average of {{py:get arxiv_stats['monthly_average']:.0f}} submissions
per month ({{py:get arxiv_stats['growth_rate']:.1%}} annual growth rate).
Data last updated: {{py:get arxiv_stats['last_updated']}}.
{% else %}
*Using cached data - live update unavailable*
{% endif %}

Integration with src/py/ Modules

Rxiv-Maker automatically adds your manuscript's src/py/ directory to the Python path:

MANUSCRIPT/
└── src/
    └── py/
        ├── analysis.py
        ├── plotting.py
        └── utils.py

analysis.py:

import pandas as pd
import numpy as np
from typing import Dict, Any

def load_and_validate_data(filepath: str) -> pd.DataFrame:
    """Load CSV data with validation."""
    df = pd.read_csv(filepath)

    # Data validation
    if df.empty:
        raise ValueError("Dataset is empty")

    # Remove outliers (example)
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    df_clean = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

    print(f"Loaded {len(df)} records, {len(df_clean)} after outlier removal")
    return df_clean

def compute_summary_stats(df: pd.DataFrame) -> Dict[str, Any]:
    """Compute comprehensive summary statistics."""
    return {
        'count': len(df),
        'means': df.select_dtypes(include=[np.number]).mean().to_dict(),
        'stds': df.select_dtypes(include=[np.number]).std().to_dict(),
        'correlations': df.corr().to_dict()
    }

In your manuscript:

{{py:exec
from analysis import load_and_validate_data, compute_summary_stats
from plotting import create_distribution_plot

# Load and analyze data
df = load_and_validate_data("FIGURES/DATA/experiment.csv")
stats = compute_summary_stats(df)

# Generate figure
create_distribution_plot(df, "FIGURES/PLOTS/distribution.svg")
}}

Our analysis included {{py:get stats['count']}} participants with
mean age {{py:get stats['means']['age']:.1f}} years.

Error Handling and Debugging

Understanding Error Messages

When Python code fails, detailed error information appears in the compilation output:

Python execution error in exec block: Initialization block execution failed (in manuscript:45):
Error in manuscript:45: [Errno 2] No such file or directory: 'FIGURES/DATA/missing_file.csv'

Best Practices for Robust Code

{{py:exec
from pathlib import Path
import pandas as pd

# Check file existence before loading
data_file = Path("FIGURES/DATA/results.csv")

if data_file.exists():
    try:
        df = pd.read_csv(data_file)
        analysis_successful = True
        sample_size = len(df)
        mean_value = df['measurement'].mean()

        print(f"Successfully loaded {sample_size} records")

    except pd.errors.EmptyDataError:
        print("Warning: Data file is empty")
        analysis_successful = False
        sample_size = 0
        mean_value = None

    except Exception as e:
        print(f"Data loading error: {e}")
        analysis_successful = False
        sample_size = 0
        mean_value = None
else:
    print(f"Warning: {data_file} not found")
    analysis_successful = False
    sample_size = 0
    mean_value = None

# Provide fallback values
display_size = sample_size if analysis_successful else "N/A"
display_mean = f"{mean_value:.2f}" if mean_value is not None else "not available"
}}

{% if analysis_successful %}
Our dataset contains {{py:get sample_size}} samples with
mean value {{py:get mean_value:.2f}}.
{% else %}
*Analysis could not be completed - data file issues detected.*
{% endif %}

Security and Limitations

Security Model

Python code executes in a subprocess sandbox with:

  • Restricted module access: Only safe, scientific computing modules allowed
  • File system limitations: Access limited to manuscript directory
  • Network restrictions: External network calls require explicit approval

Allowed Modules

# ✅ Safe modules (always available)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import json
import csv
from pathlib import Path
from datetime import datetime

# ⚠️ Network modules (use carefully)
import requests
import urllib

# ❌ Restricted modules (security risk)
import os          # File system access
import subprocess  # Command execution
import sys         # System manipulation

Performance Optimization

Intelligent Caching

Rxiv-Maker caches Python execution results and skips unchanged code:

{{py:exec
# This expensive computation is cached
large_dataset = pd.read_csv("FIGURES/DATA/large_file.csv")
complex_analysis = perform_expensive_analysis(large_dataset)

# Results are cached until code or data changes
}}

Cache Management

{{py:exec
from pathlib import Path
import json

# Manual caching for expensive operations
cache_file = Path("FIGURES/DATA/.cache/analysis_results.json")

if cache_file.exists():
    with open(cache_file, 'r') as f:
        cached_results = json.load(f)
    print("Loaded cached analysis results")
    use_cached = True
else:
    # Perform expensive analysis
    cached_results = perform_analysis()

    # Save to cache
    cache_file.parent.mkdir(exist_ok=True)
    with open(cache_file, 'w') as f:
        json.dump(cached_results, f)
    print("Analysis complete, results cached")
    use_cached = False

summary_stats = cached_results['summary']
}}

Migration from Legacy Syntax

Old Syntax (Deprecated)

<!-- Old approach -->
{{py:
import pandas as pd
df = pd.read_csv("data.csv")
result = df.mean()
print(f"Mean: {result}")
}}

New Syntax (Current)

<!-- New 3-step approach -->
{{py:exec
import pandas as pd
df = pd.read_csv("FIGURES/DATA/data.csv")
mean_result = df['value'].mean()
}}

The mean value is {{py:get mean_result:.2f}}.

Summary

Python execution in Rxiv-Maker enables:

  1. Dynamic manuscripts where data and narrative stay synchronized
  2. Reproducible analysis with version-controlled code
  3. Live data integration for always-current results
  4. Professional workflows with caching and error handling

Ready to add dynamic analysis to your manuscripts? Start with simple examples and gradually incorporate more complex data processing as needed.

Next Steps: - LaTeX Injection → for precise typesetting control - VS Code Extension → for enhanced editing - Examples → for real-world use cases