🔄 Reproducible Workflows¶
Create fully reproducible scientific manuscripts with automated versioning, dependency management, and collaborative workflows using Rxiv-Maker's advanced features.
Overview¶
Reproducible workflows ensure that your research can be easily validated, shared, and built upon by others. Rxiv-Maker provides the infrastructure to create completely reproducible manuscripts that include automated figure generation, consistent environments, and transparent methodologies.
Key Benefits¶
- Complete Reproducibility: Every figure, calculation, and result can be regenerated
- Version Control Integration: Full Git integration with automated builds
- Collaborative Workflows: Team-friendly processes with consistent environments
- Automated Quality Assurance: Continuous integration with testing and validation
- Long-term Sustainability: Future-proof your research with containerized environments
Workflow Components¶
1. Version Control Setup¶
Integrate your manuscript with Git for comprehensive version tracking:
# Initialize manuscript repository
git init manuscript-project
cd manuscript-project
# Create initial structure
rxiv init
git add .
git commit -m "Initial manuscript structure"
# Set up GitHub repository
gh repo create --public
git push -u origin main
2. Environment Management¶
Create reproducible environments for your analysis:
3. GitHub Actions Integration¶
Automate manuscript building and validation:
# .github/workflows/build-manuscript.yml
name: Build Manuscript
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install rxiv-maker
- name: Validate manuscript
run: rxiv validate
- name: Build PDF
run: rxiv pdf
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: manuscript-pdf
path: output/*.pdf
4. Automated Figure Generation¶
Create figures that regenerate automatically from source data:
# FIGURES/fig_analysis.py
"""
Automated figure generation with complete reproducibility.
"""
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import hashlib
# Data versioning with checksums
def verify_data_integrity(filepath, expected_hash=None):
"""Verify data hasn't changed unexpectedly."""
with open(filepath, 'rb') as f:
file_hash = hashlib.sha256(f.read()).hexdigest()
if expected_hash and file_hash != expected_hash:
raise ValueError(f"Data integrity check failed for {filepath}")
return file_hash
# Load and verify data
data_file = Path("data/experiment_results.csv")
data_hash = verify_data_integrity(data_file)
print(f"Data verified: {data_hash[:8]}...")
# Create reproducible figure
df = pd.read_csv(data_file)
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='condition', y='response_time')
plt.title(f'Response Times by Condition (n={len(df)})')
plt.ylabel('Response Time (ms)')
plt.xlabel('Experimental Condition')
# Add statistical annotations
from scipy import stats
conditions = df['condition'].unique()
if len(conditions) == 2:
group1 = df[df['condition'] == conditions[0]]['response_time']
group2 = df[df['condition'] == conditions[1]]['response_time']
stat, p_value = stats.ttest_ind(group1, group2)
plt.text(0.5, 0.95, f'p = {p_value:.3f}',
transform=plt.gca().transAxes, ha='center')
plt.tight_layout()
plt.savefig('fig_analysis.png', dpi=300, bbox_inches='tight')
plt.close()
# Log analysis metadata
metadata = {
'data_hash': data_hash,
'n_subjects': len(df),
'conditions': list(conditions),
'analysis_date': pd.Timestamp.now().isoformat()
}
import json
with open('fig_analysis_metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)
5. Data Pipeline Management¶
Create transparent data processing workflows:
# FIGURES/data_pipeline.py
"""
Reproducible data processing pipeline.
"""
import pandas as pd
from pathlib import Path
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DataPipeline:
def __init__(self, config_file="pipeline_config.yml"):
self.config = self.load_config(config_file)
self.steps = []
def load_raw_data(self, filepath):
"""Load and validate raw data."""
logger.info(f"Loading raw data from {filepath}")
df = pd.read_csv(filepath)
# Data validation
required_columns = self.config.get('required_columns', [])
missing_cols = set(required_columns) - set(df.columns)
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")
logger.info(f"Loaded {len(df)} rows, {len(df.columns)} columns")
self.steps.append(f"load_raw: {len(df)} rows")
return df
def clean_data(self, df):
"""Apply data cleaning procedures."""
logger.info("Cleaning data...")
initial_rows = len(df)
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
df = df.dropna(subset=self.config.get('required_fields', []))
final_rows = len(df)
logger.info(f"Cleaning: {initial_rows} → {final_rows} rows")
self.steps.append(f"clean: {initial_rows} → {final_rows} rows")
return df
def generate_report(self):
"""Generate processing report."""
report = {
'pipeline_steps': self.steps,
'config': self.config,
'timestamp': pd.Timestamp.now().isoformat()
}
return report
# Execute pipeline
if __name__ == "__main__":
pipeline = DataPipeline()
# Process data
raw_data = pipeline.load_raw_data("data/raw_experiment.csv")
clean_data = pipeline.clean_data(raw_data)
# Save processed data
clean_data.to_csv("data/processed_experiment.csv", index=False)
# Generate report
report = pipeline.generate_report()
with open("data/processing_report.json", "w") as f:
json.dump(report, f, indent=2)
Collaboration Workflows¶
Multi-Author Coordination¶
Manage collaborative writing with clear workflows:
# .github/workflows/collaborative-review.yml
name: Collaborative Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Check for conflicts
run: |
git merge-tree $(git merge-base HEAD main) HEAD main
- name: Validate changes
run: |
pip install rxiv-maker
rxiv validate
- name: Generate diff report
run: |
git diff main...HEAD --name-only > changed_files.txt
echo "Changed files:" >> $GITHUB_STEP_SUMMARY
cat changed_files.txt >> $GITHUB_STEP_SUMMARY
- name: Build preview
run: |
rxiv pdf
echo "Preview build completed" >> $GITHUB_STEP_SUMMARY
Code Review Guidelines¶
Establish clear review processes:
- Content Reviews: Focus on scientific accuracy and clarity
- Technical Reviews: Verify code functionality and reproducibility
- Style Reviews: Ensure consistent formatting and citation style
- Data Reviews: Validate data processing and statistical analysis
Quality Assurance¶
Automated Testing¶
Test your manuscript components automatically:
# tests/test_figures.py
"""
Test suite for figure generation.
"""
import pytest
import pandas as pd
from pathlib import Path
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
class TestFigures:
def test_data_loading(self):
"""Test that required data files exist and are valid."""
data_path = Path("data/experiment_results.csv")
assert data_path.exists(), "Missing experiment data file"
df = pd.read_csv(data_path)
assert len(df) > 0, "Empty data file"
assert 'condition' in df.columns, "Missing condition column"
assert 'response_time' in df.columns, "Missing response_time column"
def test_figure_generation(self):
"""Test that figures generate without errors."""
import subprocess
result = subprocess.run(['python', 'FIGURES/fig_analysis.py'],
capture_output=True, text=True)
assert result.returncode == 0, f"Figure generation failed: {result.stderr}"
# Check output files
assert Path("FIGURES/fig_analysis.png").exists(), "Figure not generated"
assert Path("FIGURES/fig_analysis_metadata.json").exists(), "Metadata not generated"
def test_statistical_validity(self):
"""Test statistical analysis components."""
df = pd.read_csv("data/experiment_results.csv")
# Basic statistical checks
assert df['response_time'].std() > 0, "No variance in response times"
assert df['condition'].nunique() > 1, "Only one condition found"
# Check for reasonable values
rt = df['response_time']
assert rt.min() > 0, "Negative response times found"
assert rt.max() < 10000, "Unreasonably large response times"
if __name__ == "__main__":
pytest.main([__file__, "-v"])
Continuous Integration¶
Set up comprehensive CI/CD for manuscripts:
# .github/workflows/full-validation.yml
name: Full Manuscript Validation
on:
push:
branches: [ main ]
schedule:
- cron: '0 0 * * 0' # Weekly builds
jobs:
validate:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.9', '3.10', '3.11']
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest
- name: Run tests
run: pytest tests/ -v
- name: Validate manuscript structure
run: rxiv validate --strict
- name: Build manuscript
run: rxiv pdf
- name: Check output quality
run: |
ls -la output/
file output/*.pdf
- name: Archive results
uses: actions/upload-artifact@v3
with:
name: manuscript-${{ matrix.python-version }}
path: output/
Best Practices¶
1. Documentation Standards¶
Document everything for future reproducibility:
# README.md for manuscript project
## Reproduction Instructions
1. **Environment Setup**:
```bash
conda env create -f environment.yml
conda activate manuscript-env
```
2. **Data Requirements**:
- `data/raw_experiment.csv`: Raw experimental data (contact: [email protected])
- Expected data format: [describe schema]
- Data collection period: [dates]
3. **Building the Manuscript**:
```bash
rxiv pdf
```
4. **Running Tests**:
```bash
pytest tests/ -v
```
## File Organization
- `FIGURES/`: Automated figure generation scripts
- `data/`: Raw and processed data files
- `tests/`: Test suite for reproducibility
- `output/`: Generated manuscript files
- `.github/`: Automation workflows
2. Dependency Management¶
Pin all dependencies for long-term reproducibility:
# requirements.txt
rxiv-maker==2.5.0
pandas==2.0.3
matplotlib==3.7.2
seaborn==0.12.2
scipy==1.11.1
numpy==1.24.3
pytest==7.4.0
3. Data Versioning¶
Track data changes with version control:
# Use Git LFS for large data files
git lfs track "data/*.csv"
git lfs track "data/*.xlsx"
git add .gitattributes
# Or use DVC for data versioning
dvc init
dvc add data/experiment_results.csv
git add data/experiment_results.csv.dvc
4. Container Integration¶
Create fully isolated environments:
# Dockerfile
FROM python:3.11-slim
WORKDIR /manuscript
# Install system dependencies
RUN apt-get update && apt-get install -y \
texlive-latex-extra \
texlive-fonts-recommended \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy manuscript files
COPY . .
# Build manuscript
CMD ["rxiv", "pdf"]
Troubleshooting¶
Common Issues¶
Build failures in CI: - Check Python version compatibility - Verify all dependencies are pinned - Ensure data files are accessible
Figure generation errors: - Test figure scripts locally first - Check data file paths and formats - Verify statistical analysis assumptions
Collaboration conflicts: - Use clear branching strategies - Establish code review processes - Document contribution guidelines
Performance Optimization¶
For large manuscripts or complex analyses:
# Use caching for expensive computations
from functools import lru_cache
import pickle
@lru_cache(maxsize=None)
def expensive_analysis(data_hash):
"""Cache expensive analysis results."""
# Load data, perform analysis
return results
# Or use file-based caching
def cached_analysis(data_file, cache_file):
if Path(cache_file).exists() and Path(cache_file).stat().st_mtime > Path(data_file).stat().st_mtime:
with open(cache_file, 'rb') as f:
return pickle.load(f)
# Perform analysis
results = perform_analysis(data_file)
with open(cache_file, 'wb') as f:
pickle.dump(results, f)
return results
Summary¶
Reproducible workflows with Rxiv-Maker provide:
- Complete Automation: Every aspect of manuscript generation is automated
- Version Control Integration: Full Git workflows with CI/CD
- Collaborative Features: Team-friendly processes and review workflows
- Quality Assurance: Automated testing and validation
- Long-term Sustainability: Containerized, documented, and version-controlled
Next Steps: - GitHub Actions Guide → for detailed CI/CD setup - Docker Integration → for containerized workflows - Collaboration Guide → for team processes
Perfect for research teams that need reliable, shareable, and long-lasting scientific manuscripts.