Tags: analytics, data-pipeline, etl, machine-learning, streamlit
30-second version: This is a full-stack fitness analytics platform that transforms 14 years of workout data into actionable intelligence. An AWS-based ETL pipeline ingests MapMyRun exports, ML models automatically classify workouts as runs vs. dog walks, and a multi-page Streamlit dashboard surfaces trends, anomalies, and personalized insights—all self-hosted for complete data ownership.
2-minute version
After a decade of tracking every run, I had data but no insights. The metrics lived in a fitness app I didn't control, in a format I couldn't query, with no way to answer questions like "how has my consistency changed over time?" or "what patterns predict my best performances?" Then in June 2018, a chocolate lab puppy named Choco arrived. Suddenly my pristine running data became contaminated with daily dog walks—all logged as generic "walks" with no distinction from actual training runs. My fitness app couldn't tell the difference. Neither could I, months later, when trying to analyze my running progress. This project started as a classification problem: build an ML model to distinguish real runs from pup walks based on pace, distance, and duration patterns. It evolved into a complete analytics platform: - **Cloud ETL Pipeline**: Upload a CSV to S3, Lambda processes and enriches the data, RDS stores it for querying - **ML Classification**: K-means clustering identifies workout types (real_run, pup_walk, mixed, outlier) with era-based smart defaults - **Multi-Dimensional Analytics**: Consistency scoring, trend detection, anomaly identification, and forecasting - **Interactive Dashboard**: Six specialized views including "The Choco Effect"—a data story showing how one dog transformed my exercise patterns The result? A 4x increase in workout frequency post-Choco, a complete shift in workout composition, and finally, a way to answer "am I actually more consistent now, or does it just feel that way?" **Live at [workouts.barbhs.com](https://workouts.barbhs.com)**The Problem: When a Puppy Breaks Your Data
I had 14 years of running data. Clean, consistent, trackable. Then Choco arrived.
| Before Choco (2011-2018) | After Choco (2018-Present) |
|---|---|
| ~4 workouts/month | ~16 workouts/month |
| 95% focused running | Mixed runs + daily dog walks |
| Clear activity labels | Everything labeled “Walk” |
| Easy to analyze | Impossible to distinguish |
The real problem wasn’t the data volume—it was data quality degradation. My fitness app treated a 3-mile training run the same as a 20-minute pup walk around the block. Trend analysis became meaningless. Progress tracking broke. I couldn’t answer basic questions:
- Am I actually running more or just walking the dog?
- Has my running pace improved or degraded?
- What does “consistency” even mean now?
The Deeper Friction
| Challenge | Impact |
|---|---|
| No workout classification | Can’t separate training from dog walks |
| Data locked in app | No custom queries or analysis |
| Manual exports only | No automated pipeline |
| Platform dependency | Years of data held hostage |
| No intelligence layer | Raw numbers, no insights |
The Solution: Full-Stack Fitness Intelligence
┌─────────────────────────────────────────────────────────────────────────┐
│ Cloud ETL Pipeline │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ MapMyRun │───▶│ S3 │───▶│ Lambda │───▶│ RDS │ │
│ │ CSV │ │ Bucket │ │ ETL │ │ MySQL │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ SNS │ │
│ │ Notification │ │
│ └──────────────┘ │
│ │ │
└────────────────────────────────────────┼────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Intelligence & Analytics Layer │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ K-Means │ │ Trend │ │ Anomaly │ │
│ │ Classifier │ │ Detection │ │ Detection │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └─────────────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Consistency │ │
│ │ Analyzer │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Streamlit Dashboard │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Main │ │ Choco │ │ Trends │ │ Calendar │ │
│ │ Dashboard │ │ Effect │ │ Analysis │ │ View │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ SQL │ │ Model │ │
│ │ Explorer │ │ Management │ │
│ └────────────┘ └────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
How It Works
Data Pipeline
| Stage | What Happens |
|---|---|
| Export | Download workout history CSV from MapMyRun |
| Upload | Move to S3 bucket (triggers Lambda) |
| Extract | Parse CSV, validate schema, extract workout IDs |
| Transform | Clean data, normalize formats, deduplicate by workout ID |
| Load | Insert new records to AWS RDS MySQL |
| Notify | SNS notification with row counts |
ETL Processing Details
The Lambda function handles incremental updates by tracking workout IDs: ```python class WorkoutProcessor: """Processes workout data and identifies new records.""" def process_file(self, file_key: str, existing_ids: List) -> Tuple[int, List[str]]: """Process new workout file and identify new records.""" df = self.storage.read_file(file_key) WorkoutDataValidator.validate_dataframe(df) # Extract workout IDs from URLs new_df = df[~df['workout_id'].isin(existing_ids)] new_workout_ids = new_df['workout_id'].tolist() return len(new_df), new_workout_ids ``` Key transformations: - Extract workout_id from MapMyRun URL (used as primary key) - Convert duration strings to seconds - Normalize pace formats - Validate required fieldsML Classification System
The intelligence service uses K-means clustering to automatically categorize workouts:
| Classification | Pace Range | Typical Pattern |
|---|---|---|
| real_run | 8-12 min/mi | Focused training sessions |
| pup_walk | 20-28 min/mi | Dog walking adventures |
| mixed | Variable | Combined activities |
| outlier | Extreme values | Data quality issues |
Era-Based Smart Defaults
When clustering has insufficient data (<5 workouts), the system uses era-based defaults: ```python # The Choco Effect Date: June 1, 2018 choco_date = datetime(2018, 6, 1) # Smart fallback hierarchy: # 1. Primary: K-means ML clustering (requires ≥5 workouts) # 2. Secondary: Era-based defaults (medium confidence: 0.5) # - Pre-Choco (before 2018-06-01): defaults to real_run # - Post-Choco (after 2018-06-01): defaults to pup_walk # 3. Tertiary: Rule-based classification by pace thresholds ``` This leverages the behavioral pattern: before Choco, I primarily ran. After Choco, walks dominated.Consistency Scoring
Multi-dimensional analysis across four factors:
| Dimension | What It Measures |
|---|---|
| Frequency | Workouts per time period |
| Timing | Regularity of workout days/times |
| Performance | Stability of pace and distance |
| Streaks | Consecutive active days/weeks |
What Shipped
Dashboard Views
- AI Intelligence Hub: Daily briefing with personalized insights, recommendations, and performance summary
- The Choco Effect: Data story visualization showing pre/post transformation with interactive timeline
- Monthly Dashboard: Calendar view with workout density heatmap and weekly aggregations
- Trends Analysis: Statistical trend detection with confidence intervals and forecasting
- SQL Explorer: Direct database queries with syntax highlighting and result export
- Model Management: View and adjust ML classification parameters
ML/AI Features
- K-Means Workout Classifier: Automatic categorization based on pace, distance, and duration clustering
- Trend Detection: Statistical analysis of performance over configurable time windows
- Anomaly Detection: IQR, z-score, and modified z-score methods for outlier identification
- Consistency Scoring: Multi-dimensional analysis of workout patterns
- Intelligence Briefing: Automated generation of personalized insights and recommendations
- Algorithm Transparency: Full visibility into how each AI feature makes decisions
Data Infrastructure
- Cloud ETL Pipeline: S3 → Lambda → RDS with SNS notifications
- Environment-Aware Config: Automatic switching between dev (local MySQL) and prod (AWS RDS)
- Incremental Updates: Workout ID tracking prevents duplicate entries
- Data Validation: Schema enforcement and error handling at ingestion
- Export Capabilities: CSV/JSON export for external analysis
Developer Experience
- Development Mode: Auth bypass for testing (
STREAMLIT_DEV_MODE=true) - MkDocs Documentation: Comprehensive user and developer guides
- Pytest Suite: Risk-based testing with comprehensive mocking
- Systemd Deployment: Production deployment with service management
The Choco Effect: A Data Story
The centerpiece of this project is “The Choco Effect” dashboard—a portfolio-quality demonstration of how one dog transformed 14 years of fitness data.
The Transformation
| Metric | Pre-Choco (7 years) | Post-Choco (6.5 years) | Change |
|---|---|---|---|
| Total Workouts | ~340 | ~1,250 | 3.7x |
| Workouts/Month | ~4 | ~16 | 4x |
| Avg Pace | ~9 min/mi | ~18 min/mi | Slower (walks!) |
| Avg Distance | ~4 mi | ~2 mi | Shorter, more frequent |
What The Data Shows
The transformation timeline reveals:
- Dramatic frequency increase: From sporadic runner to daily walker
- Pace bifurcation: Clear separation between run pace and walk pace clusters
- Consistency improvement: More regular, smaller efforts vs. infrequent big efforts
- Activity composition shift: From 95% runs to 70% walks + 30% runs
The Choco Effect Visualization
The dashboard creates an interactive timeline showing: - Monthly workout frequency (bar chart with pre/post coloring) - Average pace over time (line chart showing the bifurcation) - Vertical marker at June 2018: "Choco Arrives" - Before/after comparison cards with key metrics - AI classification breakdown (pie chart) - Sample classifications with confidence scoresArchitecture
| Layer | Components | Technology |
|---|---|---|
| Data Ingestion | ETL pipeline, validation | AWS Lambda, S3, SNS |
| Storage | Workout database | AWS RDS (MySQL) |
| Intelligence | Classification, analytics | scikit-learn, scipy |
| Visualization | Interactive charts | Plotly, Streamlit |
| UI | Multi-page dashboard | Streamlit 1.44+ |
| Infrastructure | Deployment, config | Docker, systemd, TOML |
Why Streamlit?
Streamlit was chosen for rapid data app development: - **Native Pandas/Plotly integration** without custom API layers - **Built-in session state** for multi-step workflows - **Multi-page navigation** with minimal configuration - **Caching decorators** for performance optimization - **Single Python codebase** simplifies deployment Trade-offs accepted: Less UI customization than React, Streamlit-specific session patterns.Project Structure
``` src/ ├── config/ # Environment-aware configuration │ ├── database.py # Database connection settings │ ├── app.py # Application configuration │ └── logging_config.py ├── services/ # Business logic layer │ ├── database_service.py # Centralized DB operations │ └── intelligence_service.py # AI/ML engine ├── utils/ # Analytics utilities │ ├── statistics.py # Statistical analysis │ ├── consistency_analyzer.py # Multi-dimensional scoring │ └── data_filters.py # Shared filtering logic ├── views/ # Streamlit pages │ ├── dash.py # Monthly dashboard │ ├── choco_effect.py # The Choco Effect story │ ├── fitness-overview.py # SQL query interface │ ├── login.py # Authentication │ └── tools/ # Analysis tools │ ├── trends.py # Statistical trends │ ├── history.py # Workout history │ └── mapping.py # Geographic viz └── streamlit_app.py # Main entry point ```Implementation Notes
Performance Benchmarks
| Operation | Target | Actual |
|---|---|---|
| AI Classification (1K+ workouts) | <5s | ~3s |
| Intelligence Brief generation | <3s | ~2s |
| Algorithm Transparency loading | <3s | ~1s |
| Page load (all views) | <2s | <1.5s |
Key Design Decisions
- Workout ID as primary key: Extracted from MapMyRun URLs for deduplication
- Era-based classification defaults: Leverages known behavioral shift at Choco arrival
- Unified data filtering: Shared utilities prevent inconsistencies between views
- Risk-based testing: Focus coverage on high-risk paths (core flows, session state, API calls)
Defensive Programming Patterns
The codebase emphasizes robustness:
# Type conversion with fallback
if isinstance(week_num, str):
try:
week_num = int(week_num)
except ValueError:
week_num = 1 # Safe fallback
# Division by zero prevention
if pre_choco_freq == 0:
transformation_text = "∞x increase - started from zero!"
else:
transformation_factor = post_choco_freq / pre_choco_freq
What’s Next
Active Development
- Integration with Apple Health or Garmin APIs for automated ingestion
- Weather data enrichment for performance correlation
- Geographic visualization of workout routes
Future Enhancements
- Mobile-responsive dashboard improvements
- Goal setting and progress tracking features
- Community data sharing (anonymized patterns)
- Custom ML model training on personal data
Links
Live Dashboard View Repository Read Documentation
Related
- The Choco Effect Data Story — Deep dive into the before/after analysis
- learn_pytest — Testing patterns developed for this project