# DVC Study

**Repository Path**: charlize/dvc-study

## Basic Information

- **Project Name**: DVC Study
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-11
- **Last Updated**: 2025-11-11

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# DVC Tutorial: Complete Data Version Control Workflow

This tutorial demonstrates essential DVC commands and workflows for machine learning projects using a simple classification example.

## Prerequisites

- Python 3.8+
- Git
- DVC (`pip install dvc` or follow [official installation guide](https://dvc.org/doc/install))
- uv (for dependency management)

## Project Setup

### 1. Initialize Git and DVC

```bash
# Initialize Git repository
git init

# Initialize DVC project
dvc init

# Commit DVC initialization
git add .
git commit -m "Initialize DVC project"
```

### 2. Set Up Dependencies with uv

Create `pyproject.toml`:
```toml
[project]
name = "dvc-tutorial"
version = "0.1.0"
description = "A simple DVC tutorial project for machine learning"
authors = [
    {name = "Tutorial User", email = "user@example.com"}
]
dependencies = [
    "pandas",
    "numpy",
    "scikit-learn",
    "joblib"
]
requires-python = ">=3.8"
```

Install dependencies:
```bash
uv sync
```

### 3. Create Project Files

**`train_model.py`** - Main ML script:
```python
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib
import argparse

def generate_data(output_file="data.csv", n_samples=1000, n_features=20):
    """Generate synthetic classification data."""
    print(f"Generating {n_samples} samples with {n_features} features...")

    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=15,
        n_redundant=5,
        random_state=42
    )

    # Convert to DataFrame
    feature_names = [f"feature_{i}" for i in range(n_features)]
    df = pd.DataFrame(X, columns=feature_names)
    df['target'] = y

    df.to_csv(output_file, index=False)
    print(f"Data saved to {output_file}")
    print(f"Shape: {df.shape}")

def train_model(input_file="data.csv", model_output="model.joblib"):
    """Train a Random Forest classifier."""
    print(f"Training model using {input_file}...")

    # Load data
    df = pd.read_csv(input_file)
    X = df.drop('target', axis=1)
    y = df['target']

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    print(f"Model trained successfully!")
    print(f"Accuracy: {accuracy:.4f}")

    # Save model
    joblib.dump(model, model_output)
    print(f"Model saved to {model_output}")

    # Save metrics
    with open("metrics.txt", "w") as f:
        f.write(f"Accuracy: {accuracy:.4f}\n")
        f.write("Classification Report:\n")
        f.write(classification_report(y_test, y_pred))

    return accuracy

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Data generation and ML model training")
    parser.add_argument("--mode", choices=["generate", "train", "all"], default="all",
                       help="What to do: generate data, train model, or both")
    parser.add_argument("--data-file", default="data.csv", help="Output/input data file")
    parser.add_argument("--model-file", default="model.joblib", help="Output model file")
    parser.add_argument("--n-samples", type=int, default=1000, help="Number of samples to generate")

    args = parser.parse_args()

    if args.mode in ["generate", "all"]:
        generate_data(args.data_file, args.n_samples)

    if args.mode in ["train", "all"]:
        train_model(args.data_file, args.model_file)
```

**`params.yaml`** - Configuration parameters:
```yaml
# Data generation parameters
n_samples: 1000
n_features: 20

# Model training parameters
test_size: 0.2
n_estimators: 100
random_state: 42
```

**`dvc.yaml`** - Pipeline definition:
```yaml
stages:
  generate_data:
    cmd: uv run python train_model.py --mode generate --data-file data.csv --n-samples ${n_samples}
    deps:
      - train_model.py
      - params.yaml
    params:
      - n_samples
      - n_features
    outs:
      - data.csv

  train_model:
    cmd: uv run python train_model.py --mode train --data-file data.csv --model-file model.joblib
    deps:
      - train_model.py
      - data.csv
      - params.yaml
    params:
      - test_size
      - n_estimators
      - random_state
    metrics:
      - metrics.txt:
          cache: false
    outs:
      - model.joblib
```

## DVC Commands Walkthrough

### Command 1: `dvc repro` - Run Pipeline

**Purpose**: Execute pipeline stages automatically based on dependencies and changes.

**Command**:
```bash
dvc repro
```

**Expected Output**:
```
Running stage 'generate_data':
> uv run python train_model.py --mode generate --data-file data.csv --n-samples 1000
Generating 1000 samples with 20 features...
Data saved to data.csv
Shape: (1000, 21)

Running stage 'train_model':
> uv run python train_model.py --mode train --data-file data.csv --model-file model.joblib
Training model using data.csv...
Model trained successfully!
Accuracy: 0.9000
Model saved to model.joblib
```

**Validation**:
```bash
# Check that files were created
ls -la data.csv model.joblib metrics.txt

# View metrics
cat metrics.txt
```

### Command 2: `dvc status` - Check Status

**Purpose**: Check which files have changed compared to the last DVC commit.

**Command**:
```bash
dvc status
```

**Expected Output** (when up-to-date):
```
Pipeline is up to date.
```

### Command 3: Parameter Changes and Re-reproduction

**Purpose**: Demonstrate how DVC automatically detects changes and re-runs affected stages.

**Steps**:

1. **Change a parameter**:
```bash
# Edit params.yaml to use more samples
sed -i 's/n_samples: 1000/n_samples: 2000/' params.yaml
```

2. **Re-run pipeline**:
```bash
dvc repro
```

**Expected Output**:
```
Running stage 'generate_data':
> uv run python train_model.py --mode generate --data-file data.csv --n-samples 2000
Generating 2000 samples with 20 features...

Running stage 'train_model':
> uv run python train_model.py --mode train --data-file data.csv --model-file model.joblib
Model trained successfully!
Accuracy: 0.9175
```

3. **Check improved metrics**:
```bash
cat metrics.txt
```

### Command 4: `dvc add` - Track Manual Files

**Purpose**: Track files that were created outside the DVC pipeline.

**Steps**:

1. **Create external dataset**:
```bash
uv run python -c "
import pandas as pd
import numpy as np

# Create validation dataset
X_val = np.random.randn(100, 20)
y_val = np.random.randint(0, 2, 100)

feature_names = [f'feature_{i}' for i in range(20)]
df_val = pd.DataFrame(X_val, columns=feature_names)
df_val['target'] = y_val
df_val.to_csv('validation_data.csv', index=False)
print(f'Validation data created: {df_val.shape}')
"
```

2. **Track with DVC**:
```bash
dvc add validation_data.csv
```

**Expected Output**:
```
To track the changes with git, run:
	git add validation_data.csv.dvc .gitignore
```

3. **Commit to Git**:
```bash
git add validation_data.csv.dvc .gitignore
git commit -m "Add validation dataset"
```

### Command 5: `dvc commit` - Manual File Changes

**Purpose**: Commit manual changes to tracked files when dependencies haven't changed.

**Steps**:

1. **Manually modify a tracked file**:
```bash
echo "Manual annotation: Model validated on 100 samples" >> metrics.txt
```

2. **Check DVC status**:
```bash
dvc status
```

**Expected Output**:
```
train_model:
	changed outs:
		modified:           metrics.txt
```

3. **Commit the manual change**:
```bash
dvc commit -f
```

**Expected Output**:
```
Updating lock file 'dvc.lock'
```

### Command 6: `dvc push` - Push to Remote Storage

**Purpose**: Push large files to remote storage (so they're not stored in Git).

**Steps**:

1. **Set up remote storage**:
```bash
# Create local directory to simulate remote storage
mkdir -p ../dvc-remote-storage

# Add as DVC remote
dvc remote add -d myremote ../dvc-remote-storage
```

2. **Push data files**:
```bash
dvc push
```

**Expected Output**:
```
2 files pushed
```

3. **Commit remote configuration**:
```bash
git add .dvc/config
git commit -m "Add remote storage configuration"
```

### Command 7: `dvc pull` - Pull from Remote Storage

**Purpose**: Retrieve large files from remote storage when working on different machines.

**Steps**:

1. **Remove large files**:
```bash
rm data.csv model.joblib validation_data.csv
ls -la *.csv *.joblib  # Should show no files
```

2. **Pull from remote**:
```bash
dvc pull
```

**Expected Output**:
```
A       data.csv
A       model.joblib
A       validation_data.csv
3 files added
```

3. **Verify files are restored**:
```bash
ls -la *.csv *.joblib  # Should show restored files
```

## Additional Useful DVC Commands

### `dvc dag` - Visualize Pipeline
```bash
dvc dag
```

### `dvc metrics show` - View Metrics
```bash
dvc metrics show
```

### `dvc params show` - View Parameters
```bash
dvc params show
```

### `dvc gc` - Clean Up Cache
```bash
# Clean up unused files from cache
dvc gc
```

### `dvc diff` - Compare Experiments
```bash
# Compare current workspace with Git HEAD
dvc diff HEAD
```

## Common Workflows

### Workflow 1: Start New Experiment
```bash
# Modify parameters or code
vim params.yaml  # or vim train_model.py

# Run pipeline
dvc repro

# Check results
dvc metrics show
cat metrics.txt

# Commit changes
git add .
git commit -m "Experiment: improved model with more data"
```

### Workflow 2: Share with Team
```bash
# Push code to Git
git push origin main

# Push data to remote storage
dvc push
```

### Workflow 3: Clone and Setup on New Machine
```bash
# Clone repository
git clone <repo-url>
cd dvc-tutorial

# Install dependencies
uv sync

# Pull large data files
dvc pull

# Run pipeline
dvc repro
```

## Key Concepts Summary

1. **Pipeline Stages**: Defined in `dvc.yaml` with dependencies, commands, and outputs
2. **Parameters**: Track configuration in `params.yaml`
3. **Metrics**: Automatically track model performance
4. **Remote Storage**: Large files stored separately from Git
5. **Reproducibility**: Changes automatically trigger appropriate stage re-runs

## Troubleshooting

### Issues with `dvc repro`:
- Check dependencies with `dvc status`
- Use `dvc repro --force` to force re-run all stages
- Verify command syntax in `dvc.yaml`

### Issues with `dvc push`/`pull`:
- Check remote configuration: `dvc remote list`
- Ensure remote directory exists and has proper permissions
- Verify network connectivity for cloud remotes

### Large Files in Git:
- Ensure `.dvcignore` excludes large files
- Check that all large files are tracked with `dvc add`
- Use `git status` to verify no large files are untracked

## Next Steps

1. Try adding more stages to the pipeline (data preprocessing, feature engineering, etc.)
2. Experiment with different remote storage options (S3, Google Cloud, Azure)
3. Integrate with CI/CD for automated pipeline execution
4. Explore DVC's experiment tracking features
5. Try DVC Studio for visualization and collaboration

This tutorial provides a complete foundation for using DVC in real-world machine learning projects!