# DVC Study **Repository Path**: charlize/dvc-study ## Basic Information - **Project Name**: DVC Study - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-11 - **Last Updated**: 2025-11-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DVC Tutorial: Complete Data Version Control Workflow This tutorial demonstrates essential DVC commands and workflows for machine learning projects using a simple classification example. ## Prerequisites - Python 3.8+ - Git - DVC (`pip install dvc` or follow [official installation guide](https://dvc.org/doc/install)) - uv (for dependency management) ## Project Setup ### 1. Initialize Git and DVC ```bash # Initialize Git repository git init # Initialize DVC project dvc init # Commit DVC initialization git add . git commit -m "Initialize DVC project" ``` ### 2. Set Up Dependencies with uv Create `pyproject.toml`: ```toml [project] name = "dvc-tutorial" version = "0.1.0" description = "A simple DVC tutorial project for machine learning" authors = [ {name = "Tutorial User", email = "user@example.com"} ] dependencies = [ "pandas", "numpy", "scikit-learn", "joblib" ] requires-python = ">=3.8" ``` Install dependencies: ```bash uv sync ``` ### 3. Create Project Files **`train_model.py`** - Main ML script: ```python import pandas as pd import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report import joblib import argparse def generate_data(output_file="data.csv", n_samples=1000, n_features=20): """Generate synthetic classification data.""" print(f"Generating {n_samples} samples with {n_features} features...") X, y = make_classification( n_samples=n_samples, n_features=n_features, n_informative=15, n_redundant=5, random_state=42 ) # Convert to DataFrame feature_names = [f"feature_{i}" for i in range(n_features)] df = pd.DataFrame(X, columns=feature_names) df['target'] = y df.to_csv(output_file, index=False) print(f"Data saved to {output_file}") print(f"Shape: {df.shape}") def train_model(input_file="data.csv", model_output="model.joblib"): """Train a Random Forest classifier.""" print(f"Training model using {input_file}...") # Load data df = pd.read_csv(input_file) X = df.drop('target', axis=1) y = df['target'] # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Model trained successfully!") print(f"Accuracy: {accuracy:.4f}") # Save model joblib.dump(model, model_output) print(f"Model saved to {model_output}") # Save metrics with open("metrics.txt", "w") as f: f.write(f"Accuracy: {accuracy:.4f}\n") f.write("Classification Report:\n") f.write(classification_report(y_test, y_pred)) return accuracy if __name__ == "__main__": parser = argparse.ArgumentParser(description="Data generation and ML model training") parser.add_argument("--mode", choices=["generate", "train", "all"], default="all", help="What to do: generate data, train model, or both") parser.add_argument("--data-file", default="data.csv", help="Output/input data file") parser.add_argument("--model-file", default="model.joblib", help="Output model file") parser.add_argument("--n-samples", type=int, default=1000, help="Number of samples to generate") args = parser.parse_args() if args.mode in ["generate", "all"]: generate_data(args.data_file, args.n_samples) if args.mode in ["train", "all"]: train_model(args.data_file, args.model_file) ``` **`params.yaml`** - Configuration parameters: ```yaml # Data generation parameters n_samples: 1000 n_features: 20 # Model training parameters test_size: 0.2 n_estimators: 100 random_state: 42 ``` **`dvc.yaml`** - Pipeline definition: ```yaml stages: generate_data: cmd: uv run python train_model.py --mode generate --data-file data.csv --n-samples ${n_samples} deps: - train_model.py - params.yaml params: - n_samples - n_features outs: - data.csv train_model: cmd: uv run python train_model.py --mode train --data-file data.csv --model-file model.joblib deps: - train_model.py - data.csv - params.yaml params: - test_size - n_estimators - random_state metrics: - metrics.txt: cache: false outs: - model.joblib ``` ## DVC Commands Walkthrough ### Command 1: `dvc repro` - Run Pipeline **Purpose**: Execute pipeline stages automatically based on dependencies and changes. **Command**: ```bash dvc repro ``` **Expected Output**: ``` Running stage 'generate_data': > uv run python train_model.py --mode generate --data-file data.csv --n-samples 1000 Generating 1000 samples with 20 features... Data saved to data.csv Shape: (1000, 21) Running stage 'train_model': > uv run python train_model.py --mode train --data-file data.csv --model-file model.joblib Training model using data.csv... Model trained successfully! Accuracy: 0.9000 Model saved to model.joblib ``` **Validation**: ```bash # Check that files were created ls -la data.csv model.joblib metrics.txt # View metrics cat metrics.txt ``` ### Command 2: `dvc status` - Check Status **Purpose**: Check which files have changed compared to the last DVC commit. **Command**: ```bash dvc status ``` **Expected Output** (when up-to-date): ``` Pipeline is up to date. ``` ### Command 3: Parameter Changes and Re-reproduction **Purpose**: Demonstrate how DVC automatically detects changes and re-runs affected stages. **Steps**: 1. **Change a parameter**: ```bash # Edit params.yaml to use more samples sed -i 's/n_samples: 1000/n_samples: 2000/' params.yaml ``` 2. **Re-run pipeline**: ```bash dvc repro ``` **Expected Output**: ``` Running stage 'generate_data': > uv run python train_model.py --mode generate --data-file data.csv --n-samples 2000 Generating 2000 samples with 20 features... Running stage 'train_model': > uv run python train_model.py --mode train --data-file data.csv --model-file model.joblib Model trained successfully! Accuracy: 0.9175 ``` 3. **Check improved metrics**: ```bash cat metrics.txt ``` ### Command 4: `dvc add` - Track Manual Files **Purpose**: Track files that were created outside the DVC pipeline. **Steps**: 1. **Create external dataset**: ```bash uv run python -c " import pandas as pd import numpy as np # Create validation dataset X_val = np.random.randn(100, 20) y_val = np.random.randint(0, 2, 100) feature_names = [f'feature_{i}' for i in range(20)] df_val = pd.DataFrame(X_val, columns=feature_names) df_val['target'] = y_val df_val.to_csv('validation_data.csv', index=False) print(f'Validation data created: {df_val.shape}') " ``` 2. **Track with DVC**: ```bash dvc add validation_data.csv ``` **Expected Output**: ``` To track the changes with git, run: git add validation_data.csv.dvc .gitignore ``` 3. **Commit to Git**: ```bash git add validation_data.csv.dvc .gitignore git commit -m "Add validation dataset" ``` ### Command 5: `dvc commit` - Manual File Changes **Purpose**: Commit manual changes to tracked files when dependencies haven't changed. **Steps**: 1. **Manually modify a tracked file**: ```bash echo "Manual annotation: Model validated on 100 samples" >> metrics.txt ``` 2. **Check DVC status**: ```bash dvc status ``` **Expected Output**: ``` train_model: changed outs: modified: metrics.txt ``` 3. **Commit the manual change**: ```bash dvc commit -f ``` **Expected Output**: ``` Updating lock file 'dvc.lock' ``` ### Command 6: `dvc push` - Push to Remote Storage **Purpose**: Push large files to remote storage (so they're not stored in Git). **Steps**: 1. **Set up remote storage**: ```bash # Create local directory to simulate remote storage mkdir -p ../dvc-remote-storage # Add as DVC remote dvc remote add -d myremote ../dvc-remote-storage ``` 2. **Push data files**: ```bash dvc push ``` **Expected Output**: ``` 2 files pushed ``` 3. **Commit remote configuration**: ```bash git add .dvc/config git commit -m "Add remote storage configuration" ``` ### Command 7: `dvc pull` - Pull from Remote Storage **Purpose**: Retrieve large files from remote storage when working on different machines. **Steps**: 1. **Remove large files**: ```bash rm data.csv model.joblib validation_data.csv ls -la *.csv *.joblib # Should show no files ``` 2. **Pull from remote**: ```bash dvc pull ``` **Expected Output**: ``` A data.csv A model.joblib A validation_data.csv 3 files added ``` 3. **Verify files are restored**: ```bash ls -la *.csv *.joblib # Should show restored files ``` ## Additional Useful DVC Commands ### `dvc dag` - Visualize Pipeline ```bash dvc dag ``` ### `dvc metrics show` - View Metrics ```bash dvc metrics show ``` ### `dvc params show` - View Parameters ```bash dvc params show ``` ### `dvc gc` - Clean Up Cache ```bash # Clean up unused files from cache dvc gc ``` ### `dvc diff` - Compare Experiments ```bash # Compare current workspace with Git HEAD dvc diff HEAD ``` ## Common Workflows ### Workflow 1: Start New Experiment ```bash # Modify parameters or code vim params.yaml # or vim train_model.py # Run pipeline dvc repro # Check results dvc metrics show cat metrics.txt # Commit changes git add . git commit -m "Experiment: improved model with more data" ``` ### Workflow 2: Share with Team ```bash # Push code to Git git push origin main # Push data to remote storage dvc push ``` ### Workflow 3: Clone and Setup on New Machine ```bash # Clone repository git clone cd dvc-tutorial # Install dependencies uv sync # Pull large data files dvc pull # Run pipeline dvc repro ``` ## Key Concepts Summary 1. **Pipeline Stages**: Defined in `dvc.yaml` with dependencies, commands, and outputs 2. **Parameters**: Track configuration in `params.yaml` 3. **Metrics**: Automatically track model performance 4. **Remote Storage**: Large files stored separately from Git 5. **Reproducibility**: Changes automatically trigger appropriate stage re-runs ## Troubleshooting ### Issues with `dvc repro`: - Check dependencies with `dvc status` - Use `dvc repro --force` to force re-run all stages - Verify command syntax in `dvc.yaml` ### Issues with `dvc push`/`pull`: - Check remote configuration: `dvc remote list` - Ensure remote directory exists and has proper permissions - Verify network connectivity for cloud remotes ### Large Files in Git: - Ensure `.dvcignore` excludes large files - Check that all large files are tracked with `dvc add` - Use `git status` to verify no large files are untracked ## Next Steps 1. Try adding more stages to the pipeline (data preprocessing, feature engineering, etc.) 2. Experiment with different remote storage options (S3, Google Cloud, Azure) 3. Integrate with CI/CD for automated pipeline execution 4. Explore DVC's experiment tracking features 5. Try DVC Studio for visualization and collaboration This tutorial provides a complete foundation for using DVC in real-world machine learning projects!