# garbage-classification-xgboost

**Repository Path**: mtq851/garbage-classification-xgboost

## Basic Information

- **Project Name**: garbage-classification-xgboost
- **Description**: An intelligent garbage classification system based on TF-IDF and XGBoost
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2025-12-13
- **Last Updated**: 2025-12-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 🗑️ 智能垃圾分类系统 / Intelligent Garbage Classification System

[中文](#中文指南) | [English](#english-guide)

---

<a name="中文指南"></a>

## 🇨🇳 中文指南

### 📖 项目简介
本项目是一个基于机器学习的智能垃圾分类系统。它利用 TF-IDF 进行文本特征提取，并使用 **XGBoost** 算法对垃圾名称进行分类。

系统支持将垃圾分为以下四类：
*   **0**: 可回收垃圾
*   **1**: 干垃圾
*   **2**: 湿垃圾
*   **3**: 有害垃圾

### 📂 目录结构
为了确保代码正常运行，请保持以下目录结构：

```text
.
├── data/
│   └── garbage_sorting.csv    # 训练数据 (需包含 garbage_name 和 type 列)
├── model/
│   ├── tfidf_vectorizer.pkl   # 训练好的 TF-IDF 向量化器 (自动生成)
│   └── gc_model_v1.pkl        # 训练好的 XGBoost 模型 (自动生成)
├── utils/
│   └── common.py              # 通用工具模块 (包含 data_preprocessing)
├── train.py                   # 模型训练脚本
├── predict.py                 # 推理/预测脚本 (包含 GarbageClassifier 类)
└── README.md
```

### 🛠️ 环境依赖
请确保安装了以下 Python 库：

```bash
pip install pandas scikit-learn xgboost jieba joblib
```

### 🚀 快速开始

#### 1. 模型训练
运行训练脚本以生成模型文件。脚本会自动读取数据、预处理、提取特征并训练 XGBoost 模型。

```bash
python train.py
```
*   **输入**: `../data/garbage_sorting.csv`
*   **输出**: 模型文件将保存至 `../model/` 目录。
*   **特征工程**: 使用字符级 (Char-level) TF-IDF (1-3 grams)。
*   **模型参数**: `n_estimators=150`, `max_depth=6`, `learning_rate=0.2`。

#### 2. 模型预测 (推理)
使用 `GarbageClassifier` 类进行单条或批量预测。该类实现了**单例加载模式**，避免了重复加载模型带来的性能损耗，非常适合集成到 Web API 中。

**示例代码:**

```python
from predict import GarbageClassifier

# 1. 初始化 (只加载一次模型)
classifier = GarbageClassifier()

# 2. 预测
test_items = ["瓶子", "猫砂", "苹果皮", "电池"]
for item in items:
    result = classifier.predict(item)
    if result:
        print(f"[{result['text']}] -> {result['label_name']} (ID: {result['label_id']})")
```

### 📊 分类标签说明
| ID | 类别名称 | 示例 |
|:---:|:---|:---|
| 0 | 可回收垃圾 | 瓶子, 塑料, 纸箱 |
| 1 | 干垃圾 | 贝壳, 烟蒂, 陶瓷 |
| 2 | 湿垃圾 | 果皮, 剩菜, 骨头 |
| 3 | 有害垃圾 | 电池, 药瓶, 油漆 |

---

<a name="english-guide"></a>

## 🇬🇧 English Guide

### 📖 Introduction
This project is an intelligent garbage classification system based on machine learning. It uses **TF-IDF** for text feature extraction and **XGBoost** for classification.

The system classifies waste into four categories:
*   **0**: Recyclable Waste
*   **1**: Residual Waste (Dry)
*   **2**: Household Food Waste (Wet)
*   **3**: Hazardous Waste

### 📂 Directory Structure
Please maintain the following folder structure for the code to run correctly:

```text
.
├── data/
│   └── garbage_sorting.csv    # Training data (must contain 'garbage_name' and 'type')
├── model/
│   ├── tfidf_vectorizer.pkl   # Trained TF-IDF vectorizer (Auto-generated)
│   └── gc_model_v1.pkl        # Trained XGBoost model (Auto-generated)
├── utils/
│   └── common.py              # Utility module (contains data_preprocessing)
├── train.py                   # Training script
├── predict.py                 # Inference/Prediction script (GarbageClassifier class)
└── README.md
```

### 🛠️ Requirements
Ensure you have the following Python libraries installed:

```bash
pip install pandas scikit-learn xgboost jieba joblib
```

### 🚀 Quick Start

#### 1. Training
Run the training script to generate the model artifacts. The script handles data loading, preprocessing, feature engineering, and model training.

```bash
python train.py
```
*   **Input**: `../data/garbage_sorting.csv`
*   **Output**: Model files are saved to `../model/`.
*   **Feature Engineering**: Char-level TF-IDF (1-3 grams).
*   **Hyperparameters**: `n_estimators=150`, `max_depth=6`, `learning_rate=0.2`.

#### 2. Prediction (Inference)
Use the `GarbageClassifier` class for single or batch predictions. This class implements a **load-once mechanism**, preventing performance issues caused by reloading the model for every request, making it ideal for Web APIs.

**Usage Example:**

```python
from predict import GarbageClassifier

# 1. Initialize (Loads model into memory once)
classifier = GarbageClassifier()

# 2. Predict
test_items = ["瓶子", "猫砂", "苹果皮", "电池"]
for item in items:
    result = classifier.predict(item)
    if result:
        print(f"[{result['text']}] -> {result['label_name']} (ID: {result['label_id']})")
```

### 📊 Label Mapping
| ID | Category Name (CN) | Type | Examples |
|:---:|:---|:---|:---|
| 0 | 可回收垃圾 | Recyclable | Bottles, Plastics, Cardboard |
| 1 | 干垃圾 | Dry/Residual | Shells, Cigarettes, Ceramics |
| 2 | 湿垃圾 | Wet/Food | Fruit peels, Leftovers |
| 3 | 有害垃圾 | Hazardous | Batteries, Medicines, Paint |