# garbage-classification-xgboost
**Repository Path**: mtq851/garbage-classification-xgboost
## Basic Information
- **Project Name**: garbage-classification-xgboost
- **Description**: An intelligent garbage classification system based on TF-IDF and XGBoost
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 1
- **Forks**: 0
- **Created**: 2025-12-13
- **Last Updated**: 2025-12-13
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# 🗑️ 智能垃圾分类系统 / Intelligent Garbage Classification System
[中文](#中文指南) | [English](#english-guide)
---
## 🇨🇳 中文指南
### 📖 项目简介
本项目是一个基于机器学习的智能垃圾分类系统。它利用 TF-IDF 进行文本特征提取,并使用 **XGBoost** 算法对垃圾名称进行分类。
系统支持将垃圾分为以下四类:
* **0**: 可回收垃圾
* **1**: 干垃圾
* **2**: 湿垃圾
* **3**: 有害垃圾
### 📂 目录结构
为了确保代码正常运行,请保持以下目录结构:
```text
.
├── data/
│ └── garbage_sorting.csv # 训练数据 (需包含 garbage_name 和 type 列)
├── model/
│ ├── tfidf_vectorizer.pkl # 训练好的 TF-IDF 向量化器 (自动生成)
│ └── gc_model_v1.pkl # 训练好的 XGBoost 模型 (自动生成)
├── utils/
│ └── common.py # 通用工具模块 (包含 data_preprocessing)
├── train.py # 模型训练脚本
├── predict.py # 推理/预测脚本 (包含 GarbageClassifier 类)
└── README.md
```
### 🛠️ 环境依赖
请确保安装了以下 Python 库:
```bash
pip install pandas scikit-learn xgboost jieba joblib
```
### 🚀 快速开始
#### 1. 模型训练
运行训练脚本以生成模型文件。脚本会自动读取数据、预处理、提取特征并训练 XGBoost 模型。
```bash
python train.py
```
* **输入**: `../data/garbage_sorting.csv`
* **输出**: 模型文件将保存至 `../model/` 目录。
* **特征工程**: 使用字符级 (Char-level) TF-IDF (1-3 grams)。
* **模型参数**: `n_estimators=150`, `max_depth=6`, `learning_rate=0.2`。
#### 2. 模型预测 (推理)
使用 `GarbageClassifier` 类进行单条或批量预测。该类实现了**单例加载模式**,避免了重复加载模型带来的性能损耗,非常适合集成到 Web API 中。
**示例代码:**
```python
from predict import GarbageClassifier
# 1. 初始化 (只加载一次模型)
classifier = GarbageClassifier()
# 2. 预测
test_items = ["瓶子", "猫砂", "苹果皮", "电池"]
for item in items:
result = classifier.predict(item)
if result:
print(f"[{result['text']}] -> {result['label_name']} (ID: {result['label_id']})")
```
### 📊 分类标签说明
| ID | 类别名称 | 示例 |
|:---:|:---|:---|
| 0 | 可回收垃圾 | 瓶子, 塑料, 纸箱 |
| 1 | 干垃圾 | 贝壳, 烟蒂, 陶瓷 |
| 2 | 湿垃圾 | 果皮, 剩菜, 骨头 |
| 3 | 有害垃圾 | 电池, 药瓶, 油漆 |
---
## 🇬🇧 English Guide
### 📖 Introduction
This project is an intelligent garbage classification system based on machine learning. It uses **TF-IDF** for text feature extraction and **XGBoost** for classification.
The system classifies waste into four categories:
* **0**: Recyclable Waste
* **1**: Residual Waste (Dry)
* **2**: Household Food Waste (Wet)
* **3**: Hazardous Waste
### 📂 Directory Structure
Please maintain the following folder structure for the code to run correctly:
```text
.
├── data/
│ └── garbage_sorting.csv # Training data (must contain 'garbage_name' and 'type')
├── model/
│ ├── tfidf_vectorizer.pkl # Trained TF-IDF vectorizer (Auto-generated)
│ └── gc_model_v1.pkl # Trained XGBoost model (Auto-generated)
├── utils/
│ └── common.py # Utility module (contains data_preprocessing)
├── train.py # Training script
├── predict.py # Inference/Prediction script (GarbageClassifier class)
└── README.md
```
### 🛠️ Requirements
Ensure you have the following Python libraries installed:
```bash
pip install pandas scikit-learn xgboost jieba joblib
```
### 🚀 Quick Start
#### 1. Training
Run the training script to generate the model artifacts. The script handles data loading, preprocessing, feature engineering, and model training.
```bash
python train.py
```
* **Input**: `../data/garbage_sorting.csv`
* **Output**: Model files are saved to `../model/`.
* **Feature Engineering**: Char-level TF-IDF (1-3 grams).
* **Hyperparameters**: `n_estimators=150`, `max_depth=6`, `learning_rate=0.2`.
#### 2. Prediction (Inference)
Use the `GarbageClassifier` class for single or batch predictions. This class implements a **load-once mechanism**, preventing performance issues caused by reloading the model for every request, making it ideal for Web APIs.
**Usage Example:**
```python
from predict import GarbageClassifier
# 1. Initialize (Loads model into memory once)
classifier = GarbageClassifier()
# 2. Predict
test_items = ["瓶子", "猫砂", "苹果皮", "电池"]
for item in items:
result = classifier.predict(item)
if result:
print(f"[{result['text']}] -> {result['label_name']} (ID: {result['label_id']})")
```
### 📊 Label Mapping
| ID | Category Name (CN) | Type | Examples |
|:---:|:---|:---|:---|
| 0 | 可回收垃圾 | Recyclable | Bottles, Plastics, Cardboard |
| 1 | 干垃圾 | Dry/Residual | Shells, Cigarettes, Ceramics |
| 2 | 湿垃圾 | Wet/Food | Fruit peels, Leftovers |
| 3 | 有害垃圾 | Hazardous | Batteries, Medicines, Paint |