# garbage-classification-xgboost **Repository Path**: mtq851/garbage-classification-xgboost ## Basic Information - **Project Name**: garbage-classification-xgboost - **Description**: An intelligent garbage classification system based on TF-IDF and XGBoost - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2025-12-13 - **Last Updated**: 2025-12-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 🗑️ 智能垃圾分类系统 / Intelligent Garbage Classification System [中文](#中文指南) | [English](#english-guide) --- ## 🇨🇳 中文指南 ### 📖 项目简介 本项目是一个基于机器学习的智能垃圾分类系统。它利用 TF-IDF 进行文本特征提取,并使用 **XGBoost** 算法对垃圾名称进行分类。 系统支持将垃圾分为以下四类: * **0**: 可回收垃圾 * **1**: 干垃圾 * **2**: 湿垃圾 * **3**: 有害垃圾 ### 📂 目录结构 为了确保代码正常运行,请保持以下目录结构: ```text . ├── data/ │ └── garbage_sorting.csv # 训练数据 (需包含 garbage_name 和 type 列) ├── model/ │ ├── tfidf_vectorizer.pkl # 训练好的 TF-IDF 向量化器 (自动生成) │ └── gc_model_v1.pkl # 训练好的 XGBoost 模型 (自动生成) ├── utils/ │ └── common.py # 通用工具模块 (包含 data_preprocessing) ├── train.py # 模型训练脚本 ├── predict.py # 推理/预测脚本 (包含 GarbageClassifier 类) └── README.md ``` ### 🛠️ 环境依赖 请确保安装了以下 Python 库: ```bash pip install pandas scikit-learn xgboost jieba joblib ``` ### 🚀 快速开始 #### 1. 模型训练 运行训练脚本以生成模型文件。脚本会自动读取数据、预处理、提取特征并训练 XGBoost 模型。 ```bash python train.py ``` * **输入**: `../data/garbage_sorting.csv` * **输出**: 模型文件将保存至 `../model/` 目录。 * **特征工程**: 使用字符级 (Char-level) TF-IDF (1-3 grams)。 * **模型参数**: `n_estimators=150`, `max_depth=6`, `learning_rate=0.2`。 #### 2. 模型预测 (推理) 使用 `GarbageClassifier` 类进行单条或批量预测。该类实现了**单例加载模式**,避免了重复加载模型带来的性能损耗,非常适合集成到 Web API 中。 **示例代码:** ```python from predict import GarbageClassifier # 1. 初始化 (只加载一次模型) classifier = GarbageClassifier() # 2. 预测 test_items = ["瓶子", "猫砂", "苹果皮", "电池"] for item in items: result = classifier.predict(item) if result: print(f"[{result['text']}] -> {result['label_name']} (ID: {result['label_id']})") ``` ### 📊 分类标签说明 | ID | 类别名称 | 示例 | |:---:|:---|:---| | 0 | 可回收垃圾 | 瓶子, 塑料, 纸箱 | | 1 | 干垃圾 | 贝壳, 烟蒂, 陶瓷 | | 2 | 湿垃圾 | 果皮, 剩菜, 骨头 | | 3 | 有害垃圾 | 电池, 药瓶, 油漆 | --- ## 🇬🇧 English Guide ### 📖 Introduction This project is an intelligent garbage classification system based on machine learning. It uses **TF-IDF** for text feature extraction and **XGBoost** for classification. The system classifies waste into four categories: * **0**: Recyclable Waste * **1**: Residual Waste (Dry) * **2**: Household Food Waste (Wet) * **3**: Hazardous Waste ### 📂 Directory Structure Please maintain the following folder structure for the code to run correctly: ```text . ├── data/ │ └── garbage_sorting.csv # Training data (must contain 'garbage_name' and 'type') ├── model/ │ ├── tfidf_vectorizer.pkl # Trained TF-IDF vectorizer (Auto-generated) │ └── gc_model_v1.pkl # Trained XGBoost model (Auto-generated) ├── utils/ │ └── common.py # Utility module (contains data_preprocessing) ├── train.py # Training script ├── predict.py # Inference/Prediction script (GarbageClassifier class) └── README.md ``` ### 🛠️ Requirements Ensure you have the following Python libraries installed: ```bash pip install pandas scikit-learn xgboost jieba joblib ``` ### 🚀 Quick Start #### 1. Training Run the training script to generate the model artifacts. The script handles data loading, preprocessing, feature engineering, and model training. ```bash python train.py ``` * **Input**: `../data/garbage_sorting.csv` * **Output**: Model files are saved to `../model/`. * **Feature Engineering**: Char-level TF-IDF (1-3 grams). * **Hyperparameters**: `n_estimators=150`, `max_depth=6`, `learning_rate=0.2`. #### 2. Prediction (Inference) Use the `GarbageClassifier` class for single or batch predictions. This class implements a **load-once mechanism**, preventing performance issues caused by reloading the model for every request, making it ideal for Web APIs. **Usage Example:** ```python from predict import GarbageClassifier # 1. Initialize (Loads model into memory once) classifier = GarbageClassifier() # 2. Predict test_items = ["瓶子", "猫砂", "苹果皮", "电池"] for item in items: result = classifier.predict(item) if result: print(f"[{result['text']}] -> {result['label_name']} (ID: {result['label_id']})") ``` ### 📊 Label Mapping | ID | Category Name (CN) | Type | Examples | |:---:|:---|:---|:---| | 0 | 可回收垃圾 | Recyclable | Bottles, Plastics, Cardboard | | 1 | 干垃圾 | Dry/Residual | Shells, Cigarettes, Ceramics | | 2 | 湿垃圾 | Wet/Food | Fruit peels, Leftovers | | 3 | 有害垃圾 | Hazardous | Batteries, Medicines, Paint |