# codeChatBot **Repository Path**: captainwilson/code-chat-bot ## Basic Information - **Project Name**: codeChatBot - **Description**: 分析本地代码,生成alpaca微调数据集,lorafine tuning,lora 合成vllm,以及vllm推理代码 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-16 - **Last Updated**: 2026-01-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Finance Code Analysis & Fine-tuning Project A comprehensive toolkit for code analysis, dataset generation, and language model fine-tuning, with a focus on finance domain applications. ## Overview This project provides tools for: - **Code Analysis**: Interactive Q&A with codebases using LLM APIs - **Dataset Generation**: Creating fine-tuning datasets from code repositories - **Model Fine-tuning**: LoRA-based fine-tuning of language models for finance domain - **Model Serving**: Deploying fine-tuned models with vLLM ## Project Structure ``` finance/ ├── scripts/ # Code analysis and dataset generation tools │ ├── chatcode.py # Interactive codebase Q&A tool │ ├── generate_alpaca_dataset.py # Generate fine-tuning datasets from code │ └── README.md # Detailed documentation for scripts ├── finetuning/ # Model fine-tuning and serving │ ├── lora_multiple_gpus.py # Multi-GPU LoRA training script │ ├── merge_checkpoint.py # Merge LoRA adapters into base model │ ├── vllm_serve.sh # Serve model with vLLM │ ├── vllm_request.py # Test vLLM server │ └── README.md # Fine-tuning guide ├── requirements.txt # Python dependencies └── README.md # This file ``` ## Quick Start ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` ### 2. Configure API Access Create a `.env` file in the `scripts/` directory: ```bash # scripts/.env DOUBAO_API_BASE=https://your-doubao-api-endpoint.com DOUBAO_API_KEY=your-api-key-here DOUBAO_MODEL=your-model-name DOUBAO_API_PATH=/chat/completions ``` ### 3. Use the Tools **Chat with your codebase:** ```bash python scripts/chatcode.py "How does authentication work?" ``` **Generate fine-tuning dataset:** ```bash python scripts/generate_alpaca_dataset.py \ --path . \ --output scripts/dataset.jsonl \ --max-examples 1000 ``` **Fine-tune a model:** ```bash cd finetuning torchrun --nproc_per_node=4 lora_multiple_gpus.py ``` ## Components ### 📁 scripts/ Code analysis and dataset generation tools. See [scripts/README.md](scripts/README.md) for detailed documentation. #### `chatcode.py` Interactive CLI tool for asking questions about your codebase: - Keyword-based code search - Context-aware snippet extraction - Streaming LLM responses - Conversation history (up to 50 rounds) - Support for multiple programming languages **Features:** - Interactive chat mode or one-shot queries - Real-time streaming output - ESC/Ctrl-C interrupt support - Configurable search scope and context #### `generate_alpaca_dataset.py` Generates Alpaca-format fine-tuning datasets from code repositories: - Extracts functions and classes from code - Generates explanation, generation, and refactoring tasks - Uses LLM to create high-quality training examples - Supports multiple programming languages - Interactive progress feedback **Usage:** ```bash python scripts/generate_alpaca_dataset.py \ --path /path/to/code \ --output dataset.jsonl \ --max-examples 1000 \ --task-types explanation,generation,refactoring ``` ### 📁 finetuning/ Model fine-tuning and serving infrastructure. See [finetuning/README.md](finetuning/README.md) for detailed documentation. #### Training Pipeline 1. **Training**: Multi-GPU LoRA fine-tuning with `lora_multiple_gpus.py` 2. **Merging**: Merge LoRA adapters into base model with `merge_checkpoint.py` 3. **Serving**: Deploy merged model with vLLM using `vllm_serve.sh` 4. **Testing**: Test the served model with `vllm_request.py` **Workflow:** ```bash # 1. Train LoRA adapter torchrun --nproc_per_node=4 finetuning/lora_multiple_gpus.py # 2. Merge adapter python finetuning/merge_checkpoint.py # 3. Serve model bash finetuning/vllm_serve.sh # 4. Test server python finetuning/vllm_request.py ``` ## Requirements ### Python Dependencies - `python-dotenv==1.2.1` - `Requests==2.32.5` ### Additional Requirements for Fine-tuning - PyTorch with CUDA support - `transformers` - `peft` - `datasets` - `pandas` - `vllm` (for serving) - `sseclient` (for streaming requests) ### System Requirements - Python 3.7+ - CUDA-capable GPUs (for fine-tuning and serving) - Multiple GPUs recommended for training (4+ GPUs) ## Configuration ### Environment Variables The tools use environment variables for API configuration. Create a `.env` file in `scripts/`: ```bash DOUBAO_API_BASE=https://ark.cn-beijing.volces.com/api/v3 DOUBAO_API_KEY=your-api-key DOUBAO_MODEL=your-model-name DOUBAO_API_PATH=/chat/completions ``` ### Dataset Paths The fine-tuning script expects datasets in Alpaca format at: ``` ../alpaca_data/finance-alpaca/Cleaned_date.json ``` Update paths in `finetuning/lora_multiple_gpus.py` if your dataset is located elsewhere. ## Use Cases ### 1. Code Understanding Use `chatcode.py` to quickly understand unfamiliar codebases: - Ask questions about code structure - Understand complex functions - Find related code patterns - Get explanations with context ### 2. Dataset Generation Use `generate_alpaca_dataset.py` to create training data: - Extract code patterns from repositories - Generate diverse training examples - Create domain-specific datasets (e.g., finance) - Prepare data for fine-tuning ### 3. Domain-Specific Fine-tuning Fine-tune models for finance domain: - Train on finance-specific code and documentation - Create specialized assistants - Improve model performance on domain tasks - Deploy with vLLM for production use ## Examples ### Example 1: Analyze a Codebase ```bash # Interactive mode python scripts/chatcode.py # One-shot query python scripts/chatcode.py "How does the payment processing work?" --path ./src ``` ### Example 2: Generate Training Dataset ```bash # Generate dataset from entire project python scripts/generate_alpaca_dataset.py \ --path . \ --output scripts/finance_dataset.jsonl \ --max-examples 5000 \ --min-complexity 2 ``` ### Example 3: Fine-tune Model ```bash # Train on 4 GPUs cd finetuning torchrun --nproc_per_node=4 lora_multiple_gpus.py # Merge and serve python merge_checkpoint.py bash vllm_serve.sh ``` ## Tips 1. **Code Analysis**: Start with broad questions, then narrow down based on responses 2. **Dataset Generation**: Use `--min-complexity` to filter out trivial functions 3. **Fine-tuning**: Monitor training loss and select the best checkpoint 4. **Memory**: Use gradient checkpointing and smaller batch sizes if OOM occurs 5. **Multi-GPU**: Always use `torchrun` for distributed training ## Troubleshooting ### Common Issues **API Connection Errors** - Verify API credentials in `.env` file - Check network connectivity - Ensure API endpoint is correct **Out of Memory (OOM)** - Reduce batch size in training script - Increase gradient accumulation steps - Use gradient checkpointing (already enabled) **Dataset Not Found** - Check dataset path in `lora_multiple_gpus.py` - Ensure dataset is in Alpaca format - Verify file permissions **Multi-GPU Training Issues** - Ensure `ddp_find_unused_parameters=False` is set - Use `torchrun` instead of manual GPU selection - Check CUDA visibility ## Contributing When adding new features: - Update relevant README files - Add configuration options to `.env` template - Test with multiple codebases - Document any new dependencies