# codeChatBot

**Repository Path**: captainwilson/code-chat-bot

## Basic Information

- **Project Name**: codeChatBot
- **Description**: 分析本地代码，生成alpaca微调数据集，lorafine tuning，lora 合成vllm，以及vllm推理代码
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-16
- **Last Updated**: 2026-01-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Finance Code Analysis & Fine-tuning Project

A comprehensive toolkit for code analysis, dataset generation, and language model fine-tuning, with a focus on finance domain applications.

## Overview

This project provides tools for:
- **Code Analysis**: Interactive Q&A with codebases using LLM APIs
- **Dataset Generation**: Creating fine-tuning datasets from code repositories
- **Model Fine-tuning**: LoRA-based fine-tuning of language models for finance domain
- **Model Serving**: Deploying fine-tuned models with vLLM

## Project Structure

```
finance/
├── scripts/              # Code analysis and dataset generation tools
│   ├── chatcode.py      # Interactive codebase Q&A tool
│   ├── generate_alpaca_dataset.py  # Generate fine-tuning datasets from code
│   └── README.md        # Detailed documentation for scripts
├── finetuning/          # Model fine-tuning and serving
│   ├── lora_multiple_gpus.py  # Multi-GPU LoRA training script
│   ├── merge_checkpoint.py     # Merge LoRA adapters into base model
│   ├── vllm_serve.sh          # Serve model with vLLM
│   ├── vllm_request.py        # Test vLLM server
│   └── README.md              # Fine-tuning guide
├── requirements.txt     # Python dependencies
└── README.md           # This file
```

## Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Configure API Access

Create a `.env` file in the `scripts/` directory:

```bash
# scripts/.env
DOUBAO_API_BASE=https://your-doubao-api-endpoint.com
DOUBAO_API_KEY=your-api-key-here
DOUBAO_MODEL=your-model-name
DOUBAO_API_PATH=/chat/completions
```

### 3. Use the Tools

**Chat with your codebase:**
```bash
python scripts/chatcode.py "How does authentication work?"
```

**Generate fine-tuning dataset:**
```bash
python scripts/generate_alpaca_dataset.py \
  --path . \
  --output scripts/dataset.jsonl \
  --max-examples 1000
```

**Fine-tune a model:**
```bash
cd finetuning
torchrun --nproc_per_node=4 lora_multiple_gpus.py
```

## Components

### 📁 scripts/

Code analysis and dataset generation tools. See [scripts/README.md](scripts/README.md) for detailed documentation.

#### `chatcode.py`
Interactive CLI tool for asking questions about your codebase:
- Keyword-based code search
- Context-aware snippet extraction
- Streaming LLM responses
- Conversation history (up to 50 rounds)
- Support for multiple programming languages

**Features:**
- Interactive chat mode or one-shot queries
- Real-time streaming output
- ESC/Ctrl-C interrupt support
- Configurable search scope and context

#### `generate_alpaca_dataset.py`
Generates Alpaca-format fine-tuning datasets from code repositories:
- Extracts functions and classes from code
- Generates explanation, generation, and refactoring tasks
- Uses LLM to create high-quality training examples
- Supports multiple programming languages
- Interactive progress feedback

**Usage:**
```bash
python scripts/generate_alpaca_dataset.py \
  --path /path/to/code \
  --output dataset.jsonl \
  --max-examples 1000 \
  --task-types explanation,generation,refactoring
```

### 📁 finetuning/

Model fine-tuning and serving infrastructure. See [finetuning/README.md](finetuning/README.md) for detailed documentation.

#### Training Pipeline
1. **Training**: Multi-GPU LoRA fine-tuning with `lora_multiple_gpus.py`
2. **Merging**: Merge LoRA adapters into base model with `merge_checkpoint.py`
3. **Serving**: Deploy merged model with vLLM using `vllm_serve.sh`
4. **Testing**: Test the served model with `vllm_request.py`

**Workflow:**
```bash
# 1. Train LoRA adapter
torchrun --nproc_per_node=4 finetuning/lora_multiple_gpus.py

# 2. Merge adapter
python finetuning/merge_checkpoint.py

# 3. Serve model
bash finetuning/vllm_serve.sh

# 4. Test server
python finetuning/vllm_request.py
```

## Requirements

### Python Dependencies
- `python-dotenv==1.2.1`
- `Requests==2.32.5`

### Additional Requirements for Fine-tuning
- PyTorch with CUDA support
- `transformers`
- `peft`
- `datasets`
- `pandas`
- `vllm` (for serving)
- `sseclient` (for streaming requests)

### System Requirements
- Python 3.7+
- CUDA-capable GPUs (for fine-tuning and serving)
- Multiple GPUs recommended for training (4+ GPUs)

## Configuration

### Environment Variables

The tools use environment variables for API configuration. Create a `.env` file in `scripts/`:

```bash
DOUBAO_API_BASE=https://ark.cn-beijing.volces.com/api/v3
DOUBAO_API_KEY=your-api-key
DOUBAO_MODEL=your-model-name
DOUBAO_API_PATH=/chat/completions
```

### Dataset Paths

The fine-tuning script expects datasets in Alpaca format at:
```
../alpaca_data/finance-alpaca/Cleaned_date.json
```

Update paths in `finetuning/lora_multiple_gpus.py` if your dataset is located elsewhere.

## Use Cases

### 1. Code Understanding
Use `chatcode.py` to quickly understand unfamiliar codebases:
- Ask questions about code structure
- Understand complex functions
- Find related code patterns
- Get explanations with context

### 2. Dataset Generation
Use `generate_alpaca_dataset.py` to create training data:
- Extract code patterns from repositories
- Generate diverse training examples
- Create domain-specific datasets (e.g., finance)
- Prepare data for fine-tuning

### 3. Domain-Specific Fine-tuning
Fine-tune models for finance domain:
- Train on finance-specific code and documentation
- Create specialized assistants
- Improve model performance on domain tasks
- Deploy with vLLM for production use

## Examples

### Example 1: Analyze a Codebase

```bash
# Interactive mode
python scripts/chatcode.py

# One-shot query
python scripts/chatcode.py "How does the payment processing work?" --path ./src
```

### Example 2: Generate Training Dataset

```bash
# Generate dataset from entire project
python scripts/generate_alpaca_dataset.py \
  --path . \
  --output scripts/finance_dataset.jsonl \
  --max-examples 5000 \
  --min-complexity 2
```

### Example 3: Fine-tune Model

```bash
# Train on 4 GPUs
cd finetuning
torchrun --nproc_per_node=4 lora_multiple_gpus.py

# Merge and serve
python merge_checkpoint.py
bash vllm_serve.sh
```

## Tips

1. **Code Analysis**: Start with broad questions, then narrow down based on responses
2. **Dataset Generation**: Use `--min-complexity` to filter out trivial functions
3. **Fine-tuning**: Monitor training loss and select the best checkpoint
4. **Memory**: Use gradient checkpointing and smaller batch sizes if OOM occurs
5. **Multi-GPU**: Always use `torchrun` for distributed training

## Troubleshooting

### Common Issues

**API Connection Errors**
- Verify API credentials in `.env` file
- Check network connectivity
- Ensure API endpoint is correct

**Out of Memory (OOM)**
- Reduce batch size in training script
- Increase gradient accumulation steps
- Use gradient checkpointing (already enabled)

**Dataset Not Found**
- Check dataset path in `lora_multiple_gpus.py`
- Ensure dataset is in Alpaca format
- Verify file permissions

**Multi-GPU Training Issues**
- Ensure `ddp_find_unused_parameters=False` is set
- Use `torchrun` instead of manual GPU selection
- Check CUDA visibility

## Contributing

When adding new features:
- Update relevant README files
- Add configuration options to `.env` template
- Test with multiple codebases
- Document any new dependencies