# unsloth-granite-4.0-h-small-base **Repository Path**: hf-models/unsloth-granite-4.0-h-small-base ## Basic Information - **Project Name**: unsloth-granite-4.0-h-small-base - **Description**: Mirror of https://huggingface.co/unsloth/granite-4.0-h-small-base - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-06 - **Last Updated**: 2025-10-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- base_model: - ibm-granite/granite-4.0-h-small-base license: apache-2.0 library_name: transformersd tags: - language - unsloth - granite-4.0 ---
Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.
| Benchmarks | Metric | Micro Dense | H Micro Dense | H Tiny MoE | H Small MoE |
|---|---|---|---|---|---|
| General Tasks | |||||
| MMLU | 5-shot | 66.47 | 67.43 | 68.90 | 75.85 |
| MMLU-Pro | 5-shot,CoT | 37.16 | 34.03 | 35.47 | 48.94 |
| BBH | 3-shot, CoT | 63.84 | 57.65 | 59.67 | 75.84 |
| AGI EVAL | 3-shot | 54.32 | 54.59 | 53.69 | 62.05 |
| DROP | 5-shot | 66.04 | 67.44 | 64.92 | 74.69 |
| Math Tasks | |||||
| GSM8K | 8-shot | 72.93 | 63.76 | 72.55 | 82.11 |
| Minerva Math | 4-shot | 38 | 39.7 | 40.34 | 46.28 |
| Code Tasks | |||||
| HumanEval | pass@1 [StarCoder Prompt] | 76.19 | 73.72 | 77.59 | 83.66 |
| HumanEval | pass@1 | 59.76 | 70.73 | 71.34 | 76.22 |
| HumanEval+ | pass@1 | 54.27 | 67.07 | 64.02 | 69.51 |
| MBPP | pass@1 | 81.48 | 74.87 | 81.48 | 83.07 |
| MBPP+ | pass@1 | 68.25 | 63.23 | 68.78 | 70.37 |
| Multilingual Tasks | |||||
| MMMLU | 5-shot | 56.59 | 58.5 | 62.77 | 71.18 |
| INCLUDE | 5-shot | 51.77 | 52.16 | 53.78 | 66.04 |
| MGSM | 8-shot | 58.48 | 47.04 | 54.64 | 65.2 |
| Benchmarks | # Langs | Languages |
|---|---|---|
| MMMLU | 11 | ar, de, en, es, fr, ja, ko, pt, zh, bn, hi |
| INCLUDE | 14 | hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh |
| MGSM | 5 | en, es, fr, ja, zh |
| Model | Micro Dense | H Micro Dense | H Tiny MoE | H Small MoE |
|---|---|---|---|---|
| Embedding size | 2560 | 2048 | 1536 | 4096 |
| Number of layers | 40 attention | 4 attention / 36 Mamba2 | 4 attention / 36 Mamba2 | 4 attention / 36 Mamba2 |
| Attention head size | 64 | 64 | 128 | 128 |
| Number of attention heads | 40 | 32 | 12 | 32 |
| Number of KV heads | 8 | 8 | 4 | 8 |
| Mamba2 state size | - | 128 | 128 | 128 |
| Number of Mamba2 heads | - | 64 | 48 | 128 |
| MLP / Shared expert hidden size | 8192 | 8192 | 1024 | 1536 |
| Num. Experts | - | - | 64 | 72 |
| Num. active Experts | - | - | 6 | 10 |
| Expert hidden size | - | - | 512 | 768 |
| MLP activation | SwiGLU | SwiGLU | SwiGLU | SwiGLU |
| Sequence length | 128K | 128K | 128K | 128K |
| Position embedding | RoPE | NoPE | NoPE | NoPE |
| # Parameters | 3B | 3B | 7B | 32B |
| # Active parameters | 3B | 3B | 1B | 9B |
| Stage | Characteristics | Micro Dense | H Micro Dense | H Tiny MoE | H Small MoE |
|---|---|---|---|---|---|
| I | General mixture of training data, warmup, and power scheduler for learning rate. | 10 | 10 | 15 | 15 |
| II | General mixture of training data with higher percentages of code and math with power scheduler for learning rate. | 2 | 5 | 5 | 5 |
| III | High quality training data, exponential decay of learning rate. | 2 | 2 | 2 | 2 |
| IV | High quality training data, linear decay to zero for learning rate. | 0.5 | 0.5 | 0.5 | 0.5 |