Conversation Dataset Generator

Generate multi-speaker conversational datasets in ShareGPT format for LLM fine-tuning.

Specify personas, topics, and styles — or just give a creative brief and let the model handle the rest. Supports 2 to N speakers, Docker, and web search enrichment.

$ pip install -r requirements.txt && python generate.py --help

What It Does

Everything you need to build rich, varied conversational training data.

N-Speaker Conversations

Define 2, 3, or more personas inline with --persona or from a YAML file with --personas. No hard limit on participants.

Character Pools & Random Pairings

Load YAML character pools and randomly pair or group them each conversation. Combine with --group-size 3 for N-way pairings at scale.

Creative Brief Mode

Give a one-sentence idea. The model brainstorms personas, topic, scenario, and style — then varies each conversation automatically.

Continue Conversations

Extend any existing conversation in a JSONL file with --continue-from. Target a specific one with --conversation-id.

Role Mapping for Training

Use --train-speaker to designate which character's turns become the gpt role. Fine-grained control with --role-mapping.

Docker Support

Build once, run anywhere with GPU passthrough. Supports CUDA 12.x (default) and CUDA 13.x via a build arg for the latest RTX hardware.

5 Generation Modes

Pick the approach that matches your workflow.

Manual

Fully specify topic, personas, scenario, and style. Deterministic — every conversation uses your exact parameters.

Creative Brief

One sentence is all you need. The LLM invents everything else and varies the topic/scenario per conversation. Optionally enrich personas with web search.

Fixed Persona + Variation

Lock in the characters with --enable-variation, then let the model vary the topic and scenario each time.

Random Pairings

Provide character pool YAML files. Each conversation draws a fresh random pair (or group) from the pool. Great for large-scale diverse datasets.

Continue Conversation

Pick up where you left off. Pass an existing JSONL file and the model continues the dialogue naturally from the last turn.

Quick Examples

Click a tab to see a different mode in action.

Command

# One sentence is enough python generate.py \ --creative-brief "Sherlock Holmes and Watson \ debate whether AI will replace \ detectives" \ --num-examples 10 \ --output-file sherlock.jsonl # Enrich with web search context python generate.py \ --creative-brief "Linus Torvalds and Tim Cook \ debate open source" \ --persona1-search-term "Linus Torvalds" \ --persona2-search-term "Tim Cook Apple CEO" \ --num-examples 10 \ --output-file tech_debate.jsonl

What happens

  1. 1Model expands brief into full persona descriptions, topic, scenario, and style.
  2. 2For each conversation, topic and scenario are automatically varied to avoid repetition.
  3. 3Output is written as JSONL — one line per turn, ready for training.
Add --persona1-search-term / --persona2-search-term to pull live web context for real-world personas.

Command

# 3-way conversation, inline python generate.py \ --persona "Iron Man" "Genius billionaire, rapid-fire wit" \ --persona "Captain America" "Principled, earnest, old-fashioned" \ --persona "Thor" "Boisterous god, Shakespearean flair" \ --topic "who pays for the pizza" \ --scenario "Avengers break room" \ --style "comedic argument" \ --num-examples 10 \ --output-file avengers_pizza.jsonl # Or load from YAML file python generate.py \ --personas avengers.yaml \ --topic "mission planning" \ --output-file avengers.jsonl

Sample output

Iron Man
Look, I'll cover the pizza. JARVIS, order six large pies. No, make it eight — Thor is here.
Thor
Eight? I require at minimum twelve. I have just returned from Asgard and the realm eternal offers no Neapolitan fare.
Captain America
We should split it fairly. Tony, you can't just buy your way out of every team decision.
Iron Man
Steve, I literally fund this entire operation. "Fairly" is me paying.

Command

# Specify everything explicitly python generate.py \ --persona1 "Tony" \ --persona1-desc "A passionate Italian chef" \ --persona2 "Dave" \ --persona2-desc "A pineapple-on-pizza enthusiast" \ --topic "best pizza toppings" \ --scenario "kitchen argument" \ --style "heated but friendly debate" \ --num-examples 10 \ --output-file pizza_debate.jsonl

Notes

  • Every generated conversation uses exactly these parameters — no variation unless you add --enable-variation.
  • Legacy --persona1/--persona2 flags still work for 2-speaker setups.
  • Use --persona (repeatable) to go beyond 2 speakers in the same deterministic style.
  • Default model: Qwen/Qwen2.5-7B-Instruct, up to 4096 tokens per generation.

Continue a conversation

# Extend the last conversation in a file python generate.py \ --continue-from conversations.jsonl \ --output-file more.jsonl # Continue a specific conversation by ID python generate.py \ --continue-from conversations.jsonl \ --conversation-id 5 \ --output-file more.jsonl

Train a specific character

# All of Cap's turns become "gpt" role python generate.py \ --persona "Iron Man" "Genius billionaire" \ --persona "Captain America" "Principled leader" \ --persona "Thor" "Boisterous god" \ --train-speaker "Captain America" \ --topic "mission planning" \ --output-file cap_training.jsonl

How role mapping works

The role field controls what training frameworks see:

human Input / context — the model sees this
gpt Target — the model learns to generate this

Default: first persona = human, all others = gpt.

Use --train-speaker NAME to designate one character as the gpt target, making everyone else context.

For full control: --role-mapping "Iron Man=human,Captain America=gpt,Thor=human"

Output Format

ShareGPT-style JSONL — one line per turn, ready for training frameworks.

Single turn (one JSONL line)

{ "conversation_id": 0, "turn_number": 2, "role": "gpt", "speaker_name": "Captain America", "topic": "who pays for the pizza", "scenario": "Avengers break room", "style": "comedic argument", "include_points": "", "content": "We should split it fairly. Tony, you can't just buy your way out of every team decision." }

Field reference

Field Description
conversation_idGroups turns into one conversation
turn_numberPosition within the conversation
rolehuman or gpt for training
speaker_nameActual character name (always set)
contentThe spoken dialogue text
Upload directly to Hugging Face Hub with --upload-to-hub REPO.

Get Started

Two paths: pip for local development, Docker for reproducible GPU environments.

pip (local)

# Clone and set up git clone https://github.com/cahlen/\ conversation-dataset-generator.git cd conversation-dataset-generator python -m venv venv source venv/bin/activate pip install -r requirements.txt # Optional: 4-bit quantization pip install bitsandbytes # Optional: web search enrichment pip install duckduckgo-search # Optional: HF Hub upload huggingface-cli login

Requires Python 3.10+ and an NVIDIA GPU with CUDA.

Docker

# Build (CUDA 12.x — 30xx/40xx/50xx) docker build -t cdg . # Build for CUDA 13.x (RTX 50xx latest) docker build \ --build-arg CUDA_VERSION=13.0.0 \ -t cdg . # Run with GPU passthrough docker run --gpus all \ -v $(pwd)/output:/app/output cdg \ --creative-brief "Two scientists \ argue about time travel" \ --output-file output/data.jsonl # Or use docker compose docker compose run cdg \ --creative-brief "..." \ --output-file output/data.jsonl

Requires Docker with NVIDIA Container Toolkit.

Python
PyTorch
Transformers
Default Model
DuckDuckGo Search
bitsandbytes

121 Tests, No GPU Required

All LLM calls are mocked. Run the full test suite on any machine.

# Install dev deps and run tests pip install -r requirements-dev.txt pytest tests/ -v