Generate multi-speaker conversational datasets in ShareGPT format for LLM fine-tuning.
Specify personas, topics, and styles — or just give a creative brief and let the model handle the rest. Supports 2 to N speakers, Docker, and web search enrichment.
Everything you need to build rich, varied conversational training data.
Define 2, 3, or more personas inline with --persona or from a YAML file with --personas. No hard limit on participants.
Load YAML character pools and randomly pair or group them each conversation. Combine with --group-size 3 for N-way pairings at scale.
Give a one-sentence idea. The model brainstorms personas, topic, scenario, and style — then varies each conversation automatically.
Extend any existing conversation in a JSONL file with --continue-from. Target a specific one with --conversation-id.
Use --train-speaker to designate which character's turns become the gpt role. Fine-grained control with --role-mapping.
Build once, run anywhere with GPU passthrough. Supports CUDA 12.x (default) and CUDA 13.x via a build arg for the latest RTX hardware.
Pick the approach that matches your workflow.
Fully specify topic, personas, scenario, and style. Deterministic — every conversation uses your exact parameters.
One sentence is all you need. The LLM invents everything else and varies the topic/scenario per conversation. Optionally enrich personas with web search.
Lock in the characters with --enable-variation, then let the model vary the topic and scenario each time.
Provide character pool YAML files. Each conversation draws a fresh random pair (or group) from the pool. Great for large-scale diverse datasets.
Pick up where you left off. Pass an existing JSONL file and the model continues the dialogue naturally from the last turn.
Click a tab to see a different mode in action.
# One sentence is enough
python generate.py \
--creative-brief "Sherlock Holmes and Watson \
debate whether AI will replace \
detectives" \
--num-examples 10 \
--output-file sherlock.jsonl
# Enrich with web search context
python generate.py \
--creative-brief "Linus Torvalds and Tim Cook \
debate open source" \
--persona1-search-term "Linus Torvalds" \
--persona2-search-term "Tim Cook Apple CEO" \
--num-examples 10 \
--output-file tech_debate.jsonl
--persona1-search-term / --persona2-search-term to pull live web context for real-world personas.
# 3-way conversation, inline
python generate.py \
--persona "Iron Man" "Genius billionaire, rapid-fire wit" \
--persona "Captain America" "Principled, earnest, old-fashioned" \
--persona "Thor" "Boisterous god, Shakespearean flair" \
--topic "who pays for the pizza" \
--scenario "Avengers break room" \
--style "comedic argument" \
--num-examples 10 \
--output-file avengers_pizza.jsonl
# Or load from YAML file
python generate.py \
--personas avengers.yaml \
--topic "mission planning" \
--output-file avengers.jsonl
# Specify everything explicitly
python generate.py \
--persona1 "Tony" \
--persona1-desc "A passionate Italian chef" \
--persona2 "Dave" \
--persona2-desc "A pineapple-on-pizza enthusiast" \
--topic "best pizza toppings" \
--scenario "kitchen argument" \
--style "heated but friendly debate" \
--num-examples 10 \
--output-file pizza_debate.jsonl
--enable-variation.--persona1/--persona2 flags still work for 2-speaker setups.--persona (repeatable) to go beyond 2 speakers in the same deterministic style.Qwen/Qwen2.5-7B-Instruct, up to 4096 tokens per generation.# Extend the last conversation in a file
python generate.py \
--continue-from conversations.jsonl \
--output-file more.jsonl
# Continue a specific conversation by ID
python generate.py \
--continue-from conversations.jsonl \
--conversation-id 5 \
--output-file more.jsonl
# All of Cap's turns become "gpt" role
python generate.py \
--persona "Iron Man" "Genius billionaire" \
--persona "Captain America" "Principled leader" \
--persona "Thor" "Boisterous god" \
--train-speaker "Captain America" \
--topic "mission planning" \
--output-file cap_training.jsonl
The role field controls what training frameworks see:
Default: first persona = human, all others = gpt.
Use --train-speaker NAME to designate one character as the gpt target, making everyone else context.
For full control: --role-mapping "Iron Man=human,Captain America=gpt,Thor=human"
ShareGPT-style JSONL — one line per turn, ready for training frameworks.
{
"conversation_id": 0,
"turn_number": 2,
"role": "gpt",
"speaker_name": "Captain America",
"topic": "who pays for the pizza",
"scenario": "Avengers break room",
"style": "comedic argument",
"include_points": "",
"content": "We should split it fairly. Tony,
you can't just buy your way out
of every team decision."
}
| Field | Description |
|---|---|
| conversation_id | Groups turns into one conversation |
| turn_number | Position within the conversation |
| role | human or gpt for training |
| speaker_name | Actual character name (always set) |
| content | The spoken dialogue text |
--upload-to-hub REPO.
Two paths: pip for local development, Docker for reproducible GPU environments.
# Clone and set up
git clone https://github.com/cahlen/\
conversation-dataset-generator.git
cd conversation-dataset-generator
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Optional: 4-bit quantization
pip install bitsandbytes
# Optional: web search enrichment
pip install duckduckgo-search
# Optional: HF Hub upload
huggingface-cli login
Requires Python 3.10+ and an NVIDIA GPU with CUDA.
# Build (CUDA 12.x — 30xx/40xx/50xx)
docker build -t cdg .
# Build for CUDA 13.x (RTX 50xx latest)
docker build \
--build-arg CUDA_VERSION=13.0.0 \
-t cdg .
# Run with GPU passthrough
docker run --gpus all \
-v $(pwd)/output:/app/output cdg \
--creative-brief "Two scientists \
argue about time travel" \
--output-file output/data.jsonl
# Or use docker compose
docker compose run cdg \
--creative-brief "..." \
--output-file output/data.jsonl
Requires Docker with NVIDIA Container Toolkit.
All LLM calls are mocked. Run the full test suite on any machine.
# Install dev deps and run tests
pip install -r requirements-dev.txt
pytest tests/ -v