Conversation Dataset Generator

What It Does

Everything you need to build rich, varied conversational training data.

N-Speaker Conversations

Define 2, 3, or more personas inline with --persona or from a YAML file with --personas. No hard limit on participants.

Character Pools & Random Pairings

Load YAML character pools and randomly pair or group them each conversation. Combine with --group-size 3 for N-way pairings at scale.

Creative Brief Mode

Give a one-sentence idea. The model brainstorms personas, topic, scenario, and style — then varies each conversation automatically.

Continue Conversations

Extend any existing conversation in a JSONL file with --continue-from. Target a specific one with --conversation-id.

Role Mapping for Training

Use --train-speaker to designate which character's turns become the gpt role. Fine-grained control with --role-mapping.

Docker Support

Build once, run anywhere with GPU passthrough. Supports CUDA 12.x (default) and CUDA 13.x via a build arg for the latest RTX hardware.

5 Generation Modes

Pick the approach that matches your workflow.

Manual

Fully specify topic, personas, scenario, and style. Deterministic — every conversation uses your exact parameters.

Creative Brief

One sentence is all you need. The LLM invents everything else and varies the topic/scenario per conversation. Optionally enrich personas with web search.

Fixed Persona + Variation

Lock in the characters with --enable-variation, then let the model vary the topic and scenario each time.

Random Pairings

Provide character pool YAML files. Each conversation draws a fresh random pair (or group) from the pool. Great for large-scale diverse datasets.

Continue Conversation

Pick up where you left off. Pass an existing JSONL file and the model continues the dialogue naturally from the last turn.

Quick Examples

Click a tab to see a different mode in action.

Command

# One sentence is enough
python generate.py \
  --creative-brief "Sherlock Holmes and Watson \
    debate whether AI will replace \
    detectives" \
  --num-examples 10 \
  --output-file sherlock.jsonl

# Enrich with web search context
python generate.py \
  --creative-brief "Linus Torvalds and Tim Cook \
    debate open source" \
  --persona1-search-term "Linus Torvalds" \
  --persona2-search-term "Tim Cook Apple CEO" \
  --num-examples 10 \
  --output-file tech_debate.jsonl
                        

What happens

1Model expands brief into full persona descriptions, topic, scenario, and style.
2For each conversation, topic and scenario are automatically varied to avoid repetition.
3Output is written as JSONL — one line per turn, ready for training.

Add --persona1-search-term / --persona2-search-term to pull live web context for real-world personas.

Command

# 3-way conversation, inline
python generate.py \
  --persona "Iron Man" "Genius billionaire, rapid-fire wit" \
  --persona "Captain America" "Principled, earnest, old-fashioned" \
  --persona "Thor" "Boisterous god, Shakespearean flair" \
  --topic "who pays for the pizza" \
  --scenario "Avengers break room" \
  --style "comedic argument" \
  --num-examples 10 \
  --output-file avengers_pizza.jsonl

# Or load from YAML file
python generate.py \
  --personas avengers.yaml \
  --topic "mission planning" \
  --output-file avengers.jsonl
                        

Sample output

Iron Man

Look, I'll cover the pizza. JARVIS, order six large pies. No, make it eight — Thor is here.

Thor

Eight? I require at minimum twelve. I have just returned from Asgard and the realm eternal offers no Neapolitan fare.

Captain America

We should split it fairly. Tony, you can't just buy your way out of every team decision.

Iron Man

Steve, I literally fund this entire operation. "Fairly" is me paying.

Command

# Specify everything explicitly
python generate.py \
  --persona1 "Tony" \
  --persona1-desc "A passionate Italian chef" \
  --persona2 "Dave" \
  --persona2-desc "A pineapple-on-pizza enthusiast" \
  --topic "best pizza toppings" \
  --scenario "kitchen argument" \
  --style "heated but friendly debate" \
  --num-examples 10 \
  --output-file pizza_debate.jsonl
                        

Notes

Every generated conversation uses exactly these parameters — no variation unless you add --enable-variation.
Legacy --persona1/--persona2 flags still work for 2-speaker setups.
Use --persona (repeatable) to go beyond 2 speakers in the same deterministic style.
Default model: Qwen/Qwen2.5-7B-Instruct, up to 4096 tokens per generation.

Continue a conversation

# Extend the last conversation in a file
python generate.py \
  --continue-from conversations.jsonl \
  --output-file more.jsonl

# Continue a specific conversation by ID
python generate.py \
  --continue-from conversations.jsonl \
  --conversation-id 5 \
  --output-file more.jsonl
                        

Train a specific character

# All of Cap's turns become "gpt" role
python generate.py \
  --persona "Iron Man" "Genius billionaire" \
  --persona "Captain America" "Principled leader" \
  --persona "Thor" "Boisterous god" \
  --train-speaker "Captain America" \
  --topic "mission planning" \
  --output-file cap_training.jsonl
                        

How role mapping works

The role field controls what training frameworks see:

human Input / context — the model sees this

gpt Target — the model learns to generate this

Default: first persona = human, all others = gpt.

Use --train-speaker NAME to designate one character as the gpt target, making everyone else context.

For full control: --role-mapping "Iron Man=human,Captain America=gpt,Thor=human"

Output Format

ShareGPT-style JSONL — one line per turn, ready for training frameworks.

Single turn (one JSONL line)

{
  "conversation_id": 0,
  "turn_number": 2,
  "role": "gpt",
  "speaker_name": "Captain America",
  "topic": "who pays for the pizza",
  "scenario": "Avengers break room",
  "style": "comedic argument",
  "include_points": "",
  "content": "We should split it fairly. Tony,
    you can't just buy your way out
    of every team decision."
}
                    

Field reference

Field	Description
conversation_id	Groups turns into one conversation
turn_number	Position within the conversation
role	`human` or `gpt` for training
speaker_name	Actual character name (always set)
content	The spoken dialogue text

Upload directly to Hugging Face Hub with --upload-to-hub REPO.

Get Started

Two paths: pip for local development, Docker for reproducible GPU environments.

pip (local)

# Clone and set up
git clone https://github.com/cahlen/\
  conversation-dataset-generator.git
cd conversation-dataset-generator

python -m venv venv
source venv/bin/activate

pip install -r requirements.txt

# Optional: 4-bit quantization
pip install bitsandbytes

# Optional: web search enrichment
pip install duckduckgo-search

# Optional: HF Hub upload
huggingface-cli login
                    

Requires Python 3.10+ and an NVIDIA GPU with CUDA.

Docker

# Build (CUDA 12.x — 30xx/40xx/50xx)
docker build -t cdg .

# Build for CUDA 13.x (RTX 50xx latest)
docker build \
  --build-arg CUDA_VERSION=13.0.0 \
  -t cdg .

# Run with GPU passthrough
docker run --gpus all \
  -v $(pwd)/output:/app/output cdg \
  --creative-brief "Two scientists \
    argue about time travel" \
  --output-file output/data.jsonl

# Or use docker compose
docker compose run cdg \
  --creative-brief "..." \
  --output-file output/data.jsonl
                    

Requires Docker with NVIDIA Container Toolkit.

What It Does

N-Speaker Conversations

Character Pools & Random Pairings

Creative Brief Mode

Continue Conversations

Role Mapping for Training

Docker Support

5 Generation Modes

Manual

Creative Brief

Fixed Persona + Variation

Random Pairings

Continue Conversation

Quick Examples

Command

What happens

Command

Sample output

Command

Notes

Continue a conversation

Train a specific character

How role mapping works

Output Format

Single turn (one JSONL line)

Field reference

Get Started

pip (local)

Docker

121 Tests, No GPU Required