Craft High-Quality Dialogue Data for Your LLMs. Specify personas, topics, scenarios, or just give a creative brief and let the AI handle the rest.
Ever wish you could generate just the right kind of conversational data? Whether you're fine-tuning an LLM for a specific style or persona, need dialogue for a creative project, or want to explore complex topics in a natural flow, the Conversation Dataset Generator is here to help!
This powerful Python script leverages Hugging Face's transformers
library, operating in several modes:
batch_generate.py
.The output is a clean JSON Lines (.jsonl
) file, perfect for downstream tasks.
Best of all, the code is fully open source under the MIT license, giving you the freedom to use, modify, and extend it however you see fit!
# 1. Clone Repo (Optional)
git clone https://github.com/cahlen/conversation-dataset-generator.git
cd conversation-dataset-generator
# 2. Setup Virtual Env
python3 -m venv venv
source venv/bin/activate # Win: venv\Scripts\activate
# 3. Install Base Deps
pip install -r requirements.txt
# 4. Install PyTorch for GPU
# Visit: pytorch.org/get-started/locally
# Example: pip3 install torch ...
# 5. Install Optional Deps
# For Brief Mode web search:
pip install duckduckgo-search
# For LoRA examples / 4-bit:
pip install -U peft trl bitsandbytes
# 6. Login to HF Hub (Optional)
huggingface-cli login
Two main ways to generate data:
generate.py
: Generate a single dataset based on CLI arguments or a creative brief.batch_generate.py
: Run multiple generation jobs defined in a YAML config file for large-scale creation.generate.py
)Use one of these modes:
Provide all parameters explicitly.
python generate.py \
--persona1 "Wizard" --persona1-desc "Grumpy" \
--persona2 "Knight" --persona2-desc "Cheerful" \
--topic "Polishing armor" \
--scenario "Dungeon waiting room" \
--style "Comedic bickering" \
--num-examples 10 \
--output-file wiz_knight.jsonl
Provide a brief; use LLM + optional web search. Generates variations.
python generate.py \
--creative-brief "Pirate orders coffee" \
--num-examples 15 \
--persona1-search-term "Pirate captain traits" \
--output-file pirate_coffee.jsonl
Define fixed personas & initial context, then generate varied conversations.
python generate.py \
--enable-variation \
--fixed-persona1 "Mick Jagger" \
--fixed-persona1-desc "Rolling Stones frontman..." \
--fixed-persona2 "Ozzy Osbourne" \
--fixed-persona2-desc "Prince of Darkness..." \
--initial-topic "Modern rock & reality TV" \
--initial-scenario "Awards show backstage" \
--initial-style "Amusing clash, rambling" \
--num-examples 20 \
--output-file jagger_ozzy_fixed.jsonl
(See Argument Reference or README for full details.)
batch_generate.py
)Run multiple jobs efficiently using a YAML config.
Define runs, mixing modes. See examples/
.
# examples/batch_config.yaml
output_directory: "./batch_runs"
force_upload: true # Optional global flag
runs:
- id: "run1_brief_search"
output_file: "r1_brief.jsonl"
num_examples: 50
model_id: "meta-llama/Meta-Llama-3-8B-Instruct"
creative_brief: "Einstein explains relativity to a cat"
persona2_search_term: "Typical cat personality"
upload_repo: "YourUser/EinsteinCat"
load_in_4bit: true
- id: "run2_manual"
output_file: "r2_manual.jsonl"
num_examples: 25
model_id: "google/gemma-2b-it"
manual_args:
topic: "Best way to store cheese"
persona1: "Mouse"
Execute pointing to your config file.
# Activate env first!
source venv/bin/activate
python batch_generate.py examples/batch_config.yaml
generate.py
)1. Start generate.py
|--> If --delete-repo: Confirm & Delete -> Exit.
|--> If --creative-brief:
| [Optional Web Search] -> Gen Base Args (LLM) -> [Optional Image Search]
| Loop N (--num-examples):
| Gen Topic Variation (LLM) -> Gen Conversation (LLM) -> Parse & Store
| -> Save JSONL -> [Optional Upload] -> Exit.
|--> If --enable-variation (Fixed Persona):
| Use fixed personas & initial context -> [Optional Image Search]
| Loop N (--num-examples):
| Gen Topic Variation (LLM) -> Gen Conversation (LLM) -> Parse & Store
| -> Save JSONL -> [Optional Upload] -> Exit.
|--> Else (Manual Mode):
| Use provided args -> [Optional Image Search]
| Loop N (--num-examples):
| Gen Conversation (LLM) -> Parse & Store
| -> Save JSONL -> [Optional Upload] -> Exit.
generate.py
)Provide EITHER --creative-brief
OR manual/fixed args.
--creative-brief
: High-level concept (Brief Mode).--personaX-search-term
: Web search query (Brief Mode).--topic
: Subject (Manual).--fixed-personaX
: Name (Fixed Mode).--scenario
: Setting (Manual).--fixed-personaX-desc
: Desc (Fixed Mode).--style
: Dialogue style (Manual).--initial-topic
: Seed topic (Fixed Mode).--personaX
: Name (Manual).--initial-scenario
: Seed scenario (Fixed Mode).--personaX-desc
: Desc (Manual).--initial-style
: Seed style (Fixed Mode).--include-points
: Keywords (Manual/Fixed).--enable-variation
: Enable var (Fixed Mode).--num-examples
: # conversations.--output-file
: Output .jsonl
path.--model-id
: HF generation model.--upload-to-hub
: HF Hub repo ID (Optional).--load-in-4bit
: Use 4-bit quantization.--delete-repo
: Danger: Deletes HF repo.python generate.py \
--num-examples 1000 \
--topic "Absurdity of errands" \
--persona1 "Alex" --persona1-desc "Neurotic" \
--persona2 "Sam" --persona2-desc "Laid-back" \
--scenario "Post office line" \
--style "Observational, witty, Seinfeld-esque" \
--output-file sitcom_style.jsonl
python generate.py \
--num-examples 500 \
--topic "Debugging Python IndexError" \
--persona1 "MentorBot" --persona1-desc "Patient tutor AI" \
--persona2 "Learner" --persona2-desc "Beginner stuck" \
--scenario "Online chat session" \
--style "Supportive, educational" \
--output-file mentor_persona.jsonl
python generate.py \
--creative-brief "Debate: da Vinci & Marie Curie" \
--num-examples 25 \
--output-file hist_debate.jsonl
python generate.py \
--creative-brief "MKBHD talks cameras w/ Kubrick" \
--num-examples 5 \
--persona1-search-term "Marques Brownlee style" \
--persona2-search-term "Stanley Kubrick style" \
--output-file mkbhd_kubrick.jsonl
python generate.py \
--creative-brief "Existential toaster talks to pigeons" \
--num-examples 10 \
--output-file toaster_pigeons.jsonl
python generate.py \
--enable-variation \
--fixed-persona1 "Mick Jagger" \
--fixed-persona1-desc "Rolling Stones frontman..." \
--fixed-persona2 "Ozzy Osbourne" \
--fixed-persona2-desc "Prince of Darkness..." \
--initial-topic "Modern rock & reality TV" \
--initial-scenario "Awards show backstage" \
--initial-style "Amusing clash, rambling" \
--num-examples 20 \
--output-file jagger_ozzy_fixed.jsonl
Leverage Creative Brief mode with web search (--personaX-search-term
) to generate dialogue grounded in current events.
python generate.py \
--creative-brief "Discussion: Mickey Rourke & JoJo Siwa on CBB UK remarks/apologies." \
--persona1-search-term "Mickey Rourke CBB homophobic comments" \
--persona2-search-term "JoJo Siwa response apology homophobic remark" \
--num-examples 100 \
--output-file trending_Rourke_Siwa.jsonl \
--load-in-4bit
python generate.py \
--creative-brief "Discussion: Jason Isaacs & Walton Goggins on 'White Lotus' feud rumors." \
--persona1-search-term "Jason Isaacs White Lotus arguments on set" \
--persona2-search-term "Walton Goggins feud rumors White Lotus" \
--num-examples 100 \
--output-file trending_Isaacs_Goggins.jsonl \
--load-in-4bit
python generate.py \
--creative-brief "Discussion: Katy Perry & Katie Jane Taylor on trademark battle over clothing brand." \
--persona1-search-term "Katy Perry trademark battle clothing" \
--persona2-search-term "Katie Jane Taylor trademark dispute Katy Perry" \
--num-examples 100 \
--output-file trending_Perry_Taylor.jsonl \
--load-in-4bit
See the full list of examples in the README.
A truly unique capability emerges when combining Batch Mode with Creative Briefs and evolving persona2_search_term
arguments. This allows you to generate entire conversational training courses, level by level!
By defining each course module as a run in your batch YAML, you can set a consistent tutor persona and progressively change the `persona2_search_term` to reflect the learner's expected knowledge and confusion points at that stage. This creates highly targeted, context-aware dialogue suitable for training sequential chatbot LoRAs for education. Imagine generating data for:
This example uses `batch_generate.py` and the `examples/ai_course_curriculum.yaml` file (shown partially below) to create 6 levels of AI learning dialogues between `EnfuseBot` and `Learner`. The key is changing `persona2_search_term` in each run to simulate learner progress.
# examples/ai_course_curriculum.yaml (Snippet)
output_directory: "ai_course_datasets"
force_upload: true
runs:
# Level 1: Intro
- id: "level1_intro"
output_file: "ai_course_level1_intro.jsonl"
upload_repo: "cahlen/AICourse-Level1-Intro"
num_examples: 500
load_in_4bit: true
creative_brief: "EnfuseBot introduces AI/ML concepts..."
persona2_search_term: "Beginner Python programmer confused about AI..."
# Level 2: Scikit-learn
- id: "level2_sklearn"
# ... (similar structure) ...
creative_brief: "EnfuseBot explains core ML concepts..."
persona2_search_term: "Learner starting Scikit-learn confused..."
# ... (Levels 3-6 omitted for brevity) ...
Command to Run:
python batch_generate.py examples/ai_course_curriculum.yaml
See the full YAML configuration file in the repository.
.jsonl
)Saved locally and optionally uploaded to HF Hub. Each line is a turn:
{"conversation_id": 0, "turn_number": 0, "role": "human", ...}
{"conversation_id": 0, "turn_number": 1, "role": "gpt", ...}
{"conversation_id": 1, "turn_number": 0, "role": "human", ...}
Keys: conversation_id
, turn_number
, role
, speaker_name
, topic
, scenario
, style
, include_points
, content
.
See the full output details in the README.
Easily load your uploaded dataset:
from datasets import load_dataset
# Replace with your repo ID
ds = load_dataset("YourUser/YourDatasetName")
# Access data
print(ds['train'][0])
Meta-Llama-3-8B-Instruct
(change via --model-id
).peft
, trl
, bitsandbytes
.--load-in-4bit
for efficiency (needs bitsandbytes
).batch_generate.py
) w/ YAML config.--validate-local-save
checks (placeholder).See the open issues for more details.
Contributions are welcome! Fork the repo, create a feature branch, commit, push, and open a Pull Request.
Report bugs or suggest enhancements via GitHub Issues.
Don't forget to star the project! ⭐
Distributed under the MIT License.
See the LICENSE file for more information.
Cahlen Humphreys
Project: GitHub Repo
Start crafting high-quality conversational data for your LLMs today!
Get Started on GitHub