Conversation Icon

Conversation Dataset Generator

Craft High-Quality Dialogue Data for Your LLMs. Specify personas, topics, scenarios, or just give a creative brief and let the AI handle the rest.

About The Project

Ever wish you could generate just the right kind of conversational data? Whether you're fine-tuning an LLM for a specific style or persona, need dialogue for a creative project, or want to explore complex topics in a natural flow, the Conversation Dataset Generator is here to help!

This powerful Python script leverages Hugging Face's transformers library, operating in several modes:

  • Manual Mode: Specify everything – topic, personas, scenario, style, keywords.
  • Creative Brief Mode: Provide a high-level brief; the script brainstorms details, generates variations, and optionally uses web search.
  • Fixed Persona Mode: Define personas once, then generate varied conversations around an initial context.
  • Batch Mode: Run multiple configurations defined in a YAML file using batch_generate.py.

The output is a clean JSON Lines (.jsonl) file, perfect for downstream tasks.

Why Use This Generator?

Style Specialization: Train models for specific nuances (pirate speak, formal anchor, etc.).
Persona Embodiment: Build believable characters, enhanced by web search.
Topic Fluency: Improve discussion of specific subjects naturally.
Instruction Adherence: Train models to follow constraints (e.g., keywords).
Creative Content: Break writer's block and draft dialogue.
Dialogue Analysis: Study conversation flow with structured output.

Best of all, the code is fully open source under the MIT license, giving you the freedom to use, modify, and extend it however you see fit!

Built With

Powered by cutting-edge libraries and tools:

Python
PyTorch
Transformers
Accelerate
Datasets
Hugging Face Hub
Pandas
DuckDuckGo Search
bitsandbytes
Tailwind CSS

Getting Started

Prerequisites

  • Python 3.8+
  • GPU (Recommended for speed)
  • Capable CPU & RAM
  • Internet (for web search mode)
  • Dependencies (see Installation)

Installation

# 1. Clone Repo (Optional) git clone https://github.com/cahlen/conversation-dataset-generator.git cd conversation-dataset-generator # 2. Setup Virtual Env python3 -m venv venv source venv/bin/activate # Win: venv\Scripts\activate # 3. Install Base Deps pip install -r requirements.txt # 4. Install PyTorch for GPU # Visit: pytorch.org/get-started/locally # Example: pip3 install torch ... # 5. Install Optional Deps # For Brief Mode web search: pip install duckduckgo-search # For LoRA examples / 4-bit: pip install -U peft trl bitsandbytes # 6. Login to HF Hub (Optional) huggingface-cli login

Usage

Two main ways to generate data:

  • generate.py: Generate a single dataset based on CLI arguments or a creative brief.
  • batch_generate.py: Run multiple generation jobs defined in a YAML config file for large-scale creation.

Single Generation (generate.py)

Use one of these modes:

Manual Mode

Provide all parameters explicitly.

python generate.py \ --persona1 "Wizard" --persona1-desc "Grumpy" \ --persona2 "Knight" --persona2-desc "Cheerful" \ --topic "Polishing armor" \ --scenario "Dungeon waiting room" \ --style "Comedic bickering" \ --num-examples 10 \ --output-file wiz_knight.jsonl

Creative Brief Mode

Provide a brief; use LLM + optional web search. Generates variations.

python generate.py \ --creative-brief "Pirate orders coffee" \ --num-examples 15 \ --persona1-search-term "Pirate captain traits" \ --output-file pirate_coffee.jsonl

Fixed Persona + Variation Mode

Define fixed personas & initial context, then generate varied conversations.

python generate.py \ --enable-variation \ --fixed-persona1 "Mick Jagger" \ --fixed-persona1-desc "Rolling Stones frontman..." \ --fixed-persona2 "Ozzy Osbourne" \ --fixed-persona2-desc "Prince of Darkness..." \ --initial-topic "Modern rock & reality TV" \ --initial-scenario "Awards show backstage" \ --initial-style "Amusing clash, rambling" \ --num-examples 20 \ --output-file jagger_ozzy_fixed.jsonl

(See Argument Reference or README for full details.)

Batch Generation (batch_generate.py)

Run multiple jobs efficiently using a YAML config.

1. Create YAML Config

Define runs, mixing modes. See examples/.

# examples/batch_config.yaml output_directory: "./batch_runs" force_upload: true # Optional global flag runs: - id: "run1_brief_search" output_file: "r1_brief.jsonl" num_examples: 50 model_id: "meta-llama/Meta-Llama-3-8B-Instruct" creative_brief: "Einstein explains relativity to a cat" persona2_search_term: "Typical cat personality" upload_repo: "YourUser/EinsteinCat" load_in_4bit: true - id: "run2_manual" output_file: "r2_manual.jsonl" num_examples: 25 model_id: "google/gemma-2b-it" manual_args: topic: "Best way to store cheese" persona1: "Mouse"

2. Run the Batch Script

Execute pointing to your config file.

# Activate env first! source venv/bin/activate python batch_generate.py examples/batch_config.yaml
Script Flow & Argument Reference

Script Flow Overview (generate.py)

1. Start generate.py |--> If --delete-repo: Confirm & Delete -> Exit. |--> If --creative-brief: | [Optional Web Search] -> Gen Base Args (LLM) -> [Optional Image Search] | Loop N (--num-examples): | Gen Topic Variation (LLM) -> Gen Conversation (LLM) -> Parse & Store | -> Save JSONL -> [Optional Upload] -> Exit. |--> If --enable-variation (Fixed Persona): | Use fixed personas & initial context -> [Optional Image Search] | Loop N (--num-examples): | Gen Topic Variation (LLM) -> Gen Conversation (LLM) -> Parse & Store | -> Save JSONL -> [Optional Upload] -> Exit. |--> Else (Manual Mode): | Use provided args -> [Optional Image Search] | Loop N (--num-examples): | Gen Conversation (LLM) -> Parse & Store | -> Save JSONL -> [Optional Upload] -> Exit.

Key Argument Reference (generate.py)

Provide EITHER --creative-brief OR manual/fixed args.

--creative-brief: High-level concept (Brief Mode).
--personaX-search-term: Web search query (Brief Mode).
--topic: Subject (Manual).
--fixed-personaX: Name (Fixed Mode).
--scenario: Setting (Manual).
--fixed-personaX-desc: Desc (Fixed Mode).
--style: Dialogue style (Manual).
--initial-topic: Seed topic (Fixed Mode).
--personaX: Name (Manual).
--initial-scenario: Seed scenario (Fixed Mode).
--personaX-desc: Desc (Manual).
--initial-style: Seed style (Fixed Mode).
--include-points: Keywords (Manual/Fixed).
--enable-variation: Enable var (Fixed Mode).
--num-examples: # conversations.
--output-file: Output .jsonl path.
--model-id: HF generation model.
--upload-to-hub: HF Hub repo ID (Optional).
--load-in-4bit: Use 4-bit quantization.
--delete-repo: Danger: Deletes HF repo.

See the full argument reference in the README.

Generation Examples

Sitcom Banter Style LoRA

Style Training
python generate.py \ --num-examples 1000 \ --topic "Absurdity of errands" \ --persona1 "Alex" --persona1-desc "Neurotic" \ --persona2 "Sam" --persona2-desc "Laid-back" \ --scenario "Post office line" \ --style "Observational, witty, Seinfeld-esque" \ --output-file sitcom_style.jsonl

Helpful Coding Mentor Persona LoRA

Persona Training
python generate.py \ --num-examples 500 \ --topic "Debugging Python IndexError" \ --persona1 "MentorBot" --persona1-desc "Patient tutor AI" \ --persona2 "Learner" --persona2-desc "Beginner stuck" \ --scenario "Online chat session" \ --style "Supportive, educational" \ --output-file mentor_persona.jsonl

Varied Historical Banter

Topic Variation
python generate.py \ --creative-brief "Debate: da Vinci & Marie Curie" \ --num-examples 25 \ --output-file hist_debate.jsonl

Dialogue with Web Search

Web Context
python generate.py \ --creative-brief "MKBHD talks cameras w/ Kubrick" \ --num-examples 5 \ --persona1-search-term "Marques Brownlee style" \ --persona2-search-term "Stanley Kubrick style" \ --output-file mkbhd_kubrick.jsonl

Absurdist Comedy Variations

Creative Prompt
python generate.py \ --creative-brief "Existential toaster talks to pigeons" \ --num-examples 10 \ --output-file toaster_pigeons.jsonl

Rockstar Banter (Jagger & Ozzy)

Persona Consistency
python generate.py \ --enable-variation \ --fixed-persona1 "Mick Jagger" \ --fixed-persona1-desc "Rolling Stones frontman..." \ --fixed-persona2 "Ozzy Osbourne" \ --fixed-persona2-desc "Prince of Darkness..." \ --initial-topic "Modern rock & reality TV" \ --initial-scenario "Awards show backstage" \ --initial-style "Amusing clash, rambling" \ --num-examples 20 \ --output-file jagger_ozzy_fixed.jsonl

See the full list of examples in the README.

Advanced Use Case: Generating Progressive Course Curricula

A truly unique capability emerges when combining Batch Mode with Creative Briefs and evolving persona2_search_term arguments. This allows you to generate entire conversational training courses, level by level!

By defining each course module as a run in your batch YAML, you can set a consistent tutor persona and progressively change the `persona2_search_term` to reflect the learner's expected knowledge and confusion points at that stage. This creates highly targeted, context-aware dialogue suitable for training sequential chatbot LoRAs for education. Imagine generating data for:

K-12 Algebra College History 101 Python Programming Basics Advanced Rust Concepts Small Engine Repair Golf Cart Mechanics

AI Programming Course Curriculum (Batch Example)

Educational Data

This example uses `batch_generate.py` and the `examples/ai_course_curriculum.yaml` file (shown partially below) to create 6 levels of AI learning dialogues between `EnfuseBot` and `Learner`. The key is changing `persona2_search_term` in each run to simulate learner progress.

# examples/ai_course_curriculum.yaml (Snippet) output_directory: "ai_course_datasets" force_upload: true runs: # Level 1: Intro - id: "level1_intro" output_file: "ai_course_level1_intro.jsonl" upload_repo: "cahlen/AICourse-Level1-Intro" num_examples: 500 load_in_4bit: true creative_brief: "EnfuseBot introduces AI/ML concepts..." persona2_search_term: "Beginner Python programmer confused about AI..." # Level 2: Scikit-learn - id: "level2_sklearn" # ... (similar structure) ... creative_brief: "EnfuseBot explains core ML concepts..." persona2_search_term: "Learner starting Scikit-learn confused..." # ... (Levels 3-6 omitted for brevity) ...

Command to Run:

python batch_generate.py examples/ai_course_curriculum.yaml

See the full YAML configuration file in the repository.

Output Format & Fine-Tuning Notes

Output Format (.jsonl)

Saved locally and optionally uploaded to HF Hub. Each line is a turn:

{"conversation_id": 0, "turn_number": 0, "role": "human", ...} {"conversation_id": 0, "turn_number": 1, "role": "gpt", ...} {"conversation_id": 1, "turn_number": 0, "role": "human", ...}

Keys: conversation_id, turn_number, role, speaker_name, topic, scenario, style, include_points, content.

See the full output details in the README.

Loading from Hub

Easily load your uploaded dataset:

from datasets import load_dataset # Replace with your repo ID ds = load_dataset("YourUser/YourDatasetName") # Access data print(ds['train'][0])

Model & Fine-Tuning Notes

  • Default Model: Meta-Llama-3-8B-Instruct (change via --model-id).
  • Ideal for PEFT methods like **LoRA** for style/persona/topic specialization.
  • Use strong instruction-following base models (Llama 3, Mistral, Qwen2, Gemma Instruct, etc.).
  • LoRA requires: peft, trl, bitsandbytes.
  • Use --load-in-4bit for efficiency (needs bitsandbytes).

Roadmap

  • Add batch generation script (batch_generate.py) w/ YAML config.
  • Add diverse generation examples to docs.
  • Implement --validate-local-save checks (placeholder).
  • Explore advanced topic/scenario variation techniques.
  • Add option for different output formats (e.g., conversational JSON).
  • Improve error handling and reporting in batch script.

See the open issues for more details.

Contributing

Contributions are welcome! Fork the repo, create a feature branch, commit, push, and open a Pull Request.

Report bugs or suggest enhancements via GitHub Issues.

Don't forget to star the project! ⭐

License

Distributed under the MIT License.

See the LICENSE file for more information.

Contact

Cahlen Humphreys

GitHub Profile

Project: GitHub Repo

Acknowledgments

Ready to Generate?

Start crafting high-quality conversational data for your LLMs today!

Get Started on GitHub