Conversation Dataset Generator ✨

Craft High-Quality Dialogue Data for Your LLMs. Specify personas, topics, scenarios, or just give a creative brief and let the AI handle the rest.

View on GitHub Get Started

Report Bug · Request Feature

About The Project

Ever wish you could generate just the right kind of conversational data? Whether you're fine-tuning an LLM for a specific style or persona, need dialogue for a creative project, or want to explore complex topics in a natural flow, the Conversation Dataset Generator is here to help!

This powerful Python script leverages Hugging Face's transformers library, operating in several modes:

Manual Mode: Specify everything – topic, personas, scenario, style, keywords.
Creative Brief Mode: Provide a high-level brief; the script brainstorms details, generates variations, and optionally uses web search.
Fixed Persona Mode: Define personas once, then generate varied conversations around an initial context.
Batch Mode: Run multiple configurations defined in a YAML file using batch_generate.py.

The output is a clean JSON Lines (.jsonl) file, perfect for downstream tasks.

Why Use This Generator?

Style Specialization: Train models for specific nuances (pirate speak, formal anchor, etc.).

Persona Embodiment: Build believable characters, enhanced by web search.

Topic Fluency: Improve discussion of specific subjects naturally.

Instruction Adherence: Train models to follow constraints (e.g., keywords).

Creative Content: Break writer's block and draft dialogue.

Dialogue Analysis: Study conversation flow with structured output.

Best of all, the code is fully open source under the MIT license, giving you the freedom to use, modify, and extend it however you see fit!

Built With

Getting Started

Prerequisites

Python 3.8+
GPU (Recommended for speed)
Capable CPU & RAM
Internet (for web search mode)
Dependencies (see Installation)

Installation

# 1. Clone Repo (Optional)
git clone https://github.com/cahlen/conversation-dataset-generator.git
cd conversation-dataset-generator

# 2. Setup Virtual Env
python3 -m venv venv
source venv/bin/activate # Win: venv\Scripts\activate

# 3. Install Base Deps
pip install -r requirements.txt

# 4. Install PyTorch for GPU
# Visit: pytorch.org/get-started/locally
# Example: pip3 install torch ...

# 5. Install Optional Deps
# For Brief Mode web search:
pip install duckduckgo-search
# For LoRA examples / 4-bit:
pip install -U peft trl bitsandbytes

# 6. Login to HF Hub (Optional)
huggingface-cli login

Usage

Two main ways to generate data:

generate.py: Generate a single dataset based on CLI arguments or a creative brief.
batch_generate.py: Run multiple generation jobs defined in a YAML config file for large-scale creation.

Single Generation (`generate.py`)

Use one of these modes:

Manual Mode

Provide all parameters explicitly.

python generate.py \
  --persona1 "Wizard" --persona1-desc "Grumpy" \
  --persona2 "Knight" --persona2-desc "Cheerful" \
  --topic "Polishing armor" \
  --scenario "Dungeon waiting room" \
  --style "Comedic bickering" \
  --num-examples 10 \
  --output-file wiz_knight.jsonl

Creative Brief Mode

Provide a brief; use LLM + optional web search. Generates variations.

python generate.py \
  --creative-brief "Pirate orders coffee" \
  --num-examples 15 \
  --persona1-search-term "Pirate captain traits" \
  --output-file pirate_coffee.jsonl

Fixed Persona + Variation Mode

Define fixed personas & initial context, then generate varied conversations.

python generate.py \
  --enable-variation \
  --fixed-persona1 "Mick Jagger" \
  --fixed-persona1-desc "Rolling Stones frontman..." \
  --fixed-persona2 "Ozzy Osbourne" \
  --fixed-persona2-desc "Prince of Darkness..." \
  --initial-topic "Modern rock & reality TV" \
  --initial-scenario "Awards show backstage" \
  --initial-style "Amusing clash, rambling" \
  --num-examples 20 \
  --output-file jagger_ozzy_fixed.jsonl

(See Argument Reference or README for full details.)

Batch Generation (`batch_generate.py`)

Run multiple jobs efficiently using a YAML config.

1. Create YAML Config

Define runs, mixing modes. See examples/.

# examples/batch_config.yaml
output_directory: "./batch_runs"
force_upload: true # Optional global flag

runs:
  - id: "run1_brief_search"
    output_file: "r1_brief.jsonl"
    num_examples: 50
    model_id: "meta-llama/Meta-Llama-3-8B-Instruct"
    creative_brief: "Einstein explains relativity to a cat"
    persona2_search_term: "Typical cat personality"
    upload_repo: "YourUser/EinsteinCat"
    load_in_4bit: true

  - id: "run2_manual"
    output_file: "r2_manual.jsonl"
    num_examples: 25
    model_id: "google/gemma-2b-it"
    manual_args:
      topic: "Best way to store cheese"
      persona1: "Mouse"
                      


                    
                        2. Run the Batch Script
                        Execute pointing to your config file.
                        # Activate env first!
source venv/bin/activate

python batch_generate.py examples/batch_config.yaml



            
            
                
                    Script Flow & Argument Reference
                    
                
                 
                    
                         Script Flow Overview (generate.py)
                        1. Start generate.py
   |--> If --delete-repo: Confirm & Delete -> Exit.
   |--> If --creative-brief:
   |     [Optional Web Search] -> Gen Base Args (LLM) -> [Optional Image Search]
   |     Loop N (--num-examples):
   |       Gen Topic Variation (LLM) -> Gen Conversation (LLM) -> Parse & Store
   |     -> Save JSONL -> [Optional Upload] -> Exit.
   |--> If --enable-variation (Fixed Persona):
   |     Use fixed personas & initial context -> [Optional Image Search]
   |     Loop N (--num-examples):
   |        Gen Topic Variation (LLM) -> Gen Conversation (LLM) -> Parse & Store
   |     -> Save JSONL -> [Optional Upload] -> Exit.
   |--> Else (Manual Mode):
   |     Use provided args -> [Optional Image Search]
   |     Loop N (--num-examples):
   |       Gen Conversation (LLM) -> Parse & Store
   |     -> Save JSONL -> [Optional Upload] -> Exit.
                     
                    
                         Key Argument Reference (generate.py)
                         Provide EITHER --creative-brief OR manual/fixed args.
                         
                             --creative-brief: High-level concept (Brief Mode).
                             --personaX-search-term: Web search query (Brief Mode).
                             --topic: Subject (Manual).
                             --fixed-personaX: Name (Fixed Mode).
                             --scenario: Setting (Manual).
                             --fixed-personaX-desc: Desc (Fixed Mode).
                             --style: Dialogue style (Manual).
                             --initial-topic: Seed topic (Fixed Mode).
                             --personaX: Name (Manual).
                             --initial-scenario: Seed scenario (Fixed Mode).
                             --personaX-desc: Desc (Manual).
                             --initial-style: Seed style (Fixed Mode).
                             --include-points: Keywords (Manual/Fixed).
                              --enable-variation: Enable var (Fixed Mode).
                             
 
                             --num-examples: # conversations.
                             --output-file: Output .jsonl path.
                             --model-id: HF generation model.
                             --upload-to-hub: HF Hub repo ID (Optional).
                             --load-in-4bit: Use 4-bit quantization.
                             --delete-repo: Danger: Deletes HF repo.
                         
                         See the full argument reference in the README.
                    
                
            

            
                
                     Generation Examples
                     
                 
                 
                     
                         
                         
                         
                         
                     

                    
                     
                         
                             
                                 Sitcom Banter Style LoRA
Style Training
                                 python generate.py \
  --num-examples 1000 \
  --topic "Absurdity of errands" \
  --persona1 "Alex" --persona1-desc "Neurotic" \
  --persona2 "Sam" --persona2-desc "Laid-back" \
  --scenario "Post office line" \
  --style "Observational, witty, Seinfeld-esque" \
  --output-file sitcom_style.jsonl
                             
                             
                                 Helpful Coding Mentor Persona LoRA
Persona Training
                                python generate.py \
  --num-examples 500 \
  --topic "Debugging Python IndexError" \
  --persona1 "MentorBot" --persona1-desc "Patient tutor AI" \
  --persona2 "Learner" --persona2-desc "Beginner stuck" \
  --scenario "Online chat session" \
  --style "Supportive, educational" \
  --output-file mentor_persona.jsonl
                             
                         
                     
                    
                    
                        
                             
                                Varied Historical Banter
Topic Variation
                                python generate.py \
  --creative-brief "Debate: da Vinci & Marie Curie" \
  --num-examples 25 \
  --output-file hist_debate.jsonl
                            
                             
                                Dialogue with Web Search
Web Context
                                python generate.py \
  --creative-brief "MKBHD talks cameras w/ Kubrick" \
  --num-examples 5 \
  --persona1-search-term "Marques Brownlee style" \
  --persona2-search-term "Stanley Kubrick style" \
  --output-file mkbhd_kubrick.jsonl
                            
                             
                                Absurdist Comedy Variations
Creative Prompt
                                python generate.py \
  --creative-brief "Existential toaster talks to pigeons" \
  --num-examples 10 \
  --output-file toaster_pigeons.jsonl
                            
                        
                    
                     
                     
                         
                             
                                Rockstar Banter (Jagger & Ozzy)
Persona Consistency
                                python generate.py \
  --enable-variation \
  --fixed-persona1 "Mick Jagger" \
  --fixed-persona1-desc "Rolling Stones frontman..." \
  --fixed-persona2 "Ozzy Osbourne" \
  --fixed-persona2-desc "Prince of Darkness..." \
  --initial-topic "Modern rock & reality TV" \
  --initial-scenario "Awards show backstage" \
  --initial-style "Amusing clash, rambling" \
  --num-examples 20 \
  --output-file jagger_ozzy_fixed.jsonl
                            
                         
                     
                    
                    
                         Leverage Creative Brief mode with web search (--personaX-search-term) to generate dialogue grounded in current events.
                         
                             
                                 Celebrity Controversy (Rourke/Siwa)
Current Events
                                 python generate.py \
  --creative-brief "Discussion: Mickey Rourke & JoJo Siwa on CBB UK remarks/apologies." \
  --persona1-search-term "Mickey Rourke CBB homophobic comments" \
  --persona2-search-term "JoJo Siwa response apology homophobic remark" \
  --num-examples 100 \
  --output-file trending_Rourke_Siwa.jsonl \
  --load-in-4bit
                             
                             
                                 On-Set Tensions (Isaacs/Goggins)
Current Events
                                python generate.py \
  --creative-brief "Discussion: Jason Isaacs & Walton Goggins on 'White Lotus' feud rumors." \
  --persona1-search-term "Jason Isaacs White Lotus arguments on set" \
  --persona2-search-term "Walton Goggins feud rumors White Lotus" \
  --num-examples 100 \
  --output-file trending_Isaacs_Goggins.jsonl \
  --load-in-4bit
                             
                             
                                 Trademark Battle (Perry/Taylor)
Current Events
                                python generate.py \
  --creative-brief "Discussion: Katy Perry & Katie Jane Taylor on trademark battle over clothing brand." \
  --persona1-search-term "Katy Perry trademark battle clothing" \
  --persona2-search-term "Katie Jane Taylor trademark dispute Katy Perry" \
  --num-examples 100 \
  --output-file trending_Perry_Taylor.jsonl \
  --load-in-4bit
                             
                         
                     
                    See the full list of examples in the README.

                    
                    
                    
                    
                        
                            Advanced Use Case: Generating Progressive Course Curricula
                        
                        
                            A truly unique capability emerges when combining Batch Mode with Creative Briefs and evolving persona2_search_term arguments. This allows you to generate entire conversational training courses, level by level!
                        
                        
                             By defining each course module as a run in your batch YAML, you can set a consistent tutor persona and progressively change the `persona2_search_term` to reflect the learner's expected knowledge and confusion points at that stage. This creates highly targeted, context-aware dialogue suitable for training sequential chatbot LoRAs for education. Imagine generating data for:
                        
                        
                             K-12 Algebra
                             College History 101
                             Python Programming Basics
                             Advanced Rust Concepts
                             Small Engine Repair
                             Golf Cart Mechanics
                        

                        
                            
                                AI Programming Course Curriculum (Batch Example)
                                Educational Data
                            
                            
                                This example uses `batch_generate.py` and the `examples/ai_course_curriculum.yaml` file (shown partially below) to create 6 levels of AI learning dialogues between `EnfuseBot` and `Learner`. The key is changing `persona2_search_term` in each run to simulate learner progress.
                            
                            # examples/ai_course_curriculum.yaml (Snippet)
output_directory: "ai_course_datasets"
force_upload: true

runs:
  # Level 1: Intro
  - id: "level1_intro"
    output_file: "ai_course_level1_intro.jsonl"
    upload_repo: "cahlen/AICourse-Level1-Intro"
    num_examples: 500
    load_in_4bit: true
    creative_brief: "EnfuseBot introduces AI/ML concepts..."
    persona2_search_term: "Beginner Python programmer confused about AI..."
  # Level 2: Scikit-learn
  - id: "level2_sklearn"
    # ... (similar structure) ...
    creative_brief: "EnfuseBot explains core ML concepts..."
    persona2_search_term: "Learner starting Scikit-learn confused..."
  # ... (Levels 3-6 omitted for brevity) ...
                             Command to Run:
                             python batch_generate.py examples/ai_course_curriculum.yaml
                             See the full YAML configuration file in the repository.
                        
                    
                    
                    
                    

                 
            

             
                 
                     Output Format & Fine-Tuning Notes
                      
                 
                 
                     
                         Output Format (.jsonl)
                         Saved locally and optionally uploaded to HF Hub. Each line is a turn:
                        {"conversation_id": 0, "turn_number": 0, "role": "human", ...}
{"conversation_id": 0, "turn_number": 1, "role": "gpt", ...}
{"conversation_id": 1, "turn_number": 0, "role": "human", ...}
                         Keys: conversation_id, turn_number, role, speaker_name, topic, scenario, style, include_points, content.
                          See the full output details in the README.

                        Loading from Hub
                        Easily load your uploaded dataset:
                        from datasets import load_dataset

# Replace with your repo ID
ds = load_dataset("YourUser/YourDatasetName")

# Access data
print(ds['train'][0])
                     
                     
                         Model & Fine-Tuning Notes
                          
                             Default Model: Meta-Llama-3-8B-Instruct (change via --model-id).
                             Ideal for PEFT methods like **LoRA** for style/persona/topic specialization.
                             Use strong instruction-following base models (Llama 3, Mistral, Qwen2, Gemma Instruct, etc.).
                             LoRA requires: peft, trl, bitsandbytes.
                             Use --load-in-4bit for efficiency (needs bitsandbytes).



         
         
             
                 Roadmap
                 
             
            
                
                     Add batch generation script (batch_generate.py) w/ YAML config.
                     Add diverse generation examples to docs.
                     Implement --validate-local-save checks (placeholder).
                     Explore advanced topic/scenario variation techniques.
                     Add option for different output formats (e.g., conversational JSON).
                     Improve error handling and reporting in batch script.
                 
                See the open issues for more details.
             
         

         
        
             
                 
                     Contributing
                 
                 Contributions are welcome! Fork the repo, create a feature branch, commit, push, and open a Pull Request.
                 Report bugs or suggest enhancements via GitHub Issues.
                 Don't forget to star the project! ⭐
             

            
                
                     License
                 
               Distributed under the MIT License.
                See the LICENSE file for more information.
            

             
                
                     Contact
                 
                 Cahlen Humphreys
               GitHub Profile
               Project: GitHub Repo
            
        


        
        
             
                 Acknowledgments
                 
             
            
                
                     README structure: Best-README-Template
                     Libraries: Hugging Face (transformers, datasets, etc.), PyTorch, Pandas, DuckDuckGo Search
                     Design/Icons: Tailwind CSS, Font Awesome, Img Shields, SVG Repo
                 
             
         

        
        
            
                Ready to Generate?
                Start crafting high-quality conversational data for your LLMs today!
                
                    
                    Get Started on GitHub
                
            
        

        
        
            
                Distributed under the MIT License. Copyright © 2024 Cahlen Humphreys.
            
 
            
                Project Link: github.com/cahlen/conversation-dataset-generator

Conversation Dataset Generator ✨

About The Project

Why Use This Generator?

Built With

Getting Started

Prerequisites

Installation

Usage

Single Generation (`generate.py`)

Manual Mode

Creative Brief Mode

Fixed Persona + Variation Mode

Batch Generation (`batch_generate.py`)

1. Create YAML Config

2. Run the Batch Script

Script Flow Overview (`generate.py`)

Key Argument Reference (`generate.py`)

Sitcom Banter Style LoRA

Helpful Coding Mentor Persona LoRA

Varied Historical Banter

Dialogue with Web Search

Absurdist Comedy Variations

Rockstar Banter (Jagger & Ozzy)

Advanced Use Case: Generating Progressive Course Curricula

AI Programming Course Curriculum (Batch Example)

Output Format (`.jsonl`)

Loading from Hub

Model & Fine-Tuning Notes

Roadmap

Contributing

License

Contact

Acknowledgments

Ready to Generate?

About The Project

Why Use This Generator?

Built With

Getting Started

Prerequisites

Installation

Usage

Single Generation (generate.py)

Manual Mode

Creative Brief Mode

Fixed Persona + Variation Mode

Batch Generation (batch_generate.py)

1. Create YAML Config

2. Run the Batch Script

Script Flow Overview (generate.py)

Key Argument Reference (generate.py)

Sitcom Banter Style LoRA

Helpful Coding Mentor Persona LoRA

Varied Historical Banter

Dialogue with Web Search

Absurdist Comedy Variations

Rockstar Banter (Jagger & Ozzy)

Celebrity Controversy (Rourke/Siwa)

On-Set Tensions (Isaacs/Goggins)

Trademark Battle (Perry/Taylor)

Advanced Use Case: Generating Progressive Course Curricula

AI Programming Course Curriculum (Batch Example)

Output Format (.jsonl)

Loading from Hub

Model & Fine-Tuning Notes

Roadmap

Contributing

License

Contact

Acknowledgments

Ready to Generate?

Single Generation (`generate.py`)

Batch Generation (`batch_generate.py`)

Script Flow Overview (`generate.py`)

Key Argument Reference (`generate.py`)

Output Format (`.jsonl`)