Skip to content

The Job Title Generator is a Python-based project that automates the process of scraping job listings from Jobinja.ir, a leading job portal in Iran, and fine-tuning a language model to generate accurate job titles based on job tags and skills.

Notifications You must be signed in to change notification settings

armanjscript/Job-Title-Generator

Repository files navigation

Job Title Generator with DeepSeek and Jobinja Data

Description

The Job Title Generator is a Python-based project that automates the process of scraping job listings from Jobinja.ir, a leading job portal in Iran, and fine-tuning a language model to generate accurate job titles based on job tags and skills. Using Selenium for web scraping and the DeepSeek-R1-Distill-Qwen-1.5B model with LoRA (Low-Rank Adaptation) for efficient fine-tuning, this project creates a dataset of job listings and trains a model to suggest relevant job titles. It’s ideal for HR professionals, job portal developers, and AI enthusiasts looking to explore web scraping and natural language processing in Persian.

Why Use This Project?

  • Automate Job Title Creation: Generate accurate Persian job titles based on tags and skills, streamlining HR processes.
  • Ethical Web Scraping: Includes delays and limits to respect Jobinja.ir’s servers, ensuring responsible data collection.
  • AI-Powered: Leverages a fine-tuned DeepSeek model for high-quality job title suggestions.
  • User-Friendly: Provides a clear Python script for scraping and a Jupyter notebook for model training.
  • Open-Source: Encourages community contributions to enhance functionality and adapt to other use cases.

Features

  • Web Scraping: Extracts job titles, tags, skills, gender requirements, and certifications from Jobinja.ir using Selenium.
  • Data Preprocessing: Cleans and formats scraped data for model training, handling Persian text and special characters.
  • Model Fine-Tuning: Uses LoRA with the PEFT library to efficiently fine-tune the DeepSeek model on job data.
  • Evaluation: Assesses model performance with validation loss and sample predictions.
  • Dataset Creation: Generates a structured CSV dataset for training and analysis.
  • Error Handling: Includes robust error management for scraping and training processes.

Installation

For the Scraper (scraper.py)

  1. Install Python 3.8 or Higher:

    • Download from Python.org.
    • Verify installation:
      python --version
  2. Install Dependencies:

    • Install required libraries:
      pip install selenium pandas
  3. Set Up ChromeDriver:

    • Download ChromeDriver matching your Chrome browser version.
    • Add ChromeDriver to your system PATH or place it in the project directory.

For the Notebook (FineTuning_Jobinja.ipynb)

The notebook is designed to run in Google Colab with GPU support, but it can be run locally with additional setup:

  1. Install Additional Libraries (for local execution):

    • Install required libraries:
      pip install torch transformers datasets peft evaluate pandas scikit-learn
    • Ensure a GPU is available for faster training, though CPU is supported.
  2. Set Up Google Colab (recommended):

    • No local installation needed; upload the notebook to Colab and ensure a GPU runtime is selected.

Note: Training the model may take several minutes to hours depending on hardware. Use a GPU runtime in Colab for optimal performance. Scraping may take time due to the 60-page limit and 3-second delay between requests; adjust max_pages_to_scrape in scraper.py if needed.

Usage

Step 1: Run the Scraper

  1. Navigate to Project Directory:

    cd Job-Title-Generator
  2. Run the Scraper:

    • Execute:
      python scraper.py
    • This generates jobinja_jobs.csv with job data (title, tags, skills, gender, certification).
    • Note: The script scrapes up to 60 pages, with ~20 jobs per page. Adjust max_pages_to_scrape in scraper.py for faster execution.

Step 2: Fine-Tune the Model

  1. Open the Notebook:

    • Upload FineTuning_Jobinja.ipynb to Google Colab.
    • Upload jobinja_jobs.csv to Colab.
  2. Run the Notebook:

    • Execute all cells to:
      • Preprocess the CSV data.
      • Fine-tune the DeepSeek model using LoRA.
      • Evaluate the model on a test set.
      • Save the trained model to ./deepseek_persian_job_title_generator.
    • The notebook outputs sample predictions to verify performance.
  3. Generate Job Titles:

    • Use the saved model to generate job titles for new tags and skills, as shown in the notebook’s example outputs.

Dataset

The dataset is scraped from Jobinja.ir and includes:

  • job_title: The job’s title (e.g., "استخدام کارشناس فروش تلفنی").
  • job_tags: Job categories (e.g., "فروش و بازاریابی").
  • job_skills: Required skills (e.g., "فروش تلفنی, اصول و فنون مذاکره").
  • gender: Gender requirements, if specified.
  • certification: Required certifications, if any.

The scraper collects up to 60 pages (~1200 jobs), creating a robust dataset for training. The notebook preprocesses this data, handling Persian text and cleaning special characters.

Model

The model is DeepSeek-R1-Distill-Qwen-1.5B, a distilled language model optimized for natural language tasks. Fine-tuning with LoRA adapts it to generate Persian job titles, achieving a final validation loss of 0.5974 after 5 epochs.

Results

The fine-tuned model was evaluated on a test set, producing accurate job title predictions. Examples include:

Input Tags Input Skills Predicted Title Actual Title
فروش و بازاریابی فروش تلفنی اصول و فنون مذاکره فروش استخدام کارشناس فروش تلفنی استخدام کارشناس فروش تلفنی پاکدشت
کارگر ساده نیروی خدماتی نظافت پذیرایی و تشریفات امور خدماتی استخدام کارشناس پذیرایی استخدام کارمند خدمات
فروش و بازاریابی اصول و فنون مذاکره فروش فروش و بازاریابی استخدام کارشناس فروش استخدام کارشناس فروش اصفهان
وب برنامه نویسی و نرم افزار پشتیبانی نرم افزار Sql Server پشتیبانی استخدام کارشناس پشتی استخدام کارشناس پشتیبانی نرم افزار
کارشناس حقوقی وکالت وکالت استخدام کارشناس وکالت استخدام وکیل قم

These results show the model generates titles close to the actual ones, though some predictions lack location-specific details, indicating strong generalization.

Ethical Considerations

  • Responsible Scraping: The scraper includes a 3-second delay between requests and a 60-page limit to minimize server load on Jobinja.ir. Always comply with Jobinja.ir’s terms of service.
  • Data Privacy: Ensure scraped data is used ethically and respects user privacy.
  • Model Usage: Use the fine-tuned model responsibly, avoiding misuse in automated systems without human oversight.

Technologies Used

Technology Role
Python Primary programming language.
Selenium Web scraping from Jobinja.ir.
Pandas Data manipulation and preprocessing.
Scikit-learn Data splitting for training and testing.
Transformers Loading and fine-tuning the DeepSeek model.
Datasets Managing training and test datasets.
PEFT Efficient fine-tuning with LoRA.
Torch Model training and GPU support.
Google Colab Environment for notebook execution with GPU.

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository on GitHub.
  2. Create a new branch for your changes.
  3. Submit a pull request with a clear description.
  4. For bug reports or feature requests, open an issue.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or feedback, open an issue on GitHub or email [[email protected]].

About

The Job Title Generator is a Python-based project that automates the process of scraping job listings from Jobinja.ir, a leading job portal in Iran, and fine-tuning a language model to generate accurate job titles based on job tags and skills.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published