Job Title Generator with DeepSeek and Jobinja Data

Description

The Job Title Generator is a Python-based project that automates the process of scraping job listings from Jobinja.ir, a leading job portal in Iran, and fine-tuning a language model to generate accurate job titles based on job tags and skills. Using Selenium for web scraping and the DeepSeek-R1-Distill-Qwen-1.5B model with LoRA (Low-Rank Adaptation) for efficient fine-tuning, this project creates a dataset of job listings and trains a model to suggest relevant job titles. It’s ideal for HR professionals, job portal developers, and AI enthusiasts looking to explore web scraping and natural language processing in Persian.

Why Use This Project?

Automate Job Title Creation: Generate accurate Persian job titles based on tags and skills, streamlining HR processes.
Ethical Web Scraping: Includes delays and limits to respect Jobinja.ir’s servers, ensuring responsible data collection.
AI-Powered: Leverages a fine-tuned DeepSeek model for high-quality job title suggestions.
User-Friendly: Provides a clear Python script for scraping and a Jupyter notebook for model training.
Open-Source: Encourages community contributions to enhance functionality and adapt to other use cases.

Features

Web Scraping: Extracts job titles, tags, skills, gender requirements, and certifications from Jobinja.ir using Selenium.
Data Preprocessing: Cleans and formats scraped data for model training, handling Persian text and special characters.
Model Fine-Tuning: Uses LoRA with the PEFT library to efficiently fine-tune the DeepSeek model on job data.
Evaluation: Assesses model performance with validation loss and sample predictions.
Dataset Creation: Generates a structured CSV dataset for training and analysis.
Error Handling: Includes robust error management for scraping and training processes.

Installation

For the Scraper (`scraper.py`)

Install Python 3.8 or Higher:
- Download from Python.org.
- Verify installation:
```
python --version
```
Install Dependencies:
- Install required libraries:
```
pip install selenium pandas
```
Set Up ChromeDriver:
- Download ChromeDriver matching your Chrome browser version.
- Add ChromeDriver to your system PATH or place it in the project directory.

For the Notebook (`FineTuning_Jobinja.ipynb`)

The notebook is designed to run in Google Colab with GPU support, but it can be run locally with additional setup:

Install Additional Libraries (for local execution):
- Install required libraries:
```
pip install torch transformers datasets peft evaluate pandas scikit-learn
```
- Ensure a GPU is available for faster training, though CPU is supported.
Set Up Google Colab (recommended):
- No local installation needed; upload the notebook to Colab and ensure a GPU runtime is selected.

Note: Training the model may take several minutes to hours depending on hardware. Use a GPU runtime in Colab for optimal performance. Scraping may take time due to the 60-page limit and 3-second delay between requests; adjust max_pages_to_scrape in scraper.py if needed.

Usage

Step 1: Run the Scraper

Navigate to Project Directory:
```
cd Job-Title-Generator
```
Run the Scraper:
- Execute:
```
python scraper.py
```
- This generates jobinja_jobs.csv with job data (title, tags, skills, gender, certification).
- Note: The script scrapes up to 60 pages, with ~20 jobs per page. Adjust max_pages_to_scrape in scraper.py for faster execution.

Step 2: Fine-Tune the Model

Open the Notebook:
- Upload FineTuning_Jobinja.ipynb to Google Colab.
- Upload jobinja_jobs.csv to Colab.
Run the Notebook:
- Execute all cells to:
  - Preprocess the CSV data.
  - Fine-tune the DeepSeek model using LoRA.
  - Evaluate the model on a test set.
  - Save the trained model to ./deepseek_persian_job_title_generator.
- The notebook outputs sample predictions to verify performance.
Generate Job Titles:
- Use the saved model to generate job titles for new tags and skills, as shown in the notebook’s example outputs.

Dataset

The dataset is scraped from Jobinja.ir and includes:

job_title: The job’s title (e.g., "استخدام کارشناس فروش تلفنی").
job_tags: Job categories (e.g., "فروش و بازاریابی").
job_skills: Required skills (e.g., "فروش تلفنی, اصول و فنون مذاکره").
gender: Gender requirements, if specified.
certification: Required certifications, if any.

The scraper collects up to 60 pages (~1200 jobs), creating a robust dataset for training. The notebook preprocesses this data, handling Persian text and cleaning special characters.

Model

The model is DeepSeek-R1-Distill-Qwen-1.5B, a distilled language model optimized for natural language tasks. Fine-tuning with LoRA adapts it to generate Persian job titles, achieving a final validation loss of 0.5974 after 5 epochs.

Results

The fine-tuned model was evaluated on a test set, producing accurate job title predictions. Examples include:

Input Tags	Input Skills	Predicted Title	Actual Title
فروش و بازاریابی	فروش تلفنی اصول و فنون مذاکره فروش	استخدام کارشناس فروش تلفنی	استخدام کارشناس فروش تلفنی پاکدشت
کارگر ساده نیروی خدماتی	نظافت پذیرایی و تشریفات امور خدماتی	استخدام کارشناس پذیرایی	استخدام کارمند خدمات
فروش و بازاریابی	اصول و فنون مذاکره فروش فروش و بازاریابی	استخدام کارشناس فروش	استخدام کارشناس فروش اصفهان
وب برنامه نویسی و نرم افزار	پشتیبانی نرم افزار Sql Server پشتیبانی	استخدام کارشناس پشتی	استخدام کارشناس پشتیبانی نرم افزار
کارشناس حقوقی وکالت	وکالت	استخدام کارشناس وکالت	استخدام وکیل قم

These results show the model generates titles close to the actual ones, though some predictions lack location-specific details, indicating strong generalization.

Ethical Considerations

Responsible Scraping: The scraper includes a 3-second delay between requests and a 60-page limit to minimize server load on Jobinja.ir. Always comply with Jobinja.ir’s terms of service.
Data Privacy: Ensure scraped data is used ethically and respects user privacy.
Model Usage: Use the fine-tuned model responsibly, avoiding misuse in automated systems without human oversight.

Technologies Used

Technology	Role
Python	Primary programming language.
Selenium	Web scraping from Jobinja.ir.
Pandas	Data manipulation and preprocessing.
Scikit-learn	Data splitting for training and testing.
Transformers	Loading and fine-tuning the DeepSeek model.
Datasets	Managing training and test datasets.
PEFT	Efficient fine-tuning with LoRA.
Torch	Model training and GPU support.
Google Colab	Environment for notebook execution with GPU.

Contributing

Contributions are welcome! To contribute:

Fork the repository on GitHub.
Create a new branch for your changes.
Submit a pull request with a clear description.
For bug reports or feature requests, open an issue.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or feedback, open an issue on GitHub or email [[email protected]].

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Scraper		Scraper
jobinja_model		jobinja_model
FineTuning_Jobinja.ipynb		FineTuning_Jobinja.ipynb
README.markdown		README.markdown
jobinja_jobs.csv		jobinja_jobs.csv
jobinja_title_generator.py		jobinja_title_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Job Title Generator with DeepSeek and Jobinja Data

Description

Why Use This Project?

Features

Installation

For the Scraper (`scraper.py`)

For the Notebook (`FineTuning_Jobinja.ipynb`)

Usage

Step 1: Run the Scraper

Step 2: Fine-Tune the Model

Dataset

Model

Results

Ethical Considerations

Technologies Used

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Languages

armanjscript/Job-Title-Generator

Folders and files

Latest commit

History

Repository files navigation

Job Title Generator with DeepSeek and Jobinja Data

Description

Why Use This Project?

Features

Installation

For the Scraper (scraper.py)

For the Notebook (FineTuning_Jobinja.ipynb)

Usage

Step 1: Run the Scraper

Step 2: Fine-Tune the Model

Dataset

Model

Results

Ethical Considerations

Technologies Used

Contributing

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

For the Scraper (`scraper.py`)

For the Notebook (`FineTuning_Jobinja.ipynb`)

Packages