The Job Title Generator is a Python-based project that automates the process of scraping job listings from Jobinja.ir, a leading job portal in Iran, and fine-tuning a language model to generate accurate job titles based on job tags and skills. Using Selenium for web scraping and the DeepSeek-R1-Distill-Qwen-1.5B model with LoRA (Low-Rank Adaptation) for efficient fine-tuning, this project creates a dataset of job listings and trains a model to suggest relevant job titles. It’s ideal for HR professionals, job portal developers, and AI enthusiasts looking to explore web scraping and natural language processing in Persian.
- Automate Job Title Creation: Generate accurate Persian job titles based on tags and skills, streamlining HR processes.
- Ethical Web Scraping: Includes delays and limits to respect Jobinja.ir’s servers, ensuring responsible data collection.
- AI-Powered: Leverages a fine-tuned DeepSeek model for high-quality job title suggestions.
- User-Friendly: Provides a clear Python script for scraping and a Jupyter notebook for model training.
- Open-Source: Encourages community contributions to enhance functionality and adapt to other use cases.
- Web Scraping: Extracts job titles, tags, skills, gender requirements, and certifications from Jobinja.ir using Selenium.
- Data Preprocessing: Cleans and formats scraped data for model training, handling Persian text and special characters.
- Model Fine-Tuning: Uses LoRA with the PEFT library to efficiently fine-tune the DeepSeek model on job data.
- Evaluation: Assesses model performance with validation loss and sample predictions.
- Dataset Creation: Generates a structured CSV dataset for training and analysis.
- Error Handling: Includes robust error management for scraping and training processes.
-
Install Python 3.8 or Higher:
- Download from Python.org.
- Verify installation:
python --version
-
Install Dependencies:
- Install required libraries:
pip install selenium pandas
- Install required libraries:
-
Set Up ChromeDriver:
- Download ChromeDriver matching your Chrome browser version.
- Add ChromeDriver to your system PATH or place it in the project directory.
The notebook is designed to run in Google Colab with GPU support, but it can be run locally with additional setup:
-
Install Additional Libraries (for local execution):
- Install required libraries:
pip install torch transformers datasets peft evaluate pandas scikit-learn
- Ensure a GPU is available for faster training, though CPU is supported.
- Install required libraries:
-
Set Up Google Colab (recommended):
- No local installation needed; upload the notebook to Colab and ensure a GPU runtime is selected.
Note: Training the model may take several minutes to hours depending on hardware. Use a GPU runtime in Colab for optimal performance. Scraping may take time due to the 60-page limit and 3-second delay between requests; adjust max_pages_to_scrape
in scraper.py
if needed.
-
Navigate to Project Directory:
cd Job-Title-Generator
-
Run the Scraper:
- Execute:
python scraper.py
- This generates
jobinja_jobs.csv
with job data (title, tags, skills, gender, certification). - Note: The script scrapes up to 60 pages, with ~20 jobs per page. Adjust
max_pages_to_scrape
inscraper.py
for faster execution.
- Execute:
-
Open the Notebook:
- Upload
FineTuning_Jobinja.ipynb
to Google Colab. - Upload
jobinja_jobs.csv
to Colab.
- Upload
-
Run the Notebook:
- Execute all cells to:
- Preprocess the CSV data.
- Fine-tune the DeepSeek model using LoRA.
- Evaluate the model on a test set.
- Save the trained model to
./deepseek_persian_job_title_generator
.
- The notebook outputs sample predictions to verify performance.
- Execute all cells to:
-
Generate Job Titles:
- Use the saved model to generate job titles for new tags and skills, as shown in the notebook’s example outputs.
The dataset is scraped from Jobinja.ir and includes:
- job_title: The job’s title (e.g., "استخدام کارشناس فروش تلفنی").
- job_tags: Job categories (e.g., "فروش و بازاریابی").
- job_skills: Required skills (e.g., "فروش تلفنی, اصول و فنون مذاکره").
- gender: Gender requirements, if specified.
- certification: Required certifications, if any.
The scraper collects up to 60 pages (~1200 jobs), creating a robust dataset for training. The notebook preprocesses this data, handling Persian text and cleaning special characters.
The model is DeepSeek-R1-Distill-Qwen-1.5B, a distilled language model optimized for natural language tasks. Fine-tuning with LoRA adapts it to generate Persian job titles, achieving a final validation loss of 0.5974 after 5 epochs.
The fine-tuned model was evaluated on a test set, producing accurate job title predictions. Examples include:
Input Tags | Input Skills | Predicted Title | Actual Title |
---|---|---|---|
فروش و بازاریابی | فروش تلفنی اصول و فنون مذاکره فروش | استخدام کارشناس فروش تلفنی | استخدام کارشناس فروش تلفنی پاکدشت |
کارگر ساده نیروی خدماتی | نظافت پذیرایی و تشریفات امور خدماتی | استخدام کارشناس پذیرایی | استخدام کارمند خدمات |
فروش و بازاریابی | اصول و فنون مذاکره فروش فروش و بازاریابی | استخدام کارشناس فروش | استخدام کارشناس فروش اصفهان |
وب برنامه نویسی و نرم افزار | پشتیبانی نرم افزار Sql Server پشتیبانی | استخدام کارشناس پشتی | استخدام کارشناس پشتیبانی نرم افزار |
کارشناس حقوقی وکالت | وکالت | استخدام کارشناس وکالت | استخدام وکیل قم |
These results show the model generates titles close to the actual ones, though some predictions lack location-specific details, indicating strong generalization.
- Responsible Scraping: The scraper includes a 3-second delay between requests and a 60-page limit to minimize server load on Jobinja.ir. Always comply with Jobinja.ir’s terms of service.
- Data Privacy: Ensure scraped data is used ethically and respects user privacy.
- Model Usage: Use the fine-tuned model responsibly, avoiding misuse in automated systems without human oversight.
Technology | Role |
---|---|
Python | Primary programming language. |
Selenium | Web scraping from Jobinja.ir. |
Pandas | Data manipulation and preprocessing. |
Scikit-learn | Data splitting for training and testing. |
Transformers | Loading and fine-tuning the DeepSeek model. |
Datasets | Managing training and test datasets. |
PEFT | Efficient fine-tuning with LoRA. |
Torch | Model training and GPU support. |
Google Colab | Environment for notebook execution with GPU. |
Contributions are welcome! To contribute:
- Fork the repository on GitHub.
- Create a new branch for your changes.
- Submit a pull request with a clear description.
- For bug reports or feature requests, open an issue.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or feedback, open an issue on GitHub or email [[email protected]].