Introduction:
In today’s competitive job market, efficiently matching candidates to job openings is crucial for both job seekers and employers. At Parser, we are continually exploring alternatives to improve our existing processes to address challenges like this one. With this idea in mind, we’ve developed a model using BERT (Bidirectional Encoder Representations from Transformers) and Weights & Biases (WandB) to enhance our matching capabilities and provide insightful results.
This article explores the architecture of our model, the tools used for training, and the strategies we implemented to improve its accuracy.
Understanding BERT Architecture:
BERT is a transformer-based machine learning technique for natural language processing developed by Google. It is designed to understand the context of words in search queries by considering the full context of a word by looking at the words that come before and after it. This bidirectional approach allows BERT to understand the full context of a word based on all of its surroundings.
Unlike traditional language models, BERT learns context from both the left and right sides of a token simultaneously. This ability to comprehend complex language patterns and contextual nuances is crucial for matching job requirements with candidate qualifications.
Understanding Weights & Biases (WandB):
WandB is a Machine Learning Operations (MLOps) platform that helps in tracking experiments, managing, and visualising machine learning models. It offers several key benefits:
- Experiment Tracking: WandB logs hyperparametres, metrics, and model performance, creating a detailed history of our iterations.
- Visualisation: The platform generates real-time, interactive visualisations of our training progress, helping us quickly identify and address issues like overfitting.
- Model Versioning: We use WandB to maintain version control for our models, making it easy to compare performance across iterations or revert to previous states when needed.
By leveraging these features, we maintain a structured, data-driven approach to developing and refining our CV-Job Description matcher, including logging our training progress, such as loss and evaluation metrics for each epoch.
wandb: |
What Do These Values Mean?
- Accuracy measures the proportion of correct predictions out of all predictions. For example, if your model correctly classified 500 images out of a total of 1000 images, your accuracy would be 500/1000 = 0.5 for 50%.
- Recall, also known as sensitivity or true positive rate, measures the proportion of correct positive predictions over the actual positive instances.
For example, if your task is to classify images as either “cat” or “not a cat”, and your model correctly identifies 500 cat images but also misidentifies 100 non-cat images as “cat”, your recall would be 500/600 = 0.83 or 83%.
- Epoch refers to a complete cycle of training on a dataset. The number of epochs you train for determines how many times your model has seen the complete dataset and had the opportunity to adjust its parameters based on the error signals.
- Precision measures the proportion of correct positive predictions out of all predicted positives.
Continuing with the animal classification example, if your model confidently predicts 600 images as “cat” but only 250 of those are actually cats, your precision would be 250/600 = 0.4167 or 41.67%.
- F1 score is a measure generally used in the field of binary classification problems. It is the harmonic mean of the precision and recall of the model, calculated as: F1 = 2 * (precision * recall) / (precision + recall).
A high F1 score suggests that the model accurately identifies relevant CVs for a given job description (high precision) while capturing a large proportion of the relevant CVs (high recall).
Initial Implementation (step by step):
Here’s an overview of our process for implementing the model:
- We started with a dataset containing pairs of CVs and job descriptions, labelled as matches or non-matches.
- We used a pre-trained BERT model (‘bert-base-uncased’) and added a classification layer on top for our binary matching task.
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) |
- We tokenised the CV and job description pairs using BERT’s tokeniser, with a maximum sequence length of 512 tokens.
max_length = 512 tokeniser = BertTokeniser.from_pretrained('bert-base-uncased') |
- Our initial training process ran for 5 epochs, using AdamW optimiser with a learning rate of 2e-5.
AdamW is an optimisation algorithm commonly used to train deep-learning models. It incorporates a technique called decoupled weight decay regularisation, which, in simpler words, “avoids our model getting too complicated.”
This means the BERT model can learn to match CVs and job descriptions accurately, without becoming too complex or specific to the training data.
num_epochs = 5 optimiser = AdamW(model.parameters(), lr=2e-5) |
Initial Results:
While our baseline model showed promising results, we identified several areas for improvement, particularly regarding some false positives—matches that were not necessarily accurate.
As shown in this example, we were looking for a Cassandra DBA while sending the profile of a Front-end Engineer. Although this candidate had some database experience, they lacked experience with Cassandra, yet we were getting a 38% match. This discrepancy indicated that something might be amiss.
As we can see, even though the Job description and the CV are not a potential match, our model is giving 38.6% which might not be accurate. Let’s take a step back to understand how is our model doing the calculations, so we can think of ways to optimise it.
How is the match calculated?
Our model calculates the match between a CV and a job description by outputting a probability score, from 0 to 1. A score closer to 1 means it thinks it’s a great match, while a score closer to 0 means the opposite.
We use three main ways to check how well our matchmaker is doing. But how do the values we talked about earlier impact these calculations:
Precision: It’s like checking how many times it correctly picks the right person for a job.
Recall: It’s like making sure we don’t miss any good candidates.
By adjusting the threshold for what we consider a “match” (e.g., probability > 0.5), we can balance between precision and recall.
Experimenting with different thresholds will help you optimise your model. However in our code during training, the model is optimised using binary cross-entropy loss (BCEWithLogitsLoss). This doesn’t involve a threshold directly, but what does this mean in simple terms?
Imagine teaching a computer to rate movies on a scale of 0 to 100.
“Binary cross-entropy loss” is like a scoring system that measures how accurate the computer’s ratings are compared to expert critics’ ratings.
We don’t tell the computer “anything above 50 is a good movie” (that would be a threshold).
Instead, we simply inform the computer how close or far its rating is from the expert’s rating, and it keeps adjusting to improve its accuracy.
The threshold is a post-training decision. It’s a tool for interpreting the model’s output rather than a part of the model’s learning process.
Also when training our model, we use something called a loss function. This is like a teacher that tells the AI how wrong it is. During the process of training this “wrongness score” needs to be as small as possible, helping improve its precision and recall.
Part of the Problems We Detected:
- The model sometimes struggled with longer CVs and job descriptions due to the 512-token limit.
- Initially, we used just a few epochs (only 5).
- We used a small dataset for training; we definitely needed more data!
These observations led us to implement several strategies to improve the model’s efficiency and accuracy.
Improving Model Accuracy:
Here are some strategies we employed to improve the accuracy of our model:
1. Increased training epochs
2. Tested with larger pre-trained models
For example:
model = BertForSequenceClassification.from_pretrained('bert-large-uncased', num_labels=1) |
3. Increased maximum sequence length from 512 to 1024
4. Early stopping to prevent overfitting
As we compare our initial results:
Most Effective Strategies:
While all strategies contributed to improving our model, increasing the number of training epochs and extending the maximum sequence length had the most significant impact on the model’s performance.
Extending the maximum sequence length from 512 to 1024 tokens was particularly impactful, allowing the model to process longer CVs and job descriptions more effectively, capturing more relevant information and context.
It’s important to note that while increasing epochs improved performance, we had to be cautious of overfitting, and implementing early stopping was crucial to prevent it.
💡 Keep in mind Good Data is always the key! Inadequate data preprocessing can introduce noise and bias into the model. Data cleaning, standardising formats, and removing irrelevant information is key to getting good results. Using limited or not properly balanced datasets can hinder the model’s ability to generalise. |
Conclusion:
The BERT architecture provides a strong foundation for understanding the nuanced language of both CVs and job descriptions, while WandB offers invaluable tools for tracking, visualising, and improving our model’s performance.
Through systematic fine-tuning and various optimisation strategies— such as increased training epochs, larger pre-trained models, and early stopping mechanisms—we’ve significantly enhanced the accuracy and reliability of our matcher. This project showcases the potential of AI in streamlining human resource processes.
As we look to the future, it’s clear that the intersection of advanced NLP techniques and human resource management holds immense promise for creating more efficient, effective, and equitable hiring processes.