Skip to main content

Β· 11 min read
Juraj Bezdek

OpenAI's chatGPT has taken the world by storm with its ability to answer our questions. Regardless of the quality of the information it sometimes produces, its ability to generate text in different styles, languages, and cover various topics is undoubtedly impressive.

So, can we leverage the power of GPT to enhance our dataset, and when is it a good idea? Can we use these LLM models to generate synthetic data to improve our dataset?

When we should use chatGPT to generate synthetic data?​

... let's ask ChatGPT πŸ€“

Ok, it is right in general, but the answer is kind of vague. Let's dig a bit deeper.

First of all, you should definitely use synthetic data only to enhance your existing dataset. Training a model purely on synthetic data would very likely result in a very poor model, not really usable in the wild.

So when should we use it?

Handling imbalanced class distribution​

In the real world, it is much more likely that the distribution of examples across different classes is imbalanced. If the imbalance is not too large (within a +/- 50% margin as a rule of thumb) and our training procedure and model are good enough, it is usually not a big problem. We can solve this by using class weighting, where we assign stronger weights (importance) to the underrepresented classes and weaker weights to the overrepresented classes.

However, if the imbalance is much bigger (for example, 99% / 1%), this might not be enough. Simply put, in this case, you probably have only a few examples for the underrepresented class, and during training, the model does not have enough data to understand what you are trying to achieve.

In this case, even if your training evaluation shows promising results, it is very likely that the model is just overfitting. In some cases, we might have so little data that during the evaluation phase, we might have only a few examples in the evaluation part of the dataset. For example, if we have 10 examples and a 70/30 split, given the random nature of the selection process, we might easily end up with only 2 examples in the evaluation dataset, and they might be either very similar or very different from the rest, rendering the whole evaluation useless.

Filling the gaps​

In many cases, we can have plenty of straightforward examples in our dataset. For example, if we are doing content moderation, if we have examples like "This is great, love it" and "Die, you stupid MF" ... these examples are quite obvious. But in order to build a robust and reliable model, we need to define the edge cases and set the boundaries. But if we don't have enough data for this, it might be pretty challenging.

There is a spectrum to everything, so let's say we have:

What we want to do here is to help the model to find the boundaries. To do that, we need to provide the examples.

Improving a model on wrongly predicted data​

In the last blog post about identifying problems in our dataset, we described best practices for improving our model by targeting specific cases. (https://docs.labelator.io/blog#targeting-specific-cases)

But what should we do if we are sure that our labels (even for similar examples) are correct, but the model still keeps getting it wrong? The most naive approach might be just multiplying these records, retraining on a smaller set with only these records, or adding more weight to these records to enforce the model to pay more attention to these examples. But this is not a good idea. Firstly, this would most likely force the model to overfit on these examples in the best case and might even impact the ability to train the model.

What we need is more examples like this. They have to be similar but different. Here is where generating synthetic data by LMs like GPT-3 comes in very handy.

Using LLMs to generate synthetic data​

Ok, so ChatGPT was right...πŸ€“ the problem is that we don't have enough examples, so let's generate some!

Ok, let's start with a naive approach... let's say we want to enhance some toy dataset, like ag-news, that contains news titles that are in 4 categories (classes) - Sport, Business, Sci/Tech, World... (Let's ignore the fact that this particular dataset is perfectly balanced for the moment πŸ˜‰)

So let's assume that we have only 1 example of Sport in our dataset and need to multiply it...:

Prompt:

Generate 5 examples of text similar to this: 

Bonds Takes Stock The distraction of another milestone is gone, and Bonds is preparing himself to play every game for the rest of the season

Response:

1.  LeBron James is focused on winning the NBA championship and is taking a break from social media to prepare himself for the upcoming playoffs.
2. After a long hiatus, Taylor Swift is back in the studio recording new music for her fans.
3. Novak Djokovic is determined to defend his Wimbledon title and is training rigorously for the upcoming tournament.
4. Emma Stone is taking a break from acting to focus on her family and personal life.
5. Elon Musk is setting his sights on space exploration and is working hard to develop technology that can get humans to Mars.

Ok... this is far from ideal...

While examples 1 and 3 are relevant since they belong to the Sport category, the remaining ones are random and unrelated. Examples 2-4 appear to be sensationalized headlines typically found in tabloids and would most likely belong to the "Word" category if we had to assign them to one of the four available categories. On the other hand, the "Elon Musk Mars mission" title might be more fitting in the Sci/Tech category.

It seems that we haven't provided enough context to the model, so we need to be more specific by adding the desired label along with the example.

Prompt:

Generate 5 examples in category "Sport" of text similar to this: 

Bonds Takes Stock The distraction of another milestone is gone, and Bonds is preparing himself to play every game for the rest of the season

Response:

1.  Woods on the Comeback Trail: After a long hiatus due to injury, Tiger Woods is gearing up to make his return to competitive golf.
2. Durant Leads Nets to Victory: Kevin Durant's impressive performance led the Brooklyn Nets to a crucial win over their rivals, the Boston Celtics.
3. Djokovic Claims Another Title: Novak Djokovic showed his dominance yet again, winning his third consecutive Grand Slam title at the US Open.
4. Simone Biles Makes Olympic History: With her stunning performances, Simone Biles became the most decorated gymnast in Olympic history, solidifying her place as a true legend of the sport.
5. Serena Williams Announces Retirement: After an incredible career spanning over two decades, Serena Williams has announced that she will retire from professional tennis at the end of the season.

Okay, this task seems to be simple. We can achieve a similar result by providing more examples instead of explicitly mentioning the label. However, we need to keep in mind that this was a straightforward task and not even close to real-world applications.

Now, let's move on to a more realistic scenario. Consider the task of training a classifier for content moderation. Let's start with just one example:

Prompt:

Generate 5 examples in the category "Hate/Toxic" of text similar to this:

hmm , maybe theres more depth on the subject in the comment section.. nope just a bunch of degenerates.

Response:

1. Ugh, I can't believe I wasted my time reading this article. The comments are just a cesspool of ignorance and hate.
2. Why do I even bother scrolling through the comments? It's just a bunch of keyboard warriors spewing their toxic opinions.
3. I swear, the comment section is where intelligence goes to die. It's like a breeding ground for trolls and bigots.
4. I made the mistake of reading the comments again. It's like a never
5. I can't stand people who are so weak and sensitive. Grow a backbone and stop crying about everything.
note

Note that the previous example cannot be replicated on the ChatGPT website due to its strict hate speech detection system. However, on the OpenAI playground where moderation is not enabled, it works just fine

Ok, not ideal again... the example we provided was hateful, since the author addressed the rest of the people commenting on our post as "a bunch of degenerates"... We want to have our community more inclusive, and even if we don't agree, we should be at least respectful.

Although the responses generated by GPT were similar in nature to the previous example, none of them crossed the line into the toxic or hateful territory. Therefore, they were not particularly helpful in enhancing our Toxic/Hate class dataset.

So, let's try to provide GPT with more examples:

Prompt:

Generate 5 examples in the category "Hate/Toxic" of text similar to this:

- hmm , maybe theres more depth on the subject in the comment section.. nope just a bunch of degenerates.
- dumb mf
- THIS B**CH SHOULD BE DEPORTED

Response:

1. Ugh, another ignorant comment from someone who clearly has no idea what they're talking about.
2. I can't believe people like you still exist in this day and age. Your hate is disgusting.
3. You're a pathetic excuse for a human being. I hope you never reproduce.
4. Your toxic attitude is exactly what's wrong with society today. Get a grip.
5. I'm embarrassed to share the same planet as someone as hateful as you.

... so... most of the examples are in the same space, but I'm not sure that all of them could be classified as toxic. Example number three is the most representative, though.

While this approach may be useful for filling gaps and setting boundaries for the model, it may not improve the model's prediction quality in these specific cases.

To enhance the model's performance around these examples, we can explicitly tell ChatGPT about what we don't want and narrow down the comments within the desired area. This technique can be highly beneficial in improving the model's performance for specific cases where it did not perform well.

To do this, we can simply take the generated examples that we did not like and add them to the prompt.

Generate 5 examples in the category "Hate/Toxic" 
different from this:

- I can't believe people actually support this garbage.
- You're a complete waste of space and oxygen.
- I hope you rot in hell for your disgusting behavior.
- You're nothing but a pathetic excuse for a human being.
- I wouldn't be surprised if you're the reason why the world is going to hell.

and very similar to this:

- hmm , maybe theres more depth on the subject in the comment section.. nope just a bunch of degenerates.
- dumb mf
- THIS B**CH SHOULD BE DEPORTED


Match the length and style of those examples

Results:

- You're such a narrow-minded bigot, it's sickening.
- I can't stand people like you who think they're better than everyone else.
- You're a disgrace to humanity and I wish you would just disappear.
- I can't believe how ignorant and hateful you are, it's truly disgusting.
- You're a toxic presence in this world and I hope you get what's coming to you.

Nice... this is much better

Generating synthetic data in Labelator.io​

Reproducing what we've described here is easy using ChatGPT or OpenAI Playground. You don't need any additional tools.

However, copying and pasting content between screens and adding data to your datasets can be tedious.

That's why we've added support for these use cases in Labelator.io. It makes enhancing your dataset super easy.

Let's begin by selecting examples to multiply. For instance, if we want a more balanced dataset to address the lack of Toxic/Hate examples, we can choose those as examples to multiply.

Now, to generate more data, we can select the option in the top right corner of one of the chosen examples.

If we select only one example, Labelator.io will automatically find more similar examples and close negatives (if there are any).

We could also provide additional instructions, such as explicitly asking GPT to use profanities. After clicking "Next", the examples are generated. If we don't like some of them, we can dislike them and generate more examples.

Β· 6 min read
Juraj Bezdek

In order to improve your machine learning model, it is crucial to train it on high-quality data. Identifying problems in your data, particularly when you have a large amount of data, can lead to poor performance in your machine learning models. One common issue is mislabeled data, which can be found in even the most widely-used benchmark datasets.

Improving model performance by addressing mislabeled data is crucial and can have a much greater impact than just tweaking the hyperparameters. This is because you will be essentially fine-tuning the model to predict the mislabeled data. However, identifying these problematic records can be difficult, especially when you have a lot of data.

Here are some ways how you can do it

Exploratory topic analysis​

Clustering similar records into topics and visualizing them alongside the class distribution within the topic can be one way to improve real-world performance. This is because these topics reflect clear patterns that the model is more likely to catch during training. The existence of the cluster itself shows that there is a significant amount of this "type" of data in the dataset.

It is important to use an appropriate clustering algorithm. At Labelator.io, we use HDBSCAN, which is very good at identifying dense regions of similar data and excluding outliers. This method will help identify the most obvious patterns within the data.

This can help us to identify the most obvious problems within the data. For example we may discover clusters where all records are labeled incorrectly. In Labelator.io, this is an easy fix as we can select all of the records and change all the labels at once.

We might also observe some inconsistently labeled data within the same topic in dataset. These is not necessarily a bad thing. For example, a cluster may contain records discussing the same topic, such as World War II and Nazism, but some are labeled as dangerous and others are not. A closer look may reveal that while most records are true Nazi propaganda, a few are more akin to historical accounts written by an expert historian of that period. If most records in this topic are labeled as dangerous, the ML model may conclude that just mentioning this topic should be treated as dangerous. In this case, it is best to equalize the ratio of records for both labels by providing more examples for the type of data that is being incorrectly predicted, so the model is forced to learn the difference between these two subtypes of data within this topic.

At Labelator.io, we use several tricks during topic analysis such as "UMAP based" projection to lower dimensionality space, taking the labels into account during transformation. This means that regenerating topics after changing labels could lead to different clusters, which would essentially try to split the clusters based on labels if a clear distinction between these data is found. Running topic analysis on high-dimensionality data can be tricky and may require a lot of memory. If you are not using Labelator.io, we recommend using the BERTopic library, which encompasses common best practices and techniques.

Labelator.io topic exploration

Reviewing wrong predictions​

When you examine your wrongly predicted data, there are typically two situations that can occur. The first is that the record was correctly labeled, but the prediction was incorrect. The second is that the prediction was accurate, but the labels in your dataset are dirty. It's important to understand that metrics are calculated on potentially dirty data. Even in the best datasets, there is a certain percentage of mislabeled records.

Next thing what you need is to realize is understand whether the record that we are reviewing was part of the training or evaluation part of the dataset. If it was in the training data, it can weaken the model's confidence and potentially cause undesirable shifts in predictions. On the other hand, if it was in the evaluation dataset, it can make it difficult to compare models and trust your metrics.

Anyway, it is safe to say that the less incorrectly labeled data we have the better...

To effectively address this issue we want to start by targeting the chunks of data, where the impact will be the greatest. One way to do this is by identifying the areas where the model is most confused and focusing on those. The confusion matrix is a great tool for visualizing this information.

(To learn more about confusion matrix, you can check our docs on this topic)

Confusion matrix

In Labelator.io, you can simply click on the cell with the highest percentage in the confusion matrix and be directed to the Documents section with a filter that targets those specific records. Here you can review possible mislabels and look at similar records to determine if they are also mislabeled or if there are not enough similar examples to learn from.

Reviewing false predictions in confusion matrix

Targeting specific cases​

When we have tackled the most obvious issues, we still come across instances where the model made a mistake. In such cases, we can review similar cases to try and understand why the mistake was made. However, finding similar cases can be quite challenging. One way to do this is by using keywords, but this is not very effective and reliable approach.

Labelator.io offers a solution with semantic search. It uses the same underlying model to understand the meaning of documents in the dataset as it uses for predictions. This means that when you search for documents similar to any test query or documents, we are looking at the data through the same lens as the AI, helping us understand the data in the same way as the AI.

This is a powerful tool to help us understand why the model made such a prediction and greatly aid in solving problems during training. What we should do here is:

  • review the labels of the most similar documents
  • if needed, add more examples similar to the incorrectly predicted record to reinforce the weight of the wrong decision in this type of data

In the next article, we will cover how to identify these records and catch these cases when running our model in production, so stay tunedπŸ€—!

Β· 6 min read
Juraj Bezdek

Welcome to Labelator.io, no-code data-centric NLP machine learning platform that helps you from the very beginning, when you barely have enough data to train a model on, all the way to managing and constantly improving models in production.

Is Labelator.io for me?​

Labelator.io will help you build a production-ready AI solution FAST. It can help collect and label the data and then train and deploy an AI model to automate this.

It's the perfect platform for those of you, who want to leverage the power of AI to automate any task that requires text understanding and make a decision based upon that.

The platform is designed to make it extremely easy to train, deploy and manage AI models, without having to worry about the details of data preparation, model selection, or infrastructure.

The best part comes however when you realize that you can never collect data for all the possible situations, that can happen. Here a typical AI model that you could build with traditional tools will fail, because the model has never seen such data and therefore was not trained to handle these situations. Labelator.io offers unique and comprehensive tools to handle these cases.

How can Labelator.io help you get started ?​

Do you have labeled data? Great, just upload them and train the model. Labelator.io will help you find potential problems in your data, such as mislabels, and ensure that your model is trained on high-quality data.

But what if you have some samples, but they are not labeled yet? Even better. Labelator.io includes a powerful similarity-based bulk labeling tool that helps you annotate your data faster than anything else. You can use this tool to quickly label a large number of samples at once, even if you don't have an existing dataset to train on.

And what if you don't have any data yet? Never mind. Labelator.io will help you collect your first data and train an AI model to automate the manual work as you grow.

How will Labelator.io helps you run your models​

Deploying models has never been this easy... with just a few clicks, you can deploy your model or its new version and have it available as an API.

Managing and monitoring models is hard. Handling edge cases, targeting weak spots, even harder. Labelator.io is there to help you solve everything that you will encounter when running the AI model in production. Our platform includes tools for monitoring model performance, identifying issues, and troubleshooting problems. We also provide a variety of tools for managing and deploying models, so you can easily deploy your model to production, test it in different environments, and monitor its performance over time.

Overall, Labelator.io is a powerful and easy-to-use platform that helps you every step of the way, from data collection to model deployment. Whether you're a beginner or an experienced machine learning practitioner, Labelator.io has everything you need to train

Unique Labelator.io features​

Similarity based bulk labeling​

Similarity based bulk labeling is a powerful feature that makes the data labeling process much more efficient. With this feature, Labelator.io automatically shows you how many similar documents have not been labeled yet. This can greatly improve the labeling speed since unlike other annotation software, Labelator.io allows you to label tens or even hundreds of similar documents with a single click. This saves a significant amount of time and effort, allowing you to focus on the most important task of providing accurate labels for your data.

Data observability​

Another important problem Labelator.io help to solve is identification of potential problems in your data. The platform's built-in tools help you identify mislabels, inconsistencies and other issues that could negatively impact your model's performance. We do this by using the same backbone model used for prediction to generate semantical vectors that represents the training example in the same way as the model it self understands it. This helps you explore the data and seeing them through the lens of the AI.

Microtopic exploration and prediction analysis​

Labelator.io automatically generates clusters of similar documents, called microtopics, across the entire semantic space. This feature is extremely helpful in understanding the content and labeling accuracy of your dataset. By using microtopics, you can explore what kind of data is in your dataset, and what the accuracy and labels are within each cluster. This allows you to quickly and easily identify any errors or inconsistencies in your data labels, and make the necessary adjustments.

Smart routing​

One of the most powerful features of Labelator.io is its smart routing feature, which enables you to target potential problems when running a model in production.

This feature allows you to decide how the request will be handled by setting a threshold to semantic similarity to a certain topic, or to the training data that the model was trained on.

This allows you to automate the cases where the model is good at, and handle the weak spots manually. This helps you focus your effort where it is needed the most.

Backlog for reviewing edge cases from production​

Based on your smart routing settings and thresholds, the data alongside the predictions, that have been selected for review are shown in backlog, where an operator can confirm, or correct them. The traditional approach here is to select a random sample of production data from the last period, review the predictions within the dataset and retrain the model. After evaluation of this model on some benchmark dataset, you'd deploy this version into production.

There are two main problems with this approach:

  • The review process is extremely ineffective. Let say that you have a real accuracy of your model 80%. If that's true, your random sample would contain 80% of correctly predicted data and the model wont learn anything new on these data. But thanks to smart routing, Labelator.io offers a solutions to target just these 20% (or rather 30%, due to some overlap) that are very likely to be mishandled by the AI and by focusing purely on those, you are improving you model much faster!
  • Once you've corrected the 20% of your wrongly predicted data, you probably corrected tha data in the dataset. But these data are a real data from production system. These data are still incorrect in you master system and you should correct them as well. Unfortunately, it is very common that this will never happen due to technical and/or privacy constraints. Labelator.io integration options includes webhooks that you can connect with your master system so once the labels in you dataset are corrected, you can hook on these changes and correct the labels or decisions made in your primary system.