Is there a way to do multi-label classification with CLIP? #975
Replies: 8 comments 1 reply
-
|
not sure if it would work but have you by any chance looked at using captions like |
Beta Was this translation helpful? Give feedback.
-
|
I am attempting this now training on captions with multiple labels and then querying with single labels, and it works pretty badly compared to any normal multi-label classifier.
If I figure this out I will let you know. |
Beta Was this translation helpful? Give feedback.
-
|
Take a look at this paper: I struggled with this problem for a while and this approach is working for me. |
Beta Was this translation helpful? Give feedback.
-
|
@AmericanPresidentJimmyCarter did find a way to improve the multi-labelling performance? |
Beta Was this translation helpful? Give feedback.
-
|
No, I just trained multilabel classifiers instead and those worked. |
Beta Was this translation helpful? Give feedback.
-
|
You can do some sort of anti-text or placeholder text to do multi-label classification, ex: your objective is checking in there is the presence of "red" in an image of a dress, then use: that will give you a probability distribution and you take the zero index |
Beta Was this translation helpful? Give feedback.
-
How does that work? If the image contains neither your result will be essentially random. I think it only works if you have a multi-label classifier to identify a dress in the first place. |
Beta Was this translation helpful? Give feedback.
-
|
Take a look at this AAAI paper (arxiv): MuMIC-Multimodal Embedding for Multi-label Image Classification with Tempered Sigmoid I developed the model for Booking.com. It's been serving production traffic for years and outperforms standard multi-label fine-tuning methods in efficiency. If you don't want train models and wanna use CLIP out of the box, you can still use the tips in the paper's |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The concrete use case is a as following. I have the classes baby, child, teen, adult. My idea was to use similarity between text and image features (for text features I used the prompt 'there is at least one (c) in the photo', c being one of the 4 classes).
I went through quite a lot of examples, but I am running into the issue that the similarity scores are often very different for a fixed class or/and classes that appear might have a very similar threshold (like baby and child). For similarity scores I use the cosine similarity multiplied by 2.5 to stretch the score into the interval [0, 1] as is done in the CLIP Score paper.
Setting a threshold in that sense doesn't seem possible.
Does anyone have an idea for that? I feel quite stuck here, how I should proceed.
Beta Was this translation helpful? Give feedback.
All reactions