کثیر الوجوہی ادراک
تعارف
کثیر الوجوہی ادراک vision، language، اور روبوٹ state کو یکجا کرتی ہے۔
Vision-Language Models
CLIP
import clip
model, preprocess = clip.load("ViT-B/32")
image = preprocess(Image.open("scene.jpg"))
text = clip.tokenize(["ایک لال کپ"])
similarity = (image_features @ text_features.T)
OWL-ViT
from transformers import OwlViTProcessor
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
texts = [["ایک لال کپ"]]
outputs = model(text=texts, images=image)
Grounding Pipeline
class MultimodalGrounding:
def ground_command(self, command, image):
object_desc = extract_object(command)
boxes = self.owlvit.detect(image, object_desc)
return boxes[0]
فوائد
✅ Open-vocabulary ✅ لچکدار ✅ درست
جاری رکھیں: Capstone: Architecture
حوالہ جات
Radford, A., et al. (2021). Learning Transferable Visual Models. arXiv:2103.00020.