We’ve all been there: staring at a messy medicine cabinet, wondering which box is for allergies and which one expired in 2022. In the world of Computer Vision and AI Healthcare, digitizing physical assets is a classic challenge. Today, we're building a "Medicine Box Expert"—a sophisticated pipeline that uses YOLOv8 for precision detection and OpenAI CLIP for multimodal understanding to turn a pile of pills into a searchable digital database.
By the end of this tutorial, you'll understand how to bridge the gap between raw pixels and structured medical data. We are moving beyond simple classification; we are building a robust system capable of handling complex lighting, varied angles, and the tiny typography common in pharmaceutical packaging.
To achieve high accuracy, we don't rely on a single model. Instead, we use a "Detect-Extract-Embed" workflow.
graph TD
A[User Uploads Image] --> B[YOLOv8: Box Detection]
B --> C{Box Found?}
C -- Yes --> D[Crop & Preprocess]
C -- No --> E[Error: No Box Detected]
D --> F[Tesseract OCR: Text Extraction]
D --> G[OpenAI CLIP: Visual Embedding]
F & G --> H[SQLite Query: Semantic Search]
H --> I[Result: Drug Info & Dosage]
Before we dive into the code, ensure you have the following tech_stack
installed:
pip install ultralytics transformers torch pytesseract
First, we need to locate the medicine box within the frame. A generic YOLOv8 model (like yolov8n.pt
) is surprisingly good at detecting "books" or "cell phones," but for the best results, you should fine-tune it on the Open Images Dataset specifically for "Box" or "Medical Packaging."
from ultralytics import YOLO
import cv2
model = YOLO('yolov8n.pt')
def get_medicine_box(image_path):
results = model(image_path)
for r in results:
boxes = r.boxes.xyxy.cpu().numpy()
if len(boxes) > 0:
return boxes[0] # Returns [x1, y1, x2, y2]
return None
OCR (Optical Character Recognition) often fails when text is stylized or blurred. This is where OpenAI CLIP shines. CLIP creates a shared vector space for images and text, allowing us to compare the visual vibe of a box against a set of known categories.
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def get_visual_embedding(image_crop):
inputs = processor(images=image_crop, return_tensors="pt")
outputs = model_clip.get_image_features(**inputs)
return outputs.detach().numpy()
We combine the text found by Tesseract OCR with our visual embedding to query our local SQLite database. This ensures that even if the OCR misreads "Advil" as "Adv1l," the CLIP embedding will still point us toward the correct record.
import pytesseract
import sqlite3
def identify_medicine(crop_img, embedding):
text = pytesseract.image_to_string(crop_img)
conn = sqlite3.connect('pharmacy.db')
cursor = conn.cursor()
query = "SELECT name, dosage FROM medicines WHERE name LIKE ?"
cursor.execute(query, (f'%{text[:5]}%',))
return cursor.fetchone()
While this script works for a local "Learning in Public" project, production-grade vision systems require specialized handling for edge cases like glare, perspective warping, and batch processing.
For a deeper dive into production-grade AI architectures and more advanced multimodal patterns, I highly recommend checking out the technical deep-dives over at ** WellAlly Tech Blog**. They cover extensively how to scale these pipelines using vector databases and cloud-native inference engines.
Digitizing a home pharmacy is a perfect example of how YOLOv8 and CLIP can work in tandem. YOLO provides the "where," and CLIP/OCR provide the "what." This hybrid approach drastically reduces false positives and creates a user experience that feels like magic. 🥑
What’s next?
Happy coding! 💻🔥