METHOD AND APPARATUS WITH OBJECT DETECTION

Invented by LEE; Hyunjeong, LEE; Dongwook, YOO; Byung In, Samsung Electronics Co., Ltd.
Advancements in object detection are shaping the future of smart vehicles and intelligent machines. This new patent application presents a fresh approach by blending camera images, sensor data, and text for better object detection. Let’s break down what this invention means for the technology world, why it matters, and how it improves upon older ideas.
Background and Market Context
Object detection is the heart of many modern systems. Imagine self-driving cars, delivery drones, and even home robots—they all need to understand their surroundings. To do this, they rely on sensors like cameras and LiDAR. Cameras give us images, much like what we see with our eyes. LiDAR uses lasers to make a 3D map of the world, showing where objects are in space. Together, these tools help machines spot cars, people, signs, and more.
But there’s a catch. Cameras are great at seeing colors and details but struggle with depth and distance. LiDAR is excellent for measuring how far away things are but doesn’t see colors or fine details. By themselves, each sensor has limits. So, companies started combining camera and LiDAR data to get a fuller picture. This is called “sensor fusion.” It’s already used in cutting-edge vehicles and robots.
Yet, even with these tools, there’s still a big problem. Most object detection systems are trained to spot only what they’ve seen before. If a new kind of object appears—say, a new road sign or an unfamiliar truck—the system might miss it. In the real world, this can be dangerous.
Another challenge is accuracy. Detecting objects in 3D space (using LiDAR) is harder than just finding them in a flat image. Small mistakes can mean the difference between safe driving and an accident. The market is hungry for better, safer, and more flexible object detection. Car makers, tech companies, and robotics firms want systems that can spot new objects, work in tough conditions, and adapt fast.
That’s where the idea in this patent fits in. It brings a new tool to the table—text. Think of text as a way to describe what’s happening in the world. By adding words to the mix, like “red sedan” or “bicycle ahead,” machines can understand not just what’s in an image, but what those things mean.
This invention could be a game-changer for factories, delivery robots, and especially self-driving cars. It promises a way for systems to learn about new things quickly, work better in poor lighting or weather, and make smarter decisions. As cities get smarter and vehicles get more independent, the need for robust object detection will only grow.
Scientific Rationale and Prior Art
Let’s look at how object detection has grown and why this new invention matters.
At first, object detection was simple. Early computers looked for basic shapes or colors in images. Later, “deep learning” came along. This allowed computers to learn from thousands or millions of pictures. These systems got very good at finding known objects, like stop signs, people, or cars. But they still had blind spots—they could only spot things they had been trained to see.
To fix some of these gaps, engineers started using different types of data together. The most common mix is using camera images and LiDAR point clouds. Cameras show what things look like. LiDAR tells where things are, using points in 3D space. By combining them, machines can guess both what something is and where it sits in the world. This is sensor fusion. It’s now used in many high-end vehicles and robots.
But even with sensor fusion, there are problems. First, both image and LiDAR data are limited by what the system has seen before. If a new object pops up, the system might ignore it. Second, most current systems don’t use language at all. They can’t understand or use descriptions like “a blue truck” or “a person carrying a bag.” Language is how people share new ideas, so adding it could help computers learn faster and become more flexible.
Recently, “transformer” models have changed how machines process data. Transformers are very powerful at understanding patterns in images, text, and even sound. They can blend information from different sources. Some systems now use transformers to mix camera and LiDAR data, making object detection more accurate.
The idea of mixing text and images is also gaining steam. Big companies have made models that can match a photo with a caption, or find a cat in a picture when you type “cat.” These models can sometimes spot new things—if you tell them what to look for. This is called “open-vocabulary” detection.
But here’s the twist: until now, no one has really combined all three—images, LiDAR, and text—into one smart system for real-world object detection, especially not in a way that helps find new objects on the fly. This patent builds on that idea, using transformers to fuse image features, text clues, and 3D sensor data. The result is a system that can spot new objects, even if they weren’t in the training set, and do it with high accuracy.
Earlier inventions focused on just images, or just LiDAR, or maybe both. Some tried to use text for image search or labeling, but not for live object detection in 3D space. By bringing all these pieces together, this patent stands on top of years of research but moves beyond what came before.
Invention Description and Key Innovations
Now, let’s get into how this invention works and what makes it special.
Imagine a smart car with both a camera and a LiDAR sensor. The car is driving down the street. The camera takes a picture—the image. The LiDAR sends out lasers and makes a point cloud—a 3D map of everything around. At the same time, the system gets a text prompt. This could be a label (“Find all trucks”), a description (“A white van is near the crosswalk”), or even a list of possible objects.
Here’s what happens next:
First, the system uses special encoders. The image encoder takes the picture and turns it into a set of numbers that capture what’s in the image (these are called features). The text encoder does the same for the text—turning words into features. The LiDAR data goes through its own encoder, making features from the point cloud.
Then comes the magic—fusion. The system blends the image features and the text features. This fusion creates a new kind of feature that “knows” both what’s in the picture and what the text says. For example, if the text says “truck,” and there’s something truck-like in the image, the fusion feature will light up for that spot.
Next, the system looks for regions of interest (ROI) in the image. These are areas where there might be something important—a car, a person, a sign, etc. For each ROI, the system uses “ROI pooling” to make sure all features are the same size, even if the objects are big or small. This helps the model compare things more easily.
But there’s another wrinkle—position. The system needs to know where the camera and LiDAR are in relation to each other. It uses something called “positional embedding” to keep track of where each object is, both in the image and in 3D space. This makes sure the information lines up across sensors.
Once all these features are ready—fused features for each object, position info, and LiDAR features—the system creates a “query.” This query is like a question: “Is there an object matching these features in the point cloud?” The query goes into a transformer-based object detection model. The transformer is very good at paying attention to the right parts of the data and making smart predictions.
The output is powerful. The system can find objects in the 3D point cloud, draw a box around them (the bounding box), and even say what kind of object it is (the class). Because the text is fused in, the system can detect new object types, even if it never saw them during training. For example, if you tell it to look for “electric scooter,” and there’s one in the scene, it can find it—even if it was never taught what a scooter is.
This invention is flexible. It can take many types of text inputs—labels, full descriptions, even prompts describing situations. It works with any camera-LiDAR setup, as long as their positions are known. The system can be trained to get better over time, updating its “learnable query” parts as it learns from more data.
Key innovations in this patent include:
1. Multi-modal Fusion: The system doesn’t just add together image and text features. It deeply fuses them so the model understands both what it “sees” and what it’s “told.” This fusion is used right at the start, not just as an afterthought.
2. Query-based Detection: Instead of scanning the whole scene blindly, the transformer uses smart queries built from fused features and position info. This makes detection faster and more accurate.
3. Handling New Objects: By using text prompts, the model can find objects of types it never saw during training. This is huge for safety and flexibility.
4. Position-aware Embedding: The system tracks where each sensor is, making sure that the image and LiDAR data line up perfectly in 3D space. This reduces errors and boosts accuracy.
5. Flexible Architecture: The model can be trained and updated as new data comes in. It can also work with different types of cameras, LiDARs, and even other sensors.
6. Real-World Ready: The system is built for real-time use, such as in vehicles or robots, where decisions need to happen fast and safely.
Put simply, this invention blends the best of all worlds—seeing, describing, and measuring objects—making machines smarter and safer.
Conclusion
This patent application opens a new chapter for object detection. By fusing camera images, LiDAR data, and text descriptions, it helps machines spot more objects, adapt to new situations, and make safer choices. The invention builds on years of research but goes further by combining vision, language, and 3D mapping in a single, smart system. As the world moves toward autonomous vehicles and smarter robots, the need for robust, flexible, and accurate object detection will only grow. This invention offers a practical and innovative path forward—one that’s ready for the real world and tomorrow’s challenges.
Click here https://ppubs.uspto.gov/pubwebapp/ and search 20250218165.