PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

Abstract

Affordance understanding, the task of identifying actionable regions on 3D objects, is critical for enabling robotic systems to interact with the physical world. Although Visual Language Models (VLMs) have excelled in high-level reasoning and long-horizon planning for robotic manipulation, they still fall short in grasping the nuanced physical properties required for effective human-robot interaction. In this paper, we introduce PAVLM (Point cloud Affordance Vision-Language Model), a novel framework that leverages the rich multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud. PAVLM integrates a geometric-guided propagation module with hidden embeddings from large language models (LLMs) to enrich visual semantics. On the language side, we prompt LLaMa-3.1 models to generate refined context-aware text, augmenting the instructional input with deeper semantic cues. Experimental results on the 3D-AffordanceNet benchmark demonstrate that PAVLM outperforms baseline methods for both full and partial point clouds, particularly excelling in its generalization to novel open-world affordance tasks of 3D objects.

Key Features

  • Integration of VLMs and LLMs
  • Geometric-guided propagation module
  • Effective on full and partial point clouds
  • Superior open-world performance

Methodology

PAVLM combines the strengths of visual and language models to enhance 3D affordance understanding. Specifically, our geometric-guided propagation module enriches visual semantics, while LLaMa-3.1 generates refined, context-aware text instructions to better guide the robot's actions. In the following sections, we will delve into the architecture of the key components, starting with the Geometric-guided Point Encoder and Decoder. The diagram visually illustrates the overall architecture of PAVLM, highlighting the interaction between the visual and language models.

Visualization of affordance prediction results on multiple unseen objects and categories
Our PAVLM pipeline is designed to directly encode point cloud data. We first devise a geometric-guided propagation module to extract point features. Meanwhile, Llama-3.1 is tasked with generating richer and more detailed textual instructions. Both visual and textual embeddings are aligned through the 3D Image-Bind approach, with the combined features fed into the multi-modal LLM (Llama-2) for mask label generation. Finally, the per-point feature embeddings are multiplied by the <mask label> token and input into a 3D affordance decoder to generate the final affordance map. The following figure provides a detailed view of the PAVLM pipeline, illustrating each stage, from point cloud encoding to affordance map generation.
Visualization of affordance prediction results on multiple unseen objects and categories
The diagram illustrates the architecture of the proposed Geometric-guided Point Encoder and Decoder within the PAVLM pipeline. It highlights the two core components: the geometric information extraction module and the feature propagation module. As shown in the figure, the point cloud data is divided into patches and processed through a series of transformer blocks to extract refined geometric features, while the decoder handles feature propagation and ensures that point-wise affordance information is accurately distributed across the point cloud.
Visualization of affordance prediction results on multiple unseen objects and categories
We use an augmentation prompt template to enhance the diversity of generated question-answer pairs, improving the model's contextual understanding. This strategy is particularly important for ensuring flexibility in robotic interactions. As demonstrated in the accompanying diagram, the system generates multiple versions of question-answer pairs based on a seed input.

Experimental Results

We conducted extensive experiments to evaluate PAVLM's performance. Our ablation studies examined the effects of different text prompts and vision encoders, while comparisons with state-of-the-art methods demonstrated PAVLM's superiority in both seen and unseen object categories.

Ablation Study: Text Prompts

Table 1: Ablation study of different text prompts.
Point cloud Text Prompt mAP↑ AUC↑ aIOU↑ MSE↓
Full-shape (Aug.) Ours 45.7 86.9 16.8 0.45
Full-shape Ours 46.0 87.1 17.1 0.44
Full-shape Object, Action 44.5 85.6 16.0 0.67
Full-shape Action 45.4 87.2 16.7 0.45
Full-shape Hi 11.4 51.9 0.4 0.90

Ablation Study: Vision Encoders

Table 2: Ablation study of different vision encoders.
Point cloud Vision encoder mAP↑ AUC↑ aIOU↑ MSE↓
Full-shape PointNet++ 47.1 86.2 18.1 0.36
Full-shape DGCNN 31.0 72.9 1.0 0.68
Full-shape Ours 48.5 86.8 17.7 0.36
Partial-shape PointNet++ 25.3 69.8 3.4 0.73
Partial-shape DGCNN 26.8 69.7 0.7 0.67
Partial-shape Ours 40.2 82.2 11.7 0.39

Comparison: Seen Categories

Table 3: Comparison results with baselines on the seen split of dataset
Point cloud Method mAP↑ AUC↑ aIOU↑ MSE↓
Full-shape PointCLIP 7.6 49.9 0.9 0.80
Full-shape PointCLIP V2 7.6 50.0 0.8 0.70
Full-shape Ours (frozen) 46.3 86.9 17.1 0.44
Full-shape Ours 48.5 86.8 17.7 0.36
Partial-shape PointCLIP 9.1 50.1 1.1 0.78
Partial-shape PointCLIP V2 9.2 49.9 1.4 0.66
Partial-shape Ours (frozen) 37.2 64.2 11.4 0.36
Partial-shape Ours 40.2 82.2 11.7 0.39

Comparison: Unseen Categories

Table 4: Comparison results with baselines on the unseen split of dataset
Point cloud Method mAP↑ AUC↑ aIOU↑ MSE↓
Full-shape PointCLIP 3.7 49.9 0.5 1.58
Full-shape PointCLIP V2 3.7 49.3 0.4 0.91
Full-shape Ours (frozen) 11.8 60.3 1.5 0.78
Full-shape Ours 12.1 53.9 2.6 0.51
Partial-shape PointCLIP 4.7 50.2 0.6 1.14
Partial-shape PointCLIP V2 4.7 50.3 0.38 0.93
Partial-shape Ours (frozen) 13.8 60.3 2.0 0.60
Partial-shape Ours 22.3 63.2 2.1 0.82

Our results show that PAVLM consistently outperforms baseline methods across various metrics. The ablation studies reveal that comprehensive question prompts and our proposed geometric-guided point encoder contribute significantly to the model's performance. Furthermore, PAVLM demonstrates strong generalization capabilities, particularly in handling unseen object categories.

Visualization of Results

Visualization of affordance prediction results on multiple unseen objects and categories
Visualization of affordance prediction results on multiple unseen objects and categories, showing that our model predicts accurate affordances guided by different instructions from the original partial point cloud.