[2025-08] IROS Challenge: Vision-Language Manipulation in Open Tabletop Environments (ing)

Note: To protect commercial confidentiality, public information is limited.


Project Overview

We are hosting the IROS 2025 Challenge with two tracks: Manipulation and Navigation. The Vision-Language Manipulation in Open Tabletop Environments challenge, featured at the IROS 2025 Workshop, has a submission deadline of September 30th. We warmly welcome all participants!

My Contribution

As the evaluation lead for the manipulation track, me and my partners:

  • Developed the benchmark environment of Genmanip
  • Built the system from scratch with InternUtopia
  • Implemented evaluation protocols in InternManip

Environment Components

  1. Tasks
    • 10 manipulation tasks with two categories:
      • Seen (objects in training set)
      • Unseen (novel objects)
    • Evaluation datasets: Hugging Face
       validation
       ├── IROS_C_V3_Aloha_seen
       │   ├── collect_three_glues/ 
       │   ├── collect_two_alarm_clocks/
       │   ├── collect_two_shoes/
       │   ├── gather_three_teaboxes/
       │   ├── make_sandwich/
       │   ├── oil_painting_recognition/
       │   ├── organize_colorful_cups/
       │   ├── purchase_gift_box/
       │   ├── put_drink_on_basket/
       │   └── sort_waste/
       └── IROS_C_V3_Aloha_unseen
         ├── collect_three_glues/
         ├── collect_two_alarm_clocks/
         ├── collect_two_shoes/
         ├── gather_three_teaboxes/
         ├── make_sandwich/
         ├── oil_painting_recognition/
         ├── organize_colorful_cups/
         ├── purchase_gift_box/
         ├── put_drink_on_basket/
         └── sort_waste/
      
  2. Scenarios
    • 10 seen + 10 unseen scenarios per task
    • USD files with metadata
    • Randomly generated
       validation
       ├── IROS_C_V3_Aloha_seen
       │   ├── collect_three_glues
       │   │   ├── 000
       │   │   │   ├── meta_info.pkl
       │   │   │   ├── scene.usd
       │   │   │   └── SubUSDs -> ../SubUSDs
       │   │   ├── 001/
       │   │   ├── 002/
       │   │   ├── 003/
       │   │   ├── 004/
       │   │   ├── 005/
       │   │   ├── 006/
       │   │   ├── 007/
       │   │   ├── 008/
       │   │   ├── 009/
       │   │   └── SubUSDs
       ...
      
  3. Robots
    • Supported platforms:
      • Franka Arm + Panda Gripper
      • Franka Arm + Robotiq Gripper
      • Aloha Dual-Arm Robot (used in competition)
  4. Controllers
    • Joint position control
    • Inverse kinematics solver
    • I/O specs: Docs

    Observation Structure (take franka as an example)

     observations: List[Dict] = [
         {
             "robot": {
                 "robot_pose": Tuple[array, array], # (position, oritention(quaternion: (w, x, y, z)))
                 "joints_state": {
                     "positions": array, # (9,) or (13,) -> panda or robotiq
                     "velocities": array # (9,) or (13,) -> panda or robotiq
                 },
                 "eef_pose": Tuple[array, array], # (position, oritention(quaternion: (w, x, y, z)))
                 "sensors": {
                     "realsense": {
                         "rgb": array, # uint8 (480, 640, 3)
                         "depth": array, # float32 (480, 640)
                     },
                     "obs_camera": {
                         "rgb": array,
                         "depth": array,
                     },
                     "obs_camera_2": {
                         "rgb": array,
                         "depth": array,
                     },
                 },
                 "instruction": str,
                 "metric": {
                     "task_name": str,
                     "episode_name": str,
                     "episode_sr": int,
                     "first_success_step": int,
                     "episode_step": int
                 },
                 "step": int,
                 "render": bool
             }
         },
         ...
     ]
    

    Action Space (take franka as an example)
    You can use any of the following action data formats as input

    ActionFormat1:

     List[float] # (9,) or (13,) -> panda or robotiq
    

    ActionFormat2:

     {
         'arm_action': List[float], # (7,)
         'gripper_action': Union[List[float], int], # (2,) or (6,) -> panda or robotiq || -1 or 1 -> open or close
     }
    

    ActionFormat3:

     {
         'eef_position': List[float], # (3,) -> (x, y, z)
         'eef_orientation': List[float], # (4,) -> (quaternion: (w, x, y, z))
         'gripper_action': Union[List[float], int], # (2,) or (6,) -> panda or robotiq || -1 or 1 -> open or close
     }
    
  5. Sensors
    • Franka: Tabletop, Gripper-FPV, Rear
    • Aloha: Center FPV, Left/Right Gripper
    franka + panda: camera 1
    franka + panda: camera 2
    franka + panda: camera 3
  6. Metrics
    The primary evaluation metric used is success rate, which is defined in two forms:

    • Soft success: A task is considered partially successful if only a subset of subtasks is completed. The soft success rate is the proportion of tasks that meet this partial success criterion.
    • Hard success: A stricter metric where a task is only considered successful if all subtasks are completed.

    The baseline model used is the gr00t. The benchmarking results are shown below:

    Leaderboard: validation_seen Leaderboard: validation_unseen

  7. Additional Components
    • Developed a custom recorder to asynchronously capture frames and log both state and image data at each timestep, improving runtime efficiency.
    • Implemented support for batch evaluation across multiple environments or parallel instances of Isaac Sim.
    • A comprehensive list of configurable parameters and additional features can be found in the official documentation.

🔧 InternManip Integration

This evaluation environment is integrated as a benchmark module within InternManip. You can explore the implementation details under:

InternManip/internmanip/benchmarks/genmanip

While extending genmanip evaluation within InternManip, I implemented key components including:

  • A wrapper environment
  • A custom evaluator
  • Parallel evaluation using the Ray framework
  • Model and agent integration interfaces

Due to the ongoing development of InternManip, there are still many parts under optimization, so I am unable to showcase everything at this stage. However, once the IROS challenge concludes, a major refactor is planned.


🚀 What’s Next: Vision for InternManip

After the upcoming refactor, InternManip will evolve into a fully-fledged all-in-one training and evaluation framework for robotic manipulation tasks in simulation.

Key goals include:

  • Rich set of algorithm implementations and benchmark examples
  • User-friendly tools for training and evaluation
  • Streamlined development workflow for algorithm researchers

This refactor will lay a solid foundation for future modular and scalable development. Stay tuned for the next release!


Supplement the video of the other two robots

franka + robotiq: camera 1
franka + robotiq: camera 2
franka + robotiq: camera 3
aloha: top camera
aloha: left camera