TUTORIALS
DATASET DESIGNER
DATASET FORMAT
IMAGE ENGINE CONTENT
PRICING / BILLING
The “Designer” section of the AI Verse app is where the magic happens. It empowers users to craft and customize scenes with precision, specifying object arrangements, activity placements, camera positioning, and more.
In the context of AI Verse’s dataset designer, “Scene Contents” plays a crucial role in defining what will be randomly added to the generated scenes. It allows users to control the objects and activities that appear in the dataset, tailoring it to their specific use case and requirements.
Located in the right pane of the app, the “Scene Contents” section consists of two tabs: “Objects” and “Activities.” Each tab presents collections of objects or activities, respectively, enabling users to create diverse and realistic random scenes.
The “Objects” tab within “Scene Contents” displays object collections as tiles, each with a single thumbnail and icons representing the categories that constitute the collection. These collections are created using the concept of “Randomized Object Collections” mentioned in the “Content” section, allowing for the placement of random objects interchangeably in scenes.
To add a new randomized object collection to a dataset, click on “Add object” under the “Objects” section of “Scene Contents.” This will open the object catalog, presenting a tree of checkboxes on the left side for selecting categories. Users can expand subcategories to refine their selection, and the right side of the window displays thumbnails reflecting the current selection. Additionally, a search bar enables users to filter the tree based on keywords. Once the desired objects are selected, click “Save collection” to create the new collection.
Collection Parameter Menu
This menu pops up when click on a collection tile, allowing users to adjust various parameters.
Parameters menu for a given Activity target tile
For each activity selected as Activity POV, you can customize certain parameters to control the camera’s perspective of the activity.
In the Camera Intrinsics menu, you have the ability to adjust the following parameters:
The resolution parameter governs the image dimensions and quality. You can choose from preset options like VGA or XGA, or opt for a custom resolution to cater to specific project requirements. It’s important to note that custom resolutions are subject to pricing adjustments, with the price per image influenced by the number of pixels in each image.
The FOV angle defines the width of the camera’s field of vision. It determines how much of the scene is captured within the image. The FOV angle is connected to the image width and focal length in pixels through the following formulas:
where f is the focal length, W the image width, and $\alpha$ the FOV angle.
Use a double slider to set the smallest and largest FOV angles the camera can have. This means your images can have random FOV angles that go from the lowest to the highest values you’ve chosen. You can either simulate a specific given focal length (or FOV angle), or select a range to cover various use cases with varying zooms and lenses.
Exposure settings allow you to control the overall brightness of the captured images. You can choose from three exposure levels: “Under-exposed,” “Normal,” and “Over-Exposed,” tailoring the visual appearance of the synthetic data to your preferences.
Depth of field blur simulates the effect of a camera’s aperture, resulting in a focus on certain objects while blurring the rest. You can select from options like “No blur,” “Low,” “Mid,” or “High” to determine the amount of blur applied. When depth of field blur is active, an additional setting becomes available: “Depth of Field Focus.” This setting lets you choose whether the camera’s target (as defined in Camera Positioning parameters) remains in focus or is intentionally blurred, allowing for creative control over the image composition.
Motion blur adds a sense of motion and realism to images by simulating camera movement. Six double sliders are provided to set the range of motion for horizontal side motion, vertical motion, forward/backward motion (in centimeters), as well as tilt, roll, and pan small rotations (in degrees). When motion blur is enabled, camera view previews and dataset images are computed with randomized motion blur values within the specified ranges, introducing a dynamic element to the generated data.
A lighting scenario encapsulates the lighting conditions within a scene. It’s a recipe for how light interacts with your synthetic environment, comprising the following parameters:
Crafting and deploying lighting scenarios serves two primary purposes:
To start, access the lighting section by clicking on the lightbulb icon in the toolbar. You’re presented with a dropdown for quick preset selection and a list of lighting scenarios shown as tiles.
At the pinnacle of the lighting section, a dropdown showcases a set of multi-scenario presets, pre-arranged distributions with meaningful percentages.
Scene Overview: This is akin to the bird’s-eye view, captured from a top corner of your virtual room. It’s your chance to observe the scene’s composition, including objects and activities. However, it doesn’t directly match an image produced in your dataset.
Sample Camera View: Picture this as your camera’s perspective. It’s an image captured from a random camera placement that adheres to the constraints and parameters you’ve set in the Camera Positioning section.
Previewing Lighting Scenarios: Sometimes, you want to peek at specific aspects of your dataset. When a batch includes a distribution of lighting scenarios, you can see how a particular scenario pans out. To do this, head to the Scene Lighting section, hover over a lighting scenario tile, and click on the eye icon. This updates the Scene Overview preview, giving you a glimpse of a random scene under that lighting condition.
Previewing Single Targets: When dealing with multiple targets in Freespace mode, each image captures one random target from the distribution. To assess how a specific target looks, open the Camera Positioning section, hover over the target’s tile, and click the eye icon. This updates the Sample Camera View preview, revealing a random image with the camera aimed at the chosen target.
scene_instances.json
: A JSON file containing the metadata and annotations for the scene, including information about images, 2D object instances, 3D boxes for instances, and 2D/3D pose for humans.beauty.XXXX.png
: The rendered RGB image of the scene for a specific view, where XXXX
represents the view identifier.beauty_16bit.XXXX.exr
: A high dynamic range (HDR) version of the rendered image in OpenEXR format.depth.XXXX.png
: The depth map of the scene for the specific view, indicating the distance of each pixel from the camera. Pixel values are 255*0.5/d, where d is the distance in meters.depth_16bit.XXXX.exr
: A high-precision depth map in OpenEXR format. Pixel values are 255*0.5/d, where d is the distance in meters.normal.XXXX.png
: An image encoding the surface normals of the scene for the specific view.color.XXXX.png
: An albedo image representing the color or reflectance properties of the scene for the specific view. The albedo image provides information about the intrinsic color of objects in the scene, disregarding lighting effects and shadows.lighting_only.XXXX.png
: An image representing the scene with only the lighting information.subclass.XXXX.png
: An image indicating the subclass labels for the objects/humans present in the scene for the specific view.class.XXXX.png
: An image indicating the class labels for the objects/humans present in the scene for the specific view.superclass.XXXX.png
: An image indicating the superclass labels for the objects/humans present in the scene for the specific viewscene_instances.json
file provides detailed metadata and annotations for the synthetic scene. It consists of the following sections:
id
: A unique identifier for the image.file_name
: The filename of the image.height
: The height of the image in pixels.width
: The width of the image in pixels.camera_params
: Camera parameters for the view, including resolution, intrinsic parameters, pose, and matrix mode.camera_params
)camera_params
object provides information about the camera used to capture the scene view. It contains the following parameters:
resolution
: A 2-element array representing the resolution of the camera in pixels [width, height].intrinsics
: An array containing the intrinsic camera parameters. The parameters are typically represented as [fx, fy, cx, cy, skew], where:
fx
and fy
are the focal lengths of the camera in the x and y directions, respectively.cx
and cy
represent the principal point (the center of the image) in pixel coordinates.skew
represents any skew between the x and y axes of the camera.pose4x4
: A 16-element array representing the camera pose as a 4×4 transformation matrix. The matrix is typically represented in row-major order and contains the rotation and translation components of the camera pose.pose
: A 7-element array representing the camera pose as [tx, ty, tz, qx, qy, qz, qw], where:
tx
, ty
, and tz
represent the translation components of the camera pose.qx
, qy
, qz
, and qw
represent the quaternion components of the camera rotation.image_id
: The identifier of the image to which the instance belongs.path
: The path of the object instance in the taxonomy.area
: The area of the instance in the image.superclass
: The superclass label of the instance.class
: The class label of the instance.subclass
: The subclass label of the instance.bbox
: The bounding box coordinates [xmin, ymin, xmax, ymax] of the instance in the image.segmentation
: The COCO-style polygonal segmentation mask of the instance.box_3d_id
: The identifier of the corresponding 3D box in the “instances_3d” section.activity
: A string containing the taxonomy of the activity performed by the person.pose_2d
: An array of 2D joint positions and visibility information. Each joint object contains the following parameters:
joint_name
: The name or identifier of the joint.position
: The 2D position coordinates [x, y] of the joint in the image, specifying its pixel location.visibility
: A visibility value indicating the visibility or occlusion status of the joint in the image. It can take values such as 0 (not in the image), 1 (in the image bounds but occluded), or 2 (visible in the image).human_3d_id
: The identifier of the corresponding 3D human instance in the “humans_3d” section.id
: The unique identifier of the 3D instance. This identifier serves as a reference for the 3D object instance across multiple views. In the “instances” section, for each view where the 3D instance is visible, there will be a corresponding block with a box_3d_id
that matches the id
of this 3D instance. This allows for easy association and linking of the 2D instances to their corresponding 3D annotations.path
: The path of the object instance in the taxonomy.translation
: The translation coordinates [tx, ty, tz] of the instance centroid in 3D space.orientation
: The orientation quaternion [qx, qy, qz, qw] of the instance.scale
: The scale factors [sx, sy, sz] of the instance.id
: The unique identifier of the 3D human instance. This identifier serves as a reference for the 3D human instance across multiple views. In the “instances” section, for each view where the 3D human instance is visible, there will be a corresponding block with a human_3d_id
that matches the id
of this 3D human instance. This allows for easy association and linking of the 2D instances to their corresponding 3D annotations.posture
: The posture of the human instance, indicating the general pose or position of the human (e.g., standing, sitting, walking, etc.).pose_3d
: An array of joint positions and visibility information in 3D space. Each joint represents a specific part of the human body, such as the nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles. Each joint object contains the following parameters:
joint_name
: The name or identifier of the joint.translation
: The translation coordinates [tx, ty, tz] of the joint in 3D space, specifying its position relative to the global coordinate system.In AI Verse’s dataset designer, you have the power to enrich your datasets with diverse and dynamic content by adding randomized object collections. A collection is a group of 3D models defined by specific categories in our taxonomy, along with a desired count. For instance, you can create a collection that includes ‘4 instances of appetizers or salad bowls from the food category.’
When you create a dataset, our system generates random 3D scenes for you. Within these scenes, objects from your collection are placed in a way that matches your specified count and the chosen taxonomy subtrees. For example, a generated scene might include 3 different appetizers and 1 salad bowl, creating a variety of combinations and enriching your dataset with diverse content.”
In the dataset designer, certain objects possess the ability to act as support surfaces. These support surfaces, such as tables, shelves, bookcases, and even chairs and sofas, can logically hold or host other objects. When adding randomized object collections to your dataset, you have the option to specify one or more random support surfaces for them. This allows you to create more realistic and contextually relevant scenes, or to model a certain use case with your generated data.
For example, you can choose to include ‘4 instances of appetizers with support surface coffee tables.’ This means that the appetizers in the scene will be placed on coffee tables, creating a natural and realistic arrangement.
To simplify this process, each object, superclass, class, or subclass in the taxonomy have default support surfaces (affinities) defined. For instance, ‘prop.decorative’ is a subclass consisting of decorative objects, and it is pre-defined to be placed on support surfaces like ‘furniture.coffee_table’ and ‘furniture.shelves,’ ensuring that objects are placed in suitable and visually coherent positions.”
Certain objects have unique characteristics and placement requirements. For example:
Activities in AI Verse’s dataset designer refer to the actions and interactions performed by human actors within scenes, and range from working, reading, watching TV, eating, cooking. They include a wide range of postures, movements, and scenarios and can be added to increase realism or directly to model specific use cases including those activities.
Activity locations define where the actors are positioned in the scene. Each activity location is determined by combining a specific posture with suitable surroundings. For example, “sitting at a desk” combines the “sitting” posture with the “desk” surroundings.
Environments serve as pre-defined types of rooms that provide context and streamline the scene and image generation process. Each environment represents a specific setting, such as living rooms, bedrooms, offices, meeting rooms, kitchens, studios, hallways, and more.
By selecting an environment for a batch, users can quickly generate scenes and images that match the specified context. This avoids the need for users to manually insert collections of objects related to that environment, as the data generation engine will intelligently handle the placement and arrangement of compatible objects and activities.
Environments also play a vital role in filtering out incompatible activities and objects. For instance, it would not be feasible to have “furniture.bed” in a hallway or the activity “food_drink.cook.stove.hold_pan” in a bedroom. The defined environments ensure that only suitable content is combined to create realistic and contextually appropriate scenes.
Estimating the cost of your dataset is made simpler with the help of the price estimator tool. Here’s how you can do it:
With this estimator, you can get a clear idea of what to expect before you proceed to generate your synthetic dataset.