NVIDIA Generative AI Multimodal 認定 NCA-GENM 試験問題:
1. You're training a conditional GAN to generate images of birds based on text descriptions. The GAN generates images, but they lack fine- grained details and often have artifacts. Which of the following techniques are MOST likely to improve the quality and realism of the generated images? (Select TWO)
A) Reducing the size of the input noise vector to the generator.
B) Using a more powerful discriminator architecture (e.g., with attention mechanisms).
C) Implementing spectral normalization in both the generator and discriminator.
D) Using a deeper and wider generator network (e.g., with more layers and channels).
E) Using a simple Multi-Layer Perceptron (MLP) as the generator.
2. You are fine-tuning a pre-trained multimodal model for a new task. You have limited computational resources. Which of the following fine-tuning strategies would be the MOST computationally efficient while still achieving good performance?
A) Freeze all layers except the classification head and fine-tune only the classification head.
B) Freeze the lower layers of the model and fine-tune the upper layers and the classification head.
C) Train a new random model from scratch for the task, which will avoid the need to load the pre-trained model.
D) Fine-tune all the layers of the model.
E) Randomize the model to train, if it improves the training rate.
3. Which of the following techniques is most appropriate for mitigating the vanishing gradient problem in very deep neural networks, particularly when training generative models?
A) Data augmentation
B) Weight decay
C) Dropout
D) Early stopping
E) Residual connections (skip connections)
4. Consider the following Python code snippet using PyTorch Lightning and a Hugging Face Transformers model for multimodal classification. Which of the following code snippets is MOST appropriate to perform gradient accumulation in this context, assuming you want to accumulate gradients over 4 batches?
A)
B)
C)
D)
5. You are tasked with building a system that generates realistic images based on both textual descriptions and a semantic segmentation map. The segmentation map provides spatial information about the objects present in the scene. Which of the following generative architectures is MOST appropriate for this multimodal task?
A) Variational Autoencoder (VAE)
B) Diffusion model without conditioning
C) Autoregressive model like PixelCNN
D) Vanilla Generative Adversarial Network (GAN)
E) Conditional Generative Adversarial Network (cGAN) with both text and segmentation map as conditions.
質問と回答:
質問 # 1 正解: C、D | 質問 # 2 正解: B | 質問 # 3 正解: E | 質問 # 4 正解: C | 質問 # 5 正解: E |