NCP-AII試験無料問題集「NVIDIA AI Infrastructure 認定」
You're deploying a distributed training workload across multiple NVIDIAAIOO GPUs connected with NVLink and InfiniBand. What steps are necessary to validate the end-to-end network performance between the GPUs before running the actual training job? (Select all that apply)
正解:B,C,D
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
You are setting up a virtualized environment (using VMware vSphere) to run GPU-accelerated workloads. You have multiple physical GPUs in your server and want to assign specific GPUs to different virtual machines (VMs) for dedicated access. Which vSphere technology would BEST support this?
正解:C
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
You're deploying a large language model for inference using NVIDIA Triton Inference Server. You need to validate that the server can handle the expected query load while maintaining acceptable latency. Which tools and metrics are most relevant for this validation?
正解:D
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
An AI server with 8 GPUs is experiencing random system crashes under heavy load. The system logs indicate potential memory errors, but standard memory tests (memtest86+) pass without any failures. The GPUs are passively cooled. What are the THREE most likely root causes of these crashes?
正解:B,C,D
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
You are deploying a BlueField-2 DPU-based server in a VMware vSphere environment. Which network virtualization technology is most commonly used in conjunction with the DPU to provide accelerated networking and security features within the virtualized environment?
正解:A
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
You are troubleshooting a network performance issue in your NVIDIA Spectrum-X based A1 cluster. You suspect that the Equal-Cost Multi-Path (ECMP) hashing algorithm is not distributing traffic evenly across available paths, leading to congestion on some links. Which of the following methods would be MOST effective for verifying and addressing this issue?
正解:A
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
You are managing a cluster of GPU servers for deep learning. You observe that one server consistently exhibits high GPU temperature during training, causing thermal throttling and reduced performance. You've already ensured adequate airflow. Which of the following actions would be MOST effective in addressing this issue?
正解:A,D
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
Your deep learning training job that utilizes NCCL (NVIDIA Collective Communications Library) for multi-GPU communication is failing with "NCCL internal error, unhandled system error" after a recent CUDA update. The error occurs during the 'all reduce' operation.
What is the most likely root cause and how would you address it?
What is the most likely root cause and how would you address it?
正解:C
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
You are tasked with installing the NGC CLI on a host that does not have direct internet access. You have downloaded the NGC CLI package to a local repository. Which of the following steps are required to successfully install and configure the NGC CLI in this offline environment?
正解:A,B,D,E
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
You are using the NVIDIA Container Toolkit in a Kubernetes environment with multiple GPUs per node. You want to ensure that pods can request specific GPUs on a node, rather than simply requesting 'any' GPU. Which Kubernetes feature, in conjunction with the NVIDIA Device Plugin, allows you to achieve this fine-grained GPU resource allocation?
正解:E
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)
Consider a scenario where you are running a CUDA application on an NVIDIA GPU. The application compiles successfully but crashes during runtime with a *CUDA ERROR ILLEGAL ADDRESS* error. You've carefully reviewed your code and can't find any obvious out- of-bounds memory accesses. What advanced debugging techniques could help you pinpoint the source of this error?
正解:C,D,E
解答を投票する
解説: (GoShiken メンバーにのみ表示されます)