Cas interface 3 plus manual


















This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary Non-necessary. Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies.

It is mandatory to procure user consent prior to running these cookies on your website. This is to avoid blocking a CUDA internal shared thread and preventing forward progress. It is legal to signal another thread to perform an API call, if the dependency is one way and the thread doing the call cannot block forward progress of CUDA work. There are two version numbers that developers should care about when developing a CUDA application: The compute capability that describes the general specifications and features of the compute device see Compute Capability and the version of the CUDA driver API that describes the features supported by the driver API and runtime.

It allows developers to check whether their application requires a newer device driver than the one currently installed. This is important, because the driver API is backward compatible , meaning that applications, plug-ins, and libraries including the CUDA runtime compiled against a particular version of the driver API will continue to work on subsequent device driver releases as illustrated in Figure The driver API is not forward compatible , which means that applications, plug-ins, and libraries including the CUDA runtime compiled against a particular version of the driver API will not work on previous versions of the device driver.

It is important to note that there are limitations on the mixing and matching of versions that is supported:. The requirements on the CUDA Driver version described here apply to the version of the user-mode components.

On Tesla solutions running Windows Server and later or Linux, one can set any device in a system in one of the three following modes using NVIDIA's System Management Interface nvidia-smi , which is a tool distributed as part of the driver:. This means, in particular, that a host thread using the runtime API without explicitly calling cudaSetDevice might be associated with a device other than device 0 if device 0 turns out to be in prohibited mode or in exclusive-process mode and used by another process.

Note also that, for devices featuring the Pascal architecture onwards compute capability with major revision number 6 and higher , there exists support for Compute Preemption. This allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architecture, with the benefit that applications with long-running kernels can be prevented from either monopolizing the system or timing out.

However, there will be context switch overheads associated with Compute Preemption, which is automatically enabled on those devices for which support exists. The individual attribute query function cudaDeviceGetAttribute with the attribute cudaDevAttrComputePreemptionSupported can be used to determine if the device in use supports Compute Preemption. Users wishing to avoid context switch overheads associated with different processes can ensure that only one process is active on the GPU by selecting exclusive-process mode.

Applications may query the compute mode of a device by checking the computeMode device property see Device Enumeration. GPUs that have a display output dedicate some DRAM memory to the so-called primary surface , which is used to refresh the display device whose output is viewed by the user.

When users initiate a mode switch of the display by changing the resolution or bit depth of the display using NVIDIA control panel or the Display control panel on Windows , the amount of memory needed for the primary surface changes. For example, if the user changes the display resolution from xxbit to xxbit, the system must dedicate 7. Full-screen graphics applications running with anti-aliasing enabled may require much more display memory for the primary surface.

If a mode switch increases the amount of memory needed for the primary surface, the system may have to cannibalize memory allocations dedicated to CUDA applications.

Therefore, a mode switch results in any call to the CUDA runtime to fail and return an invalid context error. When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor.

As thread blocks terminate, new blocks are launched on the vacated multiprocessors. A multiprocessor is designed to execute hundreds of threads concurrently. The instructions are pipelined, leveraging instruction-level parallelism within a single thread, as well as extensive thread-level parallelism through simultaneous hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution.

SIMT Architecture and Hardware Multithreading describe the architecture features of the streaming multiprocessor that are common to all devices. Compute Capability 3. The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.

The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp. When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution.

The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block. A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path.

If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.

In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually. Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp.

As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.

Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.

A schedule optimizer determines how to group active threads from the same warp together into SIMT units. Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity 2 of previous hardware architectures. In particular, any warp-synchronous code such as synchronization-free, intra-warp reductions should be revisited to ensure compatibility with Volta and beyond.

See Compute Capability 7. The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive disabled.

Threads can be inactive for a variety of reasons including having exited earlier than other threads of their warp, having taken a different branch path than the branch path currently executed by the warp, or being the last threads of a block whose number of threads is not a multiple of the warp size.

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device see Compute Capability 3.

The execution context program counters, registers, etc. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction the active threads of the warp and issues the instruction to those threads.

In particular, each multiprocessor has a set of bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks. The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor.

There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix Compute Capabilities.

If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch. Which strategies will yield the best performance gain for a particular portion of an application depends on the performance limiters for that portion; optimizing instruction usage of a kernel that is mostly limited by memory accesses will not yield any significant performance gain, for example.

Optimization efforts should therefore be constantly directed by measuring and monitoring the performance limiters, for example using the CUDA profiler. Also, comparing the floating-point operation throughput or memory throughput - whichever makes more sense - of a particular kernel to the corresponding peak theoretical throughput of the device indicates how much room for improvement there is for the kernel. To maximize utilization the application should be structured in a way that it exposes as much parallelism as possible and efficiently maps this parallelism to the various components of the system to keep them busy most of the time.

At a high level, the application should maximize parallel execution between the host, the devices, and the bus connecting the host to the devices, by using asynchronous functions calls and streams as described in Asynchronous Concurrent Execution.

It should assign to each processor the type of work it does best: serial workloads to the host; parallel workloads to the devices. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic.

Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.

At a lower level, the application should maximize parallel execution between the multiprocessors of a device. Multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently as described in Asynchronous Concurrent Execution. At an even lower level, the application should maximize parallel execution between the various functional units within a multiprocessor.

As described in Hardware Multithreading , a GPU multiprocessor primarily relies on thread-level parallelism to maximize utilization of its functional units. Utilization is therefore directly linked to the number of resident warps. At every instruction issue time, a warp scheduler selects an instruction that is ready to execute.

This instruction can be another independent instruction of the same warp, exploiting instruction-level parallelism, or more commonly an instruction of another warp, exploiting thread-level parallelism.

If a ready to execute instruction is selected it is issued to the active threads of the warp. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called the latency , and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock cycle during that latency period, or in other words, when latency is completely "hidden".

The number of instructions required to hide a latency of L clock cycles depends on the respective throughputs of these instructions see Arithmetic Instructions for the throughputs of various arithmetic instructions. If we assume instructions with maximum throughput, it is equal to:.

The most common reason a warp is not ready to execute its next instruction is that the instruction's input operands are not available yet. If all input operands are registers, latency is caused by register dependencies, i. In this case, the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions of other warps during that time.

Execution time varies depending on the instruction. On devices of compute capability 7. This means that 16 active warps per multiprocessor 4 cycles, 4 warp schedulers are required to hide arithmetic instruction latencies assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed.

If the individual warps exhibit instruction-level parallelism, i. If some input operand resides in off-chip memory, the latency is much higher: typically hundreds of clock cycles. The number of warps required to keep the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism. In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands i.

Another reason a warp is not ready to execute its next instruction is that it is waiting at some memory fence Memory Fence Functions or synchronization point Memory Fence Functions. A synchronization point can force the multiprocessor to idle as more and more warps wait for other warps in the same block to complete execution of instructions prior to the synchronization point.

Having multiple resident blocks per multiprocessor can help reduce idling in this case, as warps from different blocks do not need to wait for each other at synchronization points. The number of blocks and warps residing on each multiprocessor for a given kernel call depends on the execution configuration of the call Execution Configuration , the memory resources of the multiprocessor, and the resource requirements of the kernel as described in Hardware Multithreading.

The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory and the amount of dynamically allocated shared memory. The number of registers used by a kernel can have a significant impact on the number of resident warps.

For example, for devices of compute capability 6. But as soon as the kernel uses one more register, only one block i. Therefore, the compiler attempts to minimize register usage while keeping register spilling see Device Memory Accesses and the number of instructions to a minimum. Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds.

The register file is organized as bit registers. So, each variable stored in a register needs at least one bit register, e. The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation is therefore recommended. Applications can also parameterize execution configurations based on register file size and shared memory size, which depends on the compute capability of the device, as well as on the number of multiprocessors and memory bandwidth of the device, all of which can be queried using the runtime see reference manual.

The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under-populated warps as much as possible. Several API functions exist to assist programmers in choosing thread block size based on register and shared memory requirements. The following code sample calculates the occupancy of MyKernel. It then reports the occupancy level with the ratio between concurrent warps versus maximum warps per multiprocessor.

The following code sample configures an occupancy-based kernel launch of MyKernel according to the user input.

A spreadsheet version of the occupancy calculator is also provided. The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy block size, registers per thread, and shared memory per thread.

The first step in maximizing overall memory throughput for the application is to minimize data transfers with low bandwidth. That means minimizing data transfers between the host and the device, as detailed in Data Transfer between Host and Device , since these have much lower bandwidth than data transfers between global memory and the device.

That also means minimizing data transfers between global memory and the device by maximizing use of on-chip memory: shared memory and caches i. Shared memory is equivalent to a user-managed cache: The application explicitly allocates and accesses it.

As illustrated in CUDA Runtime , a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:. For some applications e. As mentioned in Compute Capability 3. The throughput of memory accesses by a kernel can vary by an order of magnitude depending on access pattern for each type of memory.

The next step in maximizing memory throughput is therefore to organize memory accesses as optimally as possible based on the optimal memory access patterns described in Device Memory Accesses. This optimization is especially important for global memory accesses as global memory bandwidth is low compared to available on-chip bandwidths and arithmetic instruction throughput, so non-optimal global memory accesses generally have a high impact on performance.

Applications should strive to minimize data transfer between the host and the device. One way to accomplish this is to move more code from the host to the device, even if that means running kernels that do not expose enough parallelism to execute on the device with full efficiency. Intermediate data structures may be created in device memory, operated on by the device, and destroyed without ever being mapped by the host or copied to host memory.

Also, because of the overhead associated with each transfer, batching many small transfers into a single large transfer always performs better than making each transfer separately. On systems with a front-side bus, higher performance for data transfers between host and device is achieved by using page-locked host memory as described in Page-Locked Host Memory.

In addition, when using mapped page-locked memory Mapped Memory , there is no need to allocate any device memory and explicitly copy data between device and host memory. Data transfers are implicitly performed each time the kernel accesses the mapped memory.

For maximum performance, these memory accesses must be coalesced as with accesses to global memory see Device Memory Accesses. Assuming that they are and that the mapped memory is read or written only once, using mapped page-locked memory instead of explicit copies between device and host memory can be a win for performance.

On integrated systems where device memory and host memory are physically the same, any copy between host and device memory is superfluous and mapped page-locked memory should be used instead. Applications may query a device is integrated by checking that the integrated device property see Device Enumeration is equal to 1.

An instruction that accesses addressable memory i. How the distribution affects the instruction throughput this way is specific to each type of memory and described in the following sections. For example, for global memory, as a general rule, the more scattered the addresses are, the more reduced the throughput is.

Global memory resides in device memory and device memory is accessed via , , or byte memory transactions. These memory transactions must be naturally aligned: Only the , , or byte segments of device memory that are aligned to their size i.

When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.

In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. For example, if a byte memory transaction is generated for each thread's 4-byte access, throughput is divided by 8.

How many transactions are necessary and how much throughput is ultimately affected varies with the compute capability of the device. To maximize global memory throughput, it is therefore important to maximize coalescing by:. Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access via a variable or a pointer to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned i.

If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing.

It is therefore recommended to use types that meet this requirement for data that resides in global memory. The alignment requirement is automatically fulfilled for the Built-in Vector Types. Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least bytes.

Reading non-naturally aligned 8-byte or byte words produces incorrect results off by a few words , so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays with multiple calls to cudaMalloc or cuMemAlloc is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block's starting address.

For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size. In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly.

The cudaMallocPitch and cuMemAllocPitch functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.

Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:.

Inspection of the PTX assembly code obtained by compiling with the -ptx or -keep option will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the. Even if it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture: Inspection of the cubin object using cuobjdump will tell if this is the case.

Note that some mathematical functions have implementation paths that might access local memory. The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses.

Local memory is however organized such that consecutive bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address e. On some devices of compute capability 3. On devices of compute capability 5. Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory. To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously.

Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests.

If the number of separate memory requests is n , the initial memory request is said to cause n -way bank conflicts. To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts. This is described in Compute Capability 3.

The constant memory space resides in device memory and is cached in the constant cache. A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests. The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise. The texture and surface memory spaces reside in device memory and are cached in texture cache, so a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache.

The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency.

Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory:. In this section, throughputs are given in number of operations per clock cycle per multiprocessor. All throughputs are for one multiprocessor. They must be multiplied by the number of multiprocessors in the device to get throughput for the whole device.

Table 3 gives the throughputs of the arithmetic instructions that are natively supported in hardware for devices of various compute capabilities. Capacities from 15 to lbs. Dual range and legal for trade. Includes RS interface and free software. Available with or without pole display. CAS SC-Series Counting Scale Precise 5 digit unit weight calculation, items of unit weight memory, metric and pound convertible, built in comparator, count add up function.

High accuracy, battery operated, and simple to use. With the optional label and receipt printers, the EC Series Counting Scale makes counting easier than ever! Available in two capacities, the DL Series Bench Scale can be used in any general weighing application. Bright yellow industrial paint is baked on to the aluminum alloy chassis. Your computer will restart into the menu. That's why full card support is included. Technomate TM-1 0. Fausto v1. Ku-Band except when stated otherwise: 53E, 52E, Released first Official single in Hometown.

As featured on the BBC. For regular gig news and info check them on twitter, facebook or homepage:. The time now is Mark Forums Read. Remember Me? Page 1 of 2.



0コメント

  • 1000 / 1000