http://stackoverflow.com/questions/12439807/pinned-memory-in-cuda
Pinned memory is just physical RAM in your system that is set aside and not allowed to be paged out by the OS. So once pinned, that amount of memory becomes unavailable to other processes (effectively reducing the memory pool available to rest of the OS).
The maximum pinnable memory therefore is determined by what other processes (other apps, the OS itself) are competing for system memory. What processes are concurrently running in either Windows or Linux (e.g. whether they themselves are pinning memory) will determine how much memory is available for you to pin at that particular time.
+1, welcome to StackOverflow, Michael! To add to your answer: oversubscribing pinned memory can reduce the performance of interactive OSes, since it limits the available physical memory space that can be paged (i.e. it will lead the OS to "thrash" virtual memory more).
And just to re-phrase your point in another way, let's say you manage to pin 80% of total system memory for a CUDA app, then that means only 20% is left for the OS to run EVERYTHING else until the CUDA app releases the memory or unpins it. This can easily lead to (and I have done it more than once!) an unusable PC, since it's like trying to run Windows 7 + browsers + email etc on a few GB of ram!
https://en.wikipedia.org/wiki/CUDA_Pinned_memory
In the framework of accelerating computational codes by parallel computing on graphics processing units (GPUs), the data to be processed must be transferred from system memory to the graphics card's memory, and the results retrieved from the graphics memory into system memory. In a computational code accelerated by general-purpose GPUs (GPGPUs), such transactions can occur many times and may affect the overall performance, so that the problem of carrying out those transfers in the fastest way arises.
To allow programmers to use a larger virtual address space than is actually available in the RAM, CPUs (or hosts, in the language of GPGPU) implement a virtual memory system Virtual memory (non-locked memory) in which a physical memory page can be swapped out to disk. When the host needs that page, it loads it back in from the disk. The drawback with CPU⟷GPU memory transfers is that memory transactions are slower, i.e., the bandwidth of the PCI-E bus to connect CPU and GPU is not fully exploited. Non-locked memory is stored not only in memory (e.g. it can be in swap), so the driver needs to access every single page of the non-locked memory, copy it into pinned buffer and pass it to the Direct Memory Access (DMA) (synchronous, page-by-page copy). Indeed, PCI-E transfers occur only using the DMA. Accordingly, when a “normal” transfer is issued, an allocation of a block of page-locked memory is necessary, followed by a host copy from regular memory to the page-locked one, the transfer, the wait for the transfer to complete and the deletion of the page-locked memory. This consumes precious host time which is avoided when directly using page-locked memory.[1]
However, with today’s memories, the use of virtual memory is no longer necessary for many applications which will fit within the host memory space. In all those cases, it is more convenient to use page-locked (pinned) memory which enables a DMA on the GPU to request transfers to and from the host memory without the involvement of the CPU. In other words, locked memory is stored in the physical memory (RAM), so the GPU (or device, in the language of GPGPU) can fetch it without the help of the host (synchronous copy).[2]
GPU memory is automatically allocated as page-locked, since GPU memory does not support swapping to disk.[1][2] To allocate page-locked memory on the host in CUDA language one could use cudaHostAlloc.[3]
实验中测试分页内存与固定内存的传输速率,测试PCI的速率。
http://blog.csdn.net/zhangpinghao/article/details/21046435
当为了提高CUDA程序的主机内存和设备内存传输消耗时,可以尝试一下两种方案
一:使用分页锁定内存,分页锁定内存和显存之间的拷贝速度大约是6GB/s,普通的分页内存和GPU间的速度大约是3GB/s,
uda运行时提供了使用分页锁定主机存储器(也称为pinned)的函数(与常规的使用malloc()分配的可分页的主机存储器不同):
cudaHostAlloc()和cudaFreeHost()分配和释放分页锁定主机存储器;
cudaHostRegister()分页锁定一段使用malloc()分配的存储器。
如前文(可阅读以上文章)提到的,在某些设备上,设备存储器和分页锁定主机存储器间数据拷贝可与内核执行并发进行;
在一些设备上,分页锁定主机内存可映射到设备地址空间,减少了和设备间的数据拷贝
在有前端总线的系统上,如果主机存储器是分页锁定的,主机存储器和设备存储器间的带宽会高些,如果再加上3.2.4.2节所描述的写结合(write-combining)的话,带宽会更高。
然而分页锁定主机存储器是稀缺资源,所以可分页内存分配得多的话,分配会失败。另外由于减少了系统可分页的物理存储器数量,分配太多的分页锁定内存会降低系统的整体性能。
http://stackoverflow.com/questions/5736968/why-is-cuda-pinned-memory-so-fast
CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
As described here http://forums.nvidia.com/index.php?showtopic=164661
host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:
The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().
CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk. If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA. So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.
http://blog.csdn.net/ziv555/article/details/52116877
Kernel Virtual Machine (KVM)
能够在单个服务器硬件平台上运行多个虚拟机 (VM) 的能力在如今的 IT 基础架构中实现了了成本、系统管理和灵活性等方面的优势。在单个硬件平台上托管多个虚拟机,可减少硬件开支并帮助最大限度降低基础架构成本,比如能耗和制冷成本。将操作方式不同的系统作为虚拟机整合在一个硬件平台上,可简化通过管理层(比如开源虚拟化库 (libvirt))和基于它的工具(比如图形化的虚拟机管理器 (VMM))对这些系统的管理工作。虚拟化还提供了如今面向服务的高可用性 IT 操作中所需的操作灵活性,支持将正在运行的虚拟机从一个物理主机迁移到另一个主机,以满足硬件或物理场所问题的需要,或者通过负载平衡最大限度提高性能,或者应对日益增长的处理器和内存需求。
Xen 虚拟化环境(参见 参考资料 获取链接)在传统上提供了 Linux 系统上性能最高的开源虚拟化技术。Xen 使用一个虚拟机管理程序来管理虚拟机和相关的资源,还支持半虚拟化,这可在 “知道” 自己已实现虚拟化的虚拟机中提供更高的性能。Xen 提供了一个专门执行资源和虚拟管理与计划的开源虚拟机管理程序。在裸机物理硬件上引导系统时,Xen 虚拟机管理程序启动一个称为 Domain0 或管理域的主虚拟机,该虚拟机提供了对所有在该物理主机上运行的其他虚拟机(称为 Domain1 到 DomainN,或者简单地称为 Xen Guest)的中央虚拟机管理功能。
不同于 Xen,KVM 虚拟化使用 Linux 内核作为它的虚拟机管理程序。对 KVM 虚拟化的支持自 2.6.20 版开始已成为主流 Linux 内核的默认部分。使用 Linux 内核作为虚拟机管理程序是 KVM 受到批评的一个主要方面,因为(在默认情况下)Linux 内核并不符合 Type 1 虚拟机管理程序的传统定义—“一个小操作系统”。尽管大多数 Linux 发行版所提供的默认内核的确如此,但可以轻松地配置 Linux 内核来减少它的编译大小,以便它仅提供作为 Type 1 虚拟机管理程序运行所需的功能和驱动程序。Red Hat 自己的 Enterprise Virtualization 产品仅依靠这样一种特殊配置的、相对轻量型的 Linux 内核来运行。但更重要的是,“小”只是一个相对的词汇,如今具有数 GB 内存的 64 位服务器可轻松地提供现代 Linux 内核所需的几 MB 空间。
KVM 超越 Xen 成为大多数企业环境首选的开源裸机虚拟化技术,这有多个原因:
KVM 支持自 2.6.20 版开始已自动包含在每个 Linux 内核中。在 Linux 内核 3.0 版之前,将 Xen 支持集成到 Linux 内核中需要应用大量的补丁,但这仍然无法保证每个可能硬件设备的每个驱动程序都能在 Xen 环境中正确工作。
Xen 支持所需的内核源代码补丁仅提供给特定的内核版本,这阻止了 Xen 虚拟化环境利用仅在其他内核版本中可用的新驱动程序、子系统及内核修复和增强。KVM 在 Linux 内核中的集成使它能够自动利用新 Linux 内核版本中的任何改进。
Xen 要求在物理虚拟机服务器上运行一个特殊配置的 Linux 内核,以用作在该服务器上运行的所有虚拟机的管理域。KVM 可在物理服务器上使用在该物理系统上运行的 Linux VM 中使用的相同内核。
Xen 的虚拟机管理程序是一段单独的源代码,它自己的潜在缺陷与它所托管的操作系统中的缺陷无关。因为 KVM 是 Linux 内核的一个集成部分,所以只有内核缺陷能够影响它作为 KVM 虚拟机管理程序的用途。