分页内存，固定内存，KVM

zhangbinmy · 发表于 2017-6-24 07:04:10

　　http://stackoverflow.com/questions/12439807/pinned-memory-in-cuda
　　Pinned memory is just physical RAM in your system that is set aside and not allowed to be paged out by the OS. So once pinned, that amount of memory becomes unavailable to other processes (effectively reducing the memory pool available to rest of the OS).
　　The maximum pinnable memory therefore is determined by what other processes (other apps, the OS itself) are competing for system memory. What processes are concurrently running in either Windows or Linux (e.g. whether they themselves are pinning memory) will determine how much memory is available for you to pin at that particular time.
　　+1, welcome to StackOverflow, Michael! To add to your answer: oversubscribing pinned memory can reduce the performance of interactive OSes, since it limits the available physical memory space that can be paged (i.e. it will lead the OS to "thrash" virtual memory more).
　　And just to re-phrase your point in another way, let's say you manage to pin 80% of total system memory for a CUDA app, then that means only 20% is left for the OS to run EVERYTHING else until the CUDA app releases the memory or unpins it. This can easily lead to (and I have done it more than once!) an unusable PC, since it's like trying to run Windows 7 + browsers + email etc on a few GB of ram!
　　https://en.wikipedia.org/wiki/CUDA_Pinned_memory
　　In the framework of accelerating computational codes by parallel computing on graphics processing units (GPUs), the data to be processed must be transferred from system memory to the graphics card's memory, and the results retrieved from the graphics memory into system memory. In a computational code accelerated by general-purpose GPUs (GPGPUs), such transactions can occur many times and may affect the overall performance, so that the problem of carrying out those transfers in the fastest way arises.
　　To allow programmers to use a larger virtual address space than is actually available in the RAM, CPUs (or hosts, in the language of GPGPU) implement a virtual memory system Virtual memory (non-locked memory) in which a physical memory page can be swapped out to disk. When the host needs that page, it loads it back in from the disk. The drawback with CPU⟷GPU memory transfers is that memory transactions are slower, i.e., the bandwidth of the PCI-E bus to connect CPU and GPU is not fully exploited. Non-locked memory is stored not only in memory (e.g. it can be in swap), so the driver needs to access every single page of the non-locked memory, copy it into pinned buffer and pass it to the Direct Memory Access (DMA) (synchronous, page-by-page copy). Indeed, PCI-E transfers occur only using the DMA. Accordingly, when a “normal” transfer is issued, an allocation of a block of page-locked memory is necessary, followed by a host copy from regular memory to the page-locked one, the transfer, the wait for the transfer to complete and the deletion of the page-locked memory. This consumes precious host time which is avoided when directly using page-locked memory.[1]
　　However, with today’s memories, the use of virtual memory is no longer necessary for many applications which will fit within the host memory space. In all those cases, it is more convenient to use page-locked (pinned) memory which enables a DMA on the GPU to request transfers to and from the host memory without the involvement of the CPU. In other words, locked memory is stored in the physical memory (RAM), so the GPU (or device, in the language of GPGPU) can fetch it without the help of the host (synchronous copy).[2]
　　GPU memory is automatically allocated as page-locked, since GPU memory does not support swapping to disk.[1][2] To allocate page-locked memory on the host in CUDA language one could use cudaHostAlloc.[3]
　　实验中测试分页内存与固定内存的传输速率，测试PCI的速率。
　　http://blog.csdn.net/zhangpinghao/article/details/21046435
　　当为了提高CUDA程序的主机内存和设备内存传输消耗时，可以尝试一下两种方案
　　一：使用分页锁定内存，分页锁定内存和显存之间的拷贝速度大约是6GB/s，普通的分页内存和GPU间的速度大约是3GB/s，
　　uda运行时提供了使用分页锁定主机存储器(也称为pinned)的函数(与常规的使用malloc()分配的可分页的主机存储器不同)：
　　cudaHostAlloc()和cudaFreeHost()分配和释放分页锁定主机存储器;
cudaHostRegister()分页锁定一段使用malloc()分配的存储器。
如前文（可阅读以上文章）提到的，在某些设备上，设备存储器和分页锁定主机存储器间数据拷贝可与内核执行并发进行;
　　在一些设备上，分页锁定主机内存可映射到设备地址空间，减少了和设备间的数据拷贝
　　在有前端总线的系统上，如果主机存储器是分页锁定的，主机存储器和设备存储器间的带宽会高些，如果再加上3.2.4.2节所描述的写结合(write-combining)的话，带宽会更高。
　　然而分页锁定主机存储器是稀缺资源，所以可分页内存分配得多的话，分配会失败。另外由于减少了系统可分页的物理存储器数量，分配太多的分页锁定内存会降低系统的整体性能。
　　http://stackoverflow.com/questions/5736968/why-is-cuda-pinned-memory-so-fast
　　CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
　　As described here http://forums.nvidia.com/index.php?showtopic=164661

　　host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

　　I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

　　The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

　　CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk. If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA. So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.
　　http://blog.csdn.net/ziv555/article/details/52116877
　　Kernel Virtual Machine (KVM)
　　能够在单个服务器硬件平台上运行多个虚拟机 (VM) 的能力在如今的 IT 基础架构中实现了了成本、系统管理和灵活性等方面的优势。在单个硬件平台上托管多个虚拟机，可减少硬件开支并帮助最大限度降低基础架构成本，比如能耗和制冷成本。将操作方式不同的系统作为虚拟机整合在一个硬件平台上，可简化通过管理层（比如开源虚拟化库 (libvirt)）和基于它的工具（比如图形化的虚拟机管理器 (VMM)）对这些系统的管理工作。虚拟化还提供了如今面向服务的高可用性 IT 操作中所需的操作灵活性，支持将正在运行的虚拟机从一个物理主机迁移到另一个主机，以满足硬件或物理场所问题的需要，或者通过负载平衡最大限度提高性能，或者应对日益增长的处理器和内存需求。
　　Xen 虚拟化环境（参见参考资料获取链接）在传统上提供了 Linux 系统上性能最高的开源虚拟化技术。Xen 使用一个虚拟机管理程序来管理虚拟机和相关的资源，还支持半虚拟化，这可在 “知道” 自己已实现虚拟化的虚拟机中提供更高的性能。Xen 提供了一个专门执行资源和虚拟管理与计划的开源虚拟机管理程序。在裸机物理硬件上引导系统时，Xen 虚拟机管理程序启动一个称为 Domain0 或管理域的主虚拟机，该虚拟机提供了对所有在该物理主机上运行的其他虚拟机（称为 Domain1 到 DomainN，或者简单地称为 Xen Guest）的中央虚拟机管理功能。
　　不同于 Xen，KVM 虚拟化使用 Linux 内核作为它的虚拟机管理程序。对 KVM 虚拟化的支持自 2.6.20 版开始已成为主流 Linux 内核的默认部分。使用 Linux 内核作为虚拟机管理程序是 KVM 受到批评的一个主要方面，因为（在默认情况下）Linux 内核并不符合 Type 1 虚拟机管理程序的传统定义—“一个小操作系统”。尽管大多数 Linux 发行版所提供的默认内核的确如此，但可以轻松地配置 Linux 内核来减少它的编译大小，以便它仅提供作为 Type 1 虚拟机管理程序运行所需的功能和驱动程序。Red Hat 自己的 Enterprise Virtualization 产品仅依靠这样一种特殊配置的、相对轻量型的 Linux 内核来运行。但更重要的是，“小”只是一个相对的词汇，如今具有数 GB 内存的 64 位服务器可轻松地提供现代 Linux 内核所需的几 MB 空间。
　　KVM 超越 Xen 成为大多数企业环境首选的开源裸机虚拟化技术，这有多个原因：

KVM 支持自 2.6.20 版开始已自动包含在每个 Linux 内核中。在 Linux 内核 3.0 版之前，将 Xen 支持集成到 Linux 内核中需要应用大量的补丁，但这仍然无法保证每个可能硬件设备的每个驱动程序都能在 Xen 环境中正确工作。
Xen 支持所需的内核源代码补丁仅提供给特定的内核版本，这阻止了 Xen 虚拟化环境利用仅在其他内核版本中可用的新驱动程序、子系统及内核修复和增强。KVM 在 Linux 内核中的集成使它能够自动利用新 Linux 内核版本中的任何改进。
Xen 要求在物理虚拟机服务器上运行一个特殊配置的 Linux 内核，以用作在该服务器上运行的所有虚拟机的管理域。KVM 可在物理服务器上使用在该物理系统上运行的 Linux VM 中使用的相同内核。
Xen 的虚拟机管理程序是一段单独的源代码，它自己的潜在缺陷与它所托管的操作系统中的缺陷无关。因为 KVM 是 Linux 内核的一个集成部分，所以只有内核缺陷能够影响它作为 KVM 虚拟机管理程序的用途。

　　尽管 Xen 仍可提供比 KVM 性能更高的裸机虚拟化，但这些性能改进的价值常常比不上 KVM 虚拟化的简单性和易用性价值。
　　http://apusic.lofter.com/post/66148_ced4f
　　作为开源的虚拟化技术，对比Xen和KVM可以看到，Xen以6个无与伦比的优势领先：更好的可用资源、平台支持、可管理性、实施、动态迁移和性能基准。
可用资源：Xen的问世要比KVM早4年之久（两者分别是2003年和2007年）。随着Citrix、Novell、Oracle、Sun、Ret Hat和Virtual Iron等公司在市场领域的实施，就比较容易找到精通Xen的IT技术人员，更容易让这些技术人员接受Xen相关的培训、更容易得到Xen的咨询帮助以及获得Xen证书。企业管理协会（EMA：Enterprise Management Associates）2008年这对虚拟化和管理趋势的研究报告表明，这些关键因素占到那些抱怨缺少必要虚拟化技术资源和技术企业的60%。
　　http://www.cnblogs.com/caizm/archive/2012/11/26/2788875.html
在XEN宿主机上创建客户虚拟机
　　http://www.cnblogs.com/aaa103439/p/3e79d6a83335957261c886d8b83b4ae2.html
　　环境:

虚拟宿主机(vm)环境:centos-6.5,内核版本2.6.32
xen版本:4.1.3
因为xen要与外界通信,需要构建桥接,但是NetworkManager不支持,故而需要先关闭

　　http://www.ibm.com/developerworks/cn/cloud/library/cl-hypervisorcompare-kvm/
安全
　　因为虚拟机实现为一个 Linux 进程，所以它利用标准的 Linux 安全模型来提供隔离和资源控制。Linux 内核使用 SELinux（安全增强的 Linux）来添加强制访问控制、多级和多类别安全，以及处理策略的执行。SELinux 为在 Linux 内核中运行的进程提供了严格的资源隔离和限制。
　　SVirt 项目是一项社区工作，尝试集成强制访问控制 (MAC) 安全和基于 Linux 的虚拟化 (KVM)。它构建于 SELinux 之上，提供一个基础架构来使管理员能够定义虚拟机隔离策略。SVirt 可以开箱即用地确保一个虚拟机资源无法供任何其他进程（或虚拟机）访问，这可由 sysadmin 扩展来定义细粒度的权限，例如将虚拟机分组到一起以共享资源。
内存管理
　　KVM 从 Linux 继承了强大的内存管理功能。一个虚拟机的内存与任何其他 Linux 进程的内存一样进行存储，可以以大页面的形式进行交换以实现更高的性能，也可以磁盘文件的形式进行共享。NUMA 支持（非一致性内存访问，针对多处理器的内存设计）允许虚拟机有效地访问大量内存。
　　KVM 支持来自 CUP 供应商的最新的内存虚拟化功能，支持 Intel 的扩展页面表 (EPT) 和 AMD 的快速虚拟化索引 (RVI)，以实现更低的 CPU 利用率和更高的吞吐量。
　　内存页面共享通过一项名为内核同页合并 (Kernel Same-page Merging, KSM) 的内核功能来支持。KSM 扫描每个虚拟机的内存，如果虚拟机拥有相同的内存页面，KSM 将这些页面合并到一个在虚拟机之间共享的页面，仅存储一个副本。如果一个 Guest 尝试更改这个共享页面，它将得到自己的专用副本。
存储
　　KVM 能够使用 Linux 支持的任何存储来存储虚拟机镜像，包括具有 IDE、SCSI 和 SATA 的本地磁盘，网络附加存储 (NAS)（包括 NFS 和 SAMBA/CIFS），或者支持 iSCSI 和光线通道的 SAN。多路径 I/O 可用于改进存储吞吐量和提供冗余。
　　再一次，由于 KVM 是 Linux 内核的一部分，它可以利用所有领先存储供应商都支持的一种成熟且可靠的存储基础架构；它的存储堆栈在生产部署方面具有良好的记录。
　　KVM 还支持全局文件系统 (GFS2) 等共享文件系统上的虚拟机镜像，以允许虚拟机镜像在多个宿主之间共享或使用逻辑卷共享。磁盘镜像支持瘦配置，支持通过仅在虚拟机需要时分配存储空间，而不是提前分配整个存储空间，提高存储利用率。KVM 的原生磁盘格式为 QCOW2，它支持快照，允许多级快照、压缩和加密。
实时迁移
　　KVM 支持实时迁移，这提供了在物理宿主之间转移正在运行的虚拟机而不中断服务的能力。实时迁移对用户是透明的，虚拟机保持打开，网络连接保持活动，用户应用程序也持续运行，但虚拟机转移到了一个新的物理宿主上。
　　除了实时迁移，KVM 支持将虚拟机的当前状态保存到磁盘，以允许存储并在以后恢复它。
设备驱动程序
　　KVM 支持混合虚拟化，其中半虚拟化的驱动程序安装在 Guest 操作系统中，允许虚拟机使用优化的 I/O 接口而不使用模拟的设备，以为网络和块设备提供高性能的 I/O。
　　KVM 虚拟机管理程序为半虚拟化的驱动程序使用 IBM 和 Red Hat 联合 Linux 社区开发的 VirtIO 标准；它是一个与虚拟机管理程序独立的、构建设备驱动程序的接口，允许为多个虚拟机管理程序使用一组相同的设备驱动程序，能够实现更出色的 Guest 互操作性。
　　VirtIO 驱动程序包含在现代 Linux 内核中（2.6.25 以后的版本），包含在 Red Hat Enterprise Linux 4.8+ 和 5.3+ 中，可用于 Red Hat Enterprise Linux 3。Red Hat 为 Microsoft Windows Guest 系统开发了 VirtIO 驱动程序，以实现经过 Microsoft 的 Windows 硬件质量实验室认证计划 (WHQL) 认证的优化网络和磁盘 I/O。
性能和可伸缩性
　　KVM 也继承了 Linux 的性能和可伸缩性，支持拥有最多 16 个虚拟 CPU 和 256GB RAM 的虚拟机，以及拥有 256 个 CPU 核心和超过 1TB RAM 的宿主系统。它可提供

对于 SAP、Oracle、LAMP 和 Microsoft Exchange 等真实企业工作负载，相对于裸机 95 到 135% 的性能。
在运行标准服务器的虚拟机中每秒超过 100 万条消息的速度和低于 200 毫秒的延迟。
最高的整合率，将超过 600 个运行企业工作负载的虚拟机整合到单个服务器上。

　　这意味着 KVM 允许虚拟化任何要求苛刻的应用程序工作负载。

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] 分页内存，固定内存，KVM

浏览过的版块

扫码加入运维网微信交流群