archive-edu.com » EDU » C » CORNELL.EDU

Total: 46

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • Cornell Virtual Workshop
    technologies and broaden the participation of underrepresented groups in science and engineering Over 100 000 unique visitors have accessed Cornell Virtual Workshop training on programming languages parallel computing code improvement and data analysis The platform supports national cyberinfrastructure and their learning communities including XSEDE Stampede and Jetstream To Get Started Select the Topics tab Topics are listed in a suggested order You are welcome to skip to whatever topic you

    Original URL path: https://cvw.cac.cornell.edu/default (2015-12-27)
    Open archived version from archive


  • Home Page
    the TACC Portal sign in then use the Get Started button Medium Access Registration Login through the buttons on the top right of this page If you do not have access through the XSEDE or TACC portal you can register and login in this section Registrants have access to all Virtual Workshop features but will not receive consulting support Limited Access Guests Guest visitors are welcome to view the Virtual

    Original URL path: https://cvw.cac.cornell.edu/Registration/ (2015-12-27)
    Open archived version from archive

  • Cornell Virtual Workshop
    is the first standard and portable message passing library that offers good performance MPI Point to Point Communication This module details and differentiates the various types of point to point communication available in MPI Point to point communication involves transmission of a message between a pair of processes as opposed to collective communication which involves a group of processes MPI Collective Communications The purpose of collective communication is to manipulate a shared piece or set of information In this module we introduce these routines in three categories synchronization data movement and global computation MPI One Sided Communication One sided communication provides natural access to Remote Memory Access RMA functionality that is provided by low latency interconnect fabrics such as InfiniBand In this module we will introduce the various components of MPI RMA and how to use them MPI Advanced Topics This module will introduce you to some of the advanced capabilities of MPI beyond ordinary message passing including how to customize your environment in the following areas derived datatypes groups of processes and their associated communicators virtual topologies among processes and parallel I O using MPI IO Application to specific architectures such as Stampede will be discussed Parallel I O This module presents basic concepts and techniques that will allow your application to take advantage of parallel I O to increase throughput and improve scalability Emphasis is placed on the Lustre parallel file system and on MPI IO as a fundamental API OpenMP In the shared memory heterogeneous environment that Stampede has on each node it is much easier to introduce parallelism into your code with OpenMP than to do pthread programming from scratch or to use MPI This module introduces OpenMP and describes how to use it Hybrid Programming with OpenMP and MPI In hybrid programming the goal is to combine techniques from OpenMP and MPI to create a high performance parallel code that is better tailored for the non uniform and heterogeneous memory access characteristics of Stampede To meet this goal it is necessary to understand the effects of processor affinity and memory allocation policy and to exert some control over them MIC The Xeon Phi coprocessor is a system on a PCIe card designed to provide high levels of floating point performance for highly parallel HPC code Its architecture is known as Many Integrated Core MIC This module describes the MIC architecture behind the Xeon Phi its performance characteristics how and when to run code on the coprocessors available within Stampede in order to best take advantage of the resources available How to Exploit MIC This module focuses on a question of high interest to many HPC users what changes might I need to make to my program or even to my algorithm so my application can make good use of accelerators or coprocessors such as the Intel Xeon Phi on Stampede To answer this we consider just the main characteristics of Intel s Many Integrated Core architecture along with their implications for how a MIC enabled

    Original URL path: https://cvw.cac.cornell.edu/topics (2015-12-27)
    Open archived version from archive

  • Cornell Virtual Workshop
    chapters such as Compositional C and Fortran M are more rare these days but the concepts have not changed much since the book was written Quinn Michael J Parallel Programming in C with MPI and OpenMP McGraw Hill 2004 Quinn s description of the task channel model of parallel algorithm design might inspire you to more creative ways to parallelize your algorithm MPI MPICH http www mpich org OpenMPI http www open mpi org W Gropp E Lusk and A Skjellum Using MPI MIT Press 1999 This original book on MPI covers the interface nicely For more on how to program try Quinn s book above Message Passing Interface Forum Nov 15 2003 MPI 2 Extensions to the Message Passing Interface Rolf Rabenseifner MPI 3 0 OpenMP OpenMP http www openmp org http en wikipedia org wiki OpenMP Rohit Chandra Leonardo Dagum Dave Kohr Dror Maydan Jeff McDonald and Ramesh Menon Parallel Programming in OpenMP Academic Press 2001 Hybrid Programming with OpenMP and MPI Yun Helen He and Chris Ding Lawrence Berkeley National Laboratory June 24 2004 Hybrid OpenMP and MPI Programming and Tuning NUG2004 Texas Advanced Computing Center Stampede User Guide Hybrid Model Message Passing Interface Forum MPI 2 MPI and Threads specific section of the MPI 2 report Intel Corp Thread Affinity Interface Linux and Windows from the Intel Fortran Compiler User and Reference Guides Optimization and Scalability Pete Beckman Kamil Iskra Kazutomo Yoshii Susan Coghlan and Aroon Nataraj Benchmarking the effects of operating system interference on extreme scale parallel machines Cluster Computing 11 March 2008 Lei Chai Qi Gao and D K Panda Understanding the Impact of Multi Core Architecture in Cluster Computing A Case Study with Intel Dual Core System CCGrid 07 Larry P Davis Cray J Henry Roy L Campbell Jr and William A Ward Jr High Performance Computing Acquisitions Based on the Factors that Matter Computing in Science and Engineering November December 2007 On p 35 they present a simple model equation for scalability as a function of the number of processors J Hennessy and D Patterson Computer Architecture San Francisco CA Morgan Kaufmann 2002 This is the classic tome on the computer architecture While it is up to date you will need to look elsewhere for the specifics of the latest chips and characterization of clusters David J Lilja Measuring Computer Performance A Practitioner s Guide Cambridge University Press 2000 Amith R Mamidala et al On Using Connection Oriented vs Connection Less Transport for Performance and Scalability of Collective and One sided Operations Trade offs and Impact PPoPP 07 Parallel Global Address Space PGAS Languages These languages incorporate parallelism into the language enabling the compiler to optimize communication patterns While MPI is more common several papers on petascale computing anticipate that PGAS languages will grow in importance Co Array Fortran http www co array org Unified Parallel C At Berkeley http upc lbl gov and George Washington University http upc gwu edu X10 http x10 lang org Fortress http projectfortress java net Chapel

    Original URL path: https://cvw.cac.cornell.edu/main/reference (2015-12-27)
    Open archived version from archive

  • Cornell Virtual Workshop
    software components FTP File Transfer Protocol A client server protocol for transferring files over TCP with a variety of clients Uses two channels ports one for control and one for data functional parallel A programming model in which a program is broken down by functions and parallelism is achieved by assigning each function to a different worker task Message passing libraries such as MPI are commonly used for communication among the processors An alternative to functional parallelism is data parallelism functional programming A programming paradigm in which functions are treated as mathematical functions in that they do not have side effects or have as few side effects as is reasonable and do not change an internal state after being called Additionally functions can be treated as variables that may be operated on by other functions and like variables also have a well defined type largely based on the arguments and outputs of the function functional unit Any element in the CPU that performs operations and calculations It may have its own control unit registers and computational units fused multiply add FMA A floating point combined multiply add operation performed by a single instruction with a single rounding at the end Typically one such operation is performed each cycle through pipelining game design The creative process for developing a computer game including facets such as rules gameplay and storyline GASNet A low level networking layer for PGAS languages providing network independent high performance communication primitives gateway A single unified point of entry into a community developed suite of networked data tools and applications The components are usually integrated into a customized graphical user interface such as a Web portal or a suite of applications which has been tailored to meet the needs of the specific community e g a Science Gateway Gaussian elimination An algorithm for solving a system of linear equations It is essentially equivalent to LU decomposition LU factorization of the matrix of coefficients Gaussian Mixture Model GMM A probabilistic model that assumes all the data points are generated from a weighted finite sum of Gaussian distributions with unknown parameters means and variances ghost cells In domain decomposition a layer of extra computational cells that surrounds each computational subdomain usually only 1 or 2 cells thick The ghost cells are meant to hold border values that are computed in neighboring subdomains and received via message passing Also called shadow cells global address space In parallel programming a way of mapping memory across machines that hides the distinction between shared memory and distributed memory If remote data are required by a thread the data are fetched or prefetched from distributed memory via mechanisms invisible to the thread global memory Main memory that is directly addressable by all processors or CPUs It can be either shared memory or distributed shared memory Partitioned Global Address Space PGAS GPGPU General Purpose computing on Graphics Processing Units The utilization of a GPU for performing numerically intensive computations that would ordinarily be done on one or more central processing units CPUs GPU The Graphics Processing Unit of a computer It is designed to assist the CPU by taking a stream of graphics input and quickly converting it into the series of images that one sees on a computer display More recently GPUs have also become accelerators for general floating point computations including HPC applications granularity The relative number and size of independent tasks that could be distributed to multiple CPUs Fine granularity is illustrated by execution of sets of inner loop iterations as separate tasks coarse granularity might involve one or more subroutines as separate tasks The finer the granularity the the more overhead is required to keep track of the tasks Granularity can also describe the relative duration of tasks Granularity in number size and time profoundly affects the overhead and speedup of a parallel program graph algorithm An algorithm that takes one or more graphs as inputs and solves a problem in graph theory Graphs are mathematical structures that model pairwise relations between objects graph partitioning The problem of subdividing a graph G into smaller components based on specific criteria For example a uniform graph partition problem seeks to form components of G that are close to the same size and have few connections between the components graph traversal Visiting all the nodes of a given graph in order to update or check data values that are associated with the nodes of the graph graphical model A representation of the conditional dependencies between random variables used in artificial intelligence AI programming Gray code addressing Constructing addresses for the nodes in a hypercube network topology by using a binary numerical system Gray code in which two successive integer values are represented with only one bit difference For a d dimensional hypercube each node is connected to d other nodes whose addresses are 1 bit different greedy algorithm A strategy where one makes a locally optimal choice at each stage with the hope of finding the global optimum While this strategy is not guaranteed to produce the global optimum it may approximate an optimal solution in a reasonable time grid mesh A collection of sample points and their logical links to neighboring points used to construct a finite representation of a continuum object A structured grid or mesh has regularity in the selection of sample points otherwise the grid is said to be unstructured grid computing The use of aggregated heterogeneous computer resources from multiple locations to achieve a common goal The typical workload is non interactive or very loosely coupled and file based A grid can either be devoted to one particular application or used more flexibly for a variety of purposes GUI Graphical User Interface A user interface built on graphical elements rather than text although it may include text Examples include the Gnome Shell and the default user interfaces of Microsoft Windows and Apple s OSX Gustafson s Law A reformulation of Amdahl s Law in which the per process parallel workload becomes the fundamental quantity rather than the single process sequential workload Since the former can grow with the number of PEs whereas the latter typically does not scalability becomes a more reachable goal The law is stated as S P a P 1 where S is the scaled speedup P is the number of PEs and a is the sequential fraction of the parallel workload Hadoop An open source software framework that couples the MapReduce algorithm to a distributed file system It enables data intensive distributed applications to run on large clusters of commodity hardware hardware latency In a computer network the message delay imposed by the physical hardware alone also called wire time In present day systems the hardware latency is usually negligible compared to the software latency for performing message passing between sending and receiving nodes hash The value returned by a hash function an algorithm that creates a small key of fixed length from the information contained in a larger data set heap An area of main memory used for dynamic memory allocation Memory requests like malloc in C or allocate in Fortran are fulfilled by granting them unclaimed or free portions of heap memory heterogeneous computer Computer that utilizes multiple types of processors in the system for example both CPUs and GPUs When programming such a system one strives to take advantage of the strengths of the different processor types e g the low communication latency of the CPUs and the high throughput of the GPUs Hidden Markov Model HMM A type of statistical model in which the system being modeled is assumed to be a Markov history independent process whose states are hidden not observable It differs from ordinary Markov models in that just some state dependent output is visible not the states themselves home Directory you are in when you first log in hop A segment of the communication path between source and destination Data packets often have to travel through multiple routers or switches before reaching their final destination A hop occurs each time a packet is passed along to the next router or switch hot cold splitting Taking a data structure such as a C struct and splitting it in two to improve cache behavior Generally the first struct contains just the frequently used or hot fields plus a pointer to the second struct which stores the lesser used or cold fields Similarly the cold lines in a subprogram can be split into a separate routine making these lines less likely to clutter up the instruction cache HPC High Performance Computing HPC refers to the use of a computer capable of concurrent computation to obtain results in the shortest possible time Users generally turn to HPC when facing extraordinary requirements e g a need for excessive amounts of memory or much shorter wait times than are available on typical computers HPC therefore involves paying attention to many factors that influence performance including the speed of computational units the size of the available memory and communication capabilities among all components of the computer HPCC HPC Challenge benchmark a suite of tests that includes HPL STREAM and DGEMM plus four others HPL High Performance Linpack the version of the Linpack benchmarks that is used to rank the HPC systems in the Top 500 list twice per year hybrid programming A style of programming which combines distributed memory and shared memory techniques In this style the number of tasks per node is typically smaller than the number or cores per node To exploit all the available cores message passing e g MPI is used for communication between tasks and multithreading e g OpenMP is used within each task hyperthreading Intel s term for simultaneous multithreading SMT in which the number of the number of apparent cores exceeds the number of physical cores This enhances thread level parallelism TLP by better exploiting the multiple functional units that are available to each physical core resulting in faster parallel execution of computational threads hypervisor Computer software or more rarely nowadays firmware or hardware that can create and run virtual machines VMs I O Input Output often referring to moving data in and out of the computer s CPU and memory implicit parallelism A characteristic of a programming language or API in which parallelism is already expressed through the language s built in constructs so the compiler or interpreter can automatically exploit the parallelism in the code Such a language does not require special directives or functions to enable parallel execution inference engine A tool from artificial intelligence that applies logical if then rules to a preexisting knowledge base and deduces new knowledge InfiniBand A high speed communication link used in HPC It features high throughput and low latency instruction level parallelism The potential for a computer program to execute its low level operations in parallel ILP is enabled when a microprocessor has multiple functional units that can operate concurrently It is assisted through software the compiler interoperability The capability of a software application to make use of different kinds of computers OS s software components and interconnecting networks interpreted language A programming language in which programs are indirectly executed by an interpreter program This can be contrasted with a compiled language which is converted into machine code and then directly executed by the host CPU Interpreted language does not require compilation and is typically run line by line intrinsic function Fortran Functions which are built into the Fortran language specification and are expected to be recognized by the compiler iterative method A computational procedure that generates a sequence of approximate solutions to a problem until some error tolerance convergence criterion or other exit condition is met GMRES is an iterative method for solving a system of linear equations e g In contrast a direct method solves the problem through a prescribed fixed set of operations e g Gaussian elimination Parallelization may be possible in the construction of each approximate solution or single direct solution kernel A key element of operating systems or a key section within an application program In operating systems the kernel is the computer program that sits between and controls the interaction of the software and the computer s resources The kernel is responsible for scheduling processes on the available hardware for example In an application program a kernel is a compact section of code that expresses some well defined repetitive task making it a good target for performance optimization or perhaps for GPGPU recoding kernel parallelism A paradigm of computer programming in which an entire array of similarly structured data objects the stream is acted on concurrently by a specially defined function the kernel This is the central paradigm behind NVIDIA CUDA and OpenCL It is also called stream processing keyword A word or identifier that has a particular meaning to the programming language common examples are control flow words for while if else LAPACK Linear Algebra PACKage a software library for numerical linear algebra It has routines for the solution of linear systems linear least squares eigenvalue problems singular value decomposition SVD and various matrix factorizations Larrabee A prototype for Intel s many core MIC architecture and their Xeon Phi product line latency The overhead time to initiate a data transfer In the case of message passing it is the time to send a zero length message from one node or core of a parallel computer to another Latency is therefore the part of the time to complete a data transfer that is independent of the size of the message layered systems pattern In software architecture a structural pattern in which the overall system is configured into layers such that each layer only interacts in a limited way with adjacent layers using a prescribed protocol such as function calls Shaw and Garlan 1996 The implementation of each layer then becomes much simpler and cleaner and each layer may be tested independently and becomes much more portable lexically scoped synchronization Synchronization or coordination of multiple threads conferred through a language specific property of the program text itself e g synchronized methods in Java linear programming A special case of mathematical optimization in which the function to be optimized for minimum cost or maximum benefit is linear in the model variables linear speedup The ideal or perfect parallel speedup that results when a code runs N times faster on N CPUs or cores If the speedup S N is plotted against N where S N sequential time N way parallel time the graph is a straight line with slope 1 linked list A data structure in which a sequence is represented by a group of nodes Each node consists of a datum plus a pointer or link to the next node in the sequence This data structure allows for the insertion or removal of nodes at any point in the list without disturbing the rest of the list literal A literal is a constant value of a type unlike a variable which can change Examples are 42 or answer a string literal livelock A situation in which parallel tasks are too busy responding to each other to resume working They are not blocked as in deadlock their mutual actions simply do not lead to progress The phenomenon is analogous to two people in a hallway who can t pass each other because both first dodge one way then both dodge the other way etc load balance A goal for algorithms running on parallel computers which is achieved if all the workers perform approximately equal amounts of work so that no worker is idle for a significant amount of time lock A way to enforce mutual exclusion mutex between threads that require synchronized access to the same data location or device A code section that uses locks for this purpose is called a critical section In OpenMP the critical clause implicitly creates such locks around the specified code section logic gate In electronics a circuit element that implements a Boolean function AND OR etc Logic gates are primarily built with diodes or transistors acting as electronic switches LogP machine Model for communication within a parallel computer based on 4 parameters Latency Overhead communication Gap and number of Processors loop unrolling A loop transformation technique that attempts to optimize a program s execution speed at the expense of its size The transformation can be undertaken manually by the programmer or by an optimizing compiler In this technique loops are re written as a repeated sequence of similar independent statements One of the advantages gained is that the unrolled statements can potentially be executed in parallel machine learning The implementation and study of computer systems that can learn from the data that they collect e g recognize patterns or categories in the data after undergoing a training period It is a type of artificial intelligence many core processor A type of multi core processor featuring tens or hundreds of cores typically having smaller cache sizes and reduced capabilities compared to multi core but wider vector units The design goal of many core is to maximize parallel throughput MapReduce A programming model for processing large data sets in highly parallel fashion It is named for its most famous implementation the one that Google uses to create sort and store key value pairs for its search engine mass storage An external storage device capable of storing large amounts of data Online directly accessible disk space is limited on most systems mass storage systems provide additional space that is slower and more difficult to access but can be virtually unlimited in size Master Worker A programming approach in which one process designated the master assigns tasks to other processes known as workers matmul Matrix matrix multiplication matrix A two dimensional array of numbers or variables arranged into rows and columns A matrix is termed sparse if nearly all of its entries are zero otherwise it is called dense memory hierarchy The layers of memory that can be accessed by the processor arranged in order from fastest smallest to slowest largest Typically registers are at the top of the hierarchy followed by the L1 L2 and L3 caches then RAM then the hard drive memory leak A situation that arises when memory is unintentionally reserved by a programmer when it should have been freed memory bound A situation in which the speed for completing a given computational problem is limited by the rate of movement of data in and out of memory memory level parallelism The ability of a computer system to process multiple memory operations e g cache misses simultaneously MLP may be considered as a form of ILP or instruction level parallelism message passing A communication paradigm in which processes communicate by exchanging messages via communication channels metascheduling The high level direction of a computational workload through a metasystem composed of multiple systems each with its own job scheduler and resources A given job is queued for execution on these federated resources based on current and static characteristics such as workload software and memory Also known as automatic resource selection or global queue microbenchmark In HPC an attempt to measure the performance of some specific task that is performed by the processor The code that is used in such a test usually performs no I O or else it tests a single specific I O task MIMD Multiple Instruction Multiple Data A type of parallel computing in which distinct processing units can execute different instructions on different pieces of data simultaneously Examples of MIMD architectures are distributed computer systems superscalar processors and multicore processors minimum spanning tree Given a connected undirected weighted graph the minimum spanning tree of the graph is a subgraph that has the least total weight compared to all the other spanning trees i e subgraphs that have the form of a tree and connect all the vertices together MKL Intel s Math Kernel library a software implementation of BLAS LAPACK and other commonly used mathematical and numerical functions MKL routines are optimized for Intel microprocessor architectures MMX A SIMD instruction set for Intel CPUs that allows programs to process two 32 bit integers four 16 bit integers or eight 8 bit integers concurrently by using special 64 bit wide integer units model view controller A software architectural pattern for implementing user interfaces The model component embodies the application s data logic and rules data from the model are provided to the view component which presents output to the user and based on the view the user gives input to the controller component which sends the user s input to the model and or view modularity A characteristic of software that has been separated into independent self contained units each of which fulfills a single distinct role The modules interact with one another only through well defined interfaces module A command that sets up a basic environment appropriate to the specified compiler tool or library Monte Carlo method Any computational method which uses random sampling to compute an approximation as part of its algorithm Moore s Law The observation that transistor density in integrated circuits doubles roughly every two years or the prediction that this historical trend will continue into the future The law is named after Intel co founder Gordon E Moore who described the trend in a 1965 paper MOSFET Metal Oxide Semiconductor Field Effect Transistor a type of transistor in which the metal gate is insulated from the source and drain by an oxide layer Due to low power consumption and fast switching times MOSFETs are commonly used as logic gates in microprocessors and in other kinds of digital circuitry MPI Message Passing Interface a de facto standard for communication among the nodes running a parallel program on a distributed memory system MPI is a library of routines that can be called from both Fortran and C programs MPI s advantage over older message passing libraries is that it is both portable because MPI has been implemented for almost every distributed memory architecture and fast because each implementation is optimized for the hardware it runs on MPI communicator A data structure that can be associated with a group of MPI processes in order to hold additional attributes about the group s identity use means of creation and destruction and the scope of its communication universe MPI group An ordered set of MPI processes Each process in a group is associated with a unique integer rank Rank values start at zero and go to N 1 where N is the number of processes in the group MPP Massively Parallel Processor A computer system that utilizes a large number of processors to perform parallel computations An MPP is similar to a very large cluster but it features a specialized interconnect network Each CPU possesses its own memory and runs its own OS multi core processor A computer chip that holds one or more independent central processing units called cores that read and execute program instructions multigrid method A method for solving elliptic PDEs by starting with a solution on a coarse grid and making successive refinements to the solution on a series of finer grids This ultimately produces a better approximation on the coarsest grid The process is then iterated until the solution has converged to some tolerance based on a residual vector multiplexer MUX An electronic device that selects one of several input signals and forwards the selected input to a single output multiprocessing The execution of multiple concurrent software processes in a computer system especially a system with two or more hardware CPUs multithreading The hardware based capability of executing more than one thread per core From the hardware perspective multithreading is distinguished from shared memory multiprocessing which refers to only 1 thread core But from the software perspective both situations can be termed multithreading mutex Mutual exclusion which means ensuring that no two threads can concurrently execute a set of instructions that must be run by one thread at a time critical section A lock or other mutex mechanism allows only a single thread to enter the critical section forcing other threads to wait until the thread has exited from that section NAS Parallel Benchmarks A small set of programs from derived from some computational fluid dynamics CFD applications at NASA The benchmarks are intended to gauge the performance of different large scale parallel computers Navier Stokes equations The fundamental mathematical model of viscous fluid dynamics network topology Conceptual arrangement of the various elements links nodes etc of a computer network expressing the connectivity properties of the network NIC Network Interface Controller A computer hardware device connecting a computer to a network node One of the individual computers that are linked together by a network to form a parallel system A single instance of the operating system runs on a node A node may house several processors each of which may in turn have multiple cores All cores on a node have access to all of the memory on the node but the access may be nonuniform The term derives from a node on the connectivity graph that represents the computer network node graph An object in a graph that is connected to one or more other objects through links Nodes are also referred to as vertices and links are also called edges NUMA Non Uniform Memory Access In many computers memory is divided among subgroups of cores such that a given subgroup has faster access to local data due to traveling through fewer controllers and or across wider buses The performance of a program can often be improved by assigning memory and core usage to ensure that data are stored in the memory locations that are most accessible from the cores that will use the data Uniform Memory Access UMA is the opposite there is a single memory controller with a unified bus to all of the memory so that no core is favored in accessing any particular memory location octree A tree data structure or branching graph in which each internal non leaf node has exactly eight children one sided communication Message passing that can be initiated by one process or thread acting alone if RDMA is available Instead of needing a matching receive or send call a sender can simply put a message or a receiver can simply get a message online algorithm An algorithm that is able to work with input as it arrives without the complete input being available from the start In contrast an offline algorithm requires all the input data to be present from the beginning in order to produce a solution OpenCL A programming framework for writing codes that can execute across a platform consisting of CPUs GPUs and other processor types The software was originally developed by Apple Inc OpenGL Open Graphics Library An API for interacting with the graphics processing unit GPU allowing programmers to optimize the 3D rendering performance through hardware acceleration OpenMP A set of software constructs that are expressed as directives to a compiler that cause sections of code typically iterations of loops to be run in parallel on separate threads This gives a programmer the advantages of a multithreaded application without having to deal with explicitly managing the creation and destruction of threads OpenSHMEM An API for parallel programming which creates a virtual shared memory space in a distributed memory system SHMEM stands for Symmetric Hierarchical MEMory the symmetric segment is shared while other memory is private The shared memory can be accessed via one sided communication operations from remote PEs OpenSHMEM attempts to standardize and subsume several previous SHMEM APIs from SGI Cray and Quadrics It can provide a low level implementation layer for PGAS Partitioned Global Address Space languages like UPC Unified Parallel C optical flow Computational methodolgies from image processing and navigation control that are needed for robotics including motion detection object segmentation and the computation of pixel displacements between consecutive images optimization The act of tuning a program to achieve the fastest possible performance or consume the least resources on the system where it is running There are tools available to help with this process including optimization flags on compiler commands and optimizing preprocessors You can also optimize a program by hand using profiling tools to identify hot spots where the program spends most of its execution time Optimization requires repeated test runs and comparison of timings so it is usually worthwhile only for production programs that will be rerun many times Our Pattern Language OPL A design pattern language for engineering parallel software This description is also the exact title of OPL s defining document co authored by Kurt Keutzer EECS UC Berkeley and Tim Mattson Intel OPL is one outcome of an ongoing project centered at UC Berkeley s Par Lab out of core algorithm An algorithm designed to process data that are too large to fit into a computer s main or core memory all at once Such an algorithm must be very efficient at fetching data from and storing data to the slow bulk memory e g a hard drive out of order execution A way to avoid processor stalls by keeping multiple instructions in a queue and permitting the execution of any of them as soon as their operands have been loaded from memory To preserve program semantics the results from older operations must be stored ahead of the results of newer operations overhead Any hidden extra utilization of CPU time memory bandwidth or other resource that is required to achieve a particular computational goal Examples from parallel computing are communication overhead that is required to initiate message passing in MPI or OS overhead that is required to fork a thread team in OpenMP overlap Computing and communicating simultaneously or communicating while other communications are in progress a k a pipelining overload A general mechanism employed by some programming languages that will allow different functions or operators to be called using the same symbol or name the underlying function called depends on the type of the argument s passed to the overloaded symbol packet overhead The additional cost to transmit data in the form of discrete packets that contain more than just raw data Packet overhead is typically due to the extra information embedded in the packet header which is required to be assembled prior to transmission and disassembled after being received padding Inserting meaningless data entries between the end of the last data structure and the start of the next in order to produce favorable byte boundary alignment e g padding each row of an array in C page A fixed length contiguous block of virtual memory One page is the basic unit of memory that is communicated between main memory and disk storage Its size is often 4KB parallel loops Loops in which the dependencies have been removed and transformations have been applied so that each iteration may be executed in parallel on multi core or SIMD resources The OpenMP omp for or omp do and omp simd directives are appropriate for multithreading and vectorizing such loops Identifying and enabling a program s parallel loops gives an incremental way to parallelize it parallel prefix sum A parallel algorithm that computes a prefix sum also called a scan or cumulative sum Given the sequence x 0 x 1 x 2 the corresponding prefix sum is x 0 x 0 x 1 x 0 x 1 x 2 It can be generalized to apply to binary operations other than addition For n steps the parallel prefix sum has O log n cost parallel processing Computing with more than one thread of execution in order to process different data or perform different functions simultaneously The threads may be in a single task or there may be multiple tasks each with one or more threads parallel programming Writing programs that use special languages libraries or APIs that enable the code to be run on parallel computers Parallel programs can be classified by the assumptions they make about the memory architecture shared memory distributed memory or distributed shared memory For example OpenMP is an API for shared memory MPI is an API and library for distributed memory and both of them can be used together for distributed shared memory parallelism An inherent property of a code or algorithm that permits it to be run on parallel hardware There are several types of parallelism that can be identified in a program at different scales coarse grained fine grained and loop level The matching types of parallelism in computer hardware would be cluster symmetric multiprocessor SMP and vector processor All types can be exploited at once if the program is parallelized to run on a cluster of SMPs with vector units parallelization Splitting of program execution among many threads that can perform operations simultaneously in different cores of a computer A program that has been efficiently parallelized will use all of the cores that are available to it nearly all of the time particle mesh method A parallel method for computing far field forces in a particle system A regular mesh is superimposed on the system and far field forces are computed as follows 1 each particle is assigned to a nearby mesh point 2 the mesh problem is solved using a standard parallel algorithm such as FFT or multigrid 3 the resulting forces are interpolated from the mesh points back to the original particle positions partitioned array A data structure representing an array that has been divided into subarrays In the partitioned array elements may be stored and indexed by their subarray positions in order to improve data locality during subarray operations This is especially helpful if the subarrays are to be the targets of parallel operations partitioning Restructuring a program or algorithm in semi independent computational segments to take advantage of multiple CPUs simultaneously The goal is to achieve roughly equivalent work in each segment with minimal need for intersegment communication It is also worthwhile to have fewer segments than CPUs on a dedicated system passthrough Taken from signal processing where the term is used to describe logic gates that minimally alter a signal it is also applied to hypervisors that allow a guest OS to directly or nearly directly communicate with the host hardware path Tells linux where to find a file There are two kinds of paths full and relative A full path is the complete path to a file using the entire directory structure it does not rely on where you are in the directory tree A relative path starts with where you are to define where the file is located The full path will always be the same the relative path will depend on where you are PE A processing element especially one that takes part in a parallel computation It can be a machine node on a network a processor chip or a core within a processor A PE has the ability to run an independent process or thread and access the associated memory permutation An operation that gives a unique rearrangement of the items in an ordered list without adding or omitting any items PETSc Portable Extensible Toolkit for Scientific Computation pronounced PETS cee A suite of parallelized data structures and routines for the scalable solution of various PDEs that typically occur in scientific applications Interfaces exist for C C Fortran and Python PGAS Partitioned Global Address Space A parallel programming model based on the assumptions that 1 all processes or threads have access to a shared global memory address space that is logically partitioned and 2 a portion of the global memory is local to each process or thread Example of languages that incorporate such a model are UPC Coarray Fortran Chapel and Global Arrays PIC method Particle in cell method a technique for solving certain sets of partial differential equations that arise in physics simulations In the PIC method representative fluid particles or point masses move through and are influenced by a grid on which

    Original URL path: https://cvw.cac.cornell.edu/main/glossary (2015-12-27)
    Open archived version from archive

  • Register
    account Email The email field is required Password The password field is required Confirm password The confirm password field is required The password and confirmation password do not match Cornell

    Original URL path: https://cvw.cac.cornell.edu/Registration/Account/Register (2015-12-27)
    Open archived version from archive

  • Log in
    required Password The password field is required Remember me Note If you registered before July 1 2015 you will need to reset your password using Forgot your password below The software was upgraded and encrypted passwords could not be migrated

    Original URL path: https://cvw.cac.cornell.edu/Registration/Account/Login (2015-12-27)
    Open archived version from archive

  • Cornell Virtual Workshop
    technologies and broaden the participation of underrepresented groups in science and engineering Over 100 000 unique visitors have accessed Cornell Virtual Workshop training on programming languages parallel computing code improvement and data analysis The platform supports national cyberinfrastructure and their learning communities including XSEDE Stampede and Jetstream To Get Started Select the Topics tab Topics are listed in a suggested order You are welcome to skip to whatever topic you

    Original URL path: https://cvw.cac.cornell.edu/ (2015-12-27)
    Open archived version from archive