The focus of this dissertation is on kernel loops (K-loops), which are loop nests that contain hardware mapped kernels in the loop body. The reader will find here methods for improving the performance of such K-loops, by using standard loop transformations such as unrolling and pipelining, for exposing and exploiting the coarse grain loop level parallelism.The targeted applications are from the signal processing domain, where large kernels in the innermost loop are common. Examples are the JPEG, MJPEG and H.264 image/video compression algorithms, the Sobel edge detection, etc.The targeted architecture is a heterogeneous system consisting of a general purpose processor and a reconfigurable processor. This work goes towards automatically deciding the number of kernel instances to place into the reconfigurable hardware, in a flexible way that can balance between area and performance. We propose a general framework that helps determine the optimal degree of parallelism for each hardware mapped kernel within a K-loop, taking into account area, memory size and bandwidth, and performance considerations. This framework can be easily extended to other architectures, such as multicore.