1 / 18

SIMD Optimization in COINS Compiler Infrastructure

SIMD Optimization in COINS Compiler Infrastructure. Mitsugu Suzuki (The University of Electro-Communications) Nobuhisa Fujinami (Sony Computer Entertainment Inc.). Agenda. COINS SIMD optimization Two topics on SIMD optimization Data Size Inference SIMD Benchmark

viho
Download Presentation

SIMD Optimization in COINS Compiler Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMD Optimization in COINS Compiler Infrastructure Mitsugu Suzuki (The University of Electro-Communications) Nobuhisa Fujinami (Sony Computer Entertainment Inc.)

  2. Agenda • COINS SIMD optimization • Two topics on SIMD optimization • Data Size Inference • SIMD Benchmark • Current status and required improvements

  3. SIMD optimization‥‥Concept and decision • implemented as an LIR to LIR transformer • requires no additional special extensionsfor source languages. • source-level optimizable matters are postponed. → HIR-level matter ex. Vectorization (appropriate loop unrolling), if-peeling, complex if-conversion, etc.

  4. #define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1)) short *v1, *v2, *v3; /* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) // case-A *v1++ = AVE(*v2++, *v3++); for (i = 0; i < M; i++) // case-B v1[i] = AVE(v2[i], v3[i]); for (i = 0; i < M; i += 4) { // case-C v1[i] = AVE(v2[i], v3[i]); v1[i+1] = AVE(v2[i+1], v3[i+1]); ... v1[i+3] = AVE(v2[i+3], v3[i+3]); } for (i = 0; i < M; i += 4) { // case-D v1[0] = AVE(v2[0], v3[0]); v1[1] = AVE(v2[1], v3[1]); ... v1[3] = AVE(v2[3], v3[3]); v1+=4; v2+=4; v3+=4; } × ○

  5. #define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1)) struct { short r, g, b, a; } *u1, *u2, *u3; /* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) { // case-E u1[i].r = AVE(u2[i].r, u2[i].r); u1[i].g = AVE(u2[i].g, u2[i].g); u1[i].b = AVE(u2[i].b, u2[i].b); u1[i].a = AVE(u2[i].a, u2[i].a); } ○

  6. SIMD optimization‥‥Processing flow • If-conversion • Decompose basic blocks into DAGs. • Match LIR patterns to specific SIMD operation. • Combine same basic operations. (parallelization) (⇒3rd page of hand script)

  7. 8bits 8bits 7bits 7bits 9bits 8bits 8bits Data size inference ‥‥Why needed? Two styles of averaging integers: (assumption : Both x and y are given 8 bits unsigned integers.) #define AVE(x,y) (((x) + (y) + 1) >> 1) ⇒max 9bits: zero-extension is needed (normal instruction oriented coding) #define AVE(x,y) (((x)>>1) + ((y)>>1) + (((x)|(y))&1)) ⇒max 8bits: no extension is needed (SIMD instruction oriented coding) Butcompiler must extend x and y to its integral type (typically 32 bits) ← Integral promotion rule

  8. Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules • Patterns of the meaningful bits are matched while instruction selection.

  9. 0..255 0..255 0..255 0..255 0..254 0..1 1..1 0..511 0..127 1..1 0..510 1..1 0..255 0..127 0..255 0..255 0..255 0..255 1..1 1..1 SET SET CONVIT:I8 CONVIT:I8 MEM:I8 MEM:I8 ADD RSHU ADD BAND ADD CONST ADD CONST 1 CONST RSHU BOR RSHU 1 1 CONST CONST CONVZX CONVZX CONVZX CONVZX MEM:I8 MEM:I8 MEM:I8 MEM:I8 1 1 *a = (*b>>1 + *c>>1 +((*b | *c) & 1)); *a = (*b + *c + 1) >> 1;

  10. Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules • Patterns of the meaningful bits are matched while instruction selection.

  11. 8 8 8 8 8 9 9 9 8 8 8 8 8 8 8 8 8 8 SET SET 0..255 0..255 CONVIT:I8 CONVIT:I8 MEM:I8 MEM:I8 0..255 0..255 ADD RSHU 0..254 0..1 1..1 0..511 ADD BAND ADD CONST 0..127 1..1 0..510 1..1 0..255 0..127 ADD CONST 1 CONST RSHU BOR RSHU 1 0..255 0..255 0..255 0..255 1..1 1..1 1 CONST CONST CONVZX CONVZX CONVZX CONVZX MEM:I8 MEM:I8 MEM:I8 MEM:I8 1 1 *a = (*b>>1 + *c>>1 +((*b | *c) & 1)); *a = (*b + *c + 1) >> 1;

  12. Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on their Inference Rules. • Patterns of the meaningful bits are matched while instruction selection.

  13. Data size inference‥‥Method • Get value range for each node. • Get altering bits from the value range. • Get meaningful bits for each node with given one (from upper node). • Getting value ranges and required bits are based on theirInference Rules • Patterns of the meaningful bits are matched while instruction selection.

  14. SIMD Benchmark‥‥Why needed? • Existing benchmarks are not suited for tuning of SIMD optimization. • SIMD-optimizable patterns are covered with non-SIMD-optimizable ones. • Existing codes are far from SIMD-optimization (without hole-in-one matching). • Step-wise milestones for SIMD-optimization was required.

  15. SIMD Benchmark‥‥Design • SIMD-optimizable code patterns were extracted from real media processing applications. • Multiple versions were crafted by hand for each code patterns so as • covering wide range, from easily SIMD optimized level to original • classified by SIMD optimization techniques • execution times are reported for each version

  16. Original If-peeled int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel; } else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;} acLevel = ((data[i] < 0) ? -data[i] : data[i]) - quant_d_2; acLevel2 = (acLevel * mult) >> SCALEBITS; sum += ((acLevel < quant_m_2) ? 0 : acLevel2); coeff[i] = ((acLevel < quant_m_2) ? 0 : ((data[i] < 0) ? -acLevel2 : acLevel2)); and loop-unrolled / not

  17. Original If-conversed int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel; } else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;} acMsk1 = (int)data[i] >> 31; acLevel = ((data[i] & ~acMsk1)| ((-data[i]) & acMsk1)) - quant_d_2; acMsk2 = (acLevel < quant_m_2) ? 0 : 0xffff; acLevel = (acLevel * mult) >> SCALEBITS; sum += acMsk2 & acLevel; coeff[i] = acMsk2 & (((-acLevel) & acMsk1) | (acLevel & (~acMsk1))); and loop-unrolled / not

  18. Current status andrequired improvements • Bone of SIMD opt. has been implemented. • Following are MUST • Enrichment of template for specific SIMD op. • Isolation of machine dependent and independent part in SIMD opt. • Recovery method from failure in SIMD op. matching. • Alignment and overlapping check for pointers . ⇒ will be solved in the next release

More Related