how to properly handle UTF-8 ? (UTF8)

Jun 14, 2011 at 5:23 PM

My application takes input from the user in a text area, and saves it to the database.  Some users type their information in MS Word, and then cut/paste it into the textarea on the web application.  If they use quotes or apostrophes, Word will convert them into "smart quotes", or "smart apostrophes".  I am able to store these characters, and then display them back on a web page using the page encoding UTF-8, however, when I use this same data to output a MS word doc using PHPWORD, the characters don't show up correctly.

Is there a way to tell PHPWord that my text is already in UTF8 format so it will display correctly?

Jun 22, 2011 at 11:00 AM

I found this problem when using templates. When if I read text from Excel file and then put it into docx file, UTF-8 chars displays incorrectly. The same appears when I write the text (encoded in UTF-8) directly into docx.

Solution is to remove (or comment) the following part

if(!is_array($replace)) {
            $replace = utf8_encode($replace);
}

 

in: public function setValue($search, $replace)

 

in file:  PHPWord/Template.php

Jun 22, 2011 at 2:05 PM

Thanks for your suggestion, I also found the same sort of solution, but I'm not using templates, so I had to fix it in a different place:

in Section.php addText function:

I did this:

//        $givenText = utf8_encode($text);
    $givenText = $text;

and in cell.php addText function

I did this:


//        $text = utf8_encode($text);

There are many more places where this should probably be done, but I'm only using section.addText and Cell.addText in my application so that's the only places I needed to change.

 

Searching for utf8_encode found 19 matches in 7 files throughout the codebase.

If the developer could address this in the main codebase it would greatly help people.

The ironic thing about this problem is that the problem characters are coming from Microsoft Word in the first place!

 

Oct 31, 2011 at 9:16 PM

Thanks!  PHPWord was gibbling my ASCII encoded french accents, removing utf8_encode from both Section.php and Cell.php's "addText" functions solved my problem!

Mar 16, 2012 at 7:36 PM
Edited Mar 19, 2012 at 1:10 PM

Good to know. 

Jul 18, 2012 at 9:41 PM

Has this been addressed in a later version of the library?

Jul 27, 2012 at 5:58 AM

I tried to encode the Chinese characters but the output produces:

福建省泉州市惠南工业区北一路 / Quanzhou / Fujian / China

 

Instead of:

福建省泉州市惠南工业区北一路 / Quanzhou / Fujian / China

When I display the result into my web page, no problem.

When I tried to generate docx document: everything is fine, except the Chinese characters...

Any other idea?

Aug 13, 2012 at 10:14 AM

If you have do this:

  //$givenText = utf8_encode($text);

in the source code, so before you addtext you should encode the text again,just like this

$text = iconv('gbk','utf-8','福建省泉州市惠南工业区北一路');

$section->addText($text);

Sep 25, 2012 at 9:27 PM
Edited Sep 25, 2012 at 9:32 PM

Hello all,

If have another issue concerning UTF8.

I modified the section.php and cell.php to remove the utf8_encode functions.

That works ok.

But

I use a simple html ckeditor and convert the html with simple_html_dom to have some support for Bold, Italic and underline and a simple bullit list.

The parsing by the simple_htlm_dom is however decode the special chars back into the strange characters that where there before the utf8_encode removal.

Does someone have ann idea how to prevent this.

This is the convert function:

 

function convertText($text,$object) {

  $html_dom = new simple_html_dom();
  $html_dom->load('<html><body>' . $text . '</body></html>');
// Note, we needed to nest the html in a couple of dummy elements

// Create the dom array of elements which we are going to work on:
$html_dom_array = $html_dom->find('html',0)->children();

// Provide some initial settings:
$initial_state = array(
      'current_style' => array('bold'=>false, 'align'=>'left', 'color'=>'878787', 'name' => 'Arial'),
      'style_sheet' => h2d_styles_example(), // This is an array (the "style sheet") - returned by h2d_styles_Example() here (in styles.inc) - see this function for an example of how to construct this array.
      'parents' => array(0 => 'body'), // Our parent is body
      'list_depth' => 0, // This is the current depth of any current list
      'context' => 'section', // Possible values - section, footer or header
      'base_root' => 'http://www.mywebsite.com', // Required for link elements - change it to your domain
      'base_path' => '/', // Path from base_root to whatever url your links are relative to
      'pseudo_list' => TRUE, // NOTE: Word lists not yet supported (TRUE is the only option at present)
      'pseudo_list_indicator_font_name' => 'Wingdings', // Bullet indicator font
      'pseudo_list_indicator_font_size' => '7', // Bullet indicator size
      'pseudo_list_indicator_character' => 'l ', // Gives a circle bullet point with wingdings
      );    

// Convert the HTML and put it into the PHPWord object
h2d_insert_html($object, $html_dom_array[0]->nodes, $initial_state);


}

 

I changed the load to:

$html_dom->load('<html><head><meta http-equiv='Content-Type' content='Type=text/html; charset=utf-8'></head><body>' . $text . '</body></html>');


But thats is also not working, and there is nothing converted anymore.
Somebody has a idea how I can solve this issue?

Thanxs
Sep 26, 2012 at 8:49 AM

I will answer the previous question myself because I found that the h2d_htmlconverter uses also the TextRun.php from phpWord\Section area.

You have to make the same modification as in section.php and cell.php in textrun.php.

The complete patch for this is listed here:

http://htmltodocx.codeplex.com/SourceControl/changeset/view/f676be705744#htmltodocx/patches/phpword/utf8_encode_090512.patch

Greetz Aren

Nov 6, 2012 at 10:02 AM

The patch does not change the character encoding. Problematic characters (áéíőóöüúű).

What needs to be modified to properly use ?

Jan 29, 2013 at 2:23 PM
Edited Jan 29, 2013 at 2:23 PM

I had problem same type maybe check this post

http://phpword.codeplex.com/discussions/431281

is ok for me

Jul 11, 2013 at 12:39 AM
@sauronpl Thank you soooo much you saved me a lot of time!! it worked for me using templates
Oct 30, 2013 at 3:46 AM
Just want to share this without modifying the section.php and cell.php...

I've just used utf8_decode to reverse the utf8_encode.
$section->addText( utf8_decode($text) );
Mar 24, 2014 at 11:13 AM
function utf2win1251 ($s)
{
$out = "";

for ($i=0; $i<strlen($s); $i++)
{
$c1 = substr ($s, $i, 1);
$byte1 = ord ($c1);
if ($byte1>>5 == 6) // 110x xxxx, 110 prefix for 2 bytes unicode
{
$i++;
$c2 = substr ($s, $i, 1);
$byte2 = ord ($c2);
$byte1 &= 31; // remove the 3 bit two bytes prefix
$byte2 &= 63; // remove the 2 bit trailing byte prefix
$byte2 |= (($byte1 & 3) << 6); // last 2 bits of c1 become first 2 of c2
$byte1 >>= 2; // c1 shifts 2 to the right

$word = ($byte1<<8) + $byte2;
if ($word==1025) $out .= chr(168); // �
elseif ($word==1105) $out .= chr(184); // �
elseif ($word>=0x0410 && $word<=0x044F) $out .= chr($word-848); // �-� �-�
else
{
 $a = dechex($byte1);
 $a = str_pad($a, 2, "0", STR_PAD_LEFT);
 $b = dechex($byte2);
 $b = str_pad($b, 2, "0", STR_PAD_LEFT);
 $out .= "&#x".$a.$b.";";
}
}
else
{
$out .= $c1;
}
}

return $out;
}