Havock Posted June 5, 2015 Share Posted June 5, 2015 Grettings, I'm in the process of converting my old site to UTF8. So far things are clear, but one point is teasing me. I've read that with UTF8 datas some functions (such as strlen, substr, strpos ...) may not works proprely for "special" (i.e., multibyte) Unicode characters. People then advised to replace these functions by their multibytes equivalents (such as mb_strlen, mb_strpos ...). However in the V2.3 version including the latest 2.3.4 I do not find these multibytes functions ; there are only the old ones. Is there a reason ? Many thanks for your help. Link to comment Share on other sites More sharing options...
Havock Posted June 8, 2015 Author Share Posted June 8, 2015 Well so far nobody seems to be able to answer :( I'm certainly not the first one to be using UTF8 :rolleyes: so what ? Do you use the mbstring.func_overload setting to automatically replace the standard string functions with their multi-bytes equivalents ? For some functions it's not really useful to use their multi-bytes equivalents, but for functions such as strlen, substr or strpos it could cause some troubles. Has this small problem been ignored or solved somewhere ? Link to comment Share on other sites More sharing options...
MrPhil Posted June 8, 2015 Share Posted June 8, 2015 Hmm. An interesting question. Indeed, the current (2.3.4) osC is using substr(), etc. for UTF-8 data. According to the php.net manual, mb_substr() should be used (i.e., UTF-8 isn't the default encoding). Some PHP functions changed their default encoding around 5.4 or so, but 1) osC shouldn't be written to assume that, and 2) I think that only applies to string functions where the encoding is explicitly given. I wonder if indeed the current osC is occasionally giving errors due to splitting multibyte characters, but people are assuming that they did something wrong and it's not an osC problem. IIRC, some PHP-based products (e.g. SMF) use a macro (define) for which string functions to use: substr() for non-UTF-8 and mb_substr() for UTF-8. Or maybe they wrap a call around it to choose one or the other -- it's been quite a while since I was in there. Link to comment Share on other sites More sharing options...
Havock Posted June 9, 2015 Author Share Posted June 9, 2015 Thank you for your reply MrPhil. In most cases the occasional errors should not cause serious problems, but one should never underestimate Murphy's law -_- Maybe nobody really noticed this or took the time to correct this because they use standard english characters (not multibytes ones), or (as you said) they believed they made something wrong in their programming. Switching to multibytes functions is not difficult, but it's a long and tedious process. It would be nice to have the advice of people more qualified than me (all the more tha,n on the French forum, someone said that the mime.php class should not be modified. So it's a bit confusing). Link to comment Share on other sites More sharing options...
oscMarket Posted June 9, 2015 Share Posted June 9, 2015 Could you please tell what problems you actually encounter, as you not mentioned any failures? Once all is utf-8 , there should be no issues, so i am kind of curious what issues you have. Link to comment Share on other sites More sharing options...
Havock Posted June 9, 2015 Author Share Posted June 9, 2015 Hello wHiTeHaT, I'm still in the process of converting my site to UTF8, so I've not had issues yet ; but some readings on switching from latin1 to UTF8 showed this possible issue. String based functions like strlen, substr, strpos ... do not really use characters, but bytes ; and with UTF8, a single character uses between one and three bytes (for example : e takes one byte, and é takes two). So strlen('ee') will return 2, but strlen(éé) will return 4. People then advise to swtich to the multi-bytes equivalents of these functions. Link to comment Share on other sites More sharing options...
oscMarket Posted June 9, 2015 Share Posted June 9, 2015 could you do a test with: $length = strlen(utf8_decode($string)); Link to comment Share on other sites More sharing options...
oscMarket Posted June 9, 2015 Share Posted June 9, 2015 just tested and it seems it do the job: we use the Greek word (no clue what it says): ΜΠΟΡΕΙΤΕThis is an 8 character output. so you right if we do : strlen('ΜΠΟΡΕΙΤΕ');//makes a length of 16 however if we do: strlen(utf8_decode('ΜΠΟΡΕΙΤΕ')); //makes 8.... as what you want ^_^ Link to comment Share on other sites More sharing options...
Havock Posted June 9, 2015 Author Share Posted June 9, 2015 could you do a test with: $length = strlen(utf8_decode($string)); Well I'd like to but I'm still in the process of configuring my server, so I can't make live test. The use of utf8_decode should solve the problem with the strlen function, but it may not always be possible or recommended to use it for the other functions involved. I've tried on a local version of boostrap Oscommerce and echo strlen(ee); and strlen(éé); both gave 2. Logicaly that should not be the case and I do not understand why I get these results. However i've seen that even if the database uses UTF8 tables and the php pages have a meta charset="utf-8" , the php pages are not encoded in UTF8 without Bom but in Ansi. Could this mix of UTF8 & Ansi explained why these functions still work the "old way" ? This mix of Ansi & UTF8 does not seems really safe if we want to avoid mixing data charset. Link to comment Share on other sites More sharing options...
Havock Posted June 9, 2015 Author Share Posted June 9, 2015 just tested and it seems it do the job: we use the Greek word (no clue what it says): ΜΠΟΡΕΙΤΕThis is an 8 character output. so you right if we do : strlen('ΜΠΟΡΕΙΤΕ');//makes a length of 16 however if we do: strlen(utf8_decode('ΜΠΟΡΕΙΤΕ')); //makes 8.... as what you want ^_^ So the problem is real. Even if it's not a high risk problem, it should be taken care of. Using utf8_decode may not be an option for all the functions involved and even if it was, adding it to all the instances of these functions is as big a task as replacing these functions with their multibytes equivalents. Link to comment Share on other sites More sharing options...
oscMarket Posted June 9, 2015 Share Posted June 9, 2015 Well, i tested on my localhost : strlen(éé); //of course gives a warning....missing quotes however still counts 4 //should be tested as: strlen('éé');//returns 4 on mine strlen('ee');//returns 2 So.... i try to understand what you actualy testing.... let's assume you retrieve data from your database. your database should read for all fields where there is varchar/text involved utf char encoding. As you mentioned... you ported from older oscommerce version..... you first should convert your db to utf. Can find within oscommerce's admin: Tools->Databse Tables Link to comment Share on other sites More sharing options...
Havock Posted June 9, 2015 Author Share Posted June 9, 2015 I'm not using my converted old database yet. So far I've upgraded my php code to be compatible with PHP5.4 and saved all my files in UTF8 without Bom format. I'm curently looking at the server parameters and when it's done, I'll convert my base and import it. However as I try to do things with some kind of method :) I'd like to solve this possible issue with these string functions first. The test you ran show that with UTF8 datas, the strlen function is not reliable to know the length of a string. So we should assume that the other string functions (strpos, substr ...) have the same kind of problem and should also not be considered reliable with UTF8. Hence the real question : why are these functions still used in oscommerce and not their multibytes equivalents : mnb_strlen, mb_substr , mb_strpos ... Link to comment Share on other sites More sharing options...
burt Posted June 9, 2015 Share Posted June 9, 2015 Why are you wasting time on solving a possible issue? First get the works done, next see if all is OK, then make changes. Link to comment Share on other sites More sharing options...
oscMarket Posted June 9, 2015 Share Posted June 9, 2015 Hence the real question : why are these functions still used in oscommerce and not their multibytes equivalents : mnb_strlen, mb_substr , mb_strpos ... Simply because there is no need of it. There is no TEXT involved where oscommerce makes use of these functions. So the question has no relevance to the matter. You are stirring in a pot from where everyone all eaten from, and no one got sick of the food cooked in. :) Link to comment Share on other sites More sharing options...
Havock Posted June 9, 2015 Author Share Posted June 9, 2015 Why are you wasting time on solving a possible issue? First get the works done, next see if all is OK, then make changes. Greetings burt. It's not a possible issue. The test made by wHiTeHaT show that the problem exists, even if it's not a serious one and it may not really concern people using only standard english characters. Simply because there is no need of it. There is no TEXT involved where oscommerce makes use of these functions. So the question has no relevance to the matter. You are stirring in a pot from where everyone all eaten from, and no one got sick of the food cooked in. :) Let's take the create-account page for example. On this page you make tests on the number of characters entered by the customers. For example with : if (strlen($lastname) < ENTRY_LAST_NAME_MIN_LENGTH) { Is it not possible to have errors there ? If ENTRY_LAST_NAME_MIN_LENGTH is set to 4 and the customer enters 'ééé' , strlen($lastname) should give you a 6 and send an error message. I know that a lot of people far more able than me at coding have already investigate the matter, but I like to get answers when I see something weird. (And before eating from a common pot, I like to be sure nobody missed anything that should not have been put into the brew ;) ) Link to comment Share on other sites More sharing options...
oscMarket Posted June 9, 2015 Share Posted June 9, 2015 @@Havock, ok... good catch.... now you have a point. Link to comment Share on other sites More sharing options...
oscMarket Posted June 9, 2015 Share Posted June 9, 2015 If insist on a solution given: http://pageconfig.com/post/portable-utf8as of php6/7 will have build in solutions for the matter, this solution would fit to current releases. Link to comment Share on other sites More sharing options...
burt Posted June 9, 2015 Share Posted June 9, 2015 If you need a solution to a problem that you have not yet seen (by your own admission) ... sitewide Search and Replace: strlen( to tep_strlen( New function: function tep_strlen($string) { return mb_strlen($string, 'UTF-8'); } Done. 1 minute... Link to comment Share on other sites More sharing options...
MrPhil Posted June 9, 2015 Share Posted June 9, 2015 If osC only works with complete strings, and never tries to take a smaller substring, it can probably get away with non-UTF-8 aware strlen(), substr(), etc. However, any attempt to cut a string, such as producing a "teaser" or trying to reformat into shorter lines, risks cutting in the middle of a multibyte character. Enforcing maximum string lengths will give bad results when text that would fit (by character count) is pronounced too long (by byte count). If osC is in fact using functions that assume 1 character = 1 byte, we could be getting occasional bad results that no one is raising a stink about. Link to comment Share on other sites More sharing options...
♥bruyndoncx Posted June 9, 2015 Share Posted June 9, 2015 @@Havock you have my sympathy - I wondered the same, but couldn't really find a real major issue, just the minor one the character being off. @@burt, why tep_xxx, why not directly mb_xxx ? KEEP CALM AND CARRY ON I do not use the responsive bootstrap version since i coded my responsive version earlier, but i have bought every 28d of code package to support burts effort and keep this forum alive (albeit more like on life support). So if you are still here ? What are you waiting for ?! Find the most frequent unique errors to fix: grep "PHP" php_error_log.txt | sed "s/^.* PHP/PHP/g" |grep "line" |sort | uniq -c | sort -r > counterrors.txt Link to comment Share on other sites More sharing options...
♥bruyndoncx Posted June 9, 2015 Share Posted June 9, 2015 On a related note, I'm changing my site to modernize the look and I'm testing my input fields to make use of HTML5 features such as pattern matching to enforce minimum number of characters and have custom multi-lingual error messages (not the browser defaults language) using the civem.js library (on gitbhub) <input id="bill_firstname" name="bill_firstname" value="éé" pattern=".{2,}" required="" data-errormessage-value-missing="Dit veld is verplicht" class="validate invalid" data-errormessage-pattern-mismatch="min 2 karakters" type="text"> KEEP CALM AND CARRY ON I do not use the responsive bootstrap version since i coded my responsive version earlier, but i have bought every 28d of code package to support burts effort and keep this forum alive (albeit more like on life support). So if you are still here ? What are you waiting for ?! Find the most frequent unique errors to fix: grep "PHP" php_error_log.txt | sed "s/^.* PHP/PHP/g" |grep "line" |sort | uniq -c | sort -r > counterrors.txt Link to comment Share on other sites More sharing options...
MrPhil Posted June 9, 2015 Share Posted June 9, 2015 @@burt, why tep_xxx, why not directly mb_xxx ? I don't think mb_ is enabled in PHP by default. It has to be explicitly built, and not all servers may have it. Rather than lose osC users on such servers (presumably mostly in English-speaking countries), a soft fail back to the old versions might be better (even if they're otherwise using UTF-8). Link to comment Share on other sites More sharing options...
burt Posted June 9, 2015 Share Posted June 9, 2015 @@bruyndoncx I've never liked the idea of a user-made function having the same name as an inbuilt php function... @@MrPhil mb_strlen is inbuilt in php from 4.0.6 so should be available on all up-to-date servers... Link to comment Share on other sites More sharing options...
Havock Posted June 10, 2015 Author Share Posted June 10, 2015 @@wHiTeHaT : thanks for the link to the portable UTF8 library. That's a nice solution if the mbstring extension is not already installed on the server. @@burt : your solution has several problems : If the mbstring extension is already installed on the server, there's no need to create a new function as one can directly use the mb_ functions. If the extension is not installed, your function will not work. And in any case one still has to replace the "strlen" in all the php files. And last but not least, the strlen function is not the only one involved ; there are a lot of functions working with strings that may have to be changed (just have a look there : http://us.php.net/manual/en/book.mbstring.php ) @@MrPhil : You've got the point. There can be a lot of small bugs caused by this bad handling of UTF8 datas. @@bruyndoncx : Thx Carine :) . I agree that there should not be any real major issue, but if there is a known problem, should we not correct it ? Link to comment Share on other sites More sharing options...
burt Posted June 10, 2015 Share Posted June 10, 2015 Solution is flexible enough to deal with the scenario of mb_strlen unavailable; function tep_strlen($string) { return strlen($string); } :thumbsup: and that is precisely why you make a new function.I just tested a global search/replace, it took 4.3 seconds to find all strlen(, and replace with tep_strlen(, it then took me another 20 seconds to add the function to the two general.php function files. So, you now have a solution for your needs that works exactly as intended and will take less than a minute to change your whole store over to. Use at your own risk.For others, make other new functions, and do a global Search/Replace to update to those newly made functions (similar as I have showed above). Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.