9 Replies Latest reply: Oct 1, 2012 12:12 PM by etresoft
nicolas michel Level 1 Level 1 (75 points)

Dear all

 

The question should be simpler by example :

 

$ touch français

$ find . -name "*ç*"

 

How to find | grep | awk files with latin char like é à è ç ?

 

Many thanks !

 

Nicolas Michel

  • etresoft Level 7 Level 7 (26,250 points)

    Not easily. These Unicode characters can be encoded in different ways and still be valid. Apple's HFS+ file system specifically uses the decomposed form, which is probably not what terminal is using. All these tools have powerful abilities to directly use Unicode characters. Play around with them. I don't have time to give specifics now, but I will try to come back later.

  • etresoft Level 7 Level 7 (26,250 points)

    I've done a little more research.

     

    Your keyboard generates precomposed Unicode characters. The file system stores everything as decomposed Unicode characters. The version of bash (3.2) that ships with any recent version of OS X can automatically translate between the two in some cases. It will not do that in with wildcard expansion. A few years ago, Bash changed to a GPLv3 license that prevents it from being included by Apple. More recent versions of Bash do slightly better. Apple does include a recent version of zsh that is not quite up to even as capable as Bash 3.2.

     

    However, because HFS+ stores the decomposed version of these characters, you don't have to work that hard. Just do "*c*" and it will find the cédille. The same is true for any accented character.

  • nicolas michel Level 1 Level 1 (75 points)

    Dear etresoft

     

    Thanks for your interesting answer.

     

    Hosestly, in this topic, I'm lost :

    - From a mac to an other, it change. (at home, I found a trick that doesnt work at my job place)

    - From a "$LC_ALL" to an other, it should change

    - From a terminal to an other, it change. (X11 write ç %E7 and show ?? in some cases)

    - From a bash to an other, it change (it's what you say)

    - From a terminal setting to an other, it will probably change too. (strict VT-100 or enabled encoding )

     

    So just for explaning the situation :

    Word crash on some of the clients I manage.

    In multiple cases, I found special char on the parent dir of the crashed file,

    (special char could be lot of things like : ; ! >  or even a return char, my users are really creative)

    Renaming dir solved the problem in multiple cases.

     

    So what I try  is to perform a "find" on theses clients to locate special char.

    I don't want to list theses char because of it should be any unicode char, the list is too long.

     

    So I just want to find all files that contain char that are not one of theses :

    [:alnum:] space - _ . é è ê ë á à â ä î ô û ç

     

    The logical way should be, in my mind, tu use a unicode regex like \p{Lm}

    <http://www.regular-expressions.info/unicode.html>

    I tried to install gawk, findutils and bash 4.2-1 with fink,

    But I dont' find the way to do it.

     

    How can you see that is the used encoding for a file ?

     

    Many thanks !

     

    Nicolas

  • etresoft Level 7 Level 7 (26,250 points)

    There is no "encoding" for file or directory names. They are in Unicode-16 decomposed. These names are properly handled throughout the operating system with the exception of Terminal. In the Terminal, all data is just a stream of bytes. The keyboard sends data as Unicode precomposed. There are an infinite number of other possible sources of data and there is no way to identify which parts of those streams are file paths. A few places in the shell can detect when you are working with file or diretory names and can do an automatic conversion for you.

     

    Word 2011 has no problem with such directories. I suggest you just upgrade those systems that are crashing. If one particular 3rd party application isn't using the proper operating system facilities and is trying its own thing, and failing at it, there isn't much you can do. This applies to both Word and various shells. Most shells probably assume file names are UTF-8 or whatever Linux is doing. Real Unicode awareness is rare outside of Apple. You may be able to write a Perl script using Perl's Unicode support and File::Find to traverse a directory. It is difficult to write code blindly. Here is what I think you want:

     

    #!/usr/bin/perl

     

    use strict;

     

    use Unicode::Normalize;

    use File::Find;

     

    # Search the current directory.

    find(\&wanted, '.');

     

    # Look at each item.

    sub wanted

      {

      # Grab the file name.

      my $name = $_;

     

      # Only look at directories.

      if(-d $File::Find::name)

        {

        # Strip out all combining marks.

        my $stripped = Unicode::Normalize::NFKD($name);

        $stripped =~ s/\p{NonspacingMark}//g;

     

        print "$File::Find::name\n"

          if $name ne $stripped;

        }

      }

     

    This will find all directories in the current directory that have a something that could be causing problems for Office. You could concieveably collect them all and rename them in a script like this.

  • nicolas michel Level 1 Level 1 (75 points)

    Dear etresoft

     

    Unfortunately, I never learned perl.

    So learning this language just for 1 script is maybe too long for the time I have.

    But finally, a friend (patpro) found the "grep -P" option.

    It will do the job :

     

    $ touch ç

    $ touch é

    $ ls . |hexdump

    63 cc a7 0a 65 cc 81 0a           

    $ ls . |grep -P "[\x65][\xcc][\x81]"

    é

    $ ls . |grep -P "[\x63][\xcc][\xa7]"

    ç

     

    For your  "Word 2011 has no problem with such directories", I found once a bug report about parent dir name.

    I can't find it again (should be corected with updates), but the drive name is still not safe : <http://support.microsoft.com/kb/2027586>

    Anyway, there's a lot of cases where a return char or a semicolon make troubles in filename.

    The worst case I've seen was with MS SFM, you was alowed to use the full naming convention of Mac on a Win server that wasn't able to handle such names. But even without SFM, I prefer to have a tool for diagnose theses problem.

     

    Thank you for your time and your help !

     

    Nicolas Michel

  • etresoft Level 7 Level 7 (26,250 points)

    nicolas michel wrote:

     

    Unfortunately, I never learned perl.

    That's fine. You probably don't want to see the more in-depth research I did then .

     

    But finally, a friend (patpro) found the "grep -P" option.

    Then I may have some bad news for you. You must be on Lion. I don't think I ever knew about the "-P" flag in grep - and now it is gone. It seems the -P option is a feature of GNU grep. GNU has been changing it licensing over the years specifically to prevent companies like Apple (or Etresoft) from using GNU software. Apple is switching over to more open and less political licenses and GNU grep was replaced with BSD grep in Mountain Lion.

  • nicolas michel Level 1 Level 1 (75 points)

    Well, thanks again.

     

    fink can install gnu grep, so I don't care about Apple vs GNU fight.

    (I have an NFS export of /sw, it works from any mac clients)

     

    But still, your script is really interesting.

    Could you please explain theses 2 lines ?

        my $stripped = Unicode::Normalize::NFKD($name);

        $stripped =~ s/\p{NonspacingMark}//g;

     

    In fact what I want to search is quite complex and I wish to start with a quite short list of allowed char,

    and then be able to update my script whith chat I'll found on client's disk.

    Probably that I'll handel renaming on the fly too.

     

    Not allowed char will be probably at least " ' ` / | \ ; : \n \t < > = + * % &

    and spaces at begining or at the end of a name,

    point at the end of a name (without extension or after extension) and so on.

    (for windows compatibility, we have half mac and half pc)

     

     

    Many thanks !

     

    Nicolas

  • léonie Level 9 Level 9 (68,355 points)

    Nicolas,

    this should work in Mt. Lion as well: Use "fgrep"

     

    And this works in Mt. Lion:

     

     

    bash-3.2$ ls

    Hægar                    Møn                    frühe                    frûhstûck          testing

    Mühe                    Mœn                    frúhstùck        

    bash-3.2$ ls -1 | fgrep æ

    Hægar

    bash-3.2$

     

     

    The trick is to use fgrep - literal comparison!

     

    Regards

    Léonie

  • etresoft Level 7 Level 7 (26,250 points)

    nicolas michel wrote:

     

    fink can install gnu grep, so I don't care about Apple vs GNU fight.

     

     

    To be clear, these "Apple vs ..." fights are just media inventions. GNU released many popular tools and changed their license in 2007. For years Apple just relied on old versions while it worked on some ideology-free free software. Those tools are now mature and Apple is starting to replace the obsolete, circa-2007 GNU tools with more modern equivalents. It is more like "the World vs Apple" with only Apple playing fair.

     

    But still, your script is really interesting.

    Could you please explain theses 2 lines ?

        my $stripped = Unicode::Normalize::NFKD($name);

        $stripped =~ s/\p{NonspacingMark}//g;

     

    Basically is just removes any combining marks. The next step compares the original version with a version that has any combining marks removed. If they are different, then the original had something special in it.

     

    That was my first attempt and it wasn't very good. A better method is to take any path from the operating system and do:

     

      # Convert the file from UTF8 to Perl.

      my $octets = decode('UTF-8', $file);

     

      # And decompose.

      my $perlName = NFC($octets);

     

    This will result in a Perl-version of the name that can be used for standard comparisons and matching. It is a long story and very Perl-specific.

     

    In fact what I want to search is quite complex and I wish to start with a quite short list of allowed char,

    and then be able to update my script whith chat I'll found on client's disk.

    Why not just define a set of allowed characters and check for something not in that set?

     

    Still, I think there is a danger in going too far down a "how do I do x,y, and z" path here. I'm not at all convinced you need to. Windows does have some well-known limitations on allowed file and directory names, but it should fully support Unicode too. Word should not be crashing. A better approach is to identify exactly what is causing Word to crash and under what circumstances. Eliminating any non-ascii characters in a path is a 1973 solution to a problem that should be solved in 2012.