Discussion:
[clisp-list] CANONICALIZE how to use?
Jean Louis
2017-04-11 20:19:03 UTC
Permalink
Hello,

If I am not mistaken CANONICALIZE could be used to get ASCII string
from "Some ČĆŠL ™bigbog file name.html" ?

In documentation I cannot see what is first argument and what is
second argument.

Jean
Sam Steingold
2017-04-13 14:28:52 UTC
Permalink
Post by Jean Louis
If I am not mistaken CANONICALIZE could be used to get ASCII string
from "Some ČĆŠL ™bigbog file name.html" ?
Yes, but you don't really need it.

Here is what you can do:

--8<---------------cut here---------------start------------->8---
(defun drop-non-ascii-chars (string)
(if (every (lambda (c) (typep c charset:ascii)) string)
string
(let ((ascii #.(make-encoding :charset charset:ascii
:input-error-action :ignore
:output-error-action :ignore)))
(ext:convert-string-from-bytes
(ext:convert-string-to-bytes string ascii)
ascii))))
(drop-non-ascii-chars "Some ČĆŠL ™bigbog file name.html")
==> "Some L bigbog file name.html"
--8<---------------cut here---------------end--------------->8---

See http://www.clisp.org/impnotes/encoding.html

Bruno, is this The Right Way?

PS. `:input-error-action :ignore` is not really necessary
Post by Jean Louis
In documentation I cannot see what is first argument and what is
second argument.
The doc http://clisp.org/impnotes/macros3.html#canonicalize
Post by Jean Louis
(EXT:CANONICALIZE value functions &KEY (test 'EQL) (max-iter 1024))
will call functions on value until it stabilizes under test (which
should be a valid HASH-TABLE-TEST) and return the stabilized value
and the number of iterations the stabilization required.
IOW,

(ext:canonicalize "Some ČĆŠL ™bigbog file name.html"
'(drop-non-ascii-chars))

Jean, how would you clarify the doc?
--
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
http://steingoldpsychology.com http://www.childpsy.net http://www.memritv.org
http://iris.org.il http://islamexposedonline.com http://no2bds.org
Those who value Life above Freedom are destined to lose both.
Jean Louis
2017-04-13 14:31:32 UTC
Permalink
Post by Sam Steingold
Post by Jean Louis
If I am not mistaken CANONICALIZE could be used to get ASCII string
from "Some ČĆŠL ™bigbog file name.html" ?
Yes, but you don't really need it.
--8<---------------cut here---------------start------------->8---
(defun drop-non-ascii-chars (string)
(if (every (lambda (c) (typep c charset:ascii)) string)
string
(let ((ascii #.(make-encoding :charset charset:ascii
:input-error-action :ignore
:output-error-action :ignore)))
(ext:convert-string-from-bytes
(ext:convert-string-to-bytes string ascii)
ascii))))
(drop-non-ascii-chars "Some ČĆŠL ™bigbog file name.html")
==> "Some L bigbog file name.html"
Thank you much, I can start from there.

I was under impression that č could be replaced by c, and š by s and
so on. That is often used for Internet URLs, so that search terms are
yet findable. Dropping is one solution, replacement or downgrade to
ASCII is better.


Jean
Sam Steingold
2017-04-13 16:59:30 UTC
Permalink
Post by Jean Louis
I was under impression that č could be replaced by c, and š by s and
so on.
That would be lovely, but I am not sure how to define such a function on
UNICODE characters canonically.

Many UNICODE characters have no meaningful ASCII counterpart altogether
(think oriental hieroglyphs), so they would have to be dropped or
replaced with hex escapes or something...

(this is a relatively common question, and google offers "inconclusive"
answers).
--
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
http://steingoldpsychology.com http://www.childpsy.net http://think-israel.org
http://iris.org.il http://jij.org http://www.memritv.org
Lisp is not dead, it just smells funny.
Jean Louis
2017-04-13 19:16:57 UTC
Permalink
Post by Sam Steingold
Post by Jean Louis
I was under impression that č could be replaced by c, and š by s and
so on.
That would be lovely, but I am not sure how to define such a function on
UNICODE characters canonically.
Many UNICODE characters have no meaningful ASCII counterpart altogether
(think oriental hieroglyphs), so they would have to be dropped or
replaced with hex escapes or something...
(this is a relatively common question, and google offers "inconclusive"
answers).
The practical world use is not that difficult.

I found a solution, to use iconv:

echo ßöäȧ čekaj malo | iconv -f UTF-8 -t ASCII//TRANSLIT
ssoaa cekaj malo

so I will just make simple function.

Thank you.

Jean

Continue reading on narkive:
Loading...