[clisp-list] CANONICALIZE how to use?

Discussion:

Jean Louis

2017-04-11 20:19:03 UTC

Hello,

If I am not mistaken CANONICALIZE could be used to get ASCII string
from "Some ČĆŠL ™bigbog file name.html" ?

In documentation I cannot see what is first argument and what is
second argument.

Jean

Sam Steingold

2017-04-13 14:28:52 UTC

Permalink

Post by Jean Louis
If I am not mistaken CANONICALIZE could be used to get ASCII string
from "Some ČĆŠL ™bigbog file name.html" ?

Yes, but you don't really need it.

Here is what you can do:

--8<---------------cut here---------------start------------->8---
(defun drop-non-ascii-chars (string)
(if (every (lambda (c) (typep c charset:ascii)) string)
string
(let ((ascii #.(make-encoding :charset charset:ascii
:input-error-action :ignore
:output-error-action :ignore)))
(ext:convert-string-from-bytes
(ext:convert-string-to-bytes string ascii)
ascii))))
(drop-non-ascii-chars "Some ČĆŠL ™bigbog file name.html")
==> "Some L bigbog file name.html"
--8<---------------cut here---------------end--------------->8---

See http://www.clisp.org/impnotes/encoding.html

Bruno, is this The Right Way?

PS. `:input-error-action :ignore` is not really necessary

Post by Jean Louis
In documentation I cannot see what is first argument and what is
second argument.

The doc http://clisp.org/impnotes/macros3.html#canonicalize

Post by Jean Louis

(EXT:CANONICALIZE value functions &KEY (test 'EQL) (max-iter 1024))
will call functions on value until it stabilizes under test (which
should be a valid HASH-TABLE-TEST) and return the stabilized value
and the number of iterations the stabilization required.

IOW,

(ext:canonicalize "Some ČĆŠL ™bigbog file name.html"
'(drop-non-ascii-chars))

Jean, how would you clarify the doc?

--
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
http://steingoldpsychology.com http://www.childpsy.net http://www.memritv.org
http://iris.org.il http://islamexposedonline.com http://no2bds.org
Those who value Life above Freedom are destined to lose both.

Jean Louis

2017-04-13 14:31:32 UTC

Permalink

Post by Sam Steingold

Post by Jean Louis
If I am not mistaken CANONICALIZE could be used to get ASCII string
from "Some ČĆŠL ™bigbog file name.html" ?

Yes, but you don't really need it.
--8<---------------cut here---------------start------------->8---
(defun drop-non-ascii-chars (string)
(if (every (lambda (c) (typep c charset:ascii)) string)
string
(let ((ascii #.(make-encoding :charset charset:ascii
:input-error-action :ignore
:output-error-action :ignore)))
(ext:convert-string-from-bytes
(ext:convert-string-to-bytes string ascii)
ascii))))
(drop-non-ascii-chars "Some ČĆŠL ™bigbog file name.html")
==> "Some L bigbog file name.html"

Thank you much, I can start from there.

I was under impression that č could be replaced by c, and š by s and
so on. That is often used for Internet URLs, so that search terms are
yet findable. Dropping is one solution, replacement or downgrade to
ASCII is better.

Jean

Sam Steingold

2017-04-13 16:59:30 UTC

Permalink

Post by Jean Louis
I was under impression that č could be replaced by c, and š by s and
so on.

That would be lovely, but I am not sure how to define such a function on
UNICODE characters canonically.

Many UNICODE characters have no meaningful ASCII counterpart altogether
(think oriental hieroglyphs), so they would have to be dropped or
replaced with hex escapes or something...

(this is a relatively common question, and google offers "inconclusive"
answers).

--
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
http://steingoldpsychology.com http://www.childpsy.net http://think-israel.org
http://iris.org.il http://jij.org http://www.memritv.org
Lisp is not dead, it just smells funny.

Jean Louis

2017-04-13 19:16:57 UTC

Permalink

Post by Sam Steingold

Post by Jean Louis
I was under impression that č could be replaced by c, and š by s and
so on.

That would be lovely, but I am not sure how to define such a function on
UNICODE characters canonically.
Many UNICODE characters have no meaningful ASCII counterpart altogether
(think oriental hieroglyphs), so they would have to be dropped or
replaced with hex escapes or something...
(this is a relatively common question, and google offers "inconclusive"
answers).

The practical world use is not that difficult.

I found a solution, to use iconv:

echo ßöäȧ čekaj malo | iconv -f UTF-8 -t ASCII//TRANSLIT
ssoaa cekaj malo

so I will just make simple function.

Thank you.

Jean

Continue reading on narkive:

Search results for '[clisp-list] CANONICALIZE how to use?' (Questions and Answers)

replies

how to optimise my ecommerce site for the google search?

started 2012-03-06 03:54:25 UTC

google

replies

lisp function mapcar?

started 2006-06-23 01:57:50 UTC

programming & design

replies

what is dns cache?

started 2007-06-02 16:28:36 UTC

computer networking