Welcome! Please see the About page for a little more info on how this works.

0 votes
in Clojure by
The proposal is that the trim functions (trim, triml, and trimr) would get a second arity with a function {{trim?}}:


[trim? ^CharSequence s]


{{trim?}} comes first to support partial.

New doc string would be:


"Removes characters from both ends of string.
 If trim? is omitted white space is removed. When supplied it accepts
 a character and returns true if the character should be removed."


Example test:


(deftest t-trim
  (is (= "foo" (s/trim "  foo  \r\n")))
  (is (= "bar" (s/trim "\u2000bar\t \u2002")))

  ;; Additional test
  (is (= "bar" (s/trim "$%#\u2000bar\t \u2002%$#"
                       #(or (Character/isWhitespace %) ((set "$#%") %))))))



Similar to Python's strip - https://docs.python.org/2/library/stdtypes.html#str.strip

*Approach:* The proposed solution isn't very DRY but it follows the design guidelines at the top of the file, more exactly point 3:

"3. Functions take advantage of String implementation details to
   write high-performing loop/recurs instead of using higher-order
   functions. (This is not idiomatic in general-purpose application
   code.)"

First I had a solution in which I replaced Character/isWhitespace from the current implementation by calling pred. pred was defaulted to an is-whitespace? function.
That code is of course nicer, even trim-newline could just call into trimr, removing a lot of duplication, but it adds the overhead of always calling a function, instead of calling Character/isWhitespace directly.

The only way I can see to have optimised and DRYer code is to use macros, but I don't think that it will necessary lead to nicer code.

Given the existing design style of the other functions in string.clj I felt that the best solution would be to just simply duplicate in favour of optimised code.

7 Answers

0 votes
by

Comment made by: sztamas

Proposed solution. Code + tests.

0 votes
by

Comment made by: sztamas

Added new patch that renames pred to trim?

0 votes
by

Comment made by: jafingerhut

Note that Java, and thus Clojure/Java, uses UTF-16 encoding for strings in memory. Thus if you wanted to trim a set of Unicode code points from the beginning and/or end of a string, the API of trim? taking a single 16-bit Java character is not enough information to determine whether it should be trimmed or not.

If you want to handle that generality, it would require a more complex implementation, which checks whether the first/last character is one half of a code point that is encoded as 2 16-bit Java characters, and pass a 32-bit int to trim?, or something similar to that.

I have no objections if these API enhancements are made without enabling testing against an arbitrary Unicode code point. In the past, similar suggestions have been rejected in Clojure's built-in lib, e.g. CLJ-945

0 votes
by
_Comment made by: sztamas_

Yes, the UTF-16 encoding and Character representing either a codepoint or a half-codepoint is a bit of a mess, isn't it?

In the Java String and Character API's the methods that accept char, handle only characters in the Basic Multilingual Plane.
trim? accepts a character, so following the same behavior it will work only for removing characters in the Basic Multilingual Plane.

I think even this would be fine, but additionally because the high/low surrogates and the BMP characters are disjoint, you could actually use the same implementation  to remove Unicode code points that aren't in the BMP. You can just say that both the high and low code unit of the codepoint are "unwanted".

Ex:
0 votes
by

Comment made by: jafingerhut

Agreed, but probably better to anti-recommend such an implementation of trimr for removing such things, because it would also remove only one UTF-16 Java character out of 2 high/low surrogates if it matched a member of the set, even if the other surrogate didn't match anything in the set, which would leave behind a malformed UTF-16 string.

Again, probably best to either not include this in the implementation at all, and at most warn about it in the docs, or to handle it in the implementation by checking for high/low surrogates in the loop(s).

0 votes
by

Comment made by: sztamas

Yes, you're right. That solution won't work in all cases, so it can't be recommended.

I am slightly inclined towards having trim? accept chars and work only for removing BMP characters. This will arguably be enough for the majority of the use cases.
The other solution can be used for all use cases, but then trim? will have to accept int, or 2 chars, or a string, so trim? would be less intuitive (although closer to the real world :-)), and writing those trim? functions would be less user friendly.

That being said, I am happy to change the implementation to do that if it is required.

Currently, I'm not even sure if the enhancement will be accepted or rejected or what the process for that is.

0 votes
by
Reference: https://clojure.atlassian.net/browse/CLJ-1889 (reported by sztamas)
...