[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: New strsplit function
From: |
Ben Abbott |
Subject: |
Re: New strsplit function |
Date: |
Thu, 16 May 2013 14:52:44 +0800 |
On May 16, 2013, at 2:39 PM, John W. Eaton wrote:
> On 05/16/2013 02:19 AM, Ben Abbott wrote:
>
>> hmmm ... I took a look at Matlab 2013a. It's not clear to me that we'd want
>> to copy this.
>>
>
> Well, Matlab users apparently want compatibility here. That's why I
> received the report.
>
>> matlab> strsplit('', 'a')
>>
>> ans =
>>
>> {''}
>>
>> matlab> strsplit('a', 'a')
>>
>> ans =
>>
>> '' ''
>>
>> matlab> strsplit('aa', 'a')
>>
>> ans =
>>
>> '' ''
>>
>> matlab> strsplit('aaa', 'a')
>>
>> ans =
>>
>> '' ''
>>
>> matlab> strsplit('aaaa', 'a')
>>
>> ans =
>>
>> '' ''
>> matlab> strsplit ('abc', {'a','b','c'})
>>
>> ans =
>>
>> '' ''
>> In case it isn't clear, the output is a cellstring containing two empty
>> strings.
>
> Oh, so collapsdelimiters means that if multiple consecutive delimiters
> appear in the string that is being split, they should be treated as
> one?
That is my understanding. A moment ago, it occured to me I should check to see
what regexp () works.
octave> regexp ('aaaaa', '(a)+', 'split')
ans =
{
[1,1] =
[1,2] =
}
octave> strsplit ('aaaaa', 'a', 'delimitertype', 'regularexpression')
ans =
{
[1,1] =
[1,2] =
}
So, it looks unlikely that there is a Matlab bug, but instead it is a
misunderstanding on my part.
> Then I think my guess about what was happening was wrong, and the
> behavior above is correct. If the string is 'aa' and the delimiter is
> 'a', then it is the same as strsplit ('a', 'a') and the result should
> be two empty strings (one for before and one for after the
> delimiter). That's the result we used to get for the simpler case of
> strsplit ('a', 'a'). Now we get an empty cell array, which looks
> wrong to me.
ahhh ... ok, that makes sense to me!
> So in this code
>
> ## Get substring lengths.
> if (isempty (idx))
> strlens = length (str);
> else
> strlens = [idx(1)-1, diff(idx)-1, numel(str)-idx(end)];
> endif
> if (nargout > 1)
> ## Grab the separators
> matches = num2cell (str(idx)(:)).';
> if (args.collapsedelimiters)
> ## Collapse the consequtive delimiters
> ## TODO - is there a vectorized way?
> for m = numel(matches):-1:2
> if (strlens(m) == 0)
> matches{m-1} = [matches{m-1:m}];
> matches(m) = [];
> endif
> end
> endif
> endif
> ## Remove separators.
> str(idx) = [];
> if (args.collapsedelimiters)
> ## Omit zero lengths.
> strlens = strlens(strlens != 0);
> endif
>
> ## Convert!
> result = mat2cell (str, 1, strlens);
>
> it seems like we should be performing the "omit zero lengths" part on
> the output of diff, then tacking on the beginning and ending strings.
> But I don't understand what the "if (nargout > 1)" part in between is
> doing.
The (nargout > 1) part was there to allow the block t be skipped if "matches"
isn't requested (the 2nd output). I'll take a look at your suggested change.
Ben