python-对字符向量进行排序时的结果不同

2024-03-24 10:07:46

我想知道在对字符向量进行排序时R排序算法如何工作

a = c("aa(150)", "aa(1)S")
sort(a)
# [1] "aa(150)" "aa(1)S" 
a = c("aa(150)", "aa(1)")
sort(a)
# [1] "aa(1)" "aa(150)"

R不会从左到右一一比较字符的整数值吗？为什么添加字符可以改变结果？

我认为排序由“ 5”和“)”字符决定,之后的字符将被忽略.

与Python比较

In [1]: a=["aa(150)","aa(1)"]
In [2]: sorted(a)
Out[2]: ['aa(1)', 'aa(150)']
In [3]: a=["aa(150)","aa(1)S"]
In [4]: sorted(a)
Out[4]: ['aa(1)S', 'aa(150)']

解决方法:

在大多数情况下,将语言环境设置为默认设置,它将关闭特定于语言环境的排序：

Sys.setlocale("LC_COLLATE", "C")
a=c("aa(150)","aa(1)S")
sort(a)
#[1] "aa(1)S"  "aa(150)"

由于语言差异,字符串排序规则必须是国际特定的.从帮助？排序：

The sort order for character vectors will depend on the collating
sequence of the locale in use: see Comparison.

然后,我们可以转到？Comparsons进行以下比较：

Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: e.g. in
Estonian Z comes between S and T, and collation is not necessarily
character-by-character – in Danish aa sorts as a single letter, after
z. In Welsh ng may or may not be a single sorting unit: if it is it
follows g.

如前所述,由于每种语言以不同的方式使用字母,因此语言环境对于排序至关重要.

码农公寓