本文主要包括numpy用法上的一些套路和坑。

ndarray的性质

axis

执行以下代码

import numpy as np
a = np.array([[1,2],[3,4]])
np.sum(a, axis=0)
np.sum(a, axis=1)

输出

>>> a
array([[1, 2],
       [3, 4]])
>>> np.sum(a, axis=0)
array([4, 6])
>>> np.sum(a, axis=1)
array([3, 7])
>>>

实际上表示在哪个轴上连接。比如当axis=0的时候，最外面一层数组的长度就改变了。
可以看出，axis=0对每个纵轴Y，求和a[:, Y]，压扁X轴。同理，axis=1对每个横轴求和a[X, :]，压扁Y轴。

交换axis轴

在处理CNN相关数据的时候常常遇到类似的需求比如说NHWC和NCHW之间的交换。方法是通过swapaxes函数。

展平

unravel_index 可以从展平后的下标回推到展平前的下标。

Order

Fortran Order和C order在打印的结果是一致的

aforder = np.array([[1,2,3],[4,5,6]], order="F")
print(aforder, aforder.flags)
acorder = np.array([[1,2,3],[4,5,6]], order="C")
print(acorder, acorder.flags)

上面的代码输出

(array([[1, 2, 3],
       [4, 5, 6]]),   C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False)
(array([[1, 2, 3],
       [4, 5, 6]]),   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False)

并且轴也是对应的

1 2	print(np.sum(aforder, axis=0), np.sum(acorder, axis=0)) print(np.sum(aforder, axis=1), np.sum(acorder, axis=1))

上面的代码输出

1 2	(array([5, 7, 9]), array([5, 7, 9])) (array([ 6, 15]), array([ 6, 15]))

Fortran和C式存储的读写性能据说有差异，但实际看下来感觉差异不大

# 计算列和
R = 1000
C = 1000
carray = np.array([i * 0.001 for i in range(1000000)], order="C").reshape((R, C))
farray = np.array([i * 0.001 for i in range(1000000)], order="F").reshape((R, C))
sw.mark()
s = 0
for t in range(1000):
    s += np.sum(carray[:, 22])
print("C {} elapsed {}".format(s, sw.elapsed()))
sw.mark()
s = 0
for t in range(1000):
    s += np.sum(farray[:, 22])
print("F {} elapsed {}".format(s, sw.elapsed()))

得到

C 22499500.0 elapsed 0.00500011444092
F 22499500.0 elapsed 0.00400018692017
C 499522000.0 elapsed 0.00600004196167
F 499522000.0 elapsed 0.00699996948242

dtype介绍

元素类型

dtype作为诸如np.array的参数时，用来表示数组中元素的类型。Numpy中的数组是homogeneous的，都具有用dtype对象表示的类型

>>> type(np.int8)
<type 'type'>
>>> type(np.dtype(np.int32))
<type 'numpy.dtype'>

但与此同时np.dtype还是一个函数。诸如np.int8等都是这个类的实例。

1	np.dtype(object, align, copy)

如果align为True，那么就进行对齐（似乎我们不能指定对齐到多少字节）。如果copy为False，则是对内置数据类型对象的引用。
下面来看必选参数object，它非常灵活。

可以是类型
1
np.dtype(np.int32)
可以是字符串
这里i4表示int32，<表示使用小端存储。
1
np.dtype('<i4')

此外，dtype还可以被直接用来描述数组，或者嵌套结构。

表示struct
含有一个字段f1，这个字段是一个struct，里面包含一个f1

1
2
3

d = np.dtype([('f1', [('f1', np.int16)])])
>>> d.fields
dict_proxy({'f1': (dtype([('f1', '<i2')]), 0)})

表示多维数组
1
np.dtype("i4, (2,3)f8")

比较有趣的是去np.dtype到np.void上。
有的时候，astype(int)会依旧输出numpy.int64，此时参考SoF，选用.item()

ndarray的方法

创建/复制数组

复制

np.array(arr)和np.copy(arr)都可以复制numpy数组，感谢来自评论区的指正。

arr = np.array([1,2,3,4,5])
arr2 = np.array(arr)
arr[2] = 999
print arr2

以上代码输出

1	[ 1 2 3 4 5]

可以通过fromfunction函数，通过index来创建一个数组。

创建随机01数组

这个通常用来生成随机的mask数组。考虑场景，我们需要把数组arr中的随机transform_count位进行变换f，剩下的不变。我们可以借助下面的办法，首先对所有元素应用f，得到transformed数组。然后生成一个01的mask，并调用shuffle打乱，接着将所有mask为1的位置设置为transformed，其余的使用原来的arr。

from functools import reduce
def perform(f, arr, transform_count):
    transformed = f(arr)
    length = reduce(lambda x,y: x * y, arr.shape)
    mask = [1] * transform_count + [0] * (length - transform_count)
    mask = np.array(mask)
    np.random.shuffle(mask)
    mask = mask.reshape(*arr.shape)
    combine = transformed*mask + arr*(1-mask)
    return combine

arr = np.array([1,2,3,4,5,6,7,8])
print(perform(lambda x: x*2, arr, 5))

fromfunction

这个函数非常古怪，乍一看，我们是传入一个函数P，这个函数接受一个坐标作为参数，输出这个坐标上的值。
但结果是这样的么？我们首先来执行一下

def P(*X):
    import random
    return random.random()

np.fromfunction(P, (2, 3, 4), dtype=int)

结果输出了一个随机数

def P(*X):
    for x in X:
        print(x)
        print("====")
    return 1

np.fromfunction(P, (2, 3, 4), dtype=int)

结果输出

[[[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]]]
====
[[[0 0 0 0]
  [1 1 1 1]
  [2 2 2 2]]

 [[0 0 0 0]
  [1 1 1 1]
  [2 2 2 2]]]
====
[[[0 1 2 3]
  [0 1 2 3]
  [0 1 2 3]]

 [[0 1 2 3]
  [0 1 2 3]
  [0 1 2 3]]]
====

那么有什么办法可以做到呢？答案是用np.vectorize包裹一下

def P(*X):
    return X[0] + X[1] + X[2]

np.fromfunction(np.vectorize(P), (2, 3, 4), dtype=int)

增删相关方法

连接

在本节中，使用下面的定义

a = np.array([[1, 2], 
              [3, 4]])
b = np.array([[5, 6]])
c = np.array([[5], [6]])

concatenate

concatenate表示按照axis进行连接，连接之后，axis参数对应的轴的长度会变化。如果axis是None，则拍平再连接。

print np.concatenate((a, b), axis=0)
print np.concatenate((a, b), axis=None)
print len(a), len(np.concatenate((a, b), axis=0))
print np.concatenate((a, c), axis=1)
print np.concatenate((a, c), axis=None)
print len(a), len(np.concatenate((a, c), axis=1))

tile

tile是concatenate的语法糖。
第二个参数表示沿着哪个轴进行复制，下面两个是等价的。

1 2	print np.concatenate((a, a), axis=0) print np.tile(a, (2, 1))

都返回

[[1 2]
 [3 4]
 [1 2]
 [3 4]]

下面的会连接到右边

1 2	print np.concatenate((a, a), axis=1) print np.tile(a, (1, 2))

stack

stack会升维

1	print np.stack([a, a], axis=0)

会返回

array([[[1, 2],
        [3, 4]],

       [[1, 2],
        [3, 4]]])

查看一下shape

1	print a.shape, np.stack([a, a], axis=0).shape

结果是

1	(2L, 2L) (2L, 2L, 2L)

append

append类似于vector的push_back操作
同理，如果不指定axis，那么永远是得到一个一维数组。

1 2	print np.append([1, 2, 3], [[[4, 5, 6], [7, 8, 9]]], axis=None) print np.append([[1, 2, 3]], [[4, 5, 6], [7, 8, 9]], axis=0)

View相关方法

View

相比np.array和ndarray.copy，numpy还提供了view的机制以方便在相同的数据上面建立view。dtype表示元素的type，会进行reinterpret cast，而type表示外面封装的type，比如np.array或者np.matrix啥的，但必须是ndarray的子类型

mat_x = np.array([(1, 2), (3, 4)], dtype=[('a', np.int8), ('b', np.int8)])
mat_y = mat_x.view(dtype=np.int16, type=np.matrix)
mat_y_int8 = mat_x.view(dtype=np.int8, type=np.matrix)
print(mat_y_int8)

可以发现type变成了matrix，而当dtype不等于原来时，数组的含义也发生了改变。输出如下

>>> print(mat_y)
matrix([[ 513, 1027]], dtype=int16)
>>> print(mat_y)
matrix([[1, 2, 3, 4]], dtype=int8)

可以看出，我们对得到的数组进行了reshape，结果是跟随变化的

1
2
3

mat_ravel = mat_x.view(dtype=np.int8).reshape(-1)
mat_ravel[0] = 100
print(mat_x)

输出

1	[(100, 2) ( 3, 4)]

但是flatten基于view创建了副本，所以不会随之改变

1
2
3

mat_flatten = mat_x.view(dtype=np.int8).flatten()
mat_flatten[0] = 200
print(mat_x)

输出

1	[(100, 2) ( 3, 4)]

除了matrix，还有np.recarray这种东西，可以实现类似namedtuple或者pandas里面的类似效果

1 2	named_mat_x = mat_x.view(type = np.recarray) print(named_mat_x.a)

输出

[100   3]

mask机制

ascontiguousarray

ndarray默认都是C Order的，也就是按行存储，那么同行之间的元素是连续的。容易想到，通过slice操作，可以得到一个既不是C连续，又不是Fortran连续的数组。

1
2
3

a_cont = np.array([[1,2,3],[4,5,6],[7,8,9]])
a_non_cont = a_cont[:, 1:3]
print(a_non_cont.flags)

打印一下可以发现a_non_cont是不连续的。

C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

通过ascontiguousarray，能够得到一个底层存储不连续的数组的一个副本（有可能不是副本，但不能保证），这个新ndarray的底层存储是连续的。

1 2	a_non_cont_as_cont = np.ascontiguousarray(a_non_cont) print(a_non_cont_as_cont.flags)

打印发现是C连续的

C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

此外，我们还能发现np.ascontiguousarray默认以C模式（而不是Fortran）模式输出数组。下面展示了相关代码实现，ndmin=1是为了保证C模式不出错（数组至少有一维）。

ref: numpy\core\numeric.py
def asarray(a):
    return array(a, dtype, copy=False, order=order)
def ascontiguousarray(a):
    return array(a, dtype, copy=False, order='C', ndmin=1)

不过新产生的数组就是一个副本了

1
2
3

a_non_cont_as_cont[0][1] = 100
print(a_non_cont_as_cont)
print(a_non_cont)

上面的代码输出

[[  2 100]
 [  5   6]
 [  8   9]]
[[2 3]
 [5 6]
 [8 9]]

Reshape相关方法

通过arr.shape可以输出其形状。一般一维数组的shape是(n, )，二维数组是(n, m)这样。对于一维数组也可以输入一个Scalar，而不是一元tuple。
通过resize函数或者resize方法可以变更ndarray的大小，但是使用resize方法可能出现ValueError: cannot resize an array references or is referenced by another array in this way. Use the resize function这个错误。

reshape

1 2	a = np.arange(0, 10) print(a)

输出

1	[0 1 2 3 4 5 6 7 8 9]

reshape之后的array的形状要能够和之前的match，例如总长度都不等，是会报错的。

1	a2d = a.reshape((7, 7))

这个会报错

1	ValueError: cannot reshape array of size 9 into shape (7,7)

可以指定一个-1，让numpy自己推导剩下一维的大小

1 2	a2d = a.reshape((2, -1)) print(a2d)

输出

1 2	[[0 1 2 3 4] [5 6 7 8 9]]

但是不可以指定两个及以上的-1

1	a3d = a.reshape((2, -1, -1))

这样的代码会报错

1	ValueError: can only specify one unknown dimension

Flatten

下面的函数都可以把数组拍平

print(a2d.reshape(-1))
print(a2d.ravel())
print(a2d.flatten())
# 需要显式指定dtype，否则 https://github.com/numpy/numpy/issues/6616
print(np.ndarray(shape=(5, 2), buffer=a2d.data, dtype=np.int32))

上面都是输出都是相同的

1	[0 1 2 3 4 5 6 7 8 9]

flatten、ravel和reshape都可以将一个多维数组压平，那么区别是什么呢？从下面的代码可以看出flatten返回的是副本。
下面的代码输出100，可见reshape不会生成副本

1 2	a2d.reshape(-1)[1] = 100 print(a2d[0][1])

下面的代码输出200，可见ravel也不会生成副本

1 2	a2d.ravel()[1] = 200 print(a2d[0][1])

下面的代码输出200，可见flattern会生成副本

1 2	a2d.flatten()[1] = 300 print(a2d[0][1])

Filter相关方法

去重

可以通过unique去重

1
2
3

all_points = np.array([1,1,2,2,3,3,4,4])
non_repeated = np.unique(all_points)
print(non_repeated)

打印输出

1	[1, 2, 3, 4]

指定return_index，会返回不重复的index而不是值

1 2	_, idx = np.unique(all_points, return_index=True) print(idx)

结果如下

[0 2 4 6]

二维去重

需要注意的是，对于二维数组，也是逐元素进行去重的

1
2
3

all_points = np.array([[1,1],[2,2],[2,2],[3,3],[3,3],[4,5]])
non_repeated = np.unique(all_points)
print(non_repeated)

实际输出是

1	[1, 2, 3, 4, 5]

那么，如果我们是想把某一维作为整体进行去重，该如何做呢？下面展示一个骚操作，all_points是一个点集类型的数据，我们希望把这个点集进行去重。主要思路是首先把它按第二维转成二进制形式

1 2	b = np.ascontiguousarray(all_points.copy()) # 先复制一份出来，并将它变成连续的内存布局 .view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))

这样我们得到的b实际是一个一维数组，每一维表示编码后的一个点。

array([[b'\x01\x00\x00\x00\x01\x00\x00\x00'],  # [1,1]
       [b'\x02\x00\x00\x00\x02\x00\x00\x00'],  # [2,2]
       [b'\x02\x00\x00\x00\x02\x00\x00\x00'],  # [2,2]
       [b'\x03\x00\x00\x00\x03\x00\x00\x00'],  # [3,3]
       [b'\x03\x00\x00\x00\x03\x00\x00\x00'],  # [3,3]
       [b'\x04\x00\x00\x00\x05\x00\x00\x00']], # [4,5]
       dtype='|V8')

然后对这个二进制编码进行去重，再反推回来。

1 2	_, idx = np.unique(b, return_index=True) all_points = a[idx]

这样的输出是

array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 5]])

压缩

我们可以借助于unique提供的return_inverse参数实现压缩。在指定该参数后，unique还会返回一个数组，表示原数组中的每个元素对应到unique数组中的哪一项。

1	u, idx = np.unique(all_points, return_inverse=True)

在获得压缩之后，通过下面的代码会返回和all_points一样的数组

1	print u[idx]

过滤

可以通过mask机制

1 2	np.array([1,2,3])[np.array([False,True,False])] np.array([1,2,3])[np.array([0,1,0]).astype(np.bool)]

注意这里用int的结果是不同的

1	np.array([1,2,3])[np.array([0,1,0])]

Transform相关方法

vectorize

向量化是函数式编程中的一个默认的特性，向量化风格的写法天生直观，并且便于优化以及实现并行计算。
numpy提供vectorize函数，用来将一个非numpy-aware的函数向量化，并运用到ndarray上。

1	class numpy.vectorize(pyfunc, otypes=None, doc=None, excluded=None, cache=False, signature=None)

向量化后的fminus，等于对数列中的每个元素都进行了减100的操作

1
2
3

a = np.arange(0, 10)
fminus = lambda x: x - 100
print(np.vectorize(fminus)(a))

输出

1	[-100 -99 -98 -97 -96 -95 -94 -93 -92 -91]

升维：onehot

在很多时候，我们会遇到升维的需求，例如要对一个一维数组做onehot得到一个二维数组，像下面这样写会报错

1	np.vectorize(lambda x: [0,0,0,1,0])(arr_1d)

对此，可以借助于下面的代码来实现，其中nb_classes指的是onehot的depth，也就是说onehot维的长度。

>>> arr_1d = [1,1,2,3,3]
>>> nb_classes = max(arr_1d) + 1
>>> targets = np.array(arr_1d).reshape(-1)
>>> np.eye(nb_classes)[targets]
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]])

更普遍的升维

那么有没有更通用的办法呢？爆栈网上就有人问能不能one-to-n这样映射，vectorize主要是为了将非numpy-aware的函数提升到numpy上，而如果说要进行升维的操作，那么必然是numpy-aware的（因为相当于是ndarray套ndarray了），但其实也是有办法的，就是借助于signature。

1 2	fminus2order = np.vectorize(lambda x: np.array([x - 100]), signature='()->(n)') print(fminus2order(a))

输出

[[-100]
 [ -99]
 [ -98]
 [ -97]
 [ -96]
 [ -95]
 [ -94]
 [ -93]
 [ -92]
 [ -91]]

求和和降维

vectorize对于二维数组，也是逐元素应用的。如下所示，对一个二维数组应用一个“一维函数”，结果仿佛是先将这个数组展平，应用之后再 reshape 回去。

a = np.arange(0, 10)
fminus = lambda x: x - 100
a2d = a.reshape((5,2))
print(np.vectorize(fminus)(a2d))
'''
[[-100  -99]
 [ -98  -97]
 [ -96  -95]
 [ -94  -93]
 [ -92  -91]]
'''
print(np.vectorize(fminus)(a2d.ravel()).reshape((5,2)))
'''
[[-100  -99]
 [ -98  -97]
 [ -96  -95]
 [ -94  -93]
 [ -92  -91]]
'''

但有的时候，需要用到对

1 2	fsum1order = np.vectorize(lambda x: x.sum(), signature='(n)->()') print(fsum1order(a2d))

输出就是对每一行求和

1	[ 1, 5, 9, 13, 17]

那么如果要对列求和怎么办呢？可以借助于 apply_over_axes 函数。apply_over_axes函数可以指定对哪一个轴应用函数。
如下所示，我们要将二维数组中每一行的所有列加起来。下面的代码中指定1，也就是说我们要干掉第1个维度。结果正确。

>>> a
array([[1, 2],
       [3, 4]])
>>> np.apply_over_axes(np.sum, a, [1])
array([[3],
       [7]])

对于一个更复杂的三维数组，它的 shape 是 (2,2,2)，我希望保留中间那个维度，其他的维度都用 sum 聚合起来。实际上可以得到如下所示的结果。可以访问 https://pypi.org/project/ipfn/ 来查看具体的应用。

>>> a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
>>> a.shape
(2, 2, 2)
>>> np.apply_over_axes(np.sum, a, [0,2])
array([[[14],
        [22]]])

我们也可以通过指定所有轴的方式让函数应用到所有的轴上面

>>> np.apply_over_axes(lambda x,axis: x+1, a2d, [0,1])
array([[ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

vectorize的性能问题

vectorize性能没有经过优化，所以第二行要比第一行性能好。

1 2	arr = np.vectorize(lambda x: 1 if x > 0 else 0)(arr) arr[arr > 0] = 1

Aggregate相关方法

count

numpy中没有类似np.count(value)这样的方法，要知道这是连Python里面的原生list都有的。但是爆栈网上给出了下面的思路

1
2
3

a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
unique, counts = numpy.unique(a, return_counts=True)
dict(zip(unique, counts))

可以看出，unique函数还是十分强大的。此外，如果只是统计0的数量，可以用下面的函数

1	np.count_nonzero(y)

再结合一下mask方法

1	np.count_nonzero(y == value)

此外，如果限定是大数据量，但是值域很小并且是整数的场景，还可以用bincount来解决。

all和any

all指定axis会降维。如果不带参数，会返回scalar。

a = np.array([[True, True], 
              [True, True]])
print a.all()
print a.all(axis=0)

如果指定keepdims，那么返回的数组保持和原数组一样的维度。

1 2	print a.all(keepdims=True) print a.all(axis=0, keepdims=True)

我们考虑sum方法，下面的两个语句都是按行相加，但一个返回二维数组，一个返回一维数组。

1 2	print(np.sum(a, axis=1, keepdims=True)) print(np.sum(a, axis=1))

argpartitions

这个函数的一个应用是求出一个 n 维数组中，前 want 大的元素对应的下标。其做法是先 ravel 成一维数组，并 argpartition 选出第 -want 个元素，然后再调用 unravel_index 找到展平前对应的下标。

def get_top_indices(a, want):
    flat_indices = np.argpartition(a.ravel(), -want)[-want:]
    flat_elements = a.ravel()[flat_indices]
    ans = []
    for i in flat_indices:
        indices = np.unravel_index(i, a.shape)
        ans.append(indices)
    return ans, flat_elements

广播

1
2
3

print np.array([3]) * 2
print np.array([3]) * np.array([2,4,6])
print np.array([3]) * np.array([[2,4,6]])

但是下面的代码会报错

1	print np.array([3,7]) * np.array([2,4,6])

显示

1	ValueError: operands could not be broadcast together with shapes (2,) (3,)

ndarray对Python函数的亲和性

原生函数

len
一般返回ndarray的行数

zip
zip作用到两个一维ndarray，会得到一个原生的list。

z1 = np.array([1,2,3,4,5])
z2 = np.array([6,7,8,9,10])
z = zip(z1, z2)
print(z)
print(type(z))

输出

1 2	[(1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] <type 'list'>

如果说zip两个不同维度的呢？

1	list(zip(np.array([[1,2],[3,4]]), np.array([5,6])))

输出

1	[(array([1, 2]), 5), (array([3, 4]), 6)]

any和all
注意，numpy中也有any和all，例如(A==B).all()返回A和B中是否所有的数都对应相等。

itertools

1 2	import itertools print(list(itertools.imap(lambda x, y: x * y, z1, z2)))

原生类型转换

ndarray和原生数组/常量的互转

pandas类型转换

详见pandas相关

matplotlib

一个程序中，所有的plot默认会画到一个figure上。包括使用importlib的调用。所以应当使用plt.figure(x)来选择绘制的图，并使用close(x)进行关闭。

ndarray的性质

axis

交换axis轴

展平

Order

dtype介绍

元素类型

ndarray的方法

创建/复制数组

复制

创建随机01数组

fromfunction

增删相关方法

连接

concatenate

tile

stack

append

View相关方法

View

mask机制

ascontiguousarray

Reshape相关方法

reshape

Flatten

Filter相关方法

去重

二维去重

压缩

过滤

Transform相关方法

vectorize

升维：onehot

更普遍的升维

求和和降维

vectorize的性能问题

Aggregate相关方法

count

all和any

argpartitions

广播

ndarray对Python函数的亲和性

原生函数

原生类型转换

pandas类型转换

matplotlib

Reference