编程中的人工智能(上)

· · 算法·理论

前言

本文会讲一些 python 中的人工智能知识,涉及到贝叶斯最浅层的知识,如果有很好的人工智能基础不建议阅读本文,毕竟这只是一个蒟蒻的学习笔记,下图是贝叶斯的工作原理,本文就先抛开这些,讲一些最简单的知识。语法没学过的看这里。

温馨提示:禁止一切名义的盗版!

多项式贝叶斯分类预测

简单概括一下我接触到的浅层贝叶斯工作原理,就是让机器人不断接触、学习、分析网络上的数据,积累经验,对某件没有发生的事接下来的趋势进行准确的分类预测。

导库

from pre import * # 模块标准化和归一化
from sklearn.naive_bayes import * # sklearn 朴素贝叶斯类库

多项式贝叶斯

c = MultinomialNB() # NB 就是 naive_bayes 的缩写

完成机器人训练

c.fit(x_train,y_train) # 注意是 . 而不是 =

测试机器人表现

print(c.score(x_test,y_test)) # 同样是 .

实践预测

res = c.predict(t_word) # 注意这里只有一个参数
show(res) # 展示预测结果

五个参数

x_train = x_train_get() # 训练特征
y_train = y_train_get() # 训练结果
x_test = x_test_get() # 测试特征
y_test = y_test_get() # 测试结果
t_word = t_word_get() # 实践特征

此处附完整代码

from pre import *
from sklearn.naive_bayes import *

x_train = x_train_get()
y_train = y_train_get()
x_test = x_test_get()
y_test = y_test_get()
t_word = t_word_get()

c = MultinomialNB()
c.fit(x_train,y_train)
print(c.score(x_test,y_test))
res = c.predict(t_word)
show(res)

高斯贝叶斯分析数据

分类

贝叶斯是一种机器学习算法,其可用于计算同一件事的可能出现的不同结果分别的概率,并且选取概率最大的一种作为计算的结果。

特征

特征是指一条信息中最具有代表性的内容,比如大小、正负、数量、关键字等。

贝叶斯分类器可以分析信息中的特征,并根据已有结果计算。

高斯贝叶斯

c = GaussianNB()

高斯贝叶斯适合处理连续的数字信息,多项式贝叶斯更适合处理非连续的文本信息

读取表格

df = pd.read_excel('data2.xlsx')

数据拆分函数

data = train_test_split(df[['长度', '宽度', '步长', '步宽', '边缘是否完整', '型号']], df['性别']) 

这是一个随机拆分函数,这里笔者需要特别标注一下此函数来源于 sklearn.model_selection 一库。

存储拆分数据

x_train = data[0]
x_test = data[1]
y_train = data[2]
y_test = data[3]
# 特征(x)在前,结果(y)在后;训练(train)在前,测试(test)在后。

展示概率

real = [[27.0,10.1250,76.950,17.5,0,2.0]]
pro = c.predict_proba(real) # 概率
print(pro)

此处附完整代码

from sklearn.naive_bayes import *
from sklearn.model_selection import *
import pandas as pd
c = GaussianNB()
df = pd.read_excel('data2.xlsx')
data = train_test_split(df[['长度', '宽度', '步长', '步宽', '边缘是否完整', '型号']], df['性别'])
                      # df[['b1','b2','...']], df['en']
x_train = data[0]
x_test = data[1]
y_train = data[2]
y_test = data[3]
c.fit(x_train, y_train)
real = [[27.0,10.1250,76.950,17.5,0,2.0]]
res = c.predict(real)
print(res)
pro = c.predict_proba(real)
print(pro)

数据预处理

一个贝叶斯模型无法同时处理数字信息和文本信息,所以,我们就要用数据预处理

\texttt{numpy}

import numpy as np

创建数据透视表

df = pd.read_csv("adult.csv", header = None)
# 附注:见智能可视化一框中对透视表的详细解释

\texttt{empty} 函数

data = np.empty((h,l)) # h 行 l 列,另外注意 np 和 empty 间的 . 以及两对 ()
data = np.empty(df.shape) # 另一种写法

预处理部分

for i in df.columns:
    encoder = None
    if df[i].dtype == object: # 字符型数据
        encoder = LabelEncoder()
        data[:,i] = encoder.fit_transform(df[i])
    else: # 数值型数据
        data[:,i] = df[i]
    LE.append(encoder)

如果是字符型数据,进行数据编码;如果是数值型数据则不变。

拆分数据

split_data = train_test_split(data[:, :-1], data[:, -1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]

train_test_split 拆分出的数据包含 4 项结果:训练特征、训练结果、测试特征、测试结果。

输入和拼接

info1 = int(input("请输入预测者的年龄:"))
info2 = input("请输入预测者的工作性质:")
info3 = input("请输入预测者的学历:")
info4 = input("请输入预测者的性别:")
t_word[0] = info1
t_word[1] = ' ' + info2
t_word[3] = ' ' + info3
t_word[9] = ' ' + info4
t_word = pd.DataFrame(t_word)
print(t_word.shape)

DataFrame 注意一下。

预测年薪模板代码

import pandas as pd
import numpy as np
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.preprocessing import *

df = pd.read_csv("adult.csv", header = None)

LE = [] # 放置每一列的 encoder
data = np.empty(df.shape)

for i in df.columns:
    encoder = None
    if df[i].dtype == object:
        encoder = LabelEncoder()
        data[:, i] = encoder.fit_transform(df[i])
    else:  
        data[:, i] = df[i]
    LE.append(encoder)

split_data = train_test_split(data[:, :-1], data[:,-1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]

c = GaussianNB()
c.fit(x_train, y_train)
print(c.score(x_test, y_test))

t_word = pd.read_csv("pred.csv", header = None)
info1 = int(input("请输入预测者的年龄:"))
info2 = input("请输入预测者的工作性质:")
info3 = input("请输入预测者的学历:")
info4 = input("请输入预测者的性别:")
t_word[0] = info1
t_word[1] = ' ' + info2
t_word[3] = ' ' + info3
t_word[9] = ' ' + info4
t_word = pd.DataFrame(t_word)
print(t_word.shape)
t_word2 = np.empty(t_word.shape)
for i in t_word.columns:
    encoder = None
    if t_word[i].dtype == object: 
        encoder = LE[i]
        t_word2[:, i] = encoder.transform(t_word[i])
    else:  
        t_word2[:, i] = t_word[i]
res = c.predict(t_word2)
print(res)
if res == 1:
    print("年收入>50k")
else:
    print("年收入<=50k")

现在我们得到了解决问题的流程图

数据预处理的过程还可以细分为打标签、合并数据并转表格、数据编码等环节,这里就省略了。

特征提取

\texttt{jieba}

# 拆分词语
with open("libai.txt", "r", encoding = 'utf-8') as txt:
    text1 = txt.read()
    list1 = jieba.lcut(text1)

with open("dufu.txt", "r", encoding = 'utf-8') as txt:
    text2 = txt.read()
    list2 = jieba.lcut(text2)

jieba 还可以用于生成词云图,其主要原理就是切割句子,找出出现频率最高的词,从而找出一段信息中的要点。

打标签

# 给古诗词语贴标签
lb_features = []
for lb in list1:
    lb_features.append((lb, "lb"))

df_features = []
for df in list2:
    df_features.append((df, "df"))

列表中可以存放包括元组在内的各种数据结构,注意有两对小括号哦!

结果如下:

lb_feature = [(lb,'诗句1'),(lb,'诗句1'),...,(lb,'诗句 n')]
df_feature = [(df,'诗句1'),(df,'诗句1'),...,(df,'诗句 n')]

数据合并

# 合并列表
words = lb_features + df_features
# 转为表格
df = pd.DataFrame(words)

\texttt{TfidfVectorizer} 编码器

tv = TfidfVectorizer() # 创建
x_train = tv.fit_transform(x_train) # 编码
x_test = tv.transform(x_test)

区分陆游、苏轼诗词模板代码

import pandas as pd
import numpy as np
import jieba
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.feature_extraction.text import *

with open("sushi.txt", "r", encoding = 'utf-8') as txt:
    text1 = txt.read()
    list1 = jieba.lcut(text1)

with open("luyou.txt", "r", encoding = 'utf-8') as txt:
    text2 = txt.read()
    list2 = jieba.lcut(text2)

ss_features = []
for ss in list1:
    ss_features.append((ss,"ss")) # 苏轼

ly_features = []
for ly in list2:
    ly_features.append((ly,"ly")) # 陆游

words = ss_features + ly_features
df = pd.DataFrame(words) 

split_data = train_test_split(df[0], df[1], random_state = 3)

x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]

tv = TfidfVectorizer()
x_train = tv.fit_transform(x_train)
x_test = tv.transform(x_test)

clt = MultinomialNB()

clt.fit(x_train, y_train)
print(clt.score(x_test, y_test))

poem = "两两归鸿欲破群"
poem_features = jieba.lcut(poem)

print(poem_features)
df2 = pd.DataFrame(poem_features) 
t_word = tv.transform(df2[0])
res = clt.predict(t_word)
print(res)

智能可视化

透视表

透视表是一种能对多维数据进行分析统计的工具,具有筛选处理、分类汇总,优化显示等强大的功能,是 Excel 中最好用的数据分析工具之一。 在自动化办公中,使用 python 的 pivot_table(),搭配合适的聚合函数,就能有效地实现透视表的强大功能,并且能更快速便捷地完成数据统计分析过程。

透视表的创建和使用

df = pd.read_csv("train.csv")
pivot = pd.pivot_table(df,index = ["Sex",...], values = ["Survived"])
print(pivot)

数据透视表的 \texttt{head} 方法和 \texttt{tail} 方法

注意,数据透视表本质是一个表格,所以它是有 headtail 方法的(可以设置参数展示头几行或尾几行)。

print(pivot.head(20))
print(pivot.tail(20))

删除数据的 \texttt{drop} 函数

我们有时会在表格里发现一些不该有的废列,那就可以用 df.drop() 把它删掉。

df = df.drop(['Name', 'Ticket', ...], axis = 1)

判断泰坦尼特号乘客生还概率

import pandas as pd
import numpy as np
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.preprocessing import *

df = pd.read_csv("train.csv")
print(df.shape)
print(df)
print(df.columns)
pivot = pd.pivot_table(df, index = ['Age'], values = ['Survived'], margins = True)
print(pivot.head(20))
print(pivot.tail(20))
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)

LE = []
data = np.empty(df.shape)
j = 0
for i in df.columns:
    encoder = None
    if df[i].dtype == object:
        encoder = LabelEncoder()
        data[:, j] = encoder.fit_transform(df[i].astype(str))
        j = j + 1
    else:
        data[:, j] = df[i]
        j = j + 1
    LE.append(encoder)

split_data = train_test_split(data[:, :-1], data[:,-1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]

gaussianNB = GaussianNB()
gaussianNB.fit(x_train, y_train)
print(gaussianNB.score(x_test, y_test))

t_word = pd.read_csv("pred.csv")
t_word = t_word.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)

info1 = int(input("请输入您住在几等舱:"))
info2 = input("请输入您的性别:")
info3 = input("请输入您的年龄:")
info4 = input("请输入您的船票价格:")
t_word['Pclass'] = int(info1)
t_word['Sex'] = info2
t_word['Age'] = int(info3)
t_word['Fare'] = float(info4)

t_word2 = np.empty(t_word.shape)
j = 0
for i in t_word.columns:
    encoder = None
    if t_word[i].dtype == object:
        encoder = LE[j]
        t_word2[:, j] = encoder.transform(t_word[i].astype(str))
        j = j + 1
    else:
        t_word2[:, j] = t_word[i]
        j = j + 1

print(gaussianNB.predict(t_word2))

实战演练 1

说了那么多,就列举一个生活中很常见的问题和大家一起分析吧!

关键字

在一整段话中,只有特征是含关键信息的词。

打标签

这里就是把数据合并到一个列表里,给每个数据打专属标签的过程。

for good in list1:
    lb_features.append((good, "good"))
for bad in list2:
    df_features.append((bad, "bad"))

转换成表格

# 拼接
words = lb_features + df_features
# 转表格
df = pd.DataFrame(words)

可爱的机器人宝宝不能分析列表里的数据,我们就要把列表转换成表格

训练特征、训练结果、测试特征、测试结果

split_data = train_test_split(df[0], df[1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]

这是重点,一定要记住。

选择合适的模型并训练

c = MultinomialNB()
c.fit(x_train,y_train)
print(c.score(x_test, y_test))

这里注意用多项式贝叶斯

判断评论是否为恶评模板代码

import pandas as pd
import numpy as np
import jieba
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.feature_extraction.text import *

with open("good.txt", "r", encoding = 'utf-8') as txt:
    text1 = txt.read()
    list1 = jieba.lcut(text1)

with open("bad.txt", "r", encoding = 'utf-8') as txt:
    text2 = txt.read()
    list2 = jieba.lcut(text2)

lb_features = []
for lb in list1:
    lb_features.append((lb, "good"))

df_features = []
for df in list2:
    df_features.append((df, "bad"))

words = lb_features + df_features

df = pd.DataFrame(words) 

split_data = train_test_split(df[0], df[1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]

tv = TfidfVectorizer()
x_train = tv.fit_transform(x_train)
x_test = tv.transform(x_test)

c = MultinomialNB()
c.fit(x_train,y_train)
print(c.score(x_test, y_test))

poem = "这么丑也好意思出来"
poem_fratures = jieba.lcut(poem)
print(poem_fratures)

df2 = pd.DataFrame(poem_fratures) 
t_word = tv.transform(df2[0])
res = c.predict(t_word)
print(res)

sum = 0
for i in res:
    if i == "bad":
        sum = sum + 1;
if sum / len(res) >= 0.2:
    print("恶评")
else:
    print("正常")

实战演练 2

这一栏给出一个例题的代码,就让大家自主分析,仅供参考,不建议实际应用。

预测犯罪模板代码

import pandas as pd
import numpy as np
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn.metrics import *

df = pd.read_csv("train2.csv")
print(df.shape)
print(df)
print(df.columns)

pivot = pd.pivot_table(df, index = ['Resolution'], values = ['Crime'])
print(pivot)
df = df.drop(['X', 'Y', 'Category'], axis = 1)

label_encoder = []
data = np.empty(df.shape)
j = 0
for i in df.columns:
    encoder = None
    if df[i].dtype == object:
        encoder = LabelEncoder()
        data[:, j] = encoder.fit_transform(df[i])
        j = j + 1
    else:
        data[:, j] = df[i]
        j = j + 1
    label_encoder.append(encoder)

split_data = train_test_split(data[:, :-1], data[:,-1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]

c = GaussianNB()

c.fit(x_train,y_train)
print(c.score(x_test, y_test))

real = pd.read_csv("test.csv")
real = real.drop(['X', 'Y', 'Category'], axis = 1)

real2 = np.empty(real.shape)
j = 0
for i in real.columns:
    encoder = None
    if real[i].dtype == object:
        encoder = label_encoder[j]
        real2[:, j] = encoder.transform(real[i])
        j = j + 1
    else:
        real2[:, j] = real[i]
        j = j + 1

print(c.predict(real2))

最后一块内容涉及到决策树,在本文中一笔带过,只讲一些最基础的概念,

认识决策树

决策树

决策树模型是根据权重对数据信息进行排序,保留重要的因素并舍弃可有可无的因素的一种人工智能模型,排序的依据为熵的大小

能够体现事件的不确定性。影响一件事的因素越复杂,结果越不确定,熵越大,反之熵越小。熵还有英文名,英文单词 entropy 即为熵。熵可以通过数学公式计算得到。

决策树模型

# Ccc 决策树模型
c = DecisionTreeClassifier(criterion = "entropy")

经典流程

决策树也是一种人工智能模型,其解决问题的的流程和之前贝叶斯解决问题的流程有相似之处。

决策树四步走

导入 \texttt{pydot}

import pydot

生成文档

# 用 export_graphviz 导出名为 tree.dot 的文件
x = export_graphviz(c, out_file = "tree.dot")

画决策树

# 神奇的代码
(graph,) = pydot.graph_from_dot_file("tree.dot")

导出图片

# 导出图片
graph.write_png("tree.png")

此处附完整代码

import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn.tree import *
import pydot
import os
os.environ['PATH'] = os.environ['PATH'] + (';' + os.getcwd() + '\\bin')

df = pd.read_excel("friend.xlsx")
print(df.columns)
print(df.head(10))
label_encoder = []
data = np.empty(df.shape)
j = 0
for i in df.columns:
    encoder = None
    if df[i].dtype == object:
        encoder = LabelEncoder()
        data[:, j] = encoder.fit_transform(df[i])
        j = j + 1
    else:
        data[:, j] = df[i]
        j = ga + 1
    label_encoder.append(encoder)

data = train_test_split(data[:,0:7], data[:,7])
x_train = data[0]
x_test = data[1]
y_train = data[2]
y_test = data[3]

c = DecisionTreeClassifier(criterion = 'entropy')
c.fit(x_train, y_train)
print(c.score(x_test, y_test))
print(c.predict(x_test))

x = export_graphviz(c, out_file = 'tree.dot')
(graph, ) = pydot.graph_from_dot_file('tree.dot')

graph.write_png('tree.png')

到这里还有一些知识没讲完,由于某些原因,这篇文章之后不再更新,可以关注我的下一期!