数据分析面试题整理汇总

第一题：活动运营数据分析

表1——订单表orders，要用到的字段有（user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’）。
表2——活动报名表act_apply，要用到的字段有（act_id‘活动编号’, user_id‘报名用户’,act_time‘报名时间’）
注意这里的表1和表2都还有其他的字段

需求：

统计每个活动对应所有用户在报名后产生的总订单金额，总订单数。（每个用户限报一个活动,题干默认用户报名后产生的订单均为参加活动的订单）。
统计每个活动从开始后到今天平均每天产生的订单数，活动开始时间定义为最早有用户报名的时间。（涉及到时间的数据类型均为：datetime）。

答案：
1.

select t2.act_id,sum(order_pay) as total_cost,count(order_time) as order_num
from 
(select user_id，order_pay，order_time from orders) as t1
inner join
(select act_id,user_id,act_time) as t2
on t1.user_id=t2.user_id
where t1.order_time>=t2.act_time
group by t2.act_id

select t1.act_id,count(order_time)/dateiff(now(),t1.begin_time) as avg_ordercount
from
(select act_id,user_id,act_time,min(act_time) over (partition by act_id) as begin_time from act_apply) as t1
inner join
(select user_id,order_time from orders) as t2
on t1.user_id=t2.user_id
where t1.act_time between t1.begin_time and now()
and t2.order_time >= t1.act_time
group by t1.act_id

第二题涉及到over函数的使用

第二题：用户行为分析

表1——用户行为表tracking_log，大概字段有（user_id‘用户编号’,opr_id‘操作编号’,log_time‘操作时间’）

需求：

1、计算每天的访客数和他们的平均操作次数。

2、统计每天符合以下条件的用户数：A操作之后是B操作，AB操作必须相邻。

答案：
1.

select date(log_time),count(distinct user_id) as user_num,avg(num_ci) as avg_operqationcount
from
(select date(log_time),user_id,count(opr_id) as num_ci from tracking_log group by date(log_time),user_id)
group by date(log_time)

select date(log_time),count(distinct user_id) as user_num
from
(select user_id,date(log_time),opr_id,lead(opr_id,1) over(partition by user_id order by lod_time) as opr_id_2 from tracking_log)
where opr_id='A' and opr_id_2='B'
group by date(log_time)

使用lead() over()实现

第三题：用户新增留存分析

表1——用户登陆表user_log，大概字段有（user_id‘用户编号’，log_time‘登陆时间’）

要求：

每天新增用户数，以及他们第2天、30天的回访比例

如何定义新增用户：用户登陆表中最早的登陆时间所在的用户数为当天新增用户数；

第2天回访用户数：第一天登陆的用户中，第二天依旧登陆的用户；—次日留存率

第30天的回访用户数：第一天登陆用户中，第30天依旧登陆的用户；

select date(t1.user_begin),count(distinct t1.user_id) as new_user,count(distinct t2.user_id) as twodays_retained_users,count(distinct t3.user_id) as thrityday_retained_users
from
(select user_id,min(log_time) as user_begin from user_log group by user_id) t1
left join
(select user_id,log_time from user_log) t2
on t1.user_id=t2.user_id
and date(t2.log_time)=date(t1.user_begin)+1
left join
(select user_id,log_time from user_log) t3
on t1.user_id=t3.user_id
and date(t3.log_time)=date(t1.user_begin)+29
group by date(t1.user_begin)

第四题：数学计算题

已知A,B厂生产的产品的次品率分别是1%和2%，现在由A,B产品分别占60%、40%的样品中随机抽一件，若取到的是次品，求此次品是B厂生产的概率。

已知：P(A)=0.6,P(B)=0.4,P(次/A)=0.01,P(次/B)=0.02

求：P(B/次)

答案：P(B/次)=P(次/B)P(B)/(P(次/B)P(B)+P(次/A)P(A))

贝叶斯公式的应用

第五题：AB test

某网站优化了商品详情页，现在新旧两个版本同时运行，新版页面覆盖了10%的用户，旧版覆盖90%的用户。现在需要了解，新版页面是否能够提高商品详情页到支付页的转化率，并决定是否要覆盖旧版，你能为决策提供哪些信息，需要收集哪些指标，给出统计方法及过程。

解答：

使用A/B测试模型，分析两个版本在一段时间期限内，详情页面到支付页面的转化率变化，并计算转化率变化后引起的的GMV变化。

可选择的决策：①确定发布新版本；②调整分流比例继续测试；③优化迭代方案重新开发。

要统计的指标：期限内新、旧版本商品详情页到支付页转化率，支付金额。

要衡量的指标：转化率变化 t 在是可接受的置信区间内是否显著，同时参考收益提升率。

指标计算方法：转化率=从某详情页到支付页用户数/浏览该商品详情页用户数（取日平均和标准差）

支付金额=从某详情页到支付页到支付成功路径用户的本次支付金额（取日平均）

采用决策①的情况：本次页面改进在显著性水平内，证明了‘转化率提升的假设’。并且收益提升率达到预期水平。

采用决策②的情况：本次页面改进在显著性水平内，无法证明‘转化率提升的假设’。分析原因可能是新版本样本空间不足。

采用决策③的情况：本次页面改进在显著性水平内，证明了‘转化率提升的假设’。但是收益提升率没有达到预期水平。

第六题：从不订购的客户

某网站包含两个表，Customers 表和 Orders 表。编写一个 SQL 查询，找出所有从不订购任何东西的客户。
Customers表：

| ID | Name  |
|----|-------|
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |

Orders表：

| ID | CustomerID |
|----|------------|
| 1  | 3          |
| 2  | 1          |

例如给定上述表格，你的查询应返回：

|Customers|
|---------|
|  Henry  |
|   Max   |

解答：

select c.name 
from customers c
left join (select distinct customerid from orders) o 
on c.id=o.customerid
where o.customerid is null

第七题：删除重复的电子邮箱（使用工具：MySQL）

编写一个 SQL 查询，来删除 Person 表中所有重复的电子邮箱，重复的邮箱里只保留 Id 最小的那个。
注意：Person 表中数据量很大，ID为主键

| ID | Email             |
|----|-------------------|
| 1  | john@example\.com |
| 2  | bob@example\.com  |
| 3  | john@example\.com |

例如，在运行你的查询语句之后，上面的 Person 表应返回以下几行:

| ID | Email             |
|----|-------------------|
| 1  | john@example\.com |
| 2  | bob@example\.com  |

delete from email 
where id not in(
select a.id from (select min(id) id from email group by email) a
)