[自动生成的转录文字。可能已进行编辑以提高清晰度。]
好吧,嗯,我们可以开始了。所以在我们开始之前,我想提醒你。
呃,本周五下午 4 点到 6 点 15 分,你有,呃,个人案例研究测试,呃,就像米和测试,呃,计算器。
如果您对自己的计算技能非常有信心,那么您可能不需要计算器。
只需在屏幕上写下每一步即可。我认为你很容易说出答案,因为所有的答案都是整数。
很简单。嗯,但如果你需要一个计算器,就带一个。
呃,我不建议你在笔记本电脑中使用计算器、呃应用程序的原因是,呃,你知道,
我不想给你制造问题,因为我们有测试人员,呃,他们已经解决了这些问题,你知道,在这种情况下。
右。有些东西无法解释。
因为我们不会录制您的屏幕,所以您只需使用常用的浏览器,登录,呃,oakland.inspira.ac。
呃,我认为 the.com 点点。然后你可以看到,呃,如果你愿意的话,你可以稍后再试试,因为,
呃,测试已经在那里,呃,在周五下午 4 点之前被激活。
好。所以,嗯,带上你的身份证,因为他们必须检查你的身份。
如果您没有身份证,您可以使用驾照或护照。
任何有效的,嗯,身份证明都可以。
因此,有些问题可能与您的 TBL 研讨会问题集非常相关。
正如我所提到的,关于计算市场均衡存在一些问题。
右。我认为这也是你回顾的好机会,嗯,例如,
就像一周的 TBL 问题集一样,因为我们已经讨论了如何理解这些功能,对吧。
当事情发生变化时发生了什么。我认为这些数据有点重要。
另外,我在讲座中提到了一个与呃有关的事情,即最小值。
所以我认为这是你重新审视这个部分的好机会。
是的。所以这就是经济学的全部内容。呃,这周你还在测验吗?
不。不。克里斯。不。克里斯。是的。不,克里斯。不用担心。
不,克里斯。所以不用担心测验。
你没有测验。所以。是的。
而且,嗯,对于案例研究,你知道,你们都读过新闻吗?
我在讨论中发帖。呃,很好。
如果您还没有阅读,请花一些时间阅读它。
特别是为了确保你理解每一个单词,你知道,嗯,经济学案例不是另一个 aus 理解的。
呃,案例研究。你没有。
你不应该只阅读一个问题。搜索,答案本身就运行案例,但只需使用您的经济学知识即可。
因为我之所以这样做,是因为我希望你们能够应用知识经济学,进行分析。
你知道在现实的企业中会发生什么,呃,在你日常生活中的一个世界中。
这就是为什么这是我们正在寻找的东西。
而且,嗯,你知道,呃,我提到的部分技能,我试图为你培养你的批判性思维能力。
好。因此,当你看到某件事有意义时,你也必须考虑是否要平衡你的答案。
右。有时,你知道,对于我的第一阶段,你会数一、五、一。
因为我有一个问题。
我问学生,嘿,你知道,黄金法则是边际成本等于边际收入,对吧?
但为什么有时它不起作用。
或者为什么在日常商业实践中,朋友通常不会真正关注我等于 MC。
所以我在期末考试中有这个问题。
我只是试图,你知道,激励学生批判性地认为你知道他们学到了什么,对吧。
只是不要只是记住所有内容。我认为了解我们所拥有的条件,嗯,我等于 MC。
无论如何,这对你来说非常重要,我不担心。
你不会在个人案例研究中遇到这个问题,而是我想启发你的东西。
好的,在我们开始之前,您有什么与案例研究相关的问题吗?
不。呃,带上你的笔记本电脑,确保你有,呃,电量。
右。还有,呃,还有什么?
哦,顺便说一句,请在周五早上检查你的画布,因为我将发布有关测试的公告。
说明包括,你的考场在哪里?
好吧,因为我们发布测试的传统,
当天考试的房间信息,避免这个就行了,嗯,可能作弊的偏好。
这就是为什么,嗯,顺便说一句,你可以在周五早上查看这些信息。
好吧。嗯,所以你必须在校园里的可视化下进行测试,因为这是会计专业学生要求的一部分。
你知道,这门课程是经过认证的。这就是为什么,嗯,我们必须获得一些安全的测试组件。
好。好吧。那么让我们开始吧。
今天我们进入业务分析部分,这很有趣。
我挺喜欢这个。嗯,上个季度我教的是商业分析,但我教了一个学生蒙特卡洛模拟优化模型。
但是,对于这门课程,
我们更有可能关注描述性统计以及假设检验和回归,这非常有用。
好。你们中的一些人将来想攻读博士学位,呃,其他人是关于呃,管理或市场营销。
我说,如果你学习统计学,呃,回归,你知道,你仍然可以让我们从这门课程中获得比较优势。
所以我的计划是,嗯,你知道,今天我们有 70 页要写,而且。
工作量相当大。嗯,我想我要把它分成两部分。
在我们休息之前,我将向您展示一些有用的 Excel 函数。
好。如何筛选数据。管理数据,清理数据。
在某些情况下,您可以只搜索 YouTube 剪辑,因为有很多 YouTube 剪辑可以指导您如何使用它。
好吧。因此,只需突出显示对您来说可能很重要的事情。
嗯,我将花更多时间在关于数据可视化的第二部分,因为它非常重要。
右。你知道,为了你一份好的作业报告,我需要你创建一个歌曲图表和表格。
因此,请考虑在创建课程时如何强调信息以及如何避免错误。
我今天在这里为大家举了一些例子。
好。所以数据管理,我们知道我们总是在谈论 da da da。
那么什么是数据。数据意味着信息。
是吗。但这是否意味着数据需要。
或数字 1234。数据可以是非数字吗?
呃,数值变量?是的,它可以。
例如,呃,我们可以有,呃,性别,对吧?
性别不是一个数字。另外,我们可以有一个位置,对吧?
奥克兰。就像惠灵顿的牛一样。这些都是地点,不同的地点。
它们不是数字。对于市场营销专业的学生来说。
右。有时你可能会做一个调查,然后问一个学生,嘿,你喜欢我的产品吗?
是的。不。所以这些不是数字。我同意我的观点。
右?最近,嗯,我和我在中国的一位同事一起做一个研究项目,我们有一个,你知道的,
就业数据来自,呃,一个中国,呃,招聘网站,它很像,呃,新西兰的 Seacom。
所以,呃,我们有一个数据集,包括职位描述以及行业和职业。
所以,你知道,这些都不是数字,对吗?
在数据集中,数字只是,呃,工资数字,或者,呃,你知道,呃,比如所需的工作经验,对吧。
或者,呃,你知道,呃,工作时间。
这些都是数字,但这就是为什么,你知道,数据不需要都是数字。
它可以。呃,非数字。你知道的。
信息,我将指导您如何处理现在的数字求和。
或者我们可以找到变量。我们可以将其转换为信息,例如性别位置,一切都转换为数字。
太神奇了,对吧。然后我们可以使用数字进行一些计算。
是的。所以非常有趣。我们需要数据具有某种变化。
因为从变化中我们可以识别信息。
例如,这个星期天,我很高兴,因为,我的一个小学同学来看望我。
所以我把他们带到这里参观。
我还带他们去了使命湾。而且,你知道,现在有些日子很冷,比如夏天。
但是冰淇淋,很多人仍然买冰淇淋。
所以你知道温度和冰淇淋的销售量之间可能存在关系。
你同意我的看法吗?是的。但还有一个研究问题。
比如说,嘿,当气温升高一度时,我们能多卖多少个冰淇淋,对吧。
这可能非常有趣。或者对于你们来说,例如,如果你获得了一个额外的学位,你的薪水可以如何增加,对吧。
平均而言。是的。我想知道,因为这对我做出决定非常重要。
我应该不做这个硕士课程吗?是吗?是的。
所以这就是我们必须获得一些变化的地方。
例如,在您的数据集中,我们得到了这个。所有获得硕士学位的学生,但我们也有没有获得硕士学位的学生。
右。因此,通过这种变化,我们看到了这种差异。
我们可以确定大师如何影响这个人的薪水。
所以这是关于变化的事情。最后一个是关于关键字是随机的。
是的。当你进行业务分析时,你会说随机。
这个词需要很多。为什么?因为我们希望样本是随机的。
是的,我提到了样本这个词。嗯,也许是我所有的幻灯片。
我将谈谈总体和样本之间的区别。
样本是什么意思。人口是什么意思?
人口意味着每个人。样本意味着你知道的一部分,人口还可以。
人口的一定比例。例如,假设整个公交车 MGT 民用九类是一个人口。
我们有 129 名学生。
好吧。例如,好吧。所以今天这里有多少学生可能只有 70%。
右。所以所有参加这次讲座的学生都是一个样本,因为样本意味着人口的一定比例。
你能跟我来吗?是的。但是随机的意思是闭上眼睛。
你随机抓住一些学生。好。
假设我想为三个学生随机抽样。
所以我要闭上眼睛,在房间里工作,偷看覆盖了三个学生。
好。为什么?为什么要闭上眼睛。哦,因为我懒惰。
你知道,如果我懒惰,我只选一个,两个或三个。
说真的,观众。你们子采样。但是他们能很好地代表整个班级吗?
也许不是。因为,例如,我们这里有两个男孩和一个女孩,对吧?
是吗?但事实如何,在这种情况下,男生的比例约为 66%,而女学生的比例仅为 33% 左右。
你能跟着我来吗?然而,假设整个人口的情况,我们有 50% 的男孩和 50% 的女孩。
这就是为什么我的样本可能存在偏差。
还有另一个非常有代表性的群体,你知道的,群体。
你能跟着我来吗?是的。而且,你知道,在我们的班上,我们有一个来自不同国家的学生。
右。但这里的三名学生更有可能是来自中国。
所以它不是很有代表性。这就是为什么您必须确保知道您的数据集是随机的。
我的意思是,你的样本是随机的。好。为什么我们要强调样本。
因为有时在我们的中,我想说,在大多数情况下,你不可能获得人口数据集。
你知道人口数据集的例子是什么吗?
有什么例子吗?这包括每个人。你知道吗?
有人知道GDP吗?有人知道吗?
你知道在你的国家或新西兰,政府会让我们远离每一个人的面谈吗?
是的。人口普查称为人口普查。在新西兰,我认为大约每四年或五年一次。
新西兰统计局将向每个家庭发送一份调查问卷。
好。包括你。这并不意味着我们必须采访这里的公民。
这意味着我们必须采访这里的每个人。好。
是的。这就是人口普查。所以这是关于每个人的。
新西兰的每个人。所以这意味着这是一个完整的人口。
数据集很大,对吧?你知道新西兰有多少人,人口大约有 500 万。
所以你知道数据集大约有 500 万。你能用你的 Excel 打开它吗?
不。不可能。这就是为什么呃,这就是为什么我是你知道的,第一次要求人口普查非常昂贵,
你知道,只有新西兰每 4 或 5 年就来一次数据。
右。然后你在预算中得到一个预算来做这件事。第二个,你得到了大约 500 万个观察结果的人口普查。
你的电脑可以在上面工作吗?不。
不可能的。你知道,我下载了大约 7 GB 的中国就业数据系列,但我的 Excel 打不开。
所以这就是为什么我必须使用另一个软件来思考我该如何生存。
因此,一种解决方案是您将从总体中选择一个较小的样本。
例如,如果我想分析整个民事奈班,我不需要问每个人一个问题,也不需要从每个人那里获取信息。
我要做什么,我会闭上眼睛,在房间里工作,随机选择 20、30、40 名学生。
然后我会做一个调查并做一些统计数据,我得到的信息可能非常接近人口案例。
你能跟着我来吗?但关键是什么?当然,关键是,你知道,更多的学生接受了调查。
我能得到的答案就越准确。你能跟着我来吗?
是的。因为当我调查更多学生时,我得到了更多的变化,我是对的吗?
正如我所提到的,这里的变化。右。所以变化很重要。
是的。这也与我们想要获得样品的另一个原因有关,因为样品太贵了。
有时你不可能做对。
假设您正在为一个营销团队工作,您想知道,您知道您想做一些市场调查,对吧?
所以你想调查人们的偏好。因此,您的团队可以为消费者、潜在消费者设计产品。
你会试着去街上吗?右。
你随机挑选一个看起来友善的人。
从那里询问一些信息。右。所以这是一个样本。
你能跟着我来吗?是的。因为你永远无法敲每个人的门,对吧,获取信息。
好。嗯,你能跟着我来吗?好。
所以这是关于一个数据集的,我认为你已经下载了你的小组作业的数据集。
是吗?是的。而且,您可能会收集一些数据。
右。所以这些对你来说非常重要。正如我提到的总体情况和样本嗯,以及样本平均值。
所以当我提到我想随机选择我们的样本时,你知道,在房间里走来走去。
右。所以这是关于一个样本的。我选择样本的方式是,我称采样方法为“ok”。
这是一个随机的。是的。确保您的样本是随机样本。
这非常重要,因为当样本是随机的时,它看起来像特征,
样本的特征可能与总体非常接近。
这就是为什么你得到的样本信息可能与总体信息非常接近。
这就是为什么这是一个非常重要的因素。确保它是随机的,并且当您获得信息或见解时,它可能是有效的。
因此,对于数据集,一般来说,我们有三种数据集。
为什么叫横截面数据集。为什么叫呃时间序列数据集。
另一种是称其为面板或纵向数据集。
好。所以我要让你知道有什么区别。
因此,对于横截面数据集,我们在一个时间点收集数据集。
例如,我们今年只对所有人进行调查。
明年不再有。好。只有一个。
就是这样。所以我们收集,比如说,比如这个。
呃,我只是分享一下,你知道,上一门课程的成绩是什么样的,那是,呃,横断面数据。
好。因为学生只做对的课程,他们不会再做这门课程了。
所以,你知道,有这个一年的数据。例如,时间序列数据意味着我们观察到新西兰的 GDP 增长,对吧?
仅适用于新西兰,不适用于其他国家。而且我们只关注GDP,不关注其他主题。
所以这就是为什么嗯,而且跨越不同的年份。
所以这是一个时间序列数据,因为我们在不同的时间点观察到了一些东西。
有人说,嘿,我们可以吗?
我们能不能有呃横断面数据加上时间序列,因为我有同一组学生,我给出了方式,
呃,你知道,整个 18 个月或 15 个月的表现,你知道,通过硕士课程,呃,计划,我会说,是的,你可以做到这一点。
所以你的贸易数据是,呃,面板数据,因为我们有相同的国家,对吧?
同一组国家,新西兰与澳大利亚进行贸易,与印度进行贸易。
所以有中国。与越南的贸易应始终是日本。
所以有韩国。许多不同的国家也跨越了不同的年份。
你能在 Covid 之前、Covid 之后、Covid 之后关注我吗?
是的。所以这是面板数据。我们之所以要提到不同的数据集,是因为当你进行计量经济学建模时,
当您进行回归时,数据集的不同特征需要不同的模型来处理一些问题。
好。所以我只想让你知道,嗯,当然,对于这门课程,我不会对你提出很高的标准或要求。
但我只是想让你知道,了解不同的模型以及我们如何收集数据。
我们可以收集到,呃,你知道,有人可以购买数据,对吧?
你可以像我一样购买数据,也可以自己做调查。
右。或者有时你可以做实验。是的。
我们有一些实验室,比如实验室 8 或 9。这些实验室也是为经济学实验而设计的。
好。是的。因此,我们可以设计问卷来询问学生,你知道的回答,比如说当我雇用时,
要知道,雪糕的价格,你想买多少,对吧。
我们会提出不同的问题,并收集您的回复。
因此,这可以成为我们可以在这些分析中使用的一种数据。
好。好吧。所以这是我想提到的第一个 Excel 函数。
这是非常有趣的讲座,因为我今天要经常使用 Excel,好吧。
是的。例如,像这个数据集一样,我们有不同品牌的车辆。
就像您是您的丰田数据一样。我们有不同的国家。
是吗。如果我愿意的话。所以有人问我,嘿,我应该使用整个数据集吗?
对于回归和假设检验,我会说是的。
你必须使用整个数据集。
但是,当您试图描述新西兰贸易的情况时,您是否必须在报告中提及每个国家的案例?
是不是太多了?对吗?是的。因为例如,在你的数据集中,你知道,新西兰,一个交易者的体重有 200 个国家。
我将有一个部分来测量每 200、200 个案例。
不。哪些案例非常重要和有趣。
虽然有些学生问他们,嘿,我的建议当然是随意询问或向 GPT 或 Google 收费,对吧?
这些都是研究。
有人可能会说,嘿,我想看看,你知道,如果,呃,如果,如果我,如果你试图准备一份报告给外交和贸易部长,对吧。
因此,他或她想知道新西兰应该尝试解决什么情况。
那么怎么做呢?他们在贸易方面对新西兰非常重要。
所以也许你想专注于前十名,对吧。
或者你认识一些重要的贸易伙伴。新西兰与东盟有自由贸易协定。
还有韩国和中国。
澳大利亚也是如此。所以也许你也可以关注这些国家,或者你同意我的观点。
所以我认为只要尝试设计这个部分,然后思考,你知道,你知道如何展示新西兰的地位和贸易地位。
这就是为什么我们必须过滤数据,因为,你知道,比方说,像这个案例,如果我只想看看丰田的信息是什么。
是的。因此,我们可以转到例如,确保您选择主页选项卡。
然后你去编辑。你可以找到排序和过滤器。
你能在这里看到未来的剑吗?你能跟我来吗?
是的。单击此按钮并正确执行自定义剑。
或者过滤器。是的。无论如何,如果你做一把定制剑也没关系。
然后我们可以选择 B 列。确保检查我的数据是否有标题。
右。所以按制造商排序。我这样做了。
呃,所以现在丰田让我们看看位于此处的所有丰田信息。
你能跟我来吗?就是这个。另一个是我们可以过滤某些数据。
再次转到另一个过滤器,然后您可以看到过滤器的符号图标。
点击这个好。然后你可以发现这里有一个下拉菜单。
所以我只能选择丰田,然后选择还可以。
所以在你的屏幕上,我只向你显示丰田的信息。
是的。因此,就像您的交易数据案例一样,也许您想专注于澳大利亚,对吧。
因此,您可以筛选数据并仅选择澳大利亚的观测值。
然后你会看到不同年份发生了什么。
你能跟我来吗?好。所以这是关于几种数据的。
呃,第二个,条件格式。是的。
所以我不会跟随我的幻灯片。我只想为你介绍一下所有的案例。
这很容易。一切都在一条线上。因此,对于条件格式,这实际上非常有趣和有用。
例如,我经常使用条件格式,因为,嗯,我的另一个,呃,
现在的研究项目是关于使用一种叫做呃,行为度量分析的方法。
呃,这是什么意思?这意味着,例如,如果我想做,呃,你知道,研究金融科技的文献综述。
然后我去数据库搜索金融科技。
所有文件。右。我阅读了所有的摘要,也许是 1000 篇论文。
然后我要考虑这篇论文是否与我将要使用的研究有关。
呃,我要做是或否。右。所以我有呃,我打电话并显示是或否或也许,但你知道,当我输入是或否时,
或者,如果我想快速确定哪篇论文是相关的,哪篇论文不是,这是非常困难的。
你同意我的看法吗?最好突出使用,如红色、绿色和橙色。
右?红色意味着不。绿色表示是。
这样我就可以更容易识别。是的,这篇论文无关紧要。
那篇论文不是。你能跟着我来吗?是的。
Or sometimes, for example, uh, as you know, I'm overweighted, so I have to check my blood pressure every day, every money.
So I have a colour and show every marni's blood pressure.
Right? And I can set a threshold. What does a mean by threshold?
For example, when the blood pressure is over 120, then, oh, I have to be careful.
And I take a pills. Right. So, you know, if I show you all the numbers is a very difficult for you to tell which number quickly.
Right. Is above 120. How about if I can have Excel to automatically highlight the cell when the data is over 120,
quickly be read right so you can be very, you know, aware of the case.
So for example like this one. As uh a manager for example.
Right. If you are very clear about the growth rate of below 0% less a negative you have.
To be very, you know, careful and pay attention to the cases.
So when you use your eyeballs, I think you can get us some numbers, like a negative.
Right. But not very easy. How we can easily identify and correctly accurately get the cells which are negative.
We can do conditional formatting. Okay. So how to do that.
If you collect the whole column and go to conditional formatting and you can see the first option is the highlighted cell rules.
Right. So you have a greater than less than or equal to.
So in my in my case when I want to highlight the cells is a relay.
So shows yes. Right. So I can make it an equal to yes okay.
But with quotation mark.
Because everything you want to cite a Excel needs to be having a quotation mark, because otherwise Excel cannot do the calculation because.
Yes. No. These are all text right. These are not numbers.
So let's say we set, uh, you know, 80 below zero let's say.
Okay. Then you choose, okay. Of course you can choose a different colour if you want.
Yeah. Anyway, then is that easy for you to tell which cells are negative?
Yeah. Far easier right. So this is called a conditional formatting.
Do you like it? Yes. Good.
So, um, I think that this is also very, uh, useful in my case, in my research.
Uh, let me maybe use this one directly.
Another one I want to discuss. Very, very important.
Pay attention to this one. Okay. In your quiz, I'm going to test to you about this.
What is that? Data distribution.
We want to know how the data is distributed. What does that mean?
If you don't know how the data is distributed, then you don't know how you get this result.
Sometimes maybe you find, oh, Covid, um, actually doesn't impact, I mean, negatively impact New Zealand's export food export.
Covid actually helped New Zealand as well, the more food toured the world.
Oh, maybe you you'll be very shocked about this conclusion, right?
Because according to your daily knowledge, you know, when there's a Covid, when New Zealand also has some lockdowns, right?
So people are coming to work and you know, the poor is not working.
Right. So chaos everywhere.
Uh, why New Zealand? Well, more than before.
Amazing. So that's why if you know the data distribution, if you know, oh, what a kind of the values you have.
What is the kind of the mean value right in the centre.
What what is the extreme values looks like and how many values are more like more than the middle.
How many values are sitting in the lower bound?
Yeah. Then after that you can have a good understanding.
Yeah. That makes sense. Or something like this. Okay.
So how to get the distribution of the data.
We use histogram. Histogram is a graph.
We visualise the distribution. Because if I show you these numbers again is a hard to get a quick insights.
Right. Because we are human beings you know cells in the thousand years ago there's no cars, no, no everything.
Right. And every, every, every day we are looking for food.
So that's why human eyes are very sensitive to some things moving and changing the colour.
Right. Not the numbers, not a not a text. So, uh, where we are not very sensitive actually, to these numbers.
But anyway, let me guide you how to do the distribution.
So for getting the distribution we want to know how many numbers are sitting here.
How many numbers are sitting in the middle. How many numbers are sitting in the high band okay.
So that's why it's a like you sort the rubbish for example general rubbish.
You know being the recycle rubbish. Another band. Can you follow me?
Yeah. So we are going to create a bean bin like a rubbish being many beans.
Okay. The more the better actually. Right.
And then you see uh this numbers, which numbers are the go to the first, the being which number should it go to.
The second a, B and then you count how many numbers in each being.
Can you follow me so we can know, oh how the data is distributed.
So let's say if I create a bins like a 10 to 14, 15 to 19, 20, 24, 25, 29, 32,
34, somebody may ask me, hey, why the bean wrench is oh 410 1112 1314 sorry.
Five. Right. Not not, uh, sometimes six.
Not a, sometimes, you know, uh, four not a sometimes can be changing.
And why is all fixed. The bin range is all fixed. Do you agree with me?
It has to be all fixed. Okay. Because if it is vary, then the distribution you get is not a correct right?
Yeah, it's a bias. So for the first one the being for 10 to 14 you count how many numbers there.
12 is a one, 14 is a one, 14 is one here and a 13 as well.
So in total there are four. So what does a mean by frequency in plain English?
Can you tell me what does a mean by frequency. Any simple word can replace frequency.
How many times. Yeah. How many times do I repeat it right?
Repeated a number. Can you follow me? So frequency means how many times how many numbers?
And you should be okay. But frequency is more professional term right.
So I hope you can use in the future.
And we have the relative frequency or percentage frequency actually this or uh derive for the frame from the frequency because uh, we have 20 numbers.
Do you agree with me. So when there are four numbers in the first, the being so four divided by 20, that means 20%.
Is that right? So we know what is the proportion right of the observations in the first that be okay.
So once you are happy with it then we can draw the graph so you can see somebody say hey this is a column chart because it's all columns right.
Lots of columns. It's like a column chart. No it's not a column chart.
It's called a histogram. What is the difference between column chart and histogram.
First the one the common points are or you can see columns.
But the difference is for histogram.
Have a look at the x axis. It has to be all numbers for column chart.
It doesn't need to be all numbers. It can be text.
For example like your trader case.
I, I believe maybe some of you want to create a column chart to show what is the trade value for a different country.
Do you agree with me? Yeah. China one column USA, one column, India one column.
Here I maybe Asean countries. Another column for example.
Right. So if you want to do that is okay. But that's not called a column chart.
Why why is it called a column chart. Another histogram.
Because for histogram first of all for the x axis need to be numeric numbers okay.
Second one least numbers has orders because 10 to 14 the value is less than 15 to 19.
Do you agree with me. Yeah. But in your in your column chart we have a column for China.
We have a column for USA. We have a column for India.
Can we say China is less than USA than India.
Now that there's no order, can you follow me. No other.
Yeah. However, if you say January, February, March is okay.
Not. But you have to convert a January as a one, February as a two and a march as three.
Then it can be a sort of the histogram I saw off.
Okay. But I was say it's not a purely histogram.
Can you follow me? Can you give me some response because of asymmetric information?
I don't know if you understand. You can follow me on that. Give me a second.
Um, okay. So then you can see this chart, right?
Let's say let's have a look, the shape of it. It's very funny because up and down.
Is it right. Is it like a mountain. Right. Yeah. It's a like a mountain.
Um, so how we can describe this?
Actually, there are four kinds of the shape.
That's actually the three, because the last one is a kind of the extreme one.
Okay. In general we have the three kinds of the shapes.
The first the wise are called the left, the skewed data left the skewed.
Why? Because we have a more data focus on the right hand side.
The right hand side means all the values are big. Can you follow me?
And the very few on the left hand side. So this is called the left.
The skewed because the left of tell is long.
Panel B is it called the rider skewed Y because or skew the y A right.
Why? Either because we have lots of observations.
Focus on the left hand side. The value is small, very few on the right hand side okay.
And the third one symmetric. Cemetery maze.
It's, uh, you know, if you cut through the centre, if you fold it, it can be perfectly overlapped.
Right? Left hand side equals to the right hand side.
So it's a kind of symmetric. Can you follow me? Yeah.
Symmetric usually refer to normal distribution.
It's a like a normal Sunday economics.
Right. Normal profit. In statistics it's called a normal distribution Y because it's a very usual very common.
Many many variables follows the normal distribution actually.
Okay. For example like the IQ like the IQ of the whole class, maybe follow the normal distribution.
If my assessment is a perfect is good, then the grade distribution should be following the normal distribution.
Which means very few students got an A-plus. And uh, a little bit more variance.
Got a more students, got a minus majority, got a B plus and blah blah, blah, something like this.
Okay. Yeah. So something like this. Can you follow me?
Good. And I want to emphasise this slide.
Um, you can see there are two kinds of the cases.
Why is it called the negatively left.
The skewed or left of skewed. Next the class I'm going to mention the centre of the data.
How to measure the centre of the data. We use the three measurements to measure the centre of the data.
Okay. Wow. You are very familiar with.
I trust it's a quarter average. Do you know what is average?
Yeah. You know what is average, right?
Good. Wise. Good. Average. Average in statistics means mean m e a n the mean value.
And also we use mode. What does it mean by mode?
Modum is you know we have the histogram is a right.
We check which colour is the highest. So data means mode.
What does it mean by mode. Model means in the data set.
What? What value is mostly repeated.
Okay has the more cases. So database mode and also median.
Do you know what is median. Medium is middle easier.
Is that right? Yeah. If you saw the data from lowest value to highest value what is in the middle.
Okay. That's a median. Yeah. So for symmetric case normal distribution mean equals mode equals median.
That's a perfect case okay. But you see this are skewed right.
Skew the means either the right tails longer or left the tails longer.
Is that right. Then the three values are different.
Where it is a skew the left mode is higher than media.
Media is higher than the mean. Why?
I'm going to tell you later. Don't worry. Next, the class for the right is skewed.
We have a media is below the mean.
Okay, so I have a I have a one question.
When we do the income survey, when we do the analysis on the income survey, like in New Zealand.
You know, when I was doing my PhD, I used the New Zealand's income survey data and I analysed the New Zealand's income distribution.
Do you wish, while you think New Zealand's income distribution is more likely the left,
the skewed or the right a skewed left skewed or the right is skewed.
Where you are. Right, a skewed y.
Uh. You mean this, right? Lie. Because there is a minimum salary.
Uh, for the, uh, income. Okay. There's a guarantee for minimum wage payment.
Is that right? And also, as you can see, for the right a skewed.
What is the feature of the right is skewed. Right hand side is very long and the observations are very few.
Do you agree with me? What does it mean by right hand side?
It means the income is very high. Like a million is trillion.
This is a right. Do we have lots of millionaires trailing us?
No. Most of our people has, uh, more low salary, actually, in New Zealand, right?
So? So that is why. Because many of you are international students.
And you maybe later after the program, you want to stay here, get a permanent residence.
Right. You know how the Immigration New Zealand analyse your productivity by your salary?
Because your salary is a kind of the proxy of your productivity.
Do you agree with me? Yeah. So do you have a requirement for productivity?
Yes. Is kind of a filter. Is that right? So should a new that is, uh, should Immigration New Zealand to use media as a filter or mean as a filter?
Then you check New Zealand's income distribution.
It's the right one. The right is skewed right.
A skewed a means the New Zealand's average income is more than the media income.
What does it mean? It means, let's say, if New Zealand's average income is about $70,000 per year.
If you later when you got a job, your income, your annual income is less than $70,000.
Do you have to be really sad? No, because.
The main car is more than the media.
Media means 50%, right? So that means more than 50% of New Zealand's population earn less than 70,000.
So that's why I mean, it's not a good a measurement to measure your performance, your efficiency.
Right. So that's why, uh, usually the Immigration New Zealand used the median income to be more fair, to measure your productivity.
Can you follow me? So all of this actually is taken from the insides of the distribution.
Okay. Yeah.
Today I want to spend more time on talking about the meaning of this, but less time on Excel because you can actually follow my instructions, right?
This is easy. So for calculating the, uh, you know, this, um, for, for working out the distribution, I know, uh, it's easy because which way?
Firstly creator beings, then we can, um, you know, do the beings, uh, for example, like this.
So let me guide you. I know maybe tomorrow, uh, you can maybe follow the tutors quite well.
Just let me teach you. Okay. So first of all, for when you try to create a histogram, you have to decide the beings.
It's up to you, okay? It's up to you. You can make it 10 to 20 or 10 to 40.
When you do 10 to 40 you got the one colour. Can I need to tell you the story?
Now in this case maybe it's better. 10 to 12, 10 to 13 or 10 to 14.
Can you follow me? Yeah. Sometimes you can make some justifications by yourself.
All right. So let's follow the textbook case.
If I do it 10 to 14, 15 to 19, 20 to 24, 25 to 29.
Sorry to sorry for. Can excel really understand this?
I mean, the part I highlighted here, the answer's no.
Excel cannot understand. So how we can teach Excel.
Okay. My beans looks like this. Easy.
We only need to keep the bigger number in your being.
For example 10 to 14. Which one is the bigger number?
1415 to 19. Which one is the bigger number?
19. Can you follow me? We just need to keep this bigger number.
And then we can create another column quite as a beings.
Then what'll happen is if you go to file.
So I'm not sure about the Mac users apologise.
Although I'm a mac users. So for Mac you use you users.
Um, maybe your you have to find some way to add.
Activate the function bar for the windows users.
If you go to file and go to options, listen carefully.
And here. At ease. Can you see at ease.
Can can everyone see at is here? Yeah.
If you choose add ins then you can see.
Make sure analysis two pack has been activated.
Okay. If it's not activated properly you can see the font here.
So you just need to choose. For example let's treat this as analysis toolpak okay.
Just choose this one to choose go. So let me show you.
If I choose go there. Make sure you check for analysis Toolpak.
All right. And choose okay. We'll also choose okay. Then if you go to data you can see data analysis is here.
So if you choose data analysis then you can create a histogram.
And also you can do regressions which we are going to do it later in the in the course.
Okay. So first one choose histogram and choose okay.
So now you choose the input range.
So select the whole range of the data 20 and the big range.
Choose the branch here. Uh let me only choose numbers because I'm not going to take for labels.
So that's why I'm not going to include that that the label for beans.
All right. So you can feel free to choose where you want to show the result.
Let's say the output branch is somewhere uh underneath okay.
For example I want to show the result here. And what happened is I want to show the chart histogram.
And also the percentage operator doesn't matter.
And then choose okay. See this is how we can get a histogram.
Can you follow me. That's a pretty simple right. So this is how you can, um, create a histogram.
I'm not going to suggest to you to be lazy to use.
For example, if you go to insert and uh, the recommended charts and you can get a histogram,
you see this is a histogram automatically created by Excel.
And the histogram I created is very different right.
Which one is better. My why is better. Why.
Because it shows more dynamics more variations.
Right. That one is not clear. Can you follow me?
Yeah. Good. So, uh, some of you complaints because, uh.
See here. It's a 14, 19, 24.
So what does that mean? Right? Uh, easy, easy.
Sorry. Too smart. Uh, let's make it general.
10 to 40, pops. Okay. Doesn't matter.
Make it a text. Ten 1214 C it becomes ten 1214.
Can you follow me? Yeah. So just feel free to add a table.
And then your graph, your labels can be updated. All right.
So this is about how to create a histogram. Very good. Um I want to spend some time on it.
And how about, uh, you know, the data is not numbers easy when the data is not numbers like Coca Cola, Pepsi Cola.
You use the count e function. Let me show you.
Count. If. Yeah. So count even means you're kind of sampling based on the criteria, right?
So you select the range of the data set and a comma.
And then the criteria. The criteria is a Coca cola because Coca Cola is not a number.
So when I type Coca Cola I need to get a quotation sign.
Okay. Yeah. So A2 to be 26 comma and uh quotation sign coca cola.
Yeah. And also you probably sometimes want to, uh, locked uh, lock the area of the data because you want to be lazy.
Sometimes you want to copy the formula. Right. You don't need to aim the formula again.
Again, but you don't want to change the selection of the range of the data.
So that's why you have to put the dollar sign. Okay.
I think your children should guide you about this.
If, uh, if they didn't guide you, just raise your hand to ask, okay?
Because in the lecture, I don't have lots of time to show every single sign here.
So you can see is a 19. Can you follow me? Good.
Let me show you. If I don't have a quotation sign for Coca Cola, then what happens?
Is this the arrow? Okay. Because it doesn't recognise Excel doesn't recognise Coca Cola.
All right. Good. Okay.
I have all the details here so you can follow. Um, the next the one is about the, uh, the company, the the split and the combine, the text.
This is very important because actually, again last night in every, every day I work until 1:01 p.m.
So last night I again I do another research is about, uh, the, um, using the data from China's high speed railway.
I want to see how the high speed railway can impact the labour mobility and also the local cities GDP, economic growth.
And sometimes, you know, I got a list of the train number like let's say the G1, G2, right.
But when my colleague provide me the data is only one, two, three, four and I miss the G,
the letter G because G in China, in Chinese means, uh, high speed.
There are high, high something.
So that's why without a G, I cannot use Python to grab the information from the, you know, the, the ticketing, uh, web page.
So sometimes I need to combine the, the tabs.
Sometimes I want to split the text because here for example adjust we try to split the text for ages.
We got a one column for the detailed address, second column for city, the third column for zip.
I also actually use the list function for my previous research, which is published in a star journal, which is really good.
I analyse the, uh, the CEOs in America, how their pay is a spatially correlated with each other.
You know, when we measure spatial, spatial means, job or geographical area.
So I have to get the city information. I want to identify how many CEOs in the city or how many CEOs are in the neighbour.
80km. Okay. So I have to split the the information like this.
So I'm going to show you that's a super easy, um, when you try to split the information up, let me see where it.
Uh, here when we try to split the ages.
I said just move this column to a far away. Okay.
And then select this column and go to data and go to text to two columns.
Can you follow me. Then you can use for example the eliminated.
Okay. Choose next. And then you can use a criteria to separate the task.
For example you try to use comma. If you choose comma then you can have the preview.
Can you follow me. Yeah. So next. And the finish.
So okay. The details just are here. Um, I see the cities here and the state is here, and the zip code is here.
Easy. Can you follow me? Yeah. If you want to combine the data, for example, last night, I combine the data like this.
Right. So or you want to combine the names.
What I will do we can do tax the joy.
Easy. Tax the joy. Okay. The illuminator for example we want to have a doctor or miss or Mr.
And uh, have a little space and then you show the name.
Is that right? So we have a Terminator and, uh, space.
Yeah. And and a quotation mark.
Okay. And then do you want to ignore the empty cells?
Because here sometimes we have the empty cells is a right. So yeah.
That's true. We want to ignore the empty cells and then, uh, tax the one tax.
The two tax. The three tax. The four tax. The five.
Okay. That's it. Very smart.
Do you agree with me? Yeah. If I don't have the eliminator.
And if, for example, for that eliminator, I don't have, uh, I don't have the space.
You see, everything is sitting together.
No space. Right? And this is not what we are looking for.
Can you tell the difference now? Yeah. Good. So this is about, uh, combinations and, uh, split the numbers.
Taxed easy. Okay. Next one.
The missing data. Missing data is very important.
I want to spend about another five minutes to discuss the missing data for you.
Because you are traded. Data is a real data.
And there must be having some missing data.
Or when you try to collect the information about, let's say, a country's GDP for some some more countries, usually you have more missing data.
Can you follow me? And this is important to many of you because we need to try to clean the data set right.
How to handle the missing data. We have to identify if the missing data.
Is that the reason that you got a missing data is a random or not random?
If everything's random. Okay.
Missing data is okay because it looks like a random sample.
Can you follow me?
Yeah, because I have, uh, I should have 129 students here, but some students randomly, not here because of the weather or something, right?
It's a random issue, so it's okay. You guys can be a good representative.
Okay. But when the missing data is not random, then it's not okay to ignore them.
All right. So how to make justifications for, for example, one completely random.
For example, when you try to enter the input the data and there's a power shortage outage and then your
computer is getting down and you lose all the information is a random is totally random.
Okay. So that's okay. Take it. Take it easy and keep it going.
Right. You don't need to worry. Second one.
This is not a completely random, but a some sort of random.
For example, when you try to survey everyone's income.
Okay. But. Young people are more likely to provide this information, but for older people, you know, they are more likely to keep their privacy.
So in that case, we have more observation for young people, but fewer for old people.
This is a random, but it's not a completely random.
Okay, yeah, this is okay because is not a the conversation is not about income, which why is not a okay when I do the surveillance they are income.
If you are in kind of a low you don't want to tell me.
But if your income is high you want to show off. So you know right.
My subjective is talking about your income.
And the reason why I have the missing data is because income somewhat is in the high is on the lowest.
Incomes are low. Can you follow me? In that case is not random.
Okay. And then you have to handle them. Can you follow me?
So how we can handle the, uh, missing data?
This is something that I'm going to ask Sana for you.
Because if you tell me you are responsible for killing the data or handling the data,
this is a one possible question that I'm going to ask you in week ten.
Overall interactive assessment. Can you follow me?
Because I want to now I want to make sure you apply the good method right to handle the data.
If you trust the missing data is at least, uh, because of the random issue.
Okay. At least this, um, partially random issue. Easy way to delete the data.
What does it mean by the lidar data? If you see there's an empty cell, right?
Then what do you do? Delete everything. Delete the whole row.
Whole row. Not a whole column. If you delete the whole column, then you lose the information, right?
But if you delete the whole row, what does it mean? You lose one observation.
For example, if we know, we. If we lose the Australia's GDP in 2025.
Okay, then what are you going to do? You delete the whole row.
So that means that our row is contains the information for Australia for example.
Right. You are now going to know Australia's GDP and also whatever population lying for me because you delete the whole row.
Can you follow me. You. So we will lose the observation.
Is it okay? It's okay. As I mentioned, it looks like a random sample.
Now for you. Can you follow me? That's easy.
Right? Very naive. The second one, we can use some numbers to replace, to put a into the missing data as the missing data to replace the missing data.
Wish number zero is zero. Okay. Can we see this error?
Now, if Australia's GDP in 2025 is missing, can we use zero?
Of course not. Because if you use zero, then you can.
The true value is zero. Can you follow me? But which number should we use?
Anybody else before I discuss next class?
Maybe average is more commonly used.
Average. Okay. We use average.
Yeah. The reason why I'm going to tell you later in the next the class.
Or we can do predict. We can try to predict.
For example, you know the quantity of cells. This mouse is 100.
Next the mouse is 110. Can you tell what is the can you predict the what is the sales in the certain mouse.
Yes. Because first the mouse 102nd the mouse 110.
What is the growth rate? 10%. Right. Then you assume the growth rate of 10%.
So you can calculate can you follow me. This is called a forecasting or AI.
So yeah we can have a loss of matter. Don't worry I'm going to um guide you again.
Again don't worry. Okay. Um, so yeah.
Anyway, for the missing data we can use the function for account balance is a quite like a kind of.
So. Okay, so you can follow the slides, uh, instructions to work it out is not very, very important for me anyway.
We have to identify the outliers.
What does that mean by outliers? Extreme data, for example the incomes of data.
Who are the outliers? Millionaires trillion as they are two extreme cases.
Can you follow me or somebody only earn $1,000 per year?
Right. These are the extreme big or extreme small value.
These are outliers. Why? We have to identify the outliers.
This is the reason why sometimes the mean is below median or the mean is higher than the media.
Okay, so outliers is going to influence the mean value.
Can you follow me. Let's say in the whole class the class average score is seven.
However, I ask everyone, the majority of the students only got a five in their quiz.
How come we got a seven? Because one guy got a ten. Okay, so the guy pushes up the class average.
Can you follow me? So you know that the mean value will be influenced by the outliers.
So that's why in most of the studies, because they are going to influence the mean.
And in the regression analysis we are going to tell the slope of the curve.
Okay. Ways the ways the outliers the slope will be steeper or more flatter.
So it is going to bias the result. So what do we try to do.
Delete them. Okay.
We try to delete these outliers. Okay.
So in finance research we had called the Windsor Windsor the 5% which means 2.5% of two extreme big data and 2.5% to extreme small data.
Okay. But in your data case, do you want to delete the outliers?
This is another question. Tell me for New Zealand traded data, do you want to delete the outliers.
Who are the outliers. As I mentioned the big is lots of big numbers right.
Come from China USA right. So when you delete all these numbers Kenya report showing the correct us na na.
So what you should do. I don't want to tell you now.
Later. Okay. Yeah. Okay.
So, uh, duplicate the records. Easy.
Yeah. If you want to identify the duplicate results.
Uh, yeah. Let me, for example, show you, if you choose the column, you believe there are some duplicate results.
Then you go to data and you can see remove duplicates.
Can you see that? That shows this one and the remove duplicates.
That's easy. Very easy. Okay. All right.
So I think uh data visualisation. Okay.
Let me finish this and take a break. So this is what I want to emphasise.
Okay. Data visualisation is very important. This is about how to create a very beautiful tables, very beautiful charts for your reports.
And you will be benefiting from this for to this lecture in your future career.
Because you know I'm a program director I have to prepare the reports.
Right. I have to show the program that I am managing and directing is is very successful.
So I try to create, you know, create a column chart or pie chart to show the enrolments is high,
the student's performance is good and how to attract attention of your audience.
That's important. Okay. The pre attentive attributes is the focus your colour the size, the shape, the level.
What does it mean. See here. Can you easily tell me how many cell was in the first box?
I mean, on the far left. Can you tell me?
No. If I highlight them in red colour, can you?
Much easier. Tell me. Maybe. How about if I make it a large right?
Super easy. Is that right? So that's why every time when I reply to your discussion, I know I've written out a longer paragraph, right?
Maybe you don't want to read the whole of them.
So I highlight the keywords. Do you know what I mean?
Yeah, actually, I have done that right. So, you know, if you won, if you are too busy, just read my keywords.
That's it. So it's about the colour and the size. Okay.
When you create the table. Pay attention to the data Inc ratio.
What does that mean by data Inc? Income is.
You know the colour you spend to draw your table, right.
So I'm going to show you two examples because one example is um.
Left hand side. Another one is on the right hand side. Left hand side of low data Inc.
Right hand side high data Inc. Which one you like or maybe more eye friendly.
Left hand side or right hand side. Right hand side.
Right. Yeah. Because we need to check like, uh, some tables reproduced here or on the, uh, research articles.
More people reduce more people. It's not going to draw the border line.
Okay? When you draw the border line, you spend it too much ink on drawing the border line.
And that's why you create a low data in case.
So that's why we try to have this because it's more friendly.
And, uh, people come more focus on the data rather than the border line.
Okay. Take a break. Let's back at, uh, 1225.
Okay. Yeah. What?
I didn't do it.
I had to make. The curve?
Yes, because the 9 to 10in.
We have this one, right?
Interest rates. Right. So we're interest rates getting, uh, higher in China.
Uh, uh, Chinese people we're trying to do.
Right. Do you want to send the money out to New Zealand?
So that's why I supply that. Because you don't send the money supply.
You are supplying demand. Ho ho!
Demand in Chinese here. New Zealand. This is because they didn't want to send them back to us right before they can.
Right. Yeah. Because before I plan to spend the $100 in New Zealand, but now I don't.
Said the transaction? Maybe. Uh.
Too complicated. Uh. This one.
Doesn't matter. You can just follow my question, but this one is talking about how it influenced, uh, exploitation in politics in first place.
That's why I'm saying doesn't matter. Simple pressure to go on to say, okay, this is.
Oh, you. That.
I. Go on.
Okay. All right. Um, okay, let's back to the class.
So the next one is about, uh, the lime chart.
I want to show you. So the.
So the land chart is good at showing you, you know, the trend, actually, when we want to tell the story of the trend.
The line chart is a good one. Okay. And the usually the the the common mistake.
Okay. The common mistake that, uh.
Every time I guarantee you, every time when I mark your group assignment for this course, there must be at least one group.
Use the line chart to show me the data for across the different countries like China, India, Japan.
You know, uh, as I mentioned, there's no other, right? When there's no other.
So it doesn't show the trend. Okay. Yeah. So here is a must be showing you the trend because the X-axis is showing you the mouse.
Right. January, February, March, April, May and June.
Then it is showing you the trend. All right. Yeah. So don't use the line charts to show me the trade data across different countries.
That is totally wrong. Okay. So this is something that you have to be very careful.
Uh, you can use the column chart if you want that. You have to sort the column chart.
Okay. So that is either from the highest column to the lower one or lower one to higher.
All right. Yeah. That is easier and clear to tell. And this is a table rather with a very nice, uh, data Inc ratio that I can show you.
Uh, as you can see, these are showing you the different cases, right?
So I think, uh, design is better. Actually, there are other designs and the cross tabulation.
This is the one that I want to spend more time to show you.
Okay. For example, we got a data set.
Uh, very complicated data set. Okay here.
All right. We got, uh, this data set very complicated because we got, uh, different address around a number.
Index number one, two, three, four, and we got a quality rating, and we have the male price.
Also the waiting wait. Waiting time.
Right. So if I want to do some stats based on some condition, for example, I want to know how many restaurants can be rated as, uh, excellent.
So good, very good. And uh, the different price range.
Right. If I also want to know. Right. It's very difficult for you to spend time to manually calculate.
Don't do that. I think that there's an easy way. So for example, you can select the whole site, uh, the area of the data.
And go to insert. And you can see the first of the wiser card uh pivot table.
The pivot table is very powerful. Okay. I'm going to show you.
So if you choose pivot table and it shows okay.
Now nothing's there. Is that right? So, um, if I want to create a table according to the quality rating categories.
So I, I'm going to I can see here we have a list of the variables.
And I have the horrible rating quality rating as a while the variable.
Right. What I can do. Okay I move my cursor to the quality rating and the drag down two rows.
Drag it down two rows. See, this is there, but if I jagged two columns, it's okay.
It's up to about how you want to design.
The layout of the table is okay. Yeah. But anyway, let's be consistent with the test.
The book, uh, I mean, that the material I prepared for you.
So that is the one. And then, for example, if I want to focus on the meal price and about different meal price.
So as you can see I got a different meal price. Right.
And, uh, I want to count how many restaurants the meal price is at ten and be rated as excellent.
Can you follow me so I can do the restaurants go to values.
Does it make sense? Actually you can see the figures three, five, nine, nine.
Do we really have lots of restaurants here? I don't think so.
We only have about 300 restaurants.
Is that right? I only have 300 restaurants.
But you see the figure? The grand total can be more than 300 or more than 300.
What? What is wrong with it? Oh, pay attention to here.
Because when I want to expand the restaurant, eat the restaurant, run.
These numbers are not a number of the restaurants, right?
These are like an ID number. Do you know what I mean? You can say one plus two plus three.
So not not doing this. So what I can do is actually if I should not do so.
Yeah. Because if you move your cursor to here you see some of restaurants right is wrong.
I actually need to do a count. So how I can do count.
If you go to cursor here move your cursor here and the click and then you see value through the settings.
Okay. Choose this one. Then you choose count.
Okay. Now choose okay. Then this is correct because the grand total is 311.
Yeah. Also solve you say, hey, uh, the prices are too detailed.
I don't want to do this. I want to create a big score like the histogram 10 to 14.
So I can kind of the histogram information. Yeah, that is right.
So what I can do is I can select a 10 to 14 for example, and the right click.
And I think there's a what's called a groove. See if I click on this uh minus sign.
So you know I can collapse the groove.
Okay. Yeah. So this is a one groove as you can see.
Right. So you can keep going and you can groove there.
So this is called a pivot table which is really useful and helpful for you to do descriptive statistics.
Okay. All right.
Um. Okay.
This one. Another important graph for this course we have I have to spend a lot of time to discuss the histogram.
This one is another one very important for you is to call the scatter charts.
Scatter plots. Okay. What does it mean by scatter plots?
If you have a look at the data you see column A we can number column B number of commercials.
Column C sells. Okay. So what does that mean.
Um, as you can see, each dot each dot here is showing you a specific combination of number of commercials and the cells.
Okay.
So for example, um, this one looks like you can see it looks like at the number six observations, the number of commercial is one and the value is 38.
So so here can you follow me. Yeah. And also you can see actually there are two dots right.
Which means yeah the upper dots is actually here.
The number of commercial is one and the sales is 41 okay.
So that means one dot represents one specific observation one row.
Is that right. Yeah. And this is a 2d 2D two dimension.
Two dimension means one you know x axis showing you sales.
One of the variable y of the dimension, the x axis showing you the number of commercials.
Why we want to do scatter plots. Because scatter plots is going to show you the relationship between the two variables.
Okay, so pay attention to my later discussion from the scatter plot.
It only shows you correlation.
Correlation to be sample means relationship, either positive or negative.
This is the very basics about a regression okay.
If you understand this then when you do regression that's easier.
All right. So that the correlation tells you the scatter plots tells you how the two
variables are correlated according to the scatter plots according to these dots.
Can you tell me what is the relationship? When wise increase another.
Otherwise increase or decrease. Increase. Increase.
Yeah. Good. Very good. If it's hard, then sometimes people draw the dashed line.
Can you see the dashed line there? Yeah. People draw.
Dashed line. This dashed line is called the trend line. Right?
Yeah. So the trend line, this little remember the channel line for this is like,
oh yeah I know through the centre I see this one is also through the centre.
Okay. Must through the centre. Yeah.
Because this line actually is the regression line.
When we touch the we can I I'm going to teach you a regression.
So this is your model is your regression model.
Go through the centre okay. Yeah. And that's why when you see this line is upper slope it means the two variables are positively correlated.
Positively correlated like a supply curve. Right. When price goes up, quantity goes up.
Right. If you see this lines downward sloping it means negatively correlated.
It's like a demand curve. When price goes down quantity goes up.
Can you follow me when this line is horizontal?
Any. Any relationship? No.
Like a you are like a you are. You know what happened in the perfect competition case, right?
The demand is horizontal. Do you still remember? Yeah.
So the quantity is increasing, but the price is no change, right?
No. No correlation at all. So that's why when we touch the regressions.
Yeah. It's very important to make sure your lies upward sloping or downward sloping or horizontal when it is horizontal is a mean something.
Okay. It means your model is not uh, is has some, some insights okay.
It's very specific insights. So this is about the, uh, correlation.
And also another case is if you have, uh, temperature on the y axis.
Right. And the sales of ice cream on the axis will probably is going to a match.
The case of the life of this, do you agree with me? Because when temperatures are getting high, people are eating more ice creams, right.
To keep body cool. And that's a nice upward sloping.
All right. So this is what happened. One key part one key very very very important okay I probably is going to test to you this in the crease.
Can this correlation showing you the causal effect wish one cause another one change?
Can we say because the number of the commercials increasing is a cause?
It is the reason why cells is increasing.
Can we do that? No. No we can't.
Okay, so either I have the ice cream data, I got the temperature and I got a number of ice cream episodes.
Okay. When I say southwest. So I can say, according to the graph, when temperature is getting higher,
either cause people to eat more ice cream, I can't, I can't do that because correlation.
This graph only tell you the correlation between the two variables okay.
It doesn't tell you the causal effect. All right.
If we want to prove the causal effect then you have to prove all these two variables has this kind of relationship.
Then it might be telling you the story. Can you follow me.
But how we interpret to this graph we can say I observe when the number of cells is increasing, the quantity of commercial is also increasing.
This is fine okay. You just try to describe when one is changing another one how an otherwise change.
That's okay because is a more neutral right. So this is how we describe something in statistics.
Can you follow me. Yeah. Also I want to share.
While my experience is very, very interesting, last year like this year Auckland has a very extreme weathers.
Okay. So I found that is very interesting because last year I have my quizzes be conducted on campus in the computer labs for my students.
So like every Friday they have to come to campus to do the crisp for you guys.
Stay home. Is that right? Because I learned from my experience, every time when I have the quiz, there must be a rainy day.
It must be a rainy day. So can I say my Criss-Cross?
The weather. Right or right? No, I can, I can say that right.
So I can only say, well, I just it happens, right?
They somehow, I don't know, I can't explain every time I have the quiz, students are stuck here and I can't go back.
So that's why we change. Move the quiz online.
Interesting. So. Okay.
And, uh, there are some recommended chores for you to choose.
Yeah. So this is a wild scatter plots.
Okay. And I want to show you the scatter plots.
Actually is telling you the correlation. Of course, people want to calculate the correlation, right?
Don't worry. These formulas are not for you.
Don't worry. I just want to leave it here to some students who want to learn more.
But to everyone, don't worry about the formula.
Remember for this course. No need to remember the formula, right?
Because we have charge. We have, uh, Excel.
So your job is far easier. What are we trying to do?
I need you to learn this. Excel function equals call c o r r l array.
While everyone means variable one okay. Area two means variable two.
So what you need is you know how to use Excel to work out a correlation.
Can you follow me? It's easy correlation. How to spell it c o r r l.
So you got this one. Okay, that's a correlation, but no need to worry about the math.
Yeah when I was a student I have to memorise this maths.
But for your guys. Lucky new generation. No need.
However, this is something that you have to understand, okay?
Yeah, I show you there are three generally three cases.
Upward sloping positive correlation. Downward slope.
Negative correlation. Looks like something I can't tell the pattern.
You look more like a horizontal or very random okay.
In the middle. No correlation at all. Can you follow me?
Yeah I you know, I really like her food cooking.
That's why I feed myself a lot and are getting fatter. So to me, the middle one is like, I put this on sesame seeds on my bread, right?
Very randomly. No pattern, you know. So the other one is we use r, the lowercase r for correlation coefficient okay.
So when r is zero no correlation at all.
When r is a positive positive is upward sloping.
Well r. So negative downward sloping okay. But also once you calculate the r then you can also tell if the relationship is strong weak moderate.
Yeah. So this is my you know, suggest uh.
Terms for you to use okay. So in your analysis please use these terms how you describe the correlation.
We can say no correlation. We can say it's weak.
We can say as a strong we can say it is a moderate.
Can you follow me. So when the figure is a smaller than one, a three week, a very weak or when it is a 0.21, 0.2, no correlation.
All right. Well it is a between three point A 0.3, 2.5 is a week where it is more than 0.5, 2.7 moderate when it is more than point seven strong.
Okay. So this is the setting in this course I'm just make a standard for everyone.
All right. In the future you can use your own standard that you know for for this course for your project.
You can use this as a standard. Why? Because when the two variables are strongly correlated, we can use one variable to predict another one.
Yes. Can you follow me? Yes. If the temperature has a very strong correlation with the quantity.
Sales of ice cream. Can I use temperature to predict the ice cream quantity sales?
Of course I can, right. Because when it is increasing, either when it is increased by one degree, I can tell how many extra ice cream can be sold.
Okay. Oh my fridge. So this is what happened. So that's why in your regressions, the first step is to do the correlation matrix.
Do you know what I mean by matrix. So it's a it's a table.
It's a table. We just calculate the correlation between any two you know between any two variables in our data set.
Okay. And then we can see oh what a variables are strongly correlated.
What are not a strongly correlated. Okay. Usually we try to select the variables that are at least like a week correlated at least a week.
Okay. Do you know what I mean. So, which means the R needs to be at least upon us three otherwise is not very strongly correlated.
It doesn't really increase the, you know, the explanation power of your model.
All right. So this is something that I want to, uh, discuss for you.
Can you follow me? Very good. So the line chart, as I mentioned, is showing you the trend.
And also, you see the line chart is up and down.
Is it right? Sometimes it can be more smooth like in your business circle.
Usually like this. Right. But we say the gamma use like a policies to make it a smooth.
Uh, do you agree with me? Once more fluctuated.
Wise is less fluctuated. How we can describe the rotation.
As I mentioned, means when it is more fluctuate, it means the value is more different from each other.
Can you follow me in statistics? Which word shows difference?
I mean, which you know when you tell a difference, it looks like you have never learned stats, right?
We want to show off, right? We want to show. We learn the stats.
Do you know in stats which word means difference?
But actuality, there's economics, in other words.
Deviation. Do you know that deviation?
Yeah. So he does vary up and down.
It means the deviation is big. When it is smooth.
Deviation is small. Okay. Yeah. So that's why the line chart can also tell you the deviation about it.
So and also the trend. So you can see the North Sea small deviated fluctuated and the South is less.
Yeah. The spark lines is only like uh very tiny mini, you know um line chart for example, if you see the data here and, uh, let me see, where is it?
Uh. Insert.
Uh, here the line chart. Uh, the spark line.
So the data range is AB2 to be 14.
And the location, you can maybe choose it here.
So see this is actually, you know, the spark line chart is only like a mini line chart.
Okay. Very cute. Anyway, is something that, uh, handy for you to tell the trend.
Yeah, and the bar chart doesn't matter. This is a one example.
Do you like this chart? If you are the teacher who is marking your group assignment.
Do you like this chart? Yes. Very beautiful.
Colourful. And always a trend. Very professional.
But, uh, I'm going to critique, um.
The chart. Okay. See, um, we have a lot of countries right here with, uh, high speed railway, and.
If we do academic research, if I provide this chart, then, you know,
this is something people, uh, criticise it because, you see, China, the value is 44, 74.
Can you see that? Yeah. Uh, why is. Because I guess this is a creator by some European guy.
So they use comma for dot. Actually, when we use dot, it means it means comma here.
Right. So it's actually 40,474km.
So it should be comma that in New Zealand. But they use it anyway is okay for span is only threes 3661.
Do you think the scale is correct. Is not bigger for China is a 40,000.
Is should be this long right. And a span of this big.
So that is sounding misleading people. Can you follow me.
Yes. Yeah. And also I've saying another, uh, example is showing to New Zealand's export to a different countries.
Right. Or import New Zealand to import a lot of fish.
Actually sell some Sabah fish or some fish.
Sorry I forgot from v9. Okay.
And then it also shows the field. I export a lot of the fish to the different word of the country, right.
So it's very cute because um, the graph is showing you the fish, fill it and then cut it in a proportion.
And this proportion is for New Zealand. That proportion is is okay for the commercial journals okay.
For academic journals you know the the proportion of the fish is hard to measure, right.
Yeah. For academic research purposes we have to be very accurate.
Do you know what I mean? And also sometimes people use the different the size of the national flag to show the importance of the GDP or something.
Uh, this is not a good idea, actually, for academic purpose.
Okay. Anyway, so there are lots of different column charts.
The why is it called the stacks of column chart? I know why is it called a customer column chart?
I want to show you the difference. First cost of the column chart.
You got a two category North and the south right. And across different amounts you can have two columns sitting together.
Okay. I want to tell you my own experience.
Because of this year, my program needs to be reviewed by the university and I have to show the program is running very well.
We got, uh, seven cohorts.
So you can imagine the, like, uh, how many miles here? Like in the case, right, that we got, uh, four majors here.
Only two regions. Do you know what I mean? Two regions.
Right. Looks like. Okay. Is that right? In March, you got the four majors, four for every cohort for exam.
First of all, for the graph, is it going to be super long.
And a second. Two. Colourful one.
Colourful one. You know one. One major right.
And across the different months. So this is another either case.
But you know usually the column chart shows you the counts.
Do you agree with me. And also it just shows you a kind of the trends right.
So what I can show the information ways to create too many columns.
This one stacks the column chart. Okay.
Yeah. I create a one column for one cohort and use the proportion for showing me the different, you know, uh, majors.
Right. And then I can also tell the proportion, the importance of a one semester of a major.
Can you follow me? My experience is because I also use charge.
Right?
I put my data into the charge ability, and I say, hey, please create a column chart for me or choose one which is the most beautiful and clear, right?
You create a very great, uh, prompt and flow chart.
You know what chart? Every time. Create a column chart.
It looks like for me, it looks like you forget to start a column chart every time.
It's a create this clustered column chart for me.
Okay. So my experience the easiest from this is don't to rely on charge is not currently it's not so smart okay.
Yeah I think that's why you still have to learn this course.
Right? You learn. You know what?
It may. What are the options you have. And then you can guide a chart and GPT, you know, to create the stuff for you.
But I will say be careful when you use chart GPT to create, uh, data visualisations like in charge line charge.
Sometimes the GPT is going to generate a song fake numbers, and it's all fake I checked.
Yeah. So when I ask a GPT two, hey, create a cartoon for four people.
Either create a five people every time. I don't know why there's another people.
So anyway, it's a training sometimes. So don't to rely on it.
And the pie chart. Okay, so so the pie chart usually is showing you the different proportion of the, the groups.
Right. And this is called a bubble chart.
So bubble chart is like a scatter plots.
The difference between a bubble chart and a scatter plot is bubble chart for every sky.
Every dot. It has a size and is showing you another dimension of the information.
Okay, yeah. So for the rest is easy.
And I thank you very much for today. Um, it's all about data visualisation.
Hope you can do something for your group project. Okay. See you next time.