Time for Education: Data Mining in High School - Yuli Zhan

Time for Education: Data Mining in High School

Yuli Zhan ／English

    Time for Education: Data Mining in High School

Key Words:
education, data mining, NumPy, bisecting K-means, regression tree

1. Background
With the increasing popularity of the idea of “big data” in recent years, we have witnessed a variety of data mining practices in many areas such as finance and social studies. While we are celebrating for such practical progress, the field of education seems to be “indifferent” about the trend. Surely, traditional school education heralds centuries of refinement and perfection which may not necessarily need any further changes. However, when Massive Open Online Course(MOOC) has attracted millions of students all over night, we cannot help asking: is there any other undiscovered potential in education in the Era of Information?

With such consideration, we came up with the idea of “data mining in school”.

An abundance of data is generated in a school on a daily basis: academic results from regular examinations, comments from teachers and records of Co-Curriculum Activities(CCAs). We believe there should be much possibility hidden in the mass data and the student will be the ultimate beneficiary from it.

2. What we have done
As Junior College students, we have applied and obtained anonymous records of past year students from the school side. In the presentation, we are going to show you how the big dataset containing various records from 1000 students is digested and understood by the machine. With NumPy and its strong mathematical function, valuable data is uncovered which helps to improve some long been practiced traditional measures of education.

Two implemented cases will be shown:
1)  Big data based member allocation for small group study
2)   A-level examination result prediction

Small group study gathers students with similar academic performances and study patterns so as to facilitate teachers to give specific help and instructions. On the other hand, A-level result prediction is valued by many universities during enrollment while it let students more aware of their strengths and weaknesses in different subjects so as to have a more effective revision strategy.
However, both of these two important events heavily rely on teachers’ experiences and personal judgement which can be biased and inaccurate to a certain extent. Thus, our focus is to find a more scientific alternative method to improve quality of these two activities through data mining.

Through out the process of data analysis, we have adopted the idea of clustering and regression to    allocate groups and to predict continuous data separately. Bisecting K-means and regression tree are two main algorithms implemented for this data mining. The final result is plotted by matplotlib.

3. Presentation Outline
In the short period of 25 minutes presentation, we hope to cover the following aspects:
1) The practice and the outcome of the Python-data-mining combination in education, showing its potential under traditional school curriculum.

2) Advantages of Python and NumPy in data mining algorithm implementation. Though the two algorithm covered in this presentation are not very much complicated, the use of Python has greatly simplified the implementation process. We will firstly give a brief introduction on K-means and regression tree algorithms and their statistical principals behind. Then we will show how these problems are elegantly solved by using NumPy(e.g. various matrix calculation and built-in functions). Also, Python’s strength on data processing and displaying is also covered such as its easy file access, slicing and matplotlib.

3) Further plans/call for participation
Due to time constrains, some of our ideas and plans have not been finalized  like profiling of student characters using naive bayes classifier or psychological crisis intervention using text analysis. With the help of PyCon, we hope to encourage more people to participate in the school data mining project to benefit a larger group of students.

4.Selection of Audience Level
We have selected the “beginner” level based on the following reasons:
Data mining has long been known as difficult to understand. With the help of Python and our presentation, we will try to simplify difficult concepts to reach to a larger group of audiences, showing charm and practical usage of data mining.
As a high school student, I myself is still the beginner in both Computing and Python. The beginner level may be a more suitable point to begin with.

About Speaker

Zhan Yuli is a senior high student currently taking A-level computing in Dunman High School, Singapore. He started to learn Python and Android development since two years ago based on previous competitive programming experiences. With great interest about education, he is now studying data mining in the hope that it would help improve learning expriences under the traditional school curriculum.	

詹宇立，19歲。
現新加坡德明政府中學高二學生，主修計算機（A-level Computing）科目。對于計算機的最初認識來自於算法競賽，兩年前開始系統學習Python和Android開發，一年前接觸到數據挖掘和機器學習并產生很大興趣。因父母均在教育行業工作且自己也身為學生，對於傳統學校教育在信息時代下面臨的挑戰很有興趣，希望能盡自己微薄之力。