Joseph Azanza
Joseph Matthew R. Azanza
Data Scientist for a US-based cloud communications provider,
with expertise in Machine Learning, Artificial Intelligence,
Data Wrangling, Data Storytelling,
Sales Analytics, Sales Operations (Ops),
Marketing Analytics, Marketing Ops,
Business Intelligence, Strategic Initiatives,
Molecular Biology, and Biotechnology,
MS in Data Science
Asian Institute of Management
BS in Molecular Biology and Biotechnology
University of the Philippines Diliman
Joseph Matthew Azanza | John Christopher Tambago
Asian Institute of Management
Characterizing the different types of digital communities that are currently existing has a lot of applications including but not limited to boosting the digital presence of businesses, helping establishments and brands engage with their consumers, and having safe spaces for internet users to hang around and share their common interests. With this motivation, we are interested in identifying the types of digital communities that are existing and finding out their member websites. To answer our questions, we analyzed the October 2020 Common Crawl dataset, which is publicly available in AWS. This dataset is basically a copy of the internet, collected by the Common Crawl Foundation at certain timepoints for research purposes. We analyzed an 11.2855 GB data subset of the parquet index of the October 2020 crawl data, which has a total size of ~200 GB. In analyzing the data, we performed data preprocessing techniques, basic exploratory data analysis, and clustering via k-Means using an EMR cluster that was created via Amazon's EMR service. Our results show that we can group the websites with English content into 10 clusters or 10 digital communities. These 10 are as follows: Parliament, Tourism, Scientific Data, Environment, Scientific Texts, Medicine, Euskadi, Education, Poland, NGO. Interestingly, these digital communities are mainly government-related and development-oriented communities which suggest that the power of the internet is being leveraged to do good for mankind.
Keywords: big data, unsupervised learning, distributed machine learning
Source code can be provided upon request, and upon approval of all project collaborators