General Assembly – Data Science Remote Course
“We believe, and have explained in ‘The Timeless Way of Building,’ that the languages which people have today are so brutal, and so fragmented, that most people no longer have any language to speak of at all—and what they do have is not based on human, or natural considerations.”Christopher Alexander, Sara Ishikawa, Murray Silverstein
I recently completed a Data Science course via General Assembly (GA). In that course, I learned all of the requisite skills to get started down a path in Data Science, be it for individual curiosity, career pursuits, or both. We were taught the techniques and the tools for everything from data gathering, to cleansing, model selection, feature tuning, and much more. I am grateful for that course and the baseline knowledge that GA gave me in Data Science. I wanted to pursue this course because too many models and algorithms are being created as black boxes with no transparency and lack the needed inclusion of minority voices in their design and feedback. I touch on the need for wide array of voices participating in Data Science in a panel I participated in at the Women Who Code CONNECT Conference.
One of the important things we learned is the distinction between Data Science, Data Analysis, and Data Engineering. Data Science encompasses both machine learning and statistical analysis. Data Analytics is generally considered “not data science,” but it should not be considered “less than” Data Science. Often, Data Analyst will present an answer to a business problem or strategy for a campaign, that was gleaned from the data. Data Engineering, is more focused on the tools, applications, and/or infrastructure that collects data and deploys models out into the world. All three are essential and operate along different parts of the assembly line of Big Data.
Another way to summarize the focus areas of these disciplines is to look at whether or not they are forming judgements on the output of the analysis of the data collected.
– Analytics, Visualization, Data Engineering, make no automated judgments. Instead, it surfaces relevant information to human decision makers, e.g. through reports and dashboards.
– Statistics makes one automated judgment at a time by testing a hypothesis or estimating a parameter (e.g. does making a button red rather than yellow lead to higher click-through rates?). It was invented to make science more rigorous and objective.
– Data Science and Machine Learning produce indefinitely many automated judgments by developing models that can be applied again and again (e.g. predicting prices for houses based on size and location). It was invented as an approach to artificial intelligence.
As mentioned earlier we learned and tested ourselves on key technologies and the workflow that make up the toolbox for every Data Science. The workflow of every Data Scientist is clear and succinct. The process summarized is as follows:
1. Frame the question or problem we want to solve and define the source of the data.
2. Prepare the data after collecting it and account for gaps and quality.
3. Analyze the data using the various mathematical, programatic, and plotting tools that are widely adopted and accepted.
4. Interpret the data for what it does and does not tell us and know when to say we need more data.
5. Communicate our results to various audiences, deploy the model, and collect feedback to refine it.
Some of the myriad of tools and models we learned to use consisted of :
– Python, Pandas, Jupyter Lab & Notebooks, SciKitLearn, MatPlotLib, Seaborn, Linear Regression, Logistic Regression, Decision Trees, Random Forests, Data Cleansing, among many more.
Our course had a strong focus on Python, Modeling Techniques, and Feature Engineering. Overall, I loved the course and can highly recommend it to anyone. I would only change a couple of things about it. I would have them stretch it from 10-12 weeks. I would add two additional projects. The first would be completed in the first month, with 3-4 other classmates as team. The second in the following month, with 1-2 other classmates as a team. In the last month, keep things the same with the final project completed individually.
Knowing what I now understand, I went and put together a teach yourself Data Science Curriculum that can make the field more accessible for the novice or the complete technology beginner. The steps below outline a process that can take you from couch to, well, desk with the skills and understanding necessary to participate in the ever growing arena of Big Data.
1. Read “Machine Learning for Everyone”
2. Take a Python Course
3. Take a Linear Algebra Course
4. Take a Data Science Course
5. Read “The Hundred-Page Machine Learning Book”
6. Participate in a Kaggle Competition
7. Read “Weapons of Math Destruction”
8. Bonus: Advanced AI / ML Course
9. Optional: Pursue a full-time job in Data Science
The above is just a general outline, however, I am sure my course and many others like it in various online and in classroom settings are missing in their curriculums, the key and essential ideals spelled out in this book. Among these missing components are the importance and essentialism of ethics and fairness in the building of Algorithms and Models that are the outputs of most Data Science and Machine Learning endeavors.
Cathy O’Neil’s “Weapons of Math Destruction” I mentioned above illuminates all of the problems with poorly considered Algorithms and Models (Weapons of Math Destruction). How, when unleashed on an unsuspecting public, can deepen divides, falsely reaffirm prejudices, and compound social ills in the pursuit of efficacy. In clear and well written language, O’Neil spells out for the layman and experienced Data Scientist alike, the definition of a Weapon of Math Destruction and its key traits. She also does a great job of providing clear and actionable ways we can spot, prevent, and hold these models accountable. I believe this book is essential reading for everyone in our modern data driven culture. Reminding the builders of their fiduciary responsibility to uphold fairness in their modeling design; and teaching the non-data scientist how to spot the models and data gathering being applied to them. It is wonderfully written and commanding.