Since rules extracted with rulest can be used directly in tools like hashcat, after generating a substantial number of them I decided to briefly document the process of building highly optimal hashcat rules from that data in this short guide.
From 12 runs of the tool I obtained 12 rule files covering depths 1–10 (up to 10 operator+argument pairs per rule line).
$ tree . ├── found_chains.phase0.txt ├── found_chains.txt ├── rulest_output.phase0.txt ├── rulest_output.txt ├── rulest_raw.rule ├── rulest_rules_ga.rule ├── rulest_rules_no_ga.rule ├── rulest_rules_skull.rule ├── rulest_rules_strip_1p7m.rule ├── rulest_rules_strip_500k.rule ├── rulest_rules_strips.phase0.txt └── rulest_strip_hashmob.rule 0 directories, 12 files
After sorting by occurrence (frequency) I ended up with a file containing 4,384,527 unique rules. Not all of them are necessarily effective, so to find the most valuable ones I used the tried-and-true method of debugging rules against a large set of 32-hex hashes. The original plan was to use 160 million hashes, but an SSH connection drop mid-session made it impossible to repeat the full process — the debug run was ultimately completed against approximately ~70 million hashes, which still provides a solid statistical basis for frequency analysis.
$ cat * | uniq -c | sort -nr | cut -c 9- | awk '!seen[$0]++' > rulest2debug.rule $ wc -l rulest2debug.rule 4384527 rulest2debug.rule
I used the following script to run the debug pass:
→ https://github.com/A113L/bucket — run_debug.sh
--debug-mode=1 and --debug-file=rulest_debug_data.txt.
$ hashcat -a 0 -m 0 MD5_debug_hashes.txt wordlist.txt \ -r rulest2debug.rule \ --debug-mode=1 \ --debug-file=rulest_debug_data.txt
The resulting debug file was then processed through concentrator, which includes a Pareto-curve analysis mode designed for large sets of repeating debug data. Rules that appear most frequently across the debug output are the most valuable — that is the core principle behind the selection.
To use concentrator on the extracted rule set, simply run it without any arguments — this launches the program in interactive mode. The script begins by counting the total number of rules, so you know upfront how many you are working with.
$ ./concentrator Input Configuration: Enter rule files/directories (space-separated): rulest_debug_data.txt Analysing Input Data... ✅ SUCCESS: Rule file: rulest_debug_data.txt Quick Analysis: Files: 1 Sampled rules: 24,299,147 Est. total: 24,299,147 Unique sample: 1,048,061 Max rule len: 39 Recommendation: Low uniqueness → Extraction Long rules detected → consider functional minimization later. Processing Mode: 1 – Extract top existing rules 2 – Generate combinatorial rules 3 – Generate Markov rules Recommended: Mode 1 Select mode (1-3): 1 Top rules to extract [10000]: 24299147 Use statistical sort? [y/N]: n Global Settings: Output base name ['concentrator_output']: Max rule length to process [31]: Enable GPU acceleration? [Y/n]: y Process entirely in RAM? [y/N]: n Output Format: 1 – Standard line 2 – Expanded (space-separated operators) Select (1-2): 2 Temp directory [system default]: . Configuration Summary: Mode: extraction Input paths: 1 location(s) Output base: concentrator_output Max rule len: 31 GPU: Enabled In-memory: No Output format: expanded Top rules: 24,299,147 Stat sort: No Start processing? [Y/n]: y Active Mode: EXTRACTION Output File: concentrator_output_extracted.rule Output Format: expanded ✅ SUCCESS: OpenCL initialised on: NVIDIA GeForce RTX 3060 Ti ✅ SUCCESS: GPU Acceleration: ENABLED Collecting Rule Files (recursive, max depth 3) ✅ SUCCESS: Rule file: rulest_debug_data.txt ✅ SUCCESS: Found 1 rule files. ℹ️ INFO: Using temporary directory: . Parallel Rule File Analysis ℹ️ INFO: Parallel analysis of 1 files using 1 processes... Enter enhanced interactive mode? (Y/n): y ================================================================================ ENHANCED RULE PROCESSING – INTERACTIVE MENU ================================================================================ Initial dataset: 887,270 unique rules -------------------------------------------------------------------------------- ADVANCED FILTERING OPTIONS: (1) Filter by MINIMUM OCCURRENCE (2) Filter by MAXIMUM NUMBER OF RULES (top N) (3) Filter by FUNCTIONAL REDUNDANCY [RAM intensive] (4) INVERSE MODE – keep rules BELOW the cut-off rank (5) HASHCAT CLEANUP – validate (CPU/GPU modes) (6) LEVENSHTEIN FILTER – remove similar rules (7) TOGGLE OUTPUT FORMAT (currently: expanded) ANALYSIS & UTILITIES: (p) PARETO analysis (s) SAVE current rules (r) RESET to original dataset (i) Dataset information (q) QUIT -------------------------------------------------------------------------------- Enter choice: p
Once sorting is complete you will be dropped into a submenu where you can find the Pareto curve analysis along with milestone breakpoints showing exactly how many rules are needed to cover a given percentage of the entire initial dataset.
Pareto milestones from concentrator — rules needed to reach each coverage percentage of the full 887K-rule corpus (21.4M total occurrences).
| 10% coverage | 42 rules (0.0%) | |
| 20% coverage | 200 rules (0.0%) | |
| 30% coverage | 574 rules (0.1%) | |
| 40% coverage | 1,447 rules (0.2%) | |
| 50% coverage | 3,678 rules (0.4%) | |
| 60% coverage | 8,803 rules (1.0%) | |
| 70% coverage | 21,505 rules (2.4%) | |
| 80% coverage | 48,834 rules (5.5%) | |
| 90% coverage | 138,936 rules (15.7%) | |
| 95% coverage | 283,945 rules (32.0%) | |
| 99% coverage | 672,981 rules (75.8%) |
This makes it straightforward to cut the ruleset down to a practical size — for most cracking jobs the top 1–5% of rules by frequency covers the vast majority of real-world hits.
Using the Pareto milestones as cut points, four rule files were produced — one for each practical use case, from a large comprehensive set down to a tiny high-precision set:
$ wc -l *.rule 500001 rulest_large.rule 150001 rulest_medium.rule 25001 rulest_small.rule 251 rulest_tiny.rule 675254 total
Results from the
rules comparison spreadsheet.
Each of the four output files is compared against the best-in-class rule set in its weight
category, and against other existing rulest_* variants.
The existing rulest_* family in the benchmark includes
rulest_rules_strip_1p7m (44.65%, 1.7M rules),
rulest_rules_strip_1p1m (43.98%, 1.1M rules),
rulest_rules_strip (38.10%, 332k rules),
rulest_rules_ga (31.76%, 643k rules), and
rulest_rules_no_ga (31.35%, 329k rules).
The new rulest_large at 500k sits right in the middle of that family —
better recovery than the strip and GA variants while being a fraction of the size of
the 1M+ strip files. A good trade-off for most use cases.
One thing worth highlighting: rulest runs comfortably on cards with less than 4 GB of VRAM. The entire ruleflow — rule generation, debug pass, concentrator analysis — can be run on modest hardware most people already own. You don't need a flagship GPU to produce a competitive rule set. The debug pass itself is the most time-consuming part, but even that is a one-time cost. The resulting files can then be reused across any number of jobs at no additional overhead.
Combined with the Pareto insight that a few thousand high-frequency rules cover a disproportionate share of real-world hashes, this means the barrier to having solid, personalized rule sets is much lower than it appears.
This entire project — rulest, concentrator, and all the tooling around it — was built purely as a hobby and learning exercise. There was no grand plan, just curiosity about whether frequency analysis on debug output could produce better rules than manual curation or purely random generation. Turns out: yes, noticeably so.
If you find the tools or the methodology useful, great. If you have ideas for improvement, the repos are open. And if you just enjoy poking at this kind of thing for the same reason — have fun.